BLOG_POST / secrets-scanning-zero-dev-setup

Secrets scanning for a 200+ repo GitHub org, with zero developer setup

14 min read
2723 words
tl;dr summary

We built secrets scanning that developers never have to think about. Every push is scanned, findings are deduplicated by commit SHA, stored without secret values, and routed to the right humans fast.

When your GitHub organization has 200+ repositories and hundreds of pushes per day, secrets eventually get committed. Not because people are reckless, but because humans are busy, juniors are learning, and legacy repos have surprising gravitational pull. If your work includes customer projects, those leaks are not just embarrassing. They are contractual risk, breach risk, and reputation risk.

We built an organization-wide secrets-scanning system that runs on every push, on every branch, across all repos in the org, without requiring any developer setup. No local hooks. No CI integration. No “install this tool and keep it updated.” No merge blocking. Just fast detection, fast routing, and a workflow that makes remediation boring.

Boring is the goal. Nobody wants adrenaline in incident response.


The problem (business and culture, not just regex)

The raw problem is simple: with enough commits, someone will accidentally commit an API key, a cloud credential, an SSH key, or a private certificate. The messy part is everything around it:

People ship under deadline pressure. Interns and juniors do what the codebase teaches them, and legacy repos are excellent teachers of bad habits. Meanwhile, many teams still treat security as something you “add later” with a tool rollout that requires buy-in from every developer, every repo owner, and every CI pipeline.

That approach collapses under scale. If the scanner depends on developers opting in, then the repos that need it most will be the last ones to adopt it.

So we set goals that matched reality:

We wanted detection that is automatic, organization-wide, and invisible to developers. We wanted feedback that lands quickly (minutes, not days), and we wanted the system to be cheap enough that nobody starts negotiating with security about whether it is “worth it” to scan pushes.

We also wanted sovereignty. Not in the geopolitical sense. In the “this is our logic, our data model, our integrations” sense. If it breaks, we can fix it. If we need a tweak, we do not wait for a vendor roadmap.

The non-goals mattered just as much:

We did not block pushes or merges. We did not scan outside GitHub org repos. We avoided asking developers to install anything locally or to retrofit CI on 200+ repos.

This is “security by design” without being “security by paperwork”.


Scope and definitions (so the system stays honest)

In scope: every repository in the GitHub org, every push event, every branch. The scan target is the repository checked out exactly at the pushed commit SHA. That matters because it creates a clear, reproducible unit of work: “scan this snapshot”.

Out of scope: anything not pushed to GitHub org repos. No developer laptops. No external Git hosting. No artifact registries. No build logs. That is a separate project with a separate risk model.

What counts as a secret: API keys, SSH keys, TLS certs and private keys, cloud credentials (AWS/GCP), mailing service credentials, and “high entropy strings” that are likely passwords. In practice, we treat any secret in git history as unacceptable. There is no safe window where it was “only in a branch”. Git history is a very effective long-term storage system, and that is a compliment only until it is not.


The architecture in one sentence

GitHub push webhook → webhook receiver verifies authenticity → idempotency by commit SHA → AWS Lambda clones repo at the SHA → TruffleHog scans filesystem with verification → JSON findings are normalized → store metadata and SHA-256 hashes only → Slack + read-only dashboard.

That is the entire system. It is deliberately direct.


Current system flow (Mermaid)

GitHub org push

Webhook receiver

Idempotency commit SHA

Lambda scan job

git clone depth=1

checkout commit SHA

TruffleHog filesystem scan + verify

JSON findings

Hash-only findings store

Slack alert

Read-only dashboard


Sequence diagram (push to alert)

Internal DashboardSlackFindings StoreTruffleHogGit (clone/checkout)AWS Lambda (scan)Webhook ReceiverGitHubInternal DashboardSlackFindings StoreTruffleHogGit (clone/checkout)AWS Lambda (scan)Webhook ReceiverGitHubpush webhook (event + signature)verify HMAC signatureidempotency check(commit SHA)invoke scan(commit SHA, repo)clone --depth 1repo workspace at commitscan filesystem + verifyfindings JSONupsert current findings (hash-only)notify (no secret values)dashboard reads current state

Hosting and operational knobs

The scanning jobs run in AWS Lambda, deployed in eu-west-3 (Paris). Each scan has 1024 MB of memory. We also enforce a global concurrency cap on scan jobs. That cap is important because the workload pattern is bursty. A single busy hour can contain a large slice of the day’s pushes.

The system is designed so you can scale the scanner up or down without changing developer behavior. That is the entire point.

The tradeoff is time-to-detection. Compared to the earlier stateful PoC (more on that later), Lambda scanning is slower. A typical push-to-detection latency is about 40 seconds. In our experience, that is fast enough to keep a human feedback loop tight without making operations complicated.


Webhook receiver: authenticity and idempotency

The receiver has two jobs that must be boring and correct: validate the webhook, and make sure we do not scan the same thing repeatedly.

Authenticity is enforced via GitHub’s HMAC signature verification. We chose not to rely on IP allowlists, WAF rules, or other perimeter controls. Not because those are bad, but because for this system the signature is the primary control and is straightforward to validate correctly. If the signature is invalid, we reject the request.

Idempotency is enforced using the pushed commit SHA as the dedupe key. Webhooks can be delivered more than once, and delivery retries are normal behavior. Deduping by commit SHA makes duplicate deliveries cheap: if we already scanned that SHA, we no-op.

Here is the shape of the logic in intentionally boring pseudocode:

function handlePushWebhook(req) {
  if (!verifyGithubHmac(req)) return 401;

  const sha = req.payload.after; // pushed commit SHA
  if (alreadyScanned(sha)) return 200; // no-op

  markScanned(sha);
  invokeLambdaScan({ sha, repo: req.payload.repository.full_name, branch: req.payload.ref });

  return 202; // accepted
}

We store just enough state to remember which SHAs have been scanned. That state is not sensitive and does not contain secret material.


The scan job: clone, checkout, scan, normalize

Each scan job is stateless and repeatable:

  1. git clone --depth 1 to keep clone time and storage low.
  2. checkout the pushed commit SHA.
  3. run TruffleHog in filesystem mode with verification enabled.
  4. ingest the JSON output, normalize fields, and write findings to storage.
  5. send a Slack notification (metadata only).
  6. dashboard reads the current open state.

We scan the repository as a filesystem snapshot, not by walking historical git blobs. We also exclude the .git directory to avoid scanning object data. .gitignore is respected so we reduce noise from generated files.

This keeps scanning aligned with the question we actually care about: “Did someone just push a secret that now exists in the repo contents at this commit?”

If you want to scan full history, you can. It is just a different problem, a different runtime profile, and often a different operational budget.


TruffleHog configuration and philosophy

We run an unmodified TruffleHog binary, built from the latest release (if newer) at build time. Verification is enabled for detections that support it, which reduces false positives. JSON output is used so we can ingest findings in a structured way.

Configuration choices in plain language:

  • Filesystem scanning: scan what the repo looks like at the commit.
  • JSON output: makes storage and dedupe manageable.
  • Exclude .git: avoids rescanning blobs.
  • Respect .gitignore: reduces noise.
  • No custom detectors yet: when gaps are found, we prefer contributing upstream rather than maintaining a fork.

That last point is a sovereignty detail that surprises people. Sovereignty does not mean “fork everything and carry patches forever.” In many cases it means “own the integration and the system around the tool, and contribute fixes upstream when the tool is missing something.”


Findings storage: hash-only, current-state-only

The system stores only “current open findings”. When a finding is resolved, the record is deleted. That is a deliberate design decision.

This is the canonical schema we treat as the system model:

CREATE TABLE IF NOT EXISTS secrets (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    repo TEXT NOT NULL,
    branch TEXT,
    file_path TEXT NOT NULL,
    provider TEXT,
    hash TEXT,
    first_seen_at TEXT NOT NULL,
    last_seen_at TEXT NOT NULL,
    resolved_at TEXT
)

What we store, and what we refuse to store

We store:

  • repo name
  • branch (from the push)
  • file path
  • provider label (what detector flagged it)
  • a SHA-256 hash
  • first and last seen timestamps

We do not store secret values. Not in the database. Not in Slack. Not in the dashboard.

To compute the hash, the secret value is handled in-memory for the minimum time possible. We hash it, keep the hash, then discard the value. The hash gives us a stable identifier for deduplication and “is this the same secret resurfacing?” style tracking, without retaining the actual credential.

“Current state only” lifecycle (Mermaid)

first detection

still present (last_seen_at updated)

no longer detected

row dropped (current-state only)

Open

Resolved

Deleting resolved rows has a clear downside: you lose historical analytics unless you separately log events. You cannot answer questions like “how long did that exact secret live?” unless you have an external event log. We accepted that tradeoff because the system’s primary job is not analytics. It is fast remediation.

If we later want analytics, we can add append-only event logging without changing the core “current state” store.


Dashboard and Slack notifications

We send notifications to Slack and maintain a read-only internal dashboard. The dashboard is designed for triage: show what is currently open, allow filtering by repo and provider, and give owners a clear list of what to fix.

The Slack message includes enough context to act quickly, and nothing that leaks secrets. A sanitized example looks like this:

🚨 Secret detected

Repo: customer-project-api
Branch: feature/onboarding
Commit: 1a2b3c4 (pushed by j.doe)
File: config/dev.env
Provider: AWS (verified)

Next steps: remove from repo + rotate/revoke if needed
Dashboard: internal read-only link

The “verified” marker is meaningful because it reduces the “this is probably a false positive, I will ignore it” reflex. That reflex is how small leaks become incidents.


Routing and triage workflow

Routing is intentionally boring. The lead dev on the project is responsible, and the commit author is included for context. We do this via an internal directory mapping plus author attribution. The goal is not to shame. The goal is to put the alert in front of the person who can fix it fastest.

Severity is also intentionally boring: everything is treated as critical. We do not attempt to play “maybe this secret is low value.” When secrets are in git history, the safe posture is to assume they are compromised.

The remediation workflow is what you would expect, and that is the point:

  1. Remove the secret from the repository.
  2. Rotate or revoke the credential depending on the provider and compromise likelihood.
  3. Confirm the scanner no longer detects it (finding disappears, row dropped).
  4. Prevent recurrence by moving secrets into a manager or environment injection.

We do not create automated PRs today. That is a deliberate choice. Automation can be helpful, but it can also add friction and complexity. We prioritized “detect and route” first, because that alone can eliminate the majority of risk quickly.


Observed outcomes after rollout

A baseline scan surfaced about 100 findings. Most were resolved quickly. The max time-to-fix observed was one business day, and the average time-to-fix was under three hours.

The most important outcome was cultural: the system taught developers what “good” looks like by giving immediate feedback. It is much easier to learn from a Slack alert ten minutes after a mistake than from a security incident report two months later.

It also reinforced a practical truth: the repos that leak secrets are often the oldest, the most copy-pasted, and the most “just get it working” in local dev. That is not a moral failure. It is a signal that those repos need updated patterns and templates.


Metrics and charts

Findings per week

Findings per week
loading chart…
Weekly new vs resolved findings.

Open findings at week end

Open findings at week end
loading chart…
Illustrates open findings trending down after baseline cleanup.

MTTR distribution

Mean time to remediate
loading chart…
Mean time to remediate distribution (buckets).

Top leaked providers

Top leaked providers
loading chart…
Top provider categories from findings.

Repo hygiene snapshot

Repo hygiene
loading chart…
Clean repos vs repos with open findings.
Clean repo rate
loading chart…
Share of repos with no open findings.

The previous version: a fast PoC with a slow-motion footgun

Before the current system, we built a PoC on a small EC2 instance (t4g.micro) with a stateful workspace:

  • webhook receiver
  • simple FIFO queue backed by a DB table
  • worker process
  • per-repo lock
  • cached workspace updated via git pull
  • SQLite state
  • Slack delta notifications

The PoC was fast. Average scans were about 5.5 seconds, and p95 around 12.76 seconds. The persistent workspace helped us surface edge cases and contribute upstream improvements.

The downside was operational. The caching directory needed cleanup. Cleanup was missed. Storage grew beyond expectations. The system was still “working”, but the operational risk was unacceptable. This is the classic problem with stateful workers: the performance is great until the state starts having opinions about your disk.

So we moved to Lambda-based stateless scanning.

The current system is slower, roughly 40 seconds from push to detection, but it is operationally simpler and scales cleanly with load. In a security system, “predictable and boring” is often worth more than “fast and clever.”


Security posture of the scanner

This system is not a perimeter fortress. It is a targeted control with a specific threat model.

  • Webhook authenticity is enforced by GitHub HMAC signature verification.
  • Replay and duplicate deliveries are handled by idempotency (commit SHA dedupe).
  • The GitHub token used for cloning is a fine-grained PAT with read-only permissions.
  • There are no additional network perimeter controls like IP allowlisting or WAF in front of the receiver. HMAC is the primary control.

That last choice is a tradeoff. If you later want defense-in-depth, you can add perimeter controls. The system design does not prevent it. We just did not make it a prerequisite for shipping a useful scanner.


Cost model (order of magnitude)

With Lambda at 1 GB memory and roughly 40 seconds runtime per scan:

  • compute: 1 GB × 40 seconds = 40 GB-seconds per push
  • cost per push: 40 × (GB-seconds price) + request cost

In practice, the cost is close to zero for us because we leverage AWS free-tier.


Where this goes next (optional future): automated remediation PRs

One obvious next step is to have the system open a PR that removes the secret and replaces it with an environment reference, potentially assisted by LLM agents. That can be valuable, especially in large orgs where the same mistake repeats across repos.

We keep it as a roadmap item for a reason. Auto-remediation can easily become a source of friction if it is noisy or intrusive. The current system already delivers value by detecting, routing, and enforcing a tight feedback loop. If we add automation, it should be opt-in, conservative, and human-approved.

hash: c2e
EOF