Secrets scanning for a 200+ repo GitHub org, with zero developer setup
tl;dr summary
We built secrets scanning that developers never have to think about. Every push is scanned, findings are deduplicated by commit SHA, stored without secret values, and routed to the right humans fast.
table of contents
When your GitHub organization has 200+ repositories and hundreds of pushes per day, secrets eventually get committed. Not because people are reckless, but because humans are busy, juniors are learning, and legacy repos have surprising gravitational pull. If your work includes customer projects, those leaks are not just embarrassing. They are contractual risk, breach risk, and reputation risk.
We built an organization-wide secrets-scanning system that runs on every push, on every branch, across all repos in the org, without requiring any developer setup. No local hooks. No CI integration. No “install this tool and keep it updated.” No merge blocking. Just fast detection, fast routing, and a workflow that makes remediation boring.
Boring is the goal. Nobody wants adrenaline in incident response.
The problem (business and culture, not just regex)
The raw problem is simple: with enough commits, someone will accidentally commit an API key, a cloud credential, an SSH key, or a private certificate. The messy part is everything around it:
People ship under deadline pressure. Interns and juniors do what the codebase teaches them, and legacy repos are excellent teachers of bad habits. Meanwhile, many teams still treat security as something you “add later” with a tool rollout that requires buy-in from every developer, every repo owner, and every CI pipeline.
That approach collapses under scale. If the scanner depends on developers opting in, then the repos that need it most will be the last ones to adopt it.
So we set goals that matched reality:
We wanted detection that is automatic, organization-wide, and invisible to developers. We wanted feedback that lands quickly (minutes, not days), and we wanted the system to be cheap enough that nobody starts negotiating with security about whether it is “worth it” to scan pushes.
We also wanted sovereignty. Not in the geopolitical sense. In the “this is our logic, our data model, our integrations” sense. If it breaks, we can fix it. If we need a tweak, we do not wait for a vendor roadmap.
The non-goals mattered just as much:
We did not block pushes or merges. We did not scan outside GitHub org repos. We avoided asking developers to install anything locally or to retrofit CI on 200+ repos.
This is “security by design” without being “security by paperwork”.
Scope and definitions (so the system stays honest)
In scope: every repository in the GitHub org, every push event, every branch. The scan target is the repository checked out exactly at the pushed commit SHA. That matters because it creates a clear, reproducible unit of work: “scan this snapshot”.
Out of scope: anything not pushed to GitHub org repos. No developer laptops. No external Git hosting. No artifact registries. No build logs. That is a separate project with a separate risk model.
What counts as a secret: API keys, SSH keys, TLS certs and private keys, cloud credentials (AWS/GCP), mailing service credentials, and “high entropy strings” that are likely passwords. In practice, we treat any secret in git history as unacceptable. There is no safe window where it was “only in a branch”. Git history is a very effective long-term storage system, and that is a compliment only until it is not.
The architecture in one sentence
GitHub push webhook → webhook receiver verifies authenticity → idempotency by commit SHA → AWS Lambda clones repo at the SHA → TruffleHog scans filesystem with verification → JSON findings are normalized → store metadata and SHA-256 hashes only → Slack + read-only dashboard.
That is the entire system. It is deliberately direct.
Current system flow (Mermaid)
Sequence diagram (push to alert)
Hosting and operational knobs
The scanning jobs run in AWS Lambda, deployed in eu-west-3 (Paris). Each scan has 1024 MB of memory. We also enforce a global concurrency cap on scan jobs. That cap is important because the workload pattern is bursty. A single busy hour can contain a large slice of the day’s pushes.
The system is designed so you can scale the scanner up or down without changing developer behavior. That is the entire point.
The tradeoff is time-to-detection. Compared to the earlier stateful PoC (more on that later), Lambda scanning is slower. A typical push-to-detection latency is about 40 seconds. In our experience, that is fast enough to keep a human feedback loop tight without making operations complicated.
Webhook receiver: authenticity and idempotency
The receiver has two jobs that must be boring and correct: validate the webhook, and make sure we do not scan the same thing repeatedly.
Authenticity is enforced via GitHub’s HMAC signature verification. We chose not to rely on IP allowlists, WAF rules, or other perimeter controls. Not because those are bad, but because for this system the signature is the primary control and is straightforward to validate correctly. If the signature is invalid, we reject the request.
Idempotency is enforced using the pushed commit SHA as the dedupe key. Webhooks can be delivered more than once, and delivery retries are normal behavior. Deduping by commit SHA makes duplicate deliveries cheap: if we already scanned that SHA, we no-op.
Here is the shape of the logic in intentionally boring pseudocode:
function handlePushWebhook(req) {
if (!verifyGithubHmac(req)) return 401;
const sha = req.payload.after; // pushed commit SHA
if (alreadyScanned(sha)) return 200; // no-op
markScanned(sha);
invokeLambdaScan({ sha, repo: req.payload.repository.full_name, branch: req.payload.ref });
return 202; // accepted
}
We store just enough state to remember which SHAs have been scanned. That state is not sensitive and does not contain secret material.
The scan job: clone, checkout, scan, normalize
Each scan job is stateless and repeatable:
git clone --depth 1to keep clone time and storage low.- checkout the pushed commit SHA.
- run TruffleHog in filesystem mode with verification enabled.
- ingest the JSON output, normalize fields, and write findings to storage.
- send a Slack notification (metadata only).
- dashboard reads the current open state.
We scan the repository as a filesystem snapshot, not by walking historical git blobs. We also exclude the .git directory to avoid scanning object data. .gitignore is respected so we reduce noise from generated files.
This keeps scanning aligned with the question we actually care about: “Did someone just push a secret that now exists in the repo contents at this commit?”
If you want to scan full history, you can. It is just a different problem, a different runtime profile, and often a different operational budget.
TruffleHog configuration and philosophy
We run an unmodified TruffleHog binary, built from the latest release (if newer) at build time. Verification is enabled for detections that support it, which reduces false positives. JSON output is used so we can ingest findings in a structured way.
Configuration choices in plain language:
- Filesystem scanning: scan what the repo looks like at the commit.
- JSON output: makes storage and dedupe manageable.
- Exclude
.git: avoids rescanning blobs. - Respect
.gitignore: reduces noise. - No custom detectors yet: when gaps are found, we prefer contributing upstream rather than maintaining a fork.
That last point is a sovereignty detail that surprises people. Sovereignty does not mean “fork everything and carry patches forever.” In many cases it means “own the integration and the system around the tool, and contribute fixes upstream when the tool is missing something.”
Findings storage: hash-only, current-state-only
The system stores only “current open findings”. When a finding is resolved, the record is deleted. That is a deliberate design decision.
This is the canonical schema we treat as the system model:
CREATE TABLE IF NOT EXISTS secrets (
id INTEGER PRIMARY KEY AUTOINCREMENT,
repo TEXT NOT NULL,
branch TEXT,
file_path TEXT NOT NULL,
provider TEXT,
hash TEXT,
first_seen_at TEXT NOT NULL,
last_seen_at TEXT NOT NULL,
resolved_at TEXT
)
What we store, and what we refuse to store
We store:
- repo name
- branch (from the push)
- file path
- provider label (what detector flagged it)
- a SHA-256 hash
- first and last seen timestamps
We do not store secret values. Not in the database. Not in Slack. Not in the dashboard.
To compute the hash, the secret value is handled in-memory for the minimum time possible. We hash it, keep the hash, then discard the value. The hash gives us a stable identifier for deduplication and “is this the same secret resurfacing?” style tracking, without retaining the actual credential.
“Current state only” lifecycle (Mermaid)
Deleting resolved rows has a clear downside: you lose historical analytics unless you separately log events. You cannot answer questions like “how long did that exact secret live?” unless you have an external event log. We accepted that tradeoff because the system’s primary job is not analytics. It is fast remediation.
If we later want analytics, we can add append-only event logging without changing the core “current state” store.
Dashboard and Slack notifications
We send notifications to Slack and maintain a read-only internal dashboard. The dashboard is designed for triage: show what is currently open, allow filtering by repo and provider, and give owners a clear list of what to fix.
The Slack message includes enough context to act quickly, and nothing that leaks secrets. A sanitized example looks like this:
🚨 Secret detected
Repo: customer-project-api
Branch: feature/onboarding
Commit: 1a2b3c4 (pushed by j.doe)
File: config/dev.env
Provider: AWS (verified)
Next steps: remove from repo + rotate/revoke if needed
Dashboard: internal read-only link
The “verified” marker is meaningful because it reduces the “this is probably a false positive, I will ignore it” reflex. That reflex is how small leaks become incidents.
Routing and triage workflow
Routing is intentionally boring. The lead dev on the project is responsible, and the commit author is included for context. We do this via an internal directory mapping plus author attribution. The goal is not to shame. The goal is to put the alert in front of the person who can fix it fastest.
Severity is also intentionally boring: everything is treated as critical. We do not attempt to play “maybe this secret is low value.” When secrets are in git history, the safe posture is to assume they are compromised.
The remediation workflow is what you would expect, and that is the point:
- Remove the secret from the repository.
- Rotate or revoke the credential depending on the provider and compromise likelihood.
- Confirm the scanner no longer detects it (finding disappears, row dropped).
- Prevent recurrence by moving secrets into a manager or environment injection.
We do not create automated PRs today. That is a deliberate choice. Automation can be helpful, but it can also add friction and complexity. We prioritized “detect and route” first, because that alone can eliminate the majority of risk quickly.
Observed outcomes after rollout
A baseline scan surfaced about 100 findings. Most were resolved quickly. The max time-to-fix observed was one business day, and the average time-to-fix was under three hours.
The most important outcome was cultural: the system taught developers what “good” looks like by giving immediate feedback. It is much easier to learn from a Slack alert ten minutes after a mistake than from a security incident report two months later.
It also reinforced a practical truth: the repos that leak secrets are often the oldest, the most copy-pasted, and the most “just get it working” in local dev. That is not a moral failure. It is a signal that those repos need updated patterns and templates.
Metrics and charts
Findings per week
Open findings at week end
MTTR distribution
Top leaked providers
Repo hygiene snapshot
The previous version: a fast PoC with a slow-motion footgun
Before the current system, we built a PoC on a small EC2 instance (t4g.micro) with a stateful workspace:
- webhook receiver
- simple FIFO queue backed by a DB table
- worker process
- per-repo lock
- cached workspace updated via
git pull - SQLite state
- Slack delta notifications
The PoC was fast. Average scans were about 5.5 seconds, and p95 around 12.76 seconds. The persistent workspace helped us surface edge cases and contribute upstream improvements.
The downside was operational. The caching directory needed cleanup. Cleanup was missed. Storage grew beyond expectations. The system was still “working”, but the operational risk was unacceptable. This is the classic problem with stateful workers: the performance is great until the state starts having opinions about your disk.
So we moved to Lambda-based stateless scanning.
The current system is slower, roughly 40 seconds from push to detection, but it is operationally simpler and scales cleanly with load. In a security system, “predictable and boring” is often worth more than “fast and clever.”
Security posture of the scanner
This system is not a perimeter fortress. It is a targeted control with a specific threat model.
- Webhook authenticity is enforced by GitHub HMAC signature verification.
- Replay and duplicate deliveries are handled by idempotency (commit SHA dedupe).
- The GitHub token used for cloning is a fine-grained PAT with read-only permissions.
- There are no additional network perimeter controls like IP allowlisting or WAF in front of the receiver. HMAC is the primary control.
That last choice is a tradeoff. If you later want defense-in-depth, you can add perimeter controls. The system design does not prevent it. We just did not make it a prerequisite for shipping a useful scanner.
Cost model (order of magnitude)
With Lambda at 1 GB memory and roughly 40 seconds runtime per scan:
- compute: 1 GB × 40 seconds = 40 GB-seconds per push
- cost per push: 40 × (GB-seconds price) + request cost
In practice, the cost is close to zero for us because we leverage AWS free-tier.
Where this goes next (optional future): automated remediation PRs
One obvious next step is to have the system open a PR that removes the secret and replaces it with an environment reference, potentially assisted by LLM agents. That can be valuable, especially in large orgs where the same mistake repeats across repos.
We keep it as a roadmap item for a reason. Auto-remediation can easily become a source of friction if it is noisy or intrusive. The current system already delivers value by detecting, routing, and enforcing a tight feedback loop. If we add automation, it should be opt-in, conservative, and human-approved.