Introduction
Public code repositories leak a lot.
Developers accidentally commit config files, API endpoints, test tokens and sometimes real secrets.
Finding those early is a huge win but you must act carefully.
This Part teaches how to search GitHub and other public repos, how to verify findings safely, and how to report or handle leaks without crossing ethical or legal lines.
You will get tools, commands, regex patterns and a clear workflow you can use immediately.
Why repo scraping matters
- Repositories often contain hard-coded endpoints and keys that lead directly to interesting assets.
- A config file or an environment file can save hours of guessing.
- Public leaks are high-confidence leads because they often include real app context.
- Proper handling converts a risky finding into a responsible report and value.
Public repos tell a story. Read them carefully, and use the story to plan focused tests.
What to check – short checklist
Environment and config files – .env, config.js, settings.py
Secrets and tokens – API keys, OAuth tokens, private keys (PEM), session secrets
Hard-coded URLs and endpoints – API hosts, staging domains, callbacks
CI/CD files – .github/workflows, .gitlab-ci.yml, deploy scripts
Dockerfiles and k8s manifests – image repositories, registry creds
Logs and backups – old logs or DB dumps accidentally committed
Submodule references and .gitmodules – hidden nested repos or private links
Always validate before claiming anything as sensitive.
Tools you will use
- Web search & GitHub web UI – quick initial checks.
- GitHub CLI (
gh) – API access for scripted searches. truffleHog/truffleHog3– deep secret scanners for git history.gitleaks– flexible rules and fast scanning.gitrob– repo reconnaissance for orgs.ripgrep/rg/grep– local regex search.- GH Archive / BigQuery – for large-scale research (advanced).
- Custom regex and small Python scripts – for bespoke patterns.
Use minimal tools first, then scale up as needed.
Safe, step-by-step workflow (copy-paste ready)
1. Quick web searches
Start small. Use site searches to find obvious leaks.
site:github.com "example.com" "api_key"
site:github.com "example.com" "Authorization: Bearer"
site:github.com "example.com" ".env"
Explanation: find obvious references to your target domain and common secret markers.
2. GitHub code search via web UI
Use GitHub advanced search with queries like:
org:target-org "example.com"
filename:.env
path:/ config api_key
GitHub search highlights matches and is fast for small checks.
3. Use GH CLI for scripted searches
If you have gh and a token, use the API for structured results.
gh auth login
gh api -H "Accept: application/vnd.github.v3.text-match+json" \
/search/code -f q='example.com in:file path:.env' -f per_page=100 | jq .
Explanation: this returns code matches with contextual snippets. Keep requests reasonable to avoid rate limits.
4. Run gitleaks against a repo or organisation
Install and run gitleaks locally for quick checks.
gitleaks detect -s https://github.com/target-org/repo.git --report=gitleaks-report.json
Explanation: gitleaks uses rules to surface common secret formats and keys.
5. Scan git history with truffleHog or truffleHog3
History reveals secrets removed in later commits.
trufflehog git https://github.com/target-org/repo.git --json > truffle_report.json
Explanation: this checks commit history for high-entropy strings and known patterns.
6. Use gitrob for organisation-level recon
gitrob finds repositories likely to be interesting in an org.
gitrob -target https://github.com/target-org -token YOUR_TOKEN
Explanation: it ranks repos by likelihood of containing sensitive data.
7. Local regex hunting with ripgrep
For a cloned repo:
rg --hidden -S -n --glob '!node_modules' "api[_-]?key|access[_-]?token|secret" .
Explanation: fast keyword-based scans. Tune regex to reduce noise.
Common regex patterns and heuristics
Use these patterns as starting points. Tune for your context.
- API keys:
(?i)api[_-]?key\s*[:=]\s*['"][0-9A-Za-z\-_]{16,} - Bearer tokens:
Authorization\s*:\s*Bearer\s+[A-Za-z0-9\-\._~\+\/]+=* - AWS keys:
AKIA[0-9A-Z]{16} - PEM private key header:
-----BEGIN PRIVATE KEY----- - Passwords in env:
(?i)password\s*[:=]\s*['"].+['"]
Always validate matches; entropy alone is not proof.
Safe verification: how to confirm without causing harm
Never use leaked credentials. Do not log into accounts with found keys.
Safe verification steps:
- Check pattern context. Is the string in a test file, example folder, or a real config?
- Check activity: Does the token format match a known provider (AWS, GCP, Stripe)?
- Use provider metadata endpoints (read-only) if available. For example, some APIs allow read-only token verification without state change. Use only public, documented endpoints.
- Check token headers or scope without using full privileges. For example, for JWTs you can base64-decode the header and payload to inspect
iss,sub,expfields without validating signature. This is non-destructive.
Example: Inspecting a JWT safely
echo "eyJhbGciOi..." | awk -F. '{print $1}' | base64 -d
echo "eyJhbGciOi..." | awk -F. '{print $2}' | base64 -d
Explanation: decoding a JWT client-side reveals issuer and expiry. Do not attempt to use the token.
How to check if a token is valid without abusing it
Use provider-specific, read-only validation endpoints when permitted in program rules. If no such endpoint exists, assume the token is sensitive and treat it as active until proven otherwise by the owner.
General advice:
- Do not attempt authentication with discovered tokens.
- If you need validation for a report, ask the owner to confirm activity or provide logs.
- For bug bounty programs, include minimal proof (masked token and metadata) and ask the program owner to verify.
Responsible disclosure checklist for repo leaks
- Do not rotate or revoke keys yourself. Notify the owner.
- Provide minimal proof – a redacted snippet, filename, commit hash.
- Suggest exact remediation – rotate key, add
.gitignore, remove from history, force push withgit filter-repoorbfg. - Offer to help verify after rotation – without using secrets.
- Follow program rules and timelines – some programs require private disclosure channels.
Example report snippet
Finding: AWS access key accidentally committed
Location: repo-name/path/.env (commit: 1a2b3c)
Evidence: AKIA************ (redacted)
Suggested fix:
1. Rotate the exposed keys immediately.
2. Remove key from repo history using `git filter-repo` or `bfg`.
3. Add the file to .gitignore and ensure CI secrets use vault or env vars.
Contact: your-email@example.com
What to do after you find a secret
- Tag the finding in your tracker with high priority.
- If it is your own asset, rotate the secret immediately.
- If it belongs to another org, follow their disclosure policy and provide redacted evidence.
- Do not test the secret by using it. Do not exfiltrate data.
- If the leak exposes PII, warn the owner and follow safe-handling guidance.
Practical use-cases
- Finding staging API endpoints that bypass production auth.
- Locating CI tokens that can reveal deployment targets.
- Discovering hard-coded webhook URLs used by internal tools.
- Spotting forgotten backup files or DB dumps in repos.
Each of these leads to high-value tests in later Parts.
Mini lab exercise – 30-40 minutes
- Create a small test repo you control. Commit a
.envfile with dummy tokens such asTEST_API_KEY=abcd1234TEST. Push to GitHub as private, then make it public for the test. This is your lab data. - Run basic searches:
gh auth login
gh api -H "Accept: application/vnd.github.v3.text-match+json" /search/code -f q='TEST_API_KEY in:file repo:your-username/your-repo' -f per_page=100 | jq .
- Run
gitleakslocally against the repo:
gitleaks detect -s https://github.com/your-username/your-repo.git --report=gitleaks-test.json
- Run
truffleHogto check commit history:
trufflehog git https://github.com/your-username/your-repo.git --json > truffle_test.json
- Practice writing a responsible disclosure note for your own test repo, then remove the secret and clean history using
git filter-repoorbfg.
This exercise helps you understand how tools detect secrets and how to remediate them properly.
Common mistakes and how to avoid them
Mistake: Claiming a false positive as a secret.
Fix: Verify context before reporting. Check if token is example data or belongs to a test service.
Mistake: Using leaked credentials to confirm access.
Fix: Never use found credentials. Always request owner verification.
Mistake: Not checking commit history.
Fix: Use truffleHog or git filter-repo to inspect and clean history.
Mistake: Poor report evidence (too much or too little).
Fix: Provide filename, commit hash, and minimal redacted evidence.
Commands summary – copy-paste
Quick web search
site:github.com "example.com" ".env"
site:github.com "example.com" "api_key"
GH CLI search
gh api -H "Accept: application/vnd.github.v3.text-match+json" \
/search/code -f q='example.com in:file path:.env' -f per_page=100 | jq .
gitleaks scan
gitleaks detect -s https://github.com/target-org/repo.git --report=gitleaks-report.json
truffleHog scan
trufflehog git https://github.com/target-org/repo.git --json > truffle_report.json
ripgrep local scan
rg --hidden -S -n --glob '!node_modules' "api[_-]?key|access[_-]?token|secret" .
JWT inspection (non-destructive)
echo "JWT_TOKEN" | awk -F. '{print $1}' | base64 -d
echo "JWT_TOKEN" | awk -F. '{print $2}' | base64 -d
Short checklist – copy into your notes
- Run quick GitHub site searches.
- Use
ghfor structured queries. - Scan repo history with
truffleHog. - Run
gitleaksfor quick detection. - Validate context and do safe verification only.
- Prepare a clear, redacted report and follow disclosure rules.
Next steps and where this feeds
- Tokens and endpoints discovered here feed Part 11 for URL collection and Part 13 for JS analysis.
- CI and deploy tokens connect to Part 8 for cloud and IP mapping.
- Any leaked secrets are high-priority items for reporting and remediation.
Closing notes
Repo scraping gives high-confidence leads if you handle the findings responsibly.
Do not use leaked credentials. Verify context. Report clearly and politely. Rotate keys when you can and help owners clean their history.
Next post preview
Part 5 – Certificate Transparency, CT logs and cert-based discovery.
We will show parsing techniques, watchers, and how to use cert data to expand your surface.
Disclaimer
This content is for educational purposes only. Use it ethically and only against targets you own or have explicit permission to test. Do not use any techniques described here in ways that break laws, platform rules or third-party rights. If in doubt, stop and get permission.

