Part 4 – GitHub and Repo Scraping for Endpoints and Secrets

Introduction

Public code repositories leak a lot.
Developers accidentally commit config files, API endpoints, test tokens and sometimes real secrets.
Finding those early is a huge win but you must act carefully.

This Part teaches how to search GitHub and other public repos, how to verify findings safely, and how to report or handle leaks without crossing ethical or legal lines.
You will get tools, commands, regex patterns and a clear workflow you can use immediately.

Why repo scraping matters

Repositories often contain hard-coded endpoints and keys that lead directly to interesting assets.
A config file or an environment file can save hours of guessing.
Public leaks are high-confidence leads because they often include real app context.
Proper handling converts a risky finding into a responsible report and value.

Public repos tell a story. Read them carefully, and use the story to plan focused tests.

What to check – short checklist

Environment and config files – .env, config.js, settings.py
Secrets and tokens – API keys, OAuth tokens, private keys (PEM), session secrets
Hard-coded URLs and endpoints – API hosts, staging domains, callbacks
CI/CD files – .github/workflows, .gitlab-ci.yml, deploy scripts
Dockerfiles and k8s manifests – image repositories, registry creds
Logs and backups – old logs or DB dumps accidentally committed
Submodule references and .gitmodules – hidden nested repos or private links

Always validate before claiming anything as sensitive.

Tools you will use

Web search & GitHub web UI – quick initial checks.
GitHub CLI (gh) – API access for scripted searches.
truffleHog / truffleHog3 – deep secret scanners for git history.
gitleaks – flexible rules and fast scanning.
gitrob – repo reconnaissance for orgs.
ripgrep / rg / grep – local regex search.
GH Archive / BigQuery – for large-scale research (advanced).
Custom regex and small Python scripts – for bespoke patterns.

Use minimal tools first, then scale up as needed.

Safe, step-by-step workflow (copy-paste ready)

1. Quick web searches

Start small. Use site searches to find obvious leaks.

site:github.com "example.com" "api_key"
site:github.com "example.com" "Authorization: Bearer"
site:github.com "example.com" ".env"

Explanation: find obvious references to your target domain and common secret markers.

2. GitHub code search via web UI

Use GitHub advanced search with queries like:

org:target-org "example.com"
filename:.env
path:/ config api_key

GitHub search highlights matches and is fast for small checks.

3. Use GH CLI for scripted searches

If you have gh and a token, use the API for structured results.

gh auth login
gh api -H "Accept: application/vnd.github.v3.text-match+json" \
  /search/code -f q='example.com in:file path:.env' -f per_page=100 | jq .

Explanation: this returns code matches with contextual snippets. Keep requests reasonable to avoid rate limits.

4. Run gitleaks against a repo or organisation

Install and run gitleaks locally for quick checks.

gitleaks detect -s https://github.com/target-org/repo.git --report=gitleaks-report.json

Explanation: gitleaks uses rules to surface common secret formats and keys.

5. Scan git history with truffleHog or truffleHog3

History reveals secrets removed in later commits.

trufflehog git https://github.com/target-org/repo.git --json > truffle_report.json

Explanation: this checks commit history for high-entropy strings and known patterns.

6. Use gitrob for organisation-level recon

gitrob finds repositories likely to be interesting in an org.

gitrob -target https://github.com/target-org -token YOUR_TOKEN

Explanation: it ranks repos by likelihood of containing sensitive data.

7. Local regex hunting with ripgrep

For a cloned repo:

rg --hidden -S -n --glob '!node_modules' "api[_-]?key|access[_-]?token|secret" .

Explanation: fast keyword-based scans. Tune regex to reduce noise.

Common regex patterns and heuristics

Use these patterns as starting points. Tune for your context.

API keys: (?i)api[_-]?key\s*[:=]\s*['"][0-9A-Za-z\-_]{16,}
Bearer tokens: Authorization\s*:\s*Bearer\s+[A-Za-z0-9\-\._~\+\/]+=*
AWS keys: AKIA[0-9A-Z]{16}
PEM private key header: -----BEGIN PRIVATE KEY-----
Passwords in env: (?i)password\s*[:=]\s*['"].+['"]

Always validate matches; entropy alone is not proof.

Safe verification: how to confirm without causing harm

Never use leaked credentials. Do not log into accounts with found keys.

Safe verification steps:

Check pattern context. Is the string in a test file, example folder, or a real config?
Check activity: Does the token format match a known provider (AWS, GCP, Stripe)?
Use provider metadata endpoints (read-only) if available. For example, some APIs allow read-only token verification without state change. Use only public, documented endpoints.
Check token headers or scope without using full privileges. For example, for JWTs you can base64-decode the header and payload to inspect iss, sub, exp fields without validating signature. This is non-destructive.

Example: Inspecting a JWT safely

echo "eyJhbGciOi..." | awk -F. '{print $1}' | base64 -d
echo "eyJhbGciOi..." | awk -F. '{print $2}' | base64 -d

Explanation: decoding a JWT client-side reveals issuer and expiry. Do not attempt to use the token.

How to check if a token is valid without abusing it

Use provider-specific, read-only validation endpoints when permitted in program rules. If no such endpoint exists, assume the token is sensitive and treat it as active until proven otherwise by the owner.

General advice:

Do not attempt authentication with discovered tokens.
If you need validation for a report, ask the owner to confirm activity or provide logs.
For bug bounty programs, include minimal proof (masked token and metadata) and ask the program owner to verify.

Responsible disclosure checklist for repo leaks

Do not rotate or revoke keys yourself. Notify the owner.
Provide minimal proof – a redacted snippet, filename, commit hash.
Suggest exact remediation – rotate key, add .gitignore, remove from history, force push with git filter-repo or bfg.
Offer to help verify after rotation – without using secrets.
Follow program rules and timelines – some programs require private disclosure channels.

Example report snippet

Finding: AWS access key accidentally committed
Location: repo-name/path/.env (commit: 1a2b3c)
Evidence: AKIA************ (redacted)
Suggested fix:
1. Rotate the exposed keys immediately.
2. Remove key from repo history using `git filter-repo` or `bfg`.
3. Add the file to .gitignore and ensure CI secrets use vault or env vars.
Contact: your-email@example.com

What to do after you find a secret

Tag the finding in your tracker with high priority.
If it is your own asset, rotate the secret immediately.
If it belongs to another org, follow their disclosure policy and provide redacted evidence.
Do not test the secret by using it. Do not exfiltrate data.
If the leak exposes PII, warn the owner and follow safe-handling guidance.

Practical use-cases

Finding staging API endpoints that bypass production auth.
Locating CI tokens that can reveal deployment targets.
Discovering hard-coded webhook URLs used by internal tools.
Spotting forgotten backup files or DB dumps in repos.

Each of these leads to high-value tests in later Parts.

Mini lab exercise – 30-40 minutes

Create a small test repo you control. Commit a .env file with dummy tokens such as TEST_API_KEY=abcd1234TEST. Push to GitHub as private, then make it public for the test. This is your lab data.
Run basic searches:

gh auth login
gh api -H "Accept: application/vnd.github.v3.text-match+json" /search/code -f q='TEST_API_KEY in:file repo:your-username/your-repo' -f per_page=100 | jq .

Run gitleaks locally against the repo:

gitleaks detect -s https://github.com/your-username/your-repo.git --report=gitleaks-test.json

Run truffleHog to check commit history:

trufflehog git https://github.com/your-username/your-repo.git --json > truffle_test.json

Practice writing a responsible disclosure note for your own test repo, then remove the secret and clean history using git filter-repo or bfg.

This exercise helps you understand how tools detect secrets and how to remediate them properly.

Common mistakes and how to avoid them

Mistake: Claiming a false positive as a secret.
Fix: Verify context before reporting. Check if token is example data or belongs to a test service.

Mistake: Using leaked credentials to confirm access.
Fix: Never use found credentials. Always request owner verification.

Mistake: Not checking commit history.
Fix: Use truffleHog or git filter-repo to inspect and clean history.

Mistake: Poor report evidence (too much or too little).
Fix: Provide filename, commit hash, and minimal redacted evidence.

Commands summary – copy-paste

Quick web search

site:github.com "example.com" ".env"
site:github.com "example.com" "api_key"

GH CLI search

gh api -H "Accept: application/vnd.github.v3.text-match+json" \
  /search/code -f q='example.com in:file path:.env' -f per_page=100 | jq .

gitleaks scan

gitleaks detect -s https://github.com/target-org/repo.git --report=gitleaks-report.json

truffleHog scan

trufflehog git https://github.com/target-org/repo.git --json > truffle_report.json

ripgrep local scan

rg --hidden -S -n --glob '!node_modules' "api[_-]?key|access[_-]?token|secret" .

JWT inspection (non-destructive)

echo "JWT_TOKEN" | awk -F. '{print $1}' | base64 -d
echo "JWT_TOKEN" | awk -F. '{print $2}' | base64 -d

Short checklist – copy into your notes

Run quick GitHub site searches.
Use gh for structured queries.
Scan repo history with truffleHog.
Run gitleaks for quick detection.
Validate context and do safe verification only.
Prepare a clear, redacted report and follow disclosure rules.

Next steps and where this feeds

Tokens and endpoints discovered here feed Part 11 for URL collection and Part 13 for JS analysis.
CI and deploy tokens connect to Part 8 for cloud and IP mapping.
Any leaked secrets are high-priority items for reporting and remediation.

Closing notes

Repo scraping gives high-confidence leads if you handle the findings responsibly.
Do not use leaked credentials. Verify context. Report clearly and politely. Rotate keys when you can and help owners clean their history.

Next post preview

Part 5 – Certificate Transparency, CT logs and cert-based discovery.
We will show parsing techniques, watchers, and how to use cert data to expand your surface.

Disclaimer

This content is for educational purposes only. Use it ethically and only against targets you own or have explicit permission to test. Do not use any techniques described here in ways that break laws, platform rules or third-party rights. If in doubt, stop and get permission.

Share the Post:

CyberXsociety