Part 11 – URL Collection and Analysis (Active + Passive Combined)

Introduction

Now things start getting real.

Until now, you found:

Subdomains
IPs
Live hosts
Running services

But vulnerabilities do not live in domains.
They live in URLs and endpoints.

This part is where recon becomes practical.
You are no longer mapping.
You are preparing to test.

Why URL collection matters

URLs show actual functionality
Hidden endpoints often lead to bugs
Old URLs reveal forgotten features
Parameters become entry points for testing

A single good endpoint is worth more than 100 subdomains.

What you are trying to collect

All reachable URLs
Hidden endpoints
API paths
Parameters
Old and archived URLs

This becomes your testing dataset.

Two approaches you will combine

Passive collection

No interaction with target.

Sources:

Wayback Machine
Common Crawl
Public datasets

Active collection

Direct interaction with target.

Methods:

Crawling
Spidering
Endpoint discovery

Best results come from combining both.

Tools you will use

gau – gather URLs from multiple sources
waybackurls – archive URLs
katana – modern crawler
hakrawler – lightweight crawler
httpx – validation
uro – URL deduplication
grep / gf – filtering
jq – parsing

Step-by-step URL collection workflow

1. Passive URL collection (start here)

Use gau:

gau example.com > gau_urls.txt

Use wayback:

waybackurls example.com >> gau_urls.txt

Merge and clean:

sort -u gau_urls.txt > passive_urls.txt

What you get

Historical endpoints
Old APIs
Deprecated paths

These are high-value.

2. Active crawling with katana

katana -u https://example.com -d 3 -o katana_urls.txt

Explanation

-d 3 sets crawl depth
Extracts endpoints dynamically

This finds:

Live pages
JS-linked endpoints
Hidden paths

3. Combine passive and active results

cat passive_urls.txt katana_urls.txt | sort -u > all_urls.txt

Now you have a unified dataset.

4. Remove noise (very important)

Many URLs are useless:

Images
CSS
Fonts

Filter them:

cat all_urls.txt | grep -Ev "\.(jpg|jpeg|png|gif|css|js|woff|svg)$" > clean_urls.txt

Now your list is usable.

Parameter extraction (critical step)

You want URLs like:

https://example.com/page?id=123

Extract parameterised URLs:

cat clean_urls.txt | grep "=" > params.txt

These are your attack points.

Normalising URLs

Avoid duplicates:

uro < clean_urls.txt > final_urls.txt

This removes:

Duplicate parameters
Repeated endpoints

Clean data = faster testing.

Finding interesting endpoints

Filter for keywords:

cat final_urls.txt | grep -E "api|admin|login|auth|debug" > interesting.txt

Focus on:

/api
/admin
/login
/auth
/internal

These are high-value.

Extracting endpoints from JavaScript

JS files often contain hidden URLs.

First collect JS files:

cat final_urls.txt | grep "\.js$" > js_files.txt

Then extract endpoints:

cat js_files.txt | while read url; do curl -s $url; done | grep -oE "https?://[^\"']+" | sort -u > js_endpoints.txt

This reveals:

APIs
Hidden routes
Internal services

Very powerful step.

Validating collected URLs

Not all URLs are alive.

Check:

httpx -l final_urls.txt -status-code -silent -o live_urls.txt

Now you have only working endpoints.

Prioritisation strategy

Focus on:

High priority:

URLs with parameters
API endpoints
Auth-related paths
Admin panels

Medium priority:

Static pages
Informational endpoints

Low priority:

Repeated or duplicate URLs

Real-world use-cases

Finding /api/v1/user?id= endpoint
Discovering hidden /admin-panel
Identifying /debug endpoints
Extracting internal APIs from JS
Finding old vulnerable endpoints from Wayback

These are common bug bounty wins.

Mini lab exercise (30–40 minutes)

Run passive collection:

gau example.com > urls.txt
waybackurls example.com >> urls.txt

Run active crawl:

katana -u https://example.com -o katana.txt

Merge:

cat urls.txt katana.txt | sort -u > all.txt

Filter:

grep -Ev "\.(jpg|png|css|js)$" all.txt > clean.txt

Extract params:

grep "=" clean.txt > params.txt

Open 3 endpoints and note:

What they do
What you can test

Common mistakes and fixes

Mistake: Only using one tool
Fix: Combine passive and active

Mistake: Not filtering noise
Fix: Remove static files early

Mistake: Ignoring parameters
Fix: Parameters are key attack points

Mistake: Not validating URLs
Fix: Always use httpx

Quick command summary

Passive collection:

gau example.com
waybackurls example.com

Active crawl:

katana -u https://example.com

Filter:

grep -Ev "\.(jpg|png|css|js)$"

Params:

grep "="

Validation:

httpx -l urls.txt

What to do after this Part

Start parameter fuzzing
Test for XSS, SQLi, IDOR
Analyze APIs
Move into vulnerability testing phase

Now you are no longer doing recon only.
You are entering exploitation phase preparation.

Next post preview

Part 12 – Visual Recon and Quick Triage (Screenshots, Patterns, Grouping)

We will cover:

Screenshot-based recon
Fast manual triage
Identifying patterns visually
Grouping similar apps

This will help you move faster with clarity.

Closing thought

Domains show structure.
URLs show behaviour.

And behaviour is where vulnerabilities live.

Disclaimer

This content is for educational purposes only. Use it ethically and only against targets you own or have explicit permission to test. Do not use any techniques described here in ways that break laws, platform rules, or third-party rights. If in doubt, stop and get permission.

Share the Post:

CyberXsociety