Part 11 – URL Collection and Analysis (Active + Passive Combined)

Introduction

Now things start getting real.

Until now, you found:

  • Subdomains
  • IPs
  • Live hosts
  • Running services

But vulnerabilities do not live in domains.
They live in URLs and endpoints.

This part is where recon becomes practical.
You are no longer mapping.
You are preparing to test.


Why URL collection matters

  • URLs show actual functionality
  • Hidden endpoints often lead to bugs
  • Old URLs reveal forgotten features
  • Parameters become entry points for testing

A single good endpoint is worth more than 100 subdomains.


What you are trying to collect

  • All reachable URLs
  • Hidden endpoints
  • API paths
  • Parameters
  • Old and archived URLs

This becomes your testing dataset.


Two approaches you will combine

Passive collection

No interaction with target.

Sources:

  • Wayback Machine
  • Common Crawl
  • Public datasets

Active collection

Direct interaction with target.

Methods:

  • Crawling
  • Spidering
  • Endpoint discovery

Best results come from combining both.


Tools you will use

  • gau – gather URLs from multiple sources
  • waybackurls – archive URLs
  • katana – modern crawler
  • hakrawler – lightweight crawler
  • httpx – validation
  • uro – URL deduplication
  • grep / gf – filtering
  • jq – parsing

Step-by-step URL collection workflow


1. Passive URL collection (start here)

Use gau:

gau example.com > gau_urls.txt

Use wayback:

waybackurls example.com >> gau_urls.txt

Merge and clean:

sort -u gau_urls.txt > passive_urls.txt

What you get

  • Historical endpoints
  • Old APIs
  • Deprecated paths

These are high-value.


2. Active crawling with katana

katana -u https://example.com -d 3 -o katana_urls.txt

Explanation

  • -d 3 sets crawl depth
  • Extracts endpoints dynamically

This finds:

  • Live pages
  • JS-linked endpoints
  • Hidden paths

3. Combine passive and active results

cat passive_urls.txt katana_urls.txt | sort -u > all_urls.txt

Now you have a unified dataset.


4. Remove noise (very important)

Many URLs are useless:

  • Images
  • CSS
  • Fonts

Filter them:

cat all_urls.txt | grep -Ev "\.(jpg|jpeg|png|gif|css|js|woff|svg)$" > clean_urls.txt

Now your list is usable.


Parameter extraction (critical step)

You want URLs like:

https://example.com/page?id=123

Extract parameterised URLs:

cat clean_urls.txt | grep "=" > params.txt

These are your attack points.


Normalising URLs

Avoid duplicates:

uro < clean_urls.txt > final_urls.txt

This removes:

  • Duplicate parameters
  • Repeated endpoints

Clean data = faster testing.


Finding interesting endpoints

Filter for keywords:

cat final_urls.txt | grep -E "api|admin|login|auth|debug" > interesting.txt

Focus on:

  • /api
  • /admin
  • /login
  • /auth
  • /internal

These are high-value.


Extracting endpoints from JavaScript

JS files often contain hidden URLs.

First collect JS files:

cat final_urls.txt | grep "\.js$" > js_files.txt

Then extract endpoints:

cat js_files.txt | while read url; do curl -s $url; done | grep -oE "https?://[^\"']+" | sort -u > js_endpoints.txt

This reveals:

  • APIs
  • Hidden routes
  • Internal services

Very powerful step.


Validating collected URLs

Not all URLs are alive.

Check:

httpx -l final_urls.txt -status-code -silent -o live_urls.txt

Now you have only working endpoints.


Prioritisation strategy

Focus on:

High priority:

  • URLs with parameters
  • API endpoints
  • Auth-related paths
  • Admin panels

Medium priority:

  • Static pages
  • Informational endpoints

Low priority:

  • Repeated or duplicate URLs

Real-world use-cases

  • Finding /api/v1/user?id= endpoint
  • Discovering hidden /admin-panel
  • Identifying /debug endpoints
  • Extracting internal APIs from JS
  • Finding old vulnerable endpoints from Wayback

These are common bug bounty wins.


Mini lab exercise (30–40 minutes)

  1. Run passive collection:
gau example.com > urls.txt
waybackurls example.com >> urls.txt
  1. Run active crawl:
katana -u https://example.com -o katana.txt
  1. Merge:
cat urls.txt katana.txt | sort -u > all.txt
  1. Filter:
grep -Ev "\.(jpg|png|css|js)$" all.txt > clean.txt
  1. Extract params:
grep "=" clean.txt > params.txt
  1. Open 3 endpoints and note:
  • What they do
  • What you can test

Common mistakes and fixes

Mistake: Only using one tool
Fix: Combine passive and active

Mistake: Not filtering noise
Fix: Remove static files early

Mistake: Ignoring parameters
Fix: Parameters are key attack points

Mistake: Not validating URLs
Fix: Always use httpx


Quick command summary

Passive collection:

gau example.com
waybackurls example.com

Active crawl:

katana -u https://example.com

Filter:

grep -Ev "\.(jpg|png|css|js)$"

Params:

grep "="

Validation:

httpx -l urls.txt

What to do after this Part

  • Start parameter fuzzing
  • Test for XSS, SQLi, IDOR
  • Analyze APIs
  • Move into vulnerability testing phase

Now you are no longer doing recon only.
You are entering exploitation phase preparation.


Next post preview

Part 12 – Visual Recon and Quick Triage (Screenshots, Patterns, Grouping)

We will cover:

  • Screenshot-based recon
  • Fast manual triage
  • Identifying patterns visually
  • Grouping similar apps

This will help you move faster with clarity.


Closing thought

Domains show structure.
URLs show behaviour.

And behaviour is where vulnerabilities live.


Disclaimer

This content is for educational purposes only. Use it ethically and only against targets you own or have explicit permission to test. Do not use any techniques described here in ways that break laws, platform rules, or third-party rights. If in doubt, stop and get permission.

Share the Post:

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

×