A Burp Professional and dirsearch optimized wordlist for content discovery, built by scraping and analyzing /robots.txt from the top 100k most visited domains in February 2026.
| File | Tool | Description |
|---|---|---|
dirsearch-robots.txt |
dirsearch | Combined wordlist with %EXT% placeholders for dirsearch's extension handling |
burp-robots-files.txt |
Burp Suite | Files only (required by Burp's Content Discovery) |
burp-robots-directories.txt |
Burp Suite | Directories only (required by Burp's Content Discovery) |
The underlying content is the same - the Burp lists are simply the dirsearch list split into files and directories.
The wordlist contains one entry per line and is optimized for recursive scanning:
python3 dirsearch.py --random-agent -u https://target.com \
-w dirsearch-robots.txt \
--recursive -R 3The wordlist uses %EXT% placeholders for server-side files. Define extensions based on the target stack to keep scans efficient and avoid testing irrelevant file types:
python3 dirsearch.py --random-agent -u https://target.com \
-w dirsearch-robots.txt \
--recursive -R 3 \
-e php,htmlThe wordlist is primarily lowercase. Let dirsearch handle case transformations automatically:
python3 dirsearch.py -u https://target.com \
-w dirsearch-robots.txt \
--recursive -R 3 \
-e php \
--capital- Choose extensions based on the target stack to avoid unnecessary requests.
- Adjust case transformations depending on the target environment.
- Use recursion for deeper discovery.
- Refer to the dirsearch and Burp documentation for additional tuning options.
In pentests, a common question is: Which wordlist should I use for content discovery?
For many testers, the go-to choice is SecLists / Discovery / Web-Content. However, many of those wordlists come with practical limitations:
- Outdated coverage
Some lists are up to 9 years old and don't reflect modern applications and technologies. - Redundant extensions
Entries likefile.php,file.html,file.jsontest the same resource with multiple extensions, many of which may not exist due to the target’s technology stack, unnecessarily increasing scan time. - Overlap between lists
The same entries appear across multiple wordlists, leading to duplicate requests. - Noisy entries
Static assets (e.g. JavaScript files) and questionable entries (looking at you,raft-*.txt) add bulk without value.
The result is unnecessary requests, increased brute-force time, and less focused testing.
This project aims to create a universal and (relatively) compact wordlist that captures the most common directories and files while leveraging dirsearch's built-in features.
- Crawl
/robots.txtfrom the top 100,000 most visited domains. - Extract and clean paths from
Disallow/Allow/Noindexdirectives. - Remove noise (see below).
- Sort entries by frequency of occurrence across domains.
- Keep only entries that occur at least 10 times overall.
To reduce noise and improve scan efficiency, the following categories are removed:
- Sex-related terms
- Non-English/German language-specific words
- Site-specific or highly contextual paths (e.g. product filter URLs from individual shops)
- Language and country codes
- City and brand names
- Static content (JavaScript files, images, fonts, etc.)
- Entries that don't meaningfully contribute to discovery