Crawler Settings
Configure crawl behavior, limits, delays, and URL patterns
The [crawler] section controls how squirrelscan discovers and fetches pages.
Configuration
[crawler]
max_pages = 100
delay_ms = 100
timeout_ms = 30000
concurrency = 5
per_host_concurrency = 5
per_host_delay_ms = 50
include = []
exclude = []
allow_query_params = []
drop_query_prefixes = ["utm_", "gclid", "fbclid"]
respect_robots = true
incremental = true
breadth_first = true
max_prefix_budget = 0.25
user_agent = ""
follow_redirects = trueCrawl Limits
max_pages
Type: number
Default: 100
Range: 1 to 5000 (capped by CLI)
Maximum number of pages to crawl per audit. The literal config default is 100, but when max_pages isn’t explicitly set the effective budget follows the coverage mode: quick = 25, surface = 100, full = 500. The default coverage mode is auth-aware — signed-in Pro accounts default to surface, free/anonymous to quick (see Coverage Modes).
Examples:
Small site:
[crawler]
max_pages = 50Large site:
[crawler]
max_pages = 2000CLI override:
squirrel audit https://example.com -m 100The resolution order is --max-pages / -m > non-default [crawler] max_pages > coverage-mode default (quick 25, surface 100, full 500).
Note: The CLI enforces a hard cap (currently 5,000 pages) regardless of config. When a crawl stops because it hit the limit, the CLI prints a hint naming --max-pages and the cap — see Hitting the page limit.
max_depth
Type: number
Default: unset (unlimited)
Maximum crawl depth measured in link hops from the start page (seed and sitemap URLs are depth 0; a link found on a depth-N page is depth N+1). When set, the crawler never follows links past this depth — max_depth = 1 audits the start page plus the pages it links to directly. Unset leaves crawling depth-unbounded (the page budget still applies).
[crawler]
max_depth = 2CLI override:
squirrel audit https://example.com --max-depth 2The resolution order is --max-depth > [crawler] max_depth > unlimited.
timeout_ms
Type: number
Default: 30000 (30 seconds)
Range: 1000 to 60000 recommended
Timeout for each page request in milliseconds.
Examples:
Fast timeout for quick sites:
[crawler]
timeout_ms = 10000 # 10 secondsSlow sites or APIs:
[crawler]
timeout_ms = 45000 # 45 secondsWhen request exceeds timeout:
- Page marked as failed
- Crawl continues with next URL
- Logged in error output
Rate Limiting
delay_ms
Type: number
Default: 100 (100ms)
Base delay between requests in milliseconds.
Examples:
Fast crawl (be careful):
[crawler]
delay_ms = 50Polite crawl:
[crawler]
delay_ms = 500No delay (local development only):
[crawler]
delay_ms = 0Note: Actual delays depend on per_host_delay_ms and concurrency settings.
per_host_delay_ms
Type: number
Default: 50 (50ms)
Minimum delay between consecutive request starts to the same host. Request
starts stagger independently, so up to per_host_concurrency requests can be in
flight at once while still spacing new requests apart for politeness.
A Crawl-delay directive in the target’s robots.txt always overrides this
value, so sites that declare a slower rate are honored.
Examples:
Very polite:
[crawler]
per_host_delay_ms = 1000 # 1 secondHow it works:
With per_host_concurrency = 5 and per_host_delay_ms = 50:
- At most 5 concurrent requests to same host
- New requests to the host start at least 50ms apart
- Other hosts can be fetched simultaneously
concurrency
Type: number
Default: 5
Range: 1 to 20 recommended
Maximum number of concurrent requests globally.
Examples:
Sequential (single request at a time):
[crawler]
concurrency = 1Moderate parallelism:
[crawler]
concurrency = 10High parallelism (use cautiously):
[crawler]
concurrency = 20Impact:
- Higher = faster crawls
- Higher = more server load
- Bounded by
per_host_concurrency
per_host_concurrency
Type: number
Default: 5
Range: 1 to 8 recommended
Maximum number of concurrent requests per host.
Prevents overwhelming a single server even with high global concurrency.
Examples:
One request per host at a time:
[crawler]
per_host_concurrency = 1Allow more parallel requests:
[crawler]
per_host_concurrency = 4How it interacts with concurrency:
[crawler]
concurrency = 10
per_host_concurrency = 2- Up to 10 total concurrent requests
- At most 2 concurrent requests to any single host
- Can fetch from up to 5 different hosts simultaneously
URL Filtering
include
Type: string[]
Default: [] (empty = include all URLs from seed domain)
URL patterns to include. If set, only matching URLs are crawled.
Pattern Syntax:
Uses glob syntax:
*- Match anything except/**- Match anything including/?- Match single character[abc]- Match character set
Examples:
Only crawl blog:
[crawler]
include = ["/blog/**"]Multiple sections:
[crawler]
include = ["/blog/**", "/docs/**", "/products/**"]Specific file types:
[crawler]
include = ["*.html", "*.htm"]Important: When include is set, it overrides the domains setting in [project].
exclude
Type: string[]
Default: [] (empty = exclude nothing)
URL patterns to exclude from crawling.
Takes precedence over include - if a URL matches both, it’s excluded.
Examples:
Exclude admin areas:
[crawler]
exclude = ["/admin/**", "/wp-admin/**"]Exclude file types:
[crawler]
exclude = ["*.pdf", "*.zip", "*.tar.gz"]Exclude API endpoints:
[crawler]
exclude = ["/api/**", "/v1/**"]Exclude query parameters:
[crawler]
exclude = ["*?preview=*", "*?draft=*"]Common exclusions:
[crawler]
exclude = [
"/admin/**",
"/wp-admin/**",
"/wp-content/**",
"/api/**",
"*.pdf",
"*.zip",
"*.jpg",
"*.png",
"*?preview=*",
"*?print=*"
]Pattern Matching Examples
| Pattern | Matches | Doesn’t Match |
|---|---|---|
/blog/* | /blog/post | /blog/post/comment |
/blog/** | /blog/post, /blog/post/comment | /about |
*.pdf | /file.pdf, /docs/guide.pdf | /file.html |
*?preview=* | /page?preview=true | /page |
/api/*/users | /api/v1/users | /api/v1/v2/users |
Query Parameters
allow_query_params
Type: string[]
Default: [] (empty = drop all query params for deduplication)
Query parameters to preserve during URL deduplication.
Why this matters:
URLs are deduplicated before crawling:
/page?id=1&utm_source=google→/page?id=1(utm dropped)
Without configuration, all query params are dropped except those in allow_query_params.
Examples:
Preserve pagination:
[crawler]
allow_query_params = ["page"]Preserve filters:
[crawler]
allow_query_params = ["category", "sort", "filter", "q"]Preserve all query params:
[crawler]
allow_query_params = ["*"]Use case:
E-commerce site with filters:
[crawler]
allow_query_params = ["category", "price", "brand", "page"]This preserves:
/products?category=shoes✓/products?category=shoes&page=2✓
This drops:
/products?utm_source=google✗ (becomes/products)/products?gclid=abc123✗ (becomes/products)
drop_query_prefixes
Type: string[]
Default: ["utm_", "gclid", "fbclid"]
Query parameter prefixes to always drop, even if in allow_query_params.
Default tracking params dropped:
utm_*- Google Analytics (utm_source, utm_medium, etc.)gclid- Google Adsfbclid- Facebook Ads
Examples:
Drop more tracking params:
[crawler]
drop_query_prefixes = [
"utm_",
"gclid",
"fbclid",
"mc_", # Mailchimp
"_ga", # Google Analytics
"ref", # Referrer
"source" # Generic source tracking
]Drop nothing:
[crawler]
drop_query_prefixes = []Crawl Strategy
breadth_first
Type: boolean
Default: true
Use breadth-first crawling for better site coverage.
Breadth-first (default):
- Crawls level-by-level
- Discovers homepage, then all links from homepage, then all links from those pages
- Better site coverage
- Avoids getting stuck in deep paths
Depth-first (false):
- Crawls as deep as possible before backtracking
- Can get stuck in deep sections
- Less even coverage
Example:
Disable breadth-first:
[crawler]
breadth_first = falseRecommendation: Keep true (default) for most sites.
max_prefix_budget
Type: number
Default: 0.25 (25%)
Range: 0.1 to 1.0
Maximum percentage of crawl budget for any single path prefix.
Prevents the crawler from spending all pages on one section (e.g., /blog/ with 1000+ posts).
How it works:
With max_pages = 500 and max_prefix_budget = 0.25:
- At most 125 pages (25%) from any single path prefix
- Ensures diverse coverage across site sections
Examples:
More strict (better coverage):
[crawler]
max_prefix_budget = 0.15 # Max 15% per prefixMore lenient (deeper coverage):
[crawler]
max_prefix_budget = 0.5 # Max 50% per prefixDisable budget (not recommended):
[crawler]
max_prefix_budget = 1.0 # No limitUse case:
Site with large blog:
/blog/post-1
/blog/post-2
...
/blog/post-5000
/about
/contact
With max_prefix_budget = 0.25 and max_pages = 500:
- At most 125 blog posts crawled
- Remaining budget for other sections
Request Configuration
user_agent
Type: string
Default: "" (empty = random browser user agent per crawl)
Custom user agent string.
Default behavior (empty string):
Random modern browser user agent, refreshed per crawl:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
Examples:
Custom user agent:
[crawler]
user_agent = "SquirrelScan Bot (https://squirrelscan.com)"Mobile user agent:
[crawler]
user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 18_0 like Mac OS X) AppleWebKit/605.1.15"Recommendation: Leave empty (default) for best results with bot protection.
headers
Type: table (map of header name → value)
Default: {} (no custom headers)
Custom HTTP request headers attached to every crawl request — pages, assets,
robots.txt, sitemaps, llms.txt, and markdown probes. Use this to authorize
the crawler with schemes that require signed headers, such as
Shopify / Cloudflare Web Bot Auth.
[crawler]
headers = { "Signature-Agent" = "\"https://shopify.com\"", "Signature-Input" = "...", "Signature" = "..." }CLI override:
squirrel audit https://example.com \
-H 'Signature-Agent: "https://shopify.com"' \
-H 'Signature-Input: sig1=("@authority");keyid="..."' \
-H 'Signature: sig1=:...:'The repeatable --header / -H flag takes a Name: Value string (split on the
first colon, so values may contain colons). CLI headers merge over the TOML
[crawler] headers map — a flag with the same name wins. Quoting is preserved
verbatim, so Signature-Agent: "https://shopify.com" keeps its quotes
end-to-end.
Cloud audits: applying custom headers to dashboard cloud audits is a
Pro feature. Set them per-website under Settings → Crawl → Custom request
headers; free plans receive 403 upgrade_required. The render worker applies
them to the headless browser via setExtraHTTPHeaders, so they ride every
request the rendered page makes.
follow_redirects
Type: boolean
Default: true
Follow HTTP 3xx redirects.
When true (default):
- Follows redirects automatically
- Crawls final destination URL
- Redirect chains tracked for analysis
When false:
- Stops at redirect
- Does not fetch redirect destination
- Useful for debugging redirect issues
Example:
Disable redirect following:
[crawler]
follow_redirects = falseRecommendation: Keep true (default) for normal audits.
Robots.txt
respect_robots
Type: boolean
Default: true
Obey robots.txt rules and crawl-delay directives.
When true (default):
- Fetches and parses robots.txt
- Respects
Disallow:rules - Honors
Crawl-delay:directive - Polite and ethical
When false:
- Ignores robots.txt
- Crawls all URLs (including disallowed)
- Use only for your own sites
Example:
Ignore robots.txt (testing only):
[crawler]
respect_robots = falseRecommendation: Always keep true (default) when crawling third-party sites.
Crawling politely & re-scanning efficiently
squirrelscan is designed to be a good guest on the sites it audits. Two mechanisms keep load low: politeness controls (how fast it fetches) and incremental re-scanning (avoiding re-downloading pages that haven’t changed).
incremental
Type: boolean
Default: true
Re-scan only pages that changed since the last crawl. When enabled, the crawler
records each page’s ETag and Last-Modified response headers (plus a content
hash) and, on the next audit of the same site, sends conditional requests
(If-None-Match / If-Modified-Since). Unchanged pages return a lightweight
304 Not Modified — no body is transferred and the cached content is reused.
This is the answer to “incremental vs full scans so I don’t overwhelm the host”: incremental crawling is on by default, so repeat audits re-download only what actually changed.
- First audit of a site: there’s nothing cached yet, so every page is fetched in full — incremental has no effect on a cold run.
- Repeat audits: unchanged pages come back as
304s, cutting bandwidth and load on the target server.
[crawler]
incremental = trueCLI overrides:
# Force incremental on for this run — overrides a project config with incremental = false
squirrel audit https://example.com --incremental
# Disable conditional requests — fetch every page in full
squirrel audit https://example.com --no-incremental
# Full re-scan, ignoring all cached content (alias for a cold crawl)
squirrel audit https://example.com --refreshThe resolution order is --refresh (always full) > --incremental /
--no-incremental > [crawler] incremental (default true). --refresh also
ignores the cross-audit freshness cache (see use_cache_control), so use it when
you want a guaranteed clean re-fetch.
Politeness checklist
For a gentle crawl of a third-party site, combine incremental re-scanning with the rate-limit and robots controls documented above:
respect_robots— obeyrobots.txtand itsCrawl-delay(keeptrue).per_host_delay_ms— space out requests to a single host.per_host_concurrency— cap simultaneous requests per host.delay_ms/concurrency— global pacing.incremental— skip re-downloading unchanged pages on repeat audits.
A Crawl-delay directive in the target’s robots.txt always overrides
per_host_delay_ms, so sites that ask for a slower rate are honored automatically.
What squirrelscan does not access
squirrelscan crawls public HTTP only — the same pages a browser or search engine bot would see. It does not read server logs, databases, or any authenticated/internal source, and it requires no access beyond what the public site serves. Scanning protected or internal sources is not currently supported; if you need it, please open an issue describing your use case so it can be scoped separately.
Complete Examples
Fast Local Development
[crawler]
max_pages = 50
delay_ms = 0
per_host_delay_ms = 0
concurrency = 10
respect_robots = falsePolite Production Crawl
[crawler]
max_pages = 500
delay_ms = 200
per_host_delay_ms = 500
concurrency = 5
per_host_concurrency = 2
respect_robots = trueHigh-Volume Crawl
[crawler]
max_pages = 2000
delay_ms = 100
per_host_delay_ms = 200
concurrency = 10
per_host_concurrency = 3
breadth_first = true
max_prefix_budget = 0.2Focused Blog Crawl
[crawler]
max_pages = 200
include = ["/blog/**"]
exclude = ["*.pdf", "/blog/drafts/**"]
allow_query_params = ["page"]E-commerce Site
[crawler]
max_pages = 1000
include = ["/products/**", "/categories/**"]
exclude = ["/cart/**", "/checkout/**", "/account/**"]
allow_query_params = ["category", "sort", "page", "filter"]
drop_query_prefixes = ["utm_", "gclid", "fbclid", "ref"]Related
- Project Settings - Domains configuration
- Rules Configuration - Which rules to run
- Examples - More configuration examples