Crawler Settings

Configure crawl behavior, limits, delays, and URL patterns

The [crawler] section controls how squirrelscan discovers and fetches pages.

Configuration

toml

[crawler]
max_pages = 100
delay_ms = 100
timeout_ms = 30000
concurrency = 5
per_host_concurrency = 5
per_host_delay_ms = 50
include = []
exclude = []
allow_query_params = []
drop_query_prefixes = ["utm_", "gclid", "fbclid"]
respect_robots = true
incremental = true
breadth_first = true
max_prefix_budget = 0.25
user_agent = ""
follow_redirects = true

Crawl Limits

`max_pages`

Type: number Default: 100 Range: 1 to 5000 (capped by CLI)

Maximum number of pages to crawl per audit. The literal config default is 100, but when max_pages isn’t explicitly set the effective budget follows the coverage mode: quick = 25, surface = 100, full = 500. The default coverage mode is auth-aware — signed-in Pro accounts default to surface, free/anonymous to quick (see Coverage Modes).

Examples:

Small site:

toml

[crawler]
max_pages = 50

Large site:

toml

[crawler]
max_pages = 2000

CLI override:

bash

squirrel audit https://example.com -m 100

The resolution order is --max-pages / -m > non-default [crawler] max_pages > coverage-mode default (quick 25, surface 100, full 500).

Note: The CLI enforces a hard cap (currently 5,000 pages) regardless of config. When a crawl stops because it hit the limit, the CLI prints a hint naming --max-pages and the cap — see Hitting the page limit.

`max_depth`

Type: number Default: unset (unlimited)

Maximum crawl depth measured in link hops from the start page (seed and sitemap URLs are depth 0; a link found on a depth-N page is depth N+1). When set, the crawler never follows links past this depth — max_depth = 1 audits the start page plus the pages it links to directly. Unset leaves crawling depth-unbounded (the page budget still applies).

toml

[crawler]
max_depth = 2

CLI override:

bash

squirrel audit https://example.com --max-depth 2

The resolution order is --max-depth > [crawler] max_depth > unlimited.

`timeout_ms`

Type: number Default: 30000 (30 seconds) Range: 1000 to 60000 recommended

Timeout for each page request in milliseconds.

Examples:

Fast timeout for quick sites:

toml

[crawler]
timeout_ms = 10000  # 10 seconds

Slow sites or APIs:

toml

[crawler]
timeout_ms = 45000  # 45 seconds

When request exceeds timeout:

Page marked as failed
Crawl continues with next URL
Logged in error output

Rate Limiting

`delay_ms`

Type: number Default: 100 (100ms)

Base delay between requests in milliseconds.

Examples:

Fast crawl (be careful):

toml

[crawler]
delay_ms = 50

Polite crawl:

toml

[crawler]
delay_ms = 500

No delay (local development only):

toml

[crawler]
delay_ms = 0

Note: Actual delays depend on per_host_delay_ms and concurrency settings.

`per_host_delay_ms`

Type: number Default: 50 (50ms)

Minimum delay between consecutive request starts to the same host. Request starts stagger independently, so up to per_host_concurrency requests can be in flight at once while still spacing new requests apart for politeness.

A Crawl-delay directive in the target’s robots.txt always overrides this value, so sites that declare a slower rate are honored.

Examples:

Very polite:

toml

[crawler]
per_host_delay_ms = 1000  # 1 second

How it works:

With per_host_concurrency = 5 and per_host_delay_ms = 50:

At most 5 concurrent requests to same host
New requests to the host start at least 50ms apart
Other hosts can be fetched simultaneously

`concurrency`

Type: number Default: 5 Range: 1 to 20 recommended

Maximum number of concurrent requests globally.

Examples:

Sequential (single request at a time):

toml

[crawler]
concurrency = 1

Moderate parallelism:

toml

[crawler]
concurrency = 10

High parallelism (use cautiously):

toml

[crawler]
concurrency = 20

Impact:

Higher = faster crawls
Higher = more server load
Bounded by per_host_concurrency

`per_host_concurrency`

Type: number Default: 5 Range: 1 to 8 recommended

Maximum number of concurrent requests per host.

Prevents overwhelming a single server even with high global concurrency.

Examples:

One request per host at a time:

toml

[crawler]
per_host_concurrency = 1

Allow more parallel requests:

toml

[crawler]
per_host_concurrency = 4

How it interacts with concurrency:

toml

[crawler]
concurrency = 10
per_host_concurrency = 2

Up to 10 total concurrent requests
At most 2 concurrent requests to any single host
Can fetch from up to 5 different hosts simultaneously

URL Filtering

`include`

Type: string[] Default: [] (empty = include all URLs from seed domain)

URL patterns to include. If set, only matching URLs are crawled.

Pattern Syntax:

Uses glob syntax:

* - Match anything except /
** - Match anything including /
? - Match single character
[abc] - Match character set

Examples:

Only crawl blog:

toml

[crawler]
include = ["/blog/**"]

Multiple sections:

toml

[crawler]
include = ["/blog/**", "/docs/**", "/products/**"]

Specific file types:

toml

[crawler]
include = ["*.html", "*.htm"]

Important: When include is set, it overrides the domains setting in [project].

`exclude`

Type: string[] Default: [] (empty = exclude nothing)

URL patterns to exclude from crawling.

Takes precedence over include - if a URL matches both, it’s excluded.

Examples:

Exclude admin areas:

toml

[crawler]
exclude = ["/admin/**", "/wp-admin/**"]

Exclude file types:

toml

[crawler]
exclude = ["*.pdf", "*.zip", "*.tar.gz"]

Exclude API endpoints:

toml

[crawler]
exclude = ["/api/**", "/v1/**"]

Exclude query parameters:

toml

[crawler]
exclude = ["*?preview=*", "*?draft=*"]

Common exclusions:

toml

[crawler]
exclude = [
  "/admin/**",
  "/wp-admin/**",
  "/wp-content/**",
  "/api/**",
  "*.pdf",
  "*.zip",
  "*.jpg",
  "*.png",
  "*?preview=*",
  "*?print=*"
]

Pattern Matching Examples

Pattern	Matches	Doesn’t Match
`/blog/*`	`/blog/post`	`/blog/post/comment`
`/blog/**`	`/blog/post`, `/blog/post/comment`	`/about`
`*.pdf`	`/file.pdf`, `/docs/guide.pdf`	`/file.html`
`?preview=`	`/page?preview=true`	`/page`
`/api/*/users`	`/api/v1/users`	`/api/v1/v2/users`

Query Parameters

`allow_query_params`

Type: string[] Default: [] (empty = drop all query params for deduplication)

Query parameters to preserve during URL deduplication.

Why this matters:

URLs are deduplicated before crawling:

/page?id=1&utm_source=google → /page?id=1 (utm dropped)

Without configuration, all query params are dropped except those in allow_query_params.

Examples:

Preserve pagination:

toml

[crawler]
allow_query_params = ["page"]

Preserve filters:

toml

[crawler]
allow_query_params = ["category", "sort", "filter", "q"]

Preserve all query params:

toml

[crawler]
allow_query_params = ["*"]

Use case:

E-commerce site with filters:

toml

[crawler]
allow_query_params = ["category", "price", "brand", "page"]

This preserves:

/products?category=shoes ✓
/products?category=shoes&page=2 ✓

This drops:

/products?utm_source=google ✗ (becomes /products)
/products?gclid=abc123 ✗ (becomes /products)

`drop_query_prefixes`

Type: string[] Default: ["utm_", "gclid", "fbclid"]

Query parameter prefixes to always drop, even if in allow_query_params.

Default tracking params dropped:

utm_* - Google Analytics (utm_source, utm_medium, etc.)
gclid - Google Ads
fbclid - Facebook Ads

Examples:

Drop more tracking params:

toml

[crawler]
drop_query_prefixes = [
  "utm_",
  "gclid",
  "fbclid",
  "mc_",       # Mailchimp
  "_ga",       # Google Analytics
  "ref",       # Referrer
  "source"     # Generic source tracking
]

Drop nothing:

toml

[crawler]
drop_query_prefixes = []

Crawl Strategy

`breadth_first`

Type: boolean Default: true

Use breadth-first crawling for better site coverage.

Breadth-first (default):

Crawls level-by-level
Discovers homepage, then all links from homepage, then all links from those pages
Better site coverage
Avoids getting stuck in deep paths

Depth-first (false):

Crawls as deep as possible before backtracking
Can get stuck in deep sections
Less even coverage

Example:

Disable breadth-first:

toml

[crawler]
breadth_first = false

Recommendation: Keep true (default) for most sites.

`max_prefix_budget`

Type: number Default: 0.25 (25%) Range: 0.1 to 1.0

Maximum percentage of crawl budget for any single path prefix.

Prevents the crawler from spending all pages on one section (e.g., /blog/ with 1000+ posts).

How it works:

With max_pages = 500 and max_prefix_budget = 0.25:

At most 125 pages (25%) from any single path prefix
Ensures diverse coverage across site sections

Examples:

More strict (better coverage):

toml

[crawler]
max_prefix_budget = 0.15  # Max 15% per prefix

More lenient (deeper coverage):

toml

[crawler]
max_prefix_budget = 0.5   # Max 50% per prefix

Disable budget (not recommended):

toml

[crawler]
max_prefix_budget = 1.0   # No limit

Use case:

Site with large blog:

/blog/post-1
/blog/post-2
...
/blog/post-5000
/about
/contact

With max_prefix_budget = 0.25 and max_pages = 500:

At most 125 blog posts crawled
Remaining budget for other sections

Request Configuration

`user_agent`

Type: string Default: "" (empty = random browser user agent per crawl)

Custom user agent string.

Default behavior (empty string):

Random modern browser user agent, refreshed per crawl:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36

Examples:

Custom user agent:

toml

[crawler]
user_agent = "SquirrelScan Bot (https://squirrelscan.com)"

Mobile user agent:

toml

[crawler]
user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 18_0 like Mac OS X) AppleWebKit/605.1.15"

Recommendation: Leave empty (default) for best results with bot protection.

`headers`

Type: table (map of header name → value) Default: {} (no custom headers)

Custom HTTP request headers attached to every crawl request — pages, assets, robots.txt, sitemaps, llms.txt, and markdown probes. Use this to authorize the crawler with schemes that require signed headers, such as Shopify / Cloudflare Web Bot Auth.

toml

[crawler]
headers = { "Signature-Agent" = "\"https://shopify.com\"", "Signature-Input" = "...", "Signature" = "..." }

CLI override:

bash

squirrel audit https://example.com \
  -H 'Signature-Agent: "https://shopify.com"' \
  -H 'Signature-Input: sig1=("@authority");keyid="..."' \
  -H 'Signature: sig1=:...:'

The repeatable --header / -H flag takes a Name: Value string (split on the first colon, so values may contain colons). CLI headers merge over the TOML [crawler] headers map — a flag with the same name wins. Quoting is preserved verbatim, so Signature-Agent: "https://shopify.com" keeps its quotes end-to-end.

Cloud audits: applying custom headers to dashboard cloud audits is a Pro feature. Set them per-website under Settings → Crawl → Custom request headers; free plans receive 403 upgrade_required. The render worker applies them to the headless browser via setExtraHTTPHeaders, so they ride every request the rendered page makes.

`follow_redirects`

Type: boolean Default: true

Follow HTTP 3xx redirects.

When true (default):

Follows redirects automatically
Crawls final destination URL
Redirect chains tracked for analysis

When false:

Stops at redirect
Does not fetch redirect destination
Useful for debugging redirect issues

Example:

Disable redirect following:

toml

[crawler]
follow_redirects = false

Recommendation: Keep true (default) for normal audits.

Robots.txt

`respect_robots`

Type: boolean Default: true

Obey robots.txt rules and crawl-delay directives.

When true (default):

Fetches and parses robots.txt
Respects Disallow: rules
Honors Crawl-delay: directive
Polite and ethical

When false:

Ignores robots.txt
Crawls all URLs (including disallowed)
Use only for your own sites

Example:

Ignore robots.txt (testing only):

toml

[crawler]
respect_robots = false

Recommendation: Always keep true (default) when crawling third-party sites.

Crawling politely & re-scanning efficiently

squirrelscan is designed to be a good guest on the sites it audits. Two mechanisms keep load low: politeness controls (how fast it fetches) and incremental re-scanning (avoiding re-downloading pages that haven’t changed).

`incremental`

Type: boolean Default: true

Re-scan only pages that changed since the last crawl. When enabled, the crawler records each page’s ETag and Last-Modified response headers (plus a content hash) and, on the next audit of the same site, sends conditional requests (If-None-Match / If-Modified-Since). Unchanged pages return a lightweight 304 Not Modified — no body is transferred and the cached content is reused.

This is the answer to “incremental vs full scans so I don’t overwhelm the host”: incremental crawling is on by default, so repeat audits re-download only what actually changed.

First audit of a site: there’s nothing cached yet, so every page is fetched in full — incremental has no effect on a cold run.
Repeat audits: unchanged pages come back as 304s, cutting bandwidth and load on the target server.

toml

[crawler]
incremental = true

CLI overrides:

bash

# Force incremental on for this run — overrides a project config with incremental = false
squirrel audit https://example.com --incremental

# Disable conditional requests — fetch every page in full
squirrel audit https://example.com --no-incremental

# Full re-scan, ignoring all cached content (alias for a cold crawl)
squirrel audit https://example.com --refresh

The resolution order is --refresh (always full) > --incremental / --no-incremental > [crawler] incremental (default true). --refresh also ignores the cross-audit freshness cache (see use_cache_control), so use it when you want a guaranteed clean re-fetch.

Politeness checklist

For a gentle crawl of a third-party site, combine incremental re-scanning with the rate-limit and robots controls documented above:

respect_robots — obey robots.txt and its Crawl-delay (keep true).
per_host_delay_ms — space out requests to a single host.
per_host_concurrency — cap simultaneous requests per host.
delay_ms / concurrency — global pacing.
incremental — skip re-downloading unchanged pages on repeat audits.

A Crawl-delay directive in the target’s robots.txt always overrides per_host_delay_ms, so sites that ask for a slower rate are honored automatically.

What squirrelscan does not access

squirrelscan crawls public HTTP only — the same pages a browser or search engine bot would see. It does not read server logs, databases, or any authenticated/internal source, and it requires no access beyond what the public site serves. Scanning protected or internal sources is not currently supported; if you need it, please open an issue describing your use case so it can be scoped separately.

Complete Examples

Fast Local Development

toml

[crawler]
max_pages = 50
delay_ms = 0
per_host_delay_ms = 0
concurrency = 10
respect_robots = false

Polite Production Crawl

toml

[crawler]
max_pages = 500
delay_ms = 200
per_host_delay_ms = 500
concurrency = 5
per_host_concurrency = 2
respect_robots = true

High-Volume Crawl

toml

[crawler]
max_pages = 2000
delay_ms = 100
per_host_delay_ms = 200
concurrency = 10
per_host_concurrency = 3
breadth_first = true
max_prefix_budget = 0.2

Focused Blog Crawl

toml

[crawler]
max_pages = 200
include = ["/blog/**"]
exclude = ["*.pdf", "/blog/drafts/**"]
allow_query_params = ["page"]

E-commerce Site

toml

[crawler]
max_pages = 1000
include = ["/products/**", "/categories/**"]
exclude = ["/cart/**", "/checkout/**", "/account/**"]
allow_query_params = ["category", "sort", "page", "filter"]
drop_query_prefixes = ["utm_", "gclid", "fbclid", "ref"]

Project Settings - Domains configuration
Rules Configuration - Which rules to run
Examples - More configuration examples

Edit this page

Configuration

Crawl Limits

max_pages

max_depth

timeout_ms

Rate Limiting

delay_ms

per_host_delay_ms

concurrency

per_host_concurrency

URL Filtering

include

exclude

Pattern Matching Examples

Query Parameters

allow_query_params

drop_query_prefixes

Crawl Strategy

breadth_first

max_prefix_budget

Request Configuration

user_agent

headers

follow_redirects

Robots.txt

respect_robots

Crawling politely & re-scanning efficiently

incremental

Politeness checklist

What squirrelscan does not access

Complete Examples

Fast Local Development

Polite Production Crawl

High-Volume Crawl

Focused Blog Crawl

E-commerce Site

Related

`max_pages`

`max_depth`

`timeout_ms`

`delay_ms`

`per_host_delay_ms`

`concurrency`

`per_host_concurrency`

`include`

`exclude`

`allow_query_params`

`drop_query_prefixes`

`breadth_first`

`max_prefix_budget`

`user_agent`

`headers`

`follow_redirects`

`respect_robots`

`incremental`