How we read your site
How we read your site (so your customers don't have to).
We read every page on the URLs you authorise, respect every 'no AI' header you set, and quietly skip pages that haven't changed since last time. No flooding. No surprises. No 'why is your bot crawling our internal staging?' email.
JavaScript pages render properly. Static pages stay cheap.
Most chatbot platforms take a shortcut and use a basic page-fetcher. That works fine on plain HTML, but most modern docs sites — Mintlify, Docusaurus, GitBook, anything React-rendered — need JavaScript to render their actual content. Use the cheap fetcher and the bot ends up with a half-empty corpus and answers like "I’m sorry, I couldn’t find anything about that."
Ours uses a managed headless browser, the same way Google’s crawler does. Single-page apps, dynamic FAQs, conditional content — all read the way a human visitor would see them. You get the content you actually published, not the framework boilerplate.
We respect every "no AI" signal you set.
Every page request honors your robots.txt, your
X-Robots-Tag header, and the newer Content Signals header
that lets you say "this page is fine for search but not for AI
ingestion." If you say no to a page, we don’t fetch it. If a page
sets a noindex header, we skip it. If your robots.txt blocks
a directory, we don’t go there.
Every fetch is logged with what we requested, what we got back, when, and which user-agent. When your security team or a procurement reviewer asks "prove you only fetched the URLs we approved" — we hand them a signed log.
Recrawls don't waste your bandwidth.
Most pages on your site don’t change between recrawls. We track which pages have changed and which haven’t — the ones that haven’t are skipped before they hit our pipeline. For a typical 10,000-page docs site that updates 2% of pages monthly, that turns a $15 recrawl into a $0.30 recrawl. The savings flow back to you as overage headroom.
Practically: if your docs team ships an update on Tuesday, by Tuesday afternoon the bot is answering with the new content — without us re-reading the 9,800 pages that haven’t changed.
We won't get your domain rate-limited.
Behind the scenes, the crawler enforces a polite request rate per origin. If your site returns a "slow down" response, we back off automatically. There’s no flooding your customer-facing production site, no silent failures, no "we crawled 80% and just gave up on the rest."
Translation: we’ve never gotten a customer’s domain blocked by their CDN. We’d rather take an extra hour to crawl cleanly than ten minutes to crawl badly.
See it in action.
Five sample URLs, five different outcomes. Click through to see how
the crawler handles a normal page, a page that’s already cached,
a page blocked by robots.txt, a page with a
noindex header, and a page where the origin’s Content
Signals declaration says not for AI.
https://docs.example.com/getting-startedFETCHED- 01robots.txtpassUser-agent: * allowed on /docs/
- 02X-Robots-TagpassHeader absent (default permissive)
- 03Content SignalspasscrawlPurposes includes ai-search
- 04Conditional GETskipFirst fetch — no ETag yet
Final: Body retrieved (200). Hashed. Queued for cleaning.
Every URL we touch ends in one of these five outcomes. The decision log for every fetch is queryable in your admin panel.