AI discovery depends on two things: whether crawlers can access your content reliably, and whether they can extract usable, well‑structured text. This guide explains how to recognize AI crawlers, grant the right access, protect your origin, and ship HTML that large language models can parse.
What counts as an AI crawler?
Two broad categories:
- Retrieval and answer engines that index pages for conversational answers (e.g., assistants and aggregators)
- Training or research crawlers that gather web content for model training or evaluation
Treat both with care—optimize for retrieval visibility while controlling cost and compliance for training bots.
Common bots (identify and handle responsibly)
Bot | Primary purpose | User‑Agent token (example) | Robots.txt honored? | Notes |
---|---|---|---|---|
GPT‑related (OpenAI) | Retrieval/training | GPTBot | Commonly | Use robots.txt and allowlists; supports IP verification procedures |
Perplexity | Retrieval | PerplexityBot | Commonly | Provide stable HTML; watch bursty crawl patterns |
Anthropic/Claude | Retrieval | ClaudeBot , ClaudeWeb | Commonly | Treat as separate agents; verify UA and IP where possible |
Common Crawl | Research/training | CCBot | Yes | Consider disallowing if you don’t want training reuse |
Web/Overviews | Googlebot , GoogleOther | Yes | AI Overviews relies on standard Google systems | |
Google (training control) | Training control | Google-Extended | Yes | Control training reuse; not a content discovery bot |
User‑Agents and behavior evolve—log and verify periodically.
Access controls: safe by default
Start permissive for public, evergreen content; restrict private, ephemeral, or high‑cost endpoints.
robots.txt (hints, not auth)
# Training control example
User-agent: Google-Extended
Disallow: /
# Retrieval bots with scoped access
User-agent: GPTBot
Allow: /
Disallow: /private/
User-agent: PerplexityBot
Allow: /docs/
Disallow: /beta/
# Anthropic (example)
User-agent: ClaudeBot
Allow: /
User-agent: ClaudeWeb
Allow: /
# Catch-all
User-agent: *
Disallow: /admin/
Notes:
robots.txt
is advisory—enforce sensitive areas with auth and network controls.crawl-delay
is inconsistently supported; prefer rate limiting at the edge.
Meta and headers
Use page‑level controls for fine‑grained rules.
<!-- Block indexing but allow following links -->
<meta name="robots" content="noindex,follow" />
<!-- HTTP header variant (server) -->
X-Robots-Tag: noindex, follow
Rate limiting and origin protection
- Prefer adaptive rate limiting at CDN/WAF; return 429 on overload.
- Burst control: per‑UA and per‑ASN thresholds; whitelist verified addresses.
- Use caching for static and semi‑static pages to reduce origin hit rate.
- Monitor 4xx/5xx spikes by UA; alert on abnormal patterns.
Rendering behavior: make HTML extractable
LLMs and answer engines favor content that is present in HTML at response time.
- Server render or pre‑render primary copy; avoid hiding essential text behind JS.
- Keep heading hierarchy clean (H1/H2/H3) and add anchorable sections.
- Use tables for comparisons/specs; include explicit units and labels.
- Provide summary/key‑takeaways blocks and concise definitions near the top.
- Avoid infinite scroll for core docs; use paginated archives and sitemaps.
Sitemaps and discovery
- Maintain XML sitemaps with
lastmod
; separate large sections into index sitemaps. - Include canonical URLs; avoid duplicate parameterized URLs.
- Keep 200‑OK for canonical pages and consistent language alternates.
Verification checklist
- Reverse DNS + forward confirm for bot IPs where supported
- UA string consistency across requests
- Stable 200 responses for critical docs (no auth walls, no require‑JS to view copy)
- Real‑user fetch test: curl + headless browser snapshot
Monitoring the right metrics
- Crawl volume by UA (daily), bytes served, cache hit ratio
- 2xx/4xx/5xx by UA; median TTFB for bots vs users
- Indexation coverage from sitemaps vs actual hits
- Downstream signals: AI mentions/citations and referral patterns
FAQ
Is crawl-delay
reliable?
Not across the board. Some bots respect it; many do not. Use edge rate limiting and caching for reliability.
Should we block training but allow retrieval?
Many brands do. Use Google-Extended
and block known training bots while allowing retrieval UAs. Document the policy.
Do we need JS rendering?
Prefer shipping primary content as HTML. Enhance with JS, but ensure no‑JS snapshots still contain the essential text.
Key takeaways
- Control access with layered defenses: robots hints + auth/network + rate limits
- Ship clean, extractable HTML with clear headings, tables, and summaries
- Monitor bot behavior continuously; UA ecosystems and policies change
Last updated: 2025‑10‑08
About the Author
Vladan Ilic
Founder and CEO