Presence AIPresence AI
  • Features
  • Blog
  • FAQ
Get Early Access
  • Features
  • Blog
  • FAQ
Get Early Access
engineering

Field Guide to AI Crawlers: Access, Rate Limits, and Rendering Behavior

Identify AI crawlers, configure safe access, and optimize rendering so your pages are discoverable and extractable by AI answer engines.

October 8, 2025
4 min read
VIVladan Ilic
Field Guide to AI Crawlers: Access, Rate Limits, and Rendering Behavior
#GEO#AI crawlers#robots.txt#rendering#rate limiting#technical SEO

AI discovery depends on two things: whether crawlers can access your content reliably, and whether they can extract usable, well‑structured text. This guide explains how to recognize AI crawlers, grant the right access, protect your origin, and ship HTML that large language models can parse.

What counts as an AI crawler?

Two broad categories:

  1. Retrieval and answer engines that index pages for conversational answers (e.g., assistants and aggregators)
  2. Training or research crawlers that gather web content for model training or evaluation

Treat both with care—optimize for retrieval visibility while controlling cost and compliance for training bots.

Common bots (identify and handle responsibly)

BotPrimary purposeUser‑Agent token (example)Robots.txt honored?Notes
GPT‑related (OpenAI)Retrieval/trainingGPTBotCommonlyUse robots.txt and allowlists; supports IP verification procedures
PerplexityRetrievalPerplexityBotCommonlyProvide stable HTML; watch bursty crawl patterns
Anthropic/ClaudeRetrievalClaudeBot, ClaudeWebCommonlyTreat as separate agents; verify UA and IP where possible
Common CrawlResearch/trainingCCBotYesConsider disallowing if you don’t want training reuse
GoogleWeb/OverviewsGooglebot, GoogleOtherYesAI Overviews relies on standard Google systems
Google (training control)Training controlGoogle-ExtendedYesControl training reuse; not a content discovery bot

User‑Agents and behavior evolve—log and verify periodically.

Access controls: safe by default

Start permissive for public, evergreen content; restrict private, ephemeral, or high‑cost endpoints.

robots.txt (hints, not auth)

# Training control example
User-agent: Google-Extended
Disallow: /

# Retrieval bots with scoped access
User-agent: GPTBot
Allow: /
Disallow: /private/

User-agent: PerplexityBot
Allow: /docs/
Disallow: /beta/

# Anthropic (example)
User-agent: ClaudeBot
Allow: /
User-agent: ClaudeWeb
Allow: /

# Catch-all
User-agent: *
Disallow: /admin/

Notes:

  • robots.txt is advisory—enforce sensitive areas with auth and network controls.
  • crawl-delay is inconsistently supported; prefer rate limiting at the edge.

Meta and headers

Use page‑level controls for fine‑grained rules.

<!-- Block indexing but allow following links -->
<meta name="robots" content="noindex,follow" />

<!-- HTTP header variant (server) -->
X-Robots-Tag: noindex, follow

Rate limiting and origin protection

  • Prefer adaptive rate limiting at CDN/WAF; return 429 on overload.
  • Burst control: per‑UA and per‑ASN thresholds; whitelist verified addresses.
  • Use caching for static and semi‑static pages to reduce origin hit rate.
  • Monitor 4xx/5xx spikes by UA; alert on abnormal patterns.

Rendering behavior: make HTML extractable

LLMs and answer engines favor content that is present in HTML at response time.

  • Server render or pre‑render primary copy; avoid hiding essential text behind JS.
  • Keep heading hierarchy clean (H1/H2/H3) and add anchorable sections.
  • Use tables for comparisons/specs; include explicit units and labels.
  • Provide summary/key‑takeaways blocks and concise definitions near the top.
  • Avoid infinite scroll for core docs; use paginated archives and sitemaps.

Sitemaps and discovery

  • Maintain XML sitemaps with lastmod; separate large sections into index sitemaps.
  • Include canonical URLs; avoid duplicate parameterized URLs.
  • Keep 200‑OK for canonical pages and consistent language alternates.

Verification checklist

  • Reverse DNS + forward confirm for bot IPs where supported
  • UA string consistency across requests
  • Stable 200 responses for critical docs (no auth walls, no require‑JS to view copy)
  • Real‑user fetch test: curl + headless browser snapshot

Monitoring the right metrics

  • Crawl volume by UA (daily), bytes served, cache hit ratio
  • 2xx/4xx/5xx by UA; median TTFB for bots vs users
  • Indexation coverage from sitemaps vs actual hits
  • Downstream signals: AI mentions/citations and referral patterns

FAQ

Is crawl-delay reliable?

Not across the board. Some bots respect it; many do not. Use edge rate limiting and caching for reliability.

Should we block training but allow retrieval?

Many brands do. Use Google-Extended and block known training bots while allowing retrieval UAs. Document the policy.

Do we need JS rendering?

Prefer shipping primary content as HTML. Enhance with JS, but ensure no‑JS snapshots still contain the essential text.

Key takeaways

  • Control access with layered defenses: robots hints + auth/network + rate limits
  • Ship clean, extractable HTML with clear headings, tables, and summaries
  • Monitor bot behavior continuously; UA ecosystems and policies change

Last updated: 2025‑10‑08

Published on October 8, 2025

About the Author

VI

Vladan Ilic

Founder and CEO

Related Posts
LLMs.txt: Reality Check — Ignored by AI Search (for now), Useful for Agents

LLMs.txt: Reality Check — Ignored by AI Search (for now), Useful for Agents

October 9, 2025
Structured Data for GEO: Schema Patterns That LLMs Understand

Structured Data for GEO: Schema Patterns That LLMs Understand

October 7, 2025
Optimizing for ChatGPT Shopping and AI Tiles: Product Schema, Data Hygiene, and GEO Patterns

Optimizing for ChatGPT Shopping and AI Tiles: Product Schema, Data Hygiene, and GEO Patterns

October 8, 2025
Product Comparison Pages that AI Loves: Structures, Tables, and Criteria

Product Comparison Pages that AI Loves: Structures, Tables, and Criteria

October 8, 2025
On This Page
  • What counts as an AI crawler?
  • Common bots (identify and handle responsibly)
  • Access controls: safe by default
  • robots.txt (hints, not auth)
  • Meta and headers
  • Rate limiting and origin protection
  • Rendering behavior: make HTML extractable
  • Sitemaps and discovery
  • Verification checklist
  • Monitoring the right metrics
  • FAQ
  • Is `crawl-delay` reliable?
  • Should we block training but allow retrieval?
  • Do we need JS rendering?
  • Key takeaways
Recent Posts
LLMs.txt: Reality Check — Ignored by AI Search (for now), Useful for Agents

LLMs.txt: Reality Check — Ignored by AI Search (for now), Useful for Agents

October 9, 2025
Optimizing for ChatGPT Shopping and AI Tiles: Product Schema, Data Hygiene, and GEO Patterns

Optimizing for ChatGPT Shopping and AI Tiles: Product Schema, Data Hygiene, and GEO Patterns

October 8, 2025
Product Comparison Pages that AI Loves: Structures, Tables, and Criteria

Product Comparison Pages that AI Loves: Structures, Tables, and Criteria

October 8, 2025
The GEO Playbook 2025: How to Win Customers in AI Search

The GEO Playbook 2025: How to Win Customers in AI Search

October 7, 2025
Categories
EngineeringMarketing
Popular Tags
#AI Tiles#AI crawlers#AI search#ChatGPT Shopping#GEO#LLM#Product schema#SEO#agents#analytics