AI Crawlers Field Guide 2025: Access Control & Rate Limits

Quick Takeaways
What Counts as an AI Crawler?
Common Bots: Identify and Handle
Access Controls
Rate Limiting & Origin Protection
Rendering Behavior
Sitemaps & Discovery
Verification Checklist
Monitoring Metrics
FAQ
Key Takeaways

AI discovery depends on two things: whether crawlers can access your content reliably, and whether they can extract usable, well‑structured text. This guide explains how to recognize AI crawlers, grant the right access, protect your origin, and ship HTML that large language models can parse.

Quick Takeaways

2025 AI Crawler Landscape:

GPTBot (OpenAI) doubled market share: 4.7% → 11.7% of AI crawler traffic
ClaudeBot (Anthropic) rose from 6% → ~10% market share
80% of AI crawling is for training vs. 18% for real-time search/retrieval
Crawl-to-referral ratios: OpenAI 1,700:1, Anthropic 73,000:1 (June 2025)
ClaudeBot: 38,000 crawls per visitor in July 2025 (down from 286,000:1 in January)

Critical Technical Requirements:

JavaScript execution: Most AI crawlers cannot execute JavaScript—server-side render (SSR) or pre-render all primary content
Timeout constraints: AI crawlers impose 1-5 second timeouts; slow pages get skipped
HTML accessibility: Content injected via JavaScript is invisible to most AI crawlers
Rendering method: Static site generation (SSG) or server-side rendering (SSR) recommended for critical content

Access Control Strategy:

robots.txt is advisory only—not enforceable, legitimate bots respect it but malicious ones don't
Differentiate crawlers: Allow retrieval bots (GPTBot, PerplexityBot, ClaudeBot) for AI visibility; block training bots (Google-Extended, CCBot) to prevent dataset inclusion
Layer controls: Combine robots.txt + meta tags + authentication + rate limiting + IP verification
Update quarterly: AI crawler User-Agent tokens evolve; ClaudeBot now uses both ClaudeBot and ClaudeWeb

Rate Limiting Best Practices:

Per-User-Agent thresholds: 60 req/min for priority bots, 30 req/min for secondary, block excessive consumers
Edge-level implementation: CDN/WAF rate limiting (429 responses); crawl-delay directive is inconsistently supported
Burst control: Per-ASN thresholds, whitelist verified IP addresses
Cost management: Monitor bytes served per crawler; block crawlers without proportional business value

Verification & Monitoring:

Reverse/forward DNS: Verify bot IP addresses (e.g., OpenAI IPs resolve to OpenAI domains)
IP range whitelisting: Major platforms publish IP ranges and verification procedures
Track metrics: Crawl volume by User-Agent, bytes served, cache hit ratios, 2xx/4xx/5xx patterns, AI citations/referrals
Business impact: Correlate crawler access with AI mentions, branded search demand, referral traffic, leads/conversions (30-60 day lag)

New Tools & Standards (2025):

Cloudflare managed robots.txt: Auto-generate and manage robots.txt for AI training control
llms.txt standard: Emerging Markdown "table of contents" for AI systems (GitHub community standard)
Selective blocking: Block AI bots only on monetized/ad-supported portions of sites

What counts as an AI crawler?

Two broad categories:

Retrieval and answer engines that index pages for conversational answers (e.g., assistants and aggregators)
Training or research crawlers that gather web content for model training or evaluation

Treat both with care—optimize for retrieval visibility while controlling cost and compliance for training bots.

Common bots (identify and handle responsibly)

As of 2025, the AI crawler landscape has evolved significantly, with GPTBot doubling its market share and ClaudeBot becoming a major player. Here's the current breakdown:

Bot	Primary purpose	User‑Agent token (example)	Robots.txt honored?	2025 Market Share	Notes
GPT‑related (OpenAI)	Retrieval/training	`GPTBot`	Yes	11.7% (↑ from 4.7%)	Use robots.txt and allowlists; supports IP verification procedures; 1,700:1 crawl-to-referral ratio
Perplexity	Retrieval	`PerplexityBot`	Yes	Variable	Provide stable HTML; 194 crawls per visitor (July 2025); watch bursty crawl patterns
Anthropic/Claude	Retrieval	`ClaudeBot`, `ClaudeWeb`	Yes	~10% (↑ from 6%)	Treat as separate agents; 38,000 crawls per visitor down from 286,000:1; 73,000:1 crawl-to-referral ratio; verify UA and IP
Common Crawl	Research/training	`CCBot`	Yes	Declining	Consider disallowing if you don't want training reuse; primarily training-focused
Google	Web/Overviews	`Googlebot`, `GoogleOther`	Yes	Dominant	AI Overviews relies on standard Google systems; established crawler
Google (training control)	Training control	`Google-Extended`	Yes	N/A (control only)	Control training reuse with Cloudflare managed robots.txt; not a content discovery bot
ByteDance	Training	`Bytespider`	Variable	2.4% (↓ from 14.1%)	Significant market share decline in 2025

Key 2025 Trends:

80% of AI crawler activity is for training vs. 18% for real-time retrieval/search
Crawl volumes dramatically outpace referral traffic (broken "crawl for traffic" relationship)
User‑Agent tokens evolve quarterly—log and verify periodically using server analytics

Verification Resources:

OpenAI GPTBot documentation with IP range verification
Comprehensive AI crawler list (GitHub) - community-maintained, updated regularly
Momentic's April 2025 crawler reference - detailed User-Agent tokens

Access controls: safe by default

Start permissive for public, evergreen content; restrict private, ephemeral, or high‑cost endpoints.

robots.txt (hints, not auth)

# Training control example
User-agent: Google-Extended
Disallow: /

# Retrieval bots with scoped access
User-agent: GPTBot
Allow: /
Disallow: /private/

User-agent: PerplexityBot
Allow: /docs/
Disallow: /beta/

# Anthropic (example)
User-agent: ClaudeBot
Allow: /
User-agent: ClaudeWeb
Allow: /

# Catch-all
User-agent: *
Disallow: /admin/

Notes:

robots.txt is advisory—enforce sensitive areas with auth and network controls.
crawl-delay is inconsistently supported; prefer rate limiting at the edge.

Meta and headers

Use page‑level controls for fine‑grained rules.

<!-- Block indexing but allow following links -->
<meta name="robots" content="noindex,follow" />

<!-- HTTP header variant (server) -->
X-Robots-Tag: noindex, follow

Rate limiting and origin protection

Prefer adaptive rate limiting at CDN/WAF; return 429 on overload.
Burst control: per‑UA and per‑ASN thresholds; whitelist verified addresses.
Use caching for static and semi‑static pages to reduce origin hit rate.
Monitor 4xx/5xx spikes by UA; alert on abnormal patterns.

Rendering behavior: make HTML extractable

LLMs and answer engines favor content that is present in HTML at response time.

Server render or pre‑render primary copy; avoid hiding essential text behind JS.
Keep heading hierarchy clean (H1/H2/H3) and add anchorable sections.
Use tables for comparisons/specs; include explicit units and labels.
Provide summary/key‑takeaways blocks and concise definitions near the top.
Avoid infinite scroll for core docs; use paginated archives and sitemaps.

Sitemaps and discovery

Maintain XML sitemaps with lastmod; separate large sections into index sitemaps.
Include canonical URLs; avoid duplicate parameterized URLs.
Keep 200‑OK for canonical pages and consistent language alternates.

Verification checklist

Reverse DNS + forward confirm for bot IPs where supported
UA string consistency across requests
Stable 200 responses for critical docs (no auth walls, no require‑JS to view copy)
Real‑user fetch test: curl + headless browser snapshot

Monitoring the right metrics

Crawl volume by UA (daily), bytes served, cache hit ratio
2xx/4xx/5xx by UA; median TTFB for bots vs users
Indexation coverage from sitemaps vs actual hits
Downstream signals: AI mentions/citations and referral patterns

Frequently Asked Questions (FAQ)

Q: Is crawl-delay directive reliable for AI crawlers?

A: No, crawl-delay is inconsistently supported across AI crawlers. Some bots (like GPTBot) may respect it, but many ignore it entirely. For reliable rate limiting, implement edge-level rate limiting at your CDN or WAF with specific per-User-Agent thresholds. Return 429 (Too Many Requests) responses when limits are exceeded. Cache static and semi-static content aggressively to reduce origin server load.

Q: Should we block training crawlers but allow retrieval crawlers?

A: This is a common strategy for brands that want AI visibility without contributing to training datasets. Use Google-Extended to block Google's training crawlers while allowing Googlebot for search. Block CCBot (Common Crawl) to prevent training reuse. Allow retrieval bots like GPTBot, PerplexityBot, and ClaudeBot for AI search visibility. Document your policy in robots.txt and monitor compliance.

Q: Do AI crawlers execute JavaScript or do we need server-side rendering?

A: Most AI crawlers do not execute JavaScript—they extract content from initial HTML response. Server-side render or pre-render all primary content for AI discoverability. Progressive enhancement with JavaScript is fine for interactivity, but ensure no-JS snapshots contain all essential text, headings, tables, and structured data. Use static site generation (SSG) or server-side rendering (SSR) for critical content.

Q: How can we verify that a crawler is legitimate and not spoofing User-Agent?

A: Perform reverse DNS + forward DNS verification for bot IP addresses. For OpenAI's GPTBot, check if the IP resolves to an OpenAI domain, then forward-resolve to confirm it matches the original IP. Google, Anthropic, and other major platforms publish IP ranges and verification procedures. Log User-Agent strings and monitor for consistency across requests. Implement rate limiting per ASN (Autonomous System Number) and whitelist verified IP ranges.

Q: What's the difference between GPTBot and Google-Extended?

A: GPTBot (OpenAI) is a retrieval crawler that indexes web content for ChatGPT's real-time web search and citation features. Google-Extended is a training control token that blocks Google from using your content for training Bard/Gemini models—it does not affect search indexing or AI Overviews. Block Google-Extended if you oppose training reuse. Allow GPTBot if you want ChatGPT visibility. They serve different purposes.

Q: How often should we update robots.txt for new AI crawlers?

A: Review and update robots.txt quarterly or when major AI platforms announce new crawlers. User-Agent tokens evolve—Claude now uses both ClaudeBot and ClaudeWeb, OpenAI may introduce additional tokens, and new platforms launch regularly. Subscribe to platform changelogs, monitor server logs for new User-Agent patterns, and maintain a living robots.txt document. Document changes with dates and rationale.

Q: What HTML structure elements matter most for AI crawler extraction?

A: Clear hierarchical heading structure (H1/H2/H3), semantic HTML5 tags (article, section, aside), tables for comparisons and data, definition lists (dl/dt/dd), and structured data markup (JSON-LD schema). Avoid hiding content behind JavaScript, infinite scroll, or authentication walls. Make key takeaways, summaries, and definitions extractable as standalone blocks. Use consistent heading hierarchy without skipping levels.

Q: How do we balance crawler access with server costs and performance?

A: Implement tiered access control: allow unlimited access to cached/CDN content, moderate rate limits for origin-served pages, and strict limits for high-cost dynamic endpoints. Use separate rate limit buckets per User-Agent (e.g., 60 requests/minute for GPTBot, 30 requests/minute for less critical bots). Monitor cost per crawler and adjust thresholds. Block crawlers that consume excessive resources without providing proportional business value.

Q: Can we allow some sections of our site to AI crawlers but block others?

A: Yes, use path-based access controls in robots.txt. Allow public, evergreen content paths (e.g., /blog/, /docs/, /guides/). Disallow private areas (e.g., /admin/, /beta/, /internal/). Use meta robots tags or X-Robots-Tag headers for page-level control. Implement authentication and network-level access controls for truly sensitive areas—don't rely solely on robots.txt, which is advisory only.

Q: How do we measure the impact of allowing AI crawlers on our business?

A: Track AI mentions and citations using manual testing (query your brand/products across ChatGPT, Claude, Perplexity, Google AI). Monitor referral traffic patterns from AI platforms in analytics. Track branded search demand changes. Measure crawl volume by User-Agent, bytes served, and cache hit ratios. Correlate AI crawler access patterns with downstream business metrics like leads, signups, and conversions. Allow at least 30-60 days for meaningful data.

Given the dramatic crawl-to-referral imbalance (OpenAI 1,700:1, Anthropic 73,000:1), traditional traffic metrics may not reflect AI platform impact. Focus instead on brand mention frequency in AI responses, citation quality/context, and assisted conversions where users research on AI platforms before direct site visits.

Data Visualizations & Supporting Materials

To maximize the technical utility and clarity of this field guide, consider creating these data visualizations and resources:

Recommended Technical Diagrams

1. AI Crawler Decision Tree

Flowchart guiding "Allow vs. Block vs. Rate Limit" decisions
Decision points: Purpose (retrieval vs. training), Resource cost, Business value, Compliance requirements
Outputs: Specific robots.txt configurations and rate limit settings

2. Rendering Architecture Comparison

Side-by-side comparison: Client-side rendering (CSR) vs. Server-side rendering (SSR) vs. Static site generation (SSG)
Visual indicators showing what AI crawlers can/cannot extract from each approach
Performance metrics: Time-to-content, JavaScript execution requirements, crawler accessibility

3. Rate Limiting Strategy Matrix

Table or heat map showing recommended rate limits per bot type
Axes: Bot priority (high/medium/low) × Resource intensity (cached/origin/dynamic)
Values: Specific req/min thresholds and burst allowances

4. Crawler Verification Flow Diagram

Step-by-step visualization of reverse/forward DNS verification process
Example with actual OpenAI IP address verification
Decision points for whitelisting vs. blocking

5. 2025 Market Share Pie Chart

Visual representation of AI crawler market share:
- GPTBot: 11.7%
- ClaudeBot: ~10%
- Bytespider: 2.4%
- Other crawlers: remaining %
Trend arrows showing year-over-year changes

6. Crawl-to-Referral Ratio Visualization

Bar chart comparing crawl volumes to actual referral traffic
OpenAI (1,700:1), Anthropic (73,000:1), Perplexity (194:1)
Contextual note on broken "crawl for traffic" relationship

Downloadable Configuration Templates

1. robots.txt Template Library

Permissive configuration (allow all retrieval, block training)
Restrictive configuration (selective path access)
High-traffic site configuration (aggressive rate limiting)
E-commerce configuration (allow product pages, block admin/checkout)

2. CDN/WAF Rate Limiting Rules

Cloudflare Workers example
AWS WAF rule set
Nginx configuration
Apache .htaccess example

3. Monitoring Dashboard Template

Google Data Studio / Looker Studio template
Pre-configured metrics: crawl volume by UA, bytes served, cache hit ratio, response codes
Alert thresholds for abnormal patterns

4. Verification Script Collection

Python script for reverse/forward DNS verification
Bash script for log parsing and User-Agent analysis
API integration examples for major bot platforms

Interactive Tools

1. robots.txt Validator

Input your robots.txt, get AI crawler-specific validation
Check for common misconfigurations
Suggest optimizations based on site type

2. Rate Limit Calculator

Input: Site traffic, origin capacity, crawler priority levels
Output: Recommended rate limits per User-Agent

3. Rendering Checker

Test your URL's HTML accessibility to AI crawlers
Identify JavaScript-dependent content
SSR/SSG migration recommendations

Schema Markup Implementation

Enhance discoverability and extraction for AI crawlers with structured data:

Article Schema (Required for Technical Content)

{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "Field Guide to AI Crawlers: Access, Rate Limits, and Rendering Behavior",
  "description": "Complete technical guide to AI crawlers including GPTBot, PerplexityBot, ClaudeBot, and Google-Extended",
  "author": {
    "@type": "Person",
    "name": "Vladan Ilic",
    "url": "https://presenceai.app/about"
  },
  "datePublished": "2025-10-08",
  "dateModified": "2025-11-05",
  "publisher": {
    "@type": "Organization",
    "name": "Presence AI",
    "logo": {
      "@type": "ImageObject",
      "url": "https://presenceai.app/logo.png"
    }
  },
  "about": [
    {
      "@type": "Thing",
      "name": "AI Crawlers"
    },
    {
      "@type": "Thing",
      "name": "Generative Engine Optimization"
    }
  ]
}

FAQPage Schema (10 Questions)

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "Is crawl-delay directive reliable for AI crawlers?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "No, crawl-delay is inconsistently supported across AI crawlers. For reliable rate limiting, implement edge-level rate limiting at your CDN or WAF."
      }
    },
    {
      "@type": "Question",
      "name": "Do AI crawlers execute JavaScript?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Most AI crawlers do not execute JavaScript—they extract content from initial HTML response. Server-side render or pre-render all primary content for AI discoverability."
      }
    }
    // Additional 8 questions from FAQ section
  ]
}

HowTo Schema (For Verification Checklist)

{
  "@context": "https://schema.org",
  "@type": "HowTo",
  "name": "AI Crawler Verification Checklist",
  "description": "Step-by-step process to verify and configure AI crawler access",
  "step": [
    {
      "@type": "HowToStep",
      "name": "Verify Bot Identity",
      "text": "Perform reverse DNS + forward DNS confirmation for bot IP addresses where supported by platform"
    },
    {
      "@type": "HowToStep",
      "name": "Check User-Agent Consistency",
      "text": "Monitor User-Agent string consistency across requests from same IP ranges"
    },
    {
      "@type": "HowToStep",
      "name": "Test Content Accessibility",
      "text": "Ensure stable 200 responses for critical docs with no auth walls or JavaScript requirements for viewing content"
    },
    {
      "@type": "HowToStep",
      "name": "Run Fetch Tests",
      "text": "Test with curl for raw HTML and headless browser for rendered snapshot comparison"
    }
  ]
}

Implementation: Add JSON-LD scripts to page <head>. Validate with Google Rich Results Test and Schema.org Validator.

Technical Benefit: Structured data helps AI crawlers parse technical documentation more accurately, improving citation quality and extraction accuracy for complex technical content.

Sources & References

This technical field guide draws from authoritative 2025 sources on AI crawler behavior, access control, and optimization:

Primary Sources

Cloudflare AI Crawler Research:
- From Googlebot to GPTBot: Who's Crawling Your Site in 2025 - Market share data, GPTBot 4.7% → 11.7%, ClaudeBot 6% → 10%
- The Crawl-to-Click Gap: AI Bots, Training, and Referrals - Crawl-to-referral ratios (OpenAI 1,700:1, Anthropic 73,000:1), 80% training vs. 18% retrieval
- Control Content Use for AI Training with Managed robots.txt - 2025 Cloudflare tools for AI crawler management
AI Crawler Technical Documentation:
- Prerender.io - Understanding Web Crawlers: Traditional vs. AI - JavaScript execution limitations, timeout constraints
- Qwairy - Complete Guide to robots.txt and llms.txt for AI Crawlers - Comprehensive configuration guide, legitimate bot compliance
- Moving Traffic Media - Managing OpenAI's Web Crawlers (GPTBot) - IP range verification, access control
2025 Crawler References:
- Momentic Marketing - List of Top AI Search Crawlers + User Agents (April 2025) - Current User-Agent tokens, quarterly updates
- GitHub - ai-robots-txt/ai.robots.txt - Community-maintained AI crawler list, open-source standard
Access Control Best Practices:
- ClickRank - How to Control AI Bots: A robots.txt Guide in 2025 - robots.txt limitations, enforcement strategies
- DataDome - Using Robots.txt to Disallow or Allow Bot Crawlers - Layered security approach
- Qwairy - Understanding AI Crawlers: The Complete Guide for 2025 - Comprehensive 2025 landscape
Emerging Standards:
- Francisco A. Kemeny - Best Practices for AI-Oriented robots.txt and llms.txt Configuration - llms.txt standard emergence
- Cloudflare Bot Solutions - robots.txt Settings - Enterprise-grade configuration

Technical Implementation References

Google Search Central: Robots.txt Introduction and Guide - Authoritative robots.txt specification
Cybersecurity: nixCraft - How to Block AI Crawler Bots Using robots.txt - Security-focused configurations
SEO Integration: SEO Juice - Disable Cloudflare AI-Bot Block - Balancing blocking with GEO visibility

Methodology & Limitations

Data Currency: Statistics reflect June-July 2025 measurements from Cloudflare's global network analysis. AI crawler behavior evolves rapidly—market shares and crawl patterns may shift within 30-60 days.

Scope: Focus on major English-language AI platforms (OpenAI, Anthropic, Perplexity, Google). Regional and specialized crawlers may exhibit different behaviors.

Verification: All technical recommendations based on documented platform behavior and industry best practices. Test configurations in staging before production deployment.

Update Schedule: This guide is reviewed and updated quarterly. Last updated: November 5, 2025. Next scheduled review: February 2026.

Subscribe to our technical updates for notifications when AI crawler behavior, User-Agent tokens, or access control best practices change significantly.

This field guide reflects AI crawler behavior and technical best practices as of November 2025. Platform implementations and crawler patterns evolve continuously—monitor logs, subscribe to platform changelogs, and validate configurations quarterly.

Key Takeaways

AI crawler identification requires monitoring User-Agent strings, verifying IP addresses via reverse/forward DNS, and staying current with platform announcements as new crawlers emerge regularly
Implement layered access controls combining robots.txt (advisory), page-level meta tags, authentication, rate limiting at edge/CDN, and monitoring for abnormal patterns
Ship extractable HTML with server-side or pre-rendered content, clear heading hierarchy (H1/H2/H3), semantic HTML5, tables for data, and structured data markup—avoid hiding essential content behind JavaScript
Differentiate between retrieval crawlers (discovery/citations) and training crawlers (model training), blocking training while allowing retrieval to maintain AI visibility without contributing to datasets
Monitor crawler behavior continuously with metrics including crawl volume by User-Agent, bytes served, cache hit ratios, 2xx/4xx/5xx response patterns, median TTFB, and downstream AI citations/referrals
Balance crawler access with origin protection using adaptive rate limiting (429 responses on overload), caching for static content, per-UA and per-ASN thresholds, and whitelisting verified IP addresses
Maintain updated robots.txt quarterly, document changes and policy decisions, provide XML sitemaps with lastmod dates, and ensure canonical URLs prevent duplicate indexing

Last updated: 2025‑11‑05

Published on October 8, 2025

About the Author

Vladan Ilic

Founder and CEO

Quick Takeaways
What Counts as an AI Crawler?
Common Bots: Identify and Handle
Access Controls
Rate Limiting & Origin Protection
Rendering Behavior
Sitemaps & Discovery
Verification Checklist
Monitoring Metrics
FAQ
Key Takeaways