Defending the Crawl: The Infrastructure Layer That Decides Whether AI Engines Can Reach You

How to let the right AI crawlers in and control load. A reference on GPTBot, ClaudeBot, PerplexityBot, robots.txt for AI, server latency, and SSR for AI bots.

Elizabeth S.

Founder 3 June 2026 7 min read

Summarize with AI Open this article in your preferred assistant

In this article

01 Which AI crawlers actually visit, and what each one feeds
02 How do I let the right bots in while controlling load?
03 Why latency decides whether you get cited
04 SSR versus JavaScript rendering: what crawlers actually see
05 What to actually check, and where to start

AI crawler access is the layer that decides whether an answer engine can even reach your content before it ever judges the content’s quality. It is governed by four mechanics: which user-agents you allow in robots.txt, how fast your server responds, whether your pages render without JavaScript, and how reliably your origin stays up under crawl load. Get any of these wrong and you are absent from AI answers — not because your writing is weak, but because the fetch failed.

This is the most infrastructure-native problem in generative engine optimization. You can publish the cleanest schema and the sharpest content on the web, but if GPTBot gets a Disallow, if ClaudeBot times out on a slow response, or if your page is an empty shell until client-side JavaScript runs, none of it reaches the model. The crawl is the gate. Everything else is downstream.

Which AI crawlers actually visit, and what each one feeds

Each major AI engine sends distinct, named crawlers, and they do different jobs. Treating them as one undifferentiated stream of “AI bot traffic” is the first mistake. The split that matters is between crawlers that build training corpora and crawlers that fetch content to cite in live answers — the second category is where day-to-day AI search visibility is won or lost. Reaching you is only half of it; what those crawlers then read about you depends on the agent-facing surfaces you publish and secure.

OpenAI runs two relevant agents. GPTBot, identified as GPTBot/1.3, crawls content for training its foundation models. OAI-SearchBot, identified as OAI-SearchBot/1.3, surfaces sites in ChatGPT’s search features. Anthropic mirrors this structure: ClaudeBot (ClaudeBot/1.0) collects web content for model training, while Claude-SearchBot (Claude-SearchBot/1.0) navigates the web to improve search result quality. Perplexity sends PerplexityBot (PerplexityBot/1.0) to surface and link sites in its search results. Google’s situation is different — Google-Extended is not a separate crawler with its own HTTP user-agent at all, but a robots.txt token that controls whether your content trains Gemini.

The practical implication: blocking a training crawler and blocking a search crawler are not the same decision. Block GPTBot and you opt out of one training corpus. Block OAI-SearchBot and you remove yourself from ChatGPT’s live search citations — a much sharper loss if AI-driven discovery matters to your business. Most sites should welcome the search and user-facing crawlers without hesitation, then make a deliberate, separate call on training crawlers. The reference table above lists the verified user-agent strings so you can write robots.txt rules that target exactly the agents you intend.

How do I let the right bots in while controlling load?

Use robots.txt, because every major AI crawler that matters for search visibility documents that it respects it. GPTBot, OAI-SearchBot, ClaudeBot, Claude-SearchBot, and PerplexityBot all honor robots.txt directives. That single fact makes robots.txt the correct control surface — not IP blocking, not firewall rules guessed from request patterns. Anthropic states this directly: blocking by IP address may not work correctly or persistently, partly because it can impede the crawler’s ability to read your robots.txt in the first place. The documented method is the reliable method.

A clean configuration is explicit. Allow the search and user-facing agents you want; disallow the paths you never want surfaced (checkout flows, account pages, internal search results, faceted parameter URLs that explode into infinite crawl space). Targeting named user-agents means you can welcome OAI-SearchBot and PerplexityBot for citation while making an independent choice about GPTBot for training. One file, per-agent precision.

There is a quieter risk here that has nothing to do with what you intend: deploy pipelines that overwrite robots.txt. A build that regenerates the file from a template, or a CMS that resets it on publish, can silently flip an Allow to the default and erase your AI access overnight. We have seen the same failure mode wipe schema markup, canonicals, and llms.txt on deploy. If your robots.txt is generated rather than static, treat it as deploy-critical state and assert it in a post-deploy check — the topic we cover in our agentic web readiness post, where the manifest and access layer have to survive every release.

On load: AI crawlers can hit pages at high frequency, and the answer is not to throttle them defensively but to make the request cheap. That means edge caching and a CDN in front of your origin so a crawler fetch is served from cache, not computed fresh each time. Handled this way, “controlling load” stops being a tradeoff against visibility and becomes a side effect of good caching.

Why latency decides whether you get cited

Because search crawlers and user-initiated fetchers run on a budget, and a slow page is a timed-out candidate. When a user asks Perplexity or ChatGPT a question and the engine fetches live sources to ground its answer, it is not going to wait indefinitely for your server. A page that responds slowly is functionally the same as a page that does not respond: it does not make it into the answer. Speed is not a comfort feature for AI crawlers; it is an eligibility filter.

This reframes server response time. In traditional SEO, a slow time-to-first-byte costs you a fraction of a ranking position. In AI search, where a fetcher is assembling an answer in real time across several candidate sources, a slow response can cost you the citation entirely. The fast page gets read and quoted. The slow one gets dropped, and the engine moves to the next source it can actually reach in time.

Uptime is the same signal at a longer timescale. A page that 500s or times out when the crawler arrives is, from the engine’s perspective, a source that does not exist. Reliability under crawl load is therefore an AI-visibility metric, not just an ops metric. This is one reason static and edge-served architectures consistently outperform heavy dynamic stacks for AI discoverability — a point we make at length in why not WordPress, where the render path and the database dependency are the difference between a page that is always ready and one that is sometimes too slow to be cited.

SSR versus JavaScript rendering: what crawlers actually see

Many AI crawlers read the raw HTML your server returns and do not execute JavaScript, so content that only appears after client-side rendering is invisible to them. If your page ships an empty <div id="root"> and paints the real content in the browser, a crawler that does not run JS sees the empty div. Your article, your product description, your schema — all of it absent from the response the model ingests. This is the single most common reason a technically “published” page contributes nothing to AI answers.

The fix is to put the content in the initial HTML response. Server-side rendering, static site generation, or pre-rendering all achieve this: the meaningful content is present the moment the server responds, before any JavaScript runs. For content that needs to be reliably crawlable by AI engines, this is not an optimization to schedule later. It is a precondition. A page is either readable in the raw HTML or it is a gamble on which crawlers happen to render JS this quarter.

This connects directly to caching and latency. A pre-rendered page is also a cacheable page, and a cacheable page is a fast page. Server-side rendering, edge caching, and crawler eligibility are not three separate projects — they are one architecture decision viewed from three angles. Resolve the render path correctly and the latency and cacheability benefits follow.

What to actually check, and where to start

Start with a concrete inventory: pull your robots.txt and confirm exactly which AI user-agents are allowed and which paths are disallowed; measure server response time for the crawlers’ eligibility budget, not just for human users; and view your key pages as raw HTML — curl the URL and read what comes back before any JavaScript executes. Those three checks tell you whether AI engines can reach you, reach you fast enough, and read what they reach. Keeping that access intact release after release is the durability discipline that stops a single deploy from quietly undoing it. Most sites fail at least one without knowing it.

This is precisely the scope of an Infrastructure Health Check — a one-off, five-day engagement that audits robots.txt and crawler access, measures crawler latency, and reviews CDN and edge caching alongside the rest of your architecture, ending in a prioritized report. It is built for the case where the content is fine but the access layer is quietly failing. If you would rather see where it sits in the wider offer, the pricing page lays out the GEO foundations work it pairs with.

The crawl is the part of generative engine optimization that no amount of writing can compensate for. An engine cannot cite a page it cannot fetch, cannot fetch a page it is disallowed from, cannot wait for a page that responds too slowly, and cannot read a page that renders only in a browser it does not run. Fix the access layer first — the robots.txt directives, the response times, the render path, the edge cache — and every downstream investment in schema, entities, and content finally has a path to the model. Defend the crawl, and the rest of your GEO work stops leaking out the bottom.

Reference

Source: OpenAI, Anthropic, Perplexity & Google official crawler docs

Major AI crawler user-agents and what they feed

Crawler	Operator	User-agent	What it feeds
GPTBot	OpenAI	GPTBot/1.3	Foundation model training
OAI-SearchBot	OpenAI	OAI-SearchBot/1.3	ChatGPT search results
ClaudeBot	Anthropic	ClaudeBot/1.0	Claude model training
Claude-SearchBot	Anthropic	Claude-SearchBot/1.0	Claude search indexing
PerplexityBot	Perplexity	PerplexityBot/1.0	Perplexity search results
Google-Extended	Google	robots.txt token (no UA string)	Gemini model training opt-in/out

Frequently asked

Questions buyers ask before booking

What is an AI crawler?

An AI crawler is an automated agent operated by an AI company that fetches web pages to either train models or answer live user queries with citations. Examples include OpenAI's GPTBot and OAI-SearchBot, Anthropic's ClaudeBot and Claude-SearchBot, and Perplexity's PerplexityBot.

Does GPTBot respect robots.txt?

Yes. GPTBot respects robots.txt directives, as do OAI-SearchBot, ClaudeBot, Claude-SearchBot, and PerplexityBot. This means a robots.txt rule is the reliable, documented way to allow or block these crawlers.

How do I let AI crawlers in but keep control?

Use robots.txt to explicitly allow the named user-agents you want (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot) and disallow the paths you do not want crawled. Pair this with fast server response times and edge caching so high-frequency crawling does not strain origin infrastructure.

Will blocking Google-Extended hurt my Google Search ranking?

No. Google-Extended is a robots.txt token that only controls whether your content is used to train Gemini models. Google's documentation confirms it does not impact a site's inclusion in Google Search and is not used as a ranking signal.

Why does server-side rendering matter for AI crawlers?

Many AI crawlers read the raw HTML returned by your server and do not execute JavaScript. If your content only appears after client-side rendering, those crawlers see an empty shell. Server-side rendering or static pre-rendering ensures the content is present in the initial response.