How to Allow AI Crawlers to Access Your Site (robots.txt Guide)
By Minel Gunesoglu, founder of Is My Brand in AI · Last updated June 14, 2026
TL;DR: To allow AI crawlers, add explicit Allow rules in robots.txt for the bots you want, using exact user-agent tokens: GPTBot and OAI-SearchBot for OpenAI, ClaudeBot for Anthropic, PerplexityBot for Perplexity. Case and spelling matter, training bots and search bots are separate jobs, and a firewall can block a bot even when robots.txt allows it, so verify what the crawler actually sees.
Most guides on this topic teach you how to block AI crawlers. That made sense during the 2025 scraping panic, when half the web rushed to wall off GPTBot. But if you want your brand to show up when someone asks ChatGPT, Claude, or Perplexity a question, the goal is the opposite: you want to allow AI crawlers to reach your pages, read them cleanly, and carry your content into the answer. This guide is the allow-side manual. It covers which bots to let in, the exact robots.txt syntax that actually works, the one trap that silently undoes all of it, and how to confirm a crawler can really get through.
I run a small lab on this site. I built the free AI bot checker linked below, I run IndexNow live on ismybrandinai.com, and I keep my own robots.txt deliberately open to the engines I want citing me. Everything here is the order I work in on my own site, not theory from a deck. And I will be honest throughout about what robots.txt can and cannot do, because this is a topic where confident-sounding advice often gets the mechanics wrong.
On this page
- Why Allowing AI Crawlers Matters Now
- Training Crawlers vs Search Crawlers: The Core Distinction
- The Full AI Crawler Reference Table
- See Which AI Crawlers Your Site Actually Allows
- The robots.txt Allow Template (Copy, Paste, Adjust)
- Your robots.txt May Be Lying to You
- The Honest Tradeoff: What Allowing and Blocking Really Do
- How to Verify a Crawler Can Really Reach You
- Frequently Asked Questions
Why Allowing AI Crawlers Matters Now
The plumbing of search changed underneath a lot of site owners without them touching a thing. AI answer engines like ChatGPT, Perplexity, and Google's AI surfaces increasingly read content through their own dedicated crawlers, then quote a passage straight into the answer a user reads. If your robots.txt turns one of those crawlers away, you are not just opting out of scraping. You are removing yourself from the pool of sources that engine can cite. No access, no citation.
Two things made this urgent. First, a wave of defensive blocking. When AI scraping became a headline in 2025, a lot of sites added blanket disallows, and many CDNs and hosting platforms shipped default rules or one-click "block AI bots" toggles. Plenty of those blocks are still in place, quietly, on sites whose owners now want AI visibility and do not realize they are walled off. Second, the engines split their crawlers into separate jobs, which means a single old block can hit the exact bot that feeds live answers while leaving the harmless one untouched. The result is a lot of sites that think they are open and are not.
The fix is not hard once you see the moving parts. The hard part is that the file you would read to check your own status can be telling you a comforting lie, which is the whole reason a live check beats eyeballing the text. I will come back to that. First, the distinction that everything else hangs on.
Training Crawlers vs Search Crawlers: The Core Distinction
This is the single idea that makes the rest of the page make sense, and it is the one most "block AI bots" guides flatten into mush. The major AI companies do not run one crawler. They run at least two, with different jobs, and you can treat them differently.
A training crawler collects content to help build and improve future versions of a model. Its visits do not directly produce a citation today. OpenAI's GPTBot and Anthropic's ClaudeBot fall here. Google-Extended is the same category for Google's generative models. Blocking a training crawler is a reasonable choice if your only concern is "I do not want my work used to train models," and it carries almost no cost to your live AI search visibility.
A search crawler (sometimes called a retrieval or grounding crawler) fetches pages so the engine can find, index, and cite them inside answers, right now. OpenAI's OAI-SearchBot and Perplexity's PerplexityBot fall here. These are the bots that decide whether you can appear when a user asks a live question. If AI search visibility is your goal, these are the crawlers you must let in.
There is a third, smaller category worth naming: the user-action fetcher. When a person pastes your URL into ChatGPT or clicks a Perplexity citation, the engine fetches that one page on demand. OpenAI's ChatGPT-User and Perplexity-User do this. They are not bulk crawlers, they act on a real user's request, and blocking them can break the experience of someone who deliberately handed your link to the assistant.
The practical upshot: "should I allow AI crawlers?" is not one question. You can block the training bots to keep your content out of model training and still allow the search and user bots so you stay citable in live answers. Or you can open everything. What you should not do is block by accident and lose the search bots without meaning to, which is exactly what a blanket disallow does.
The Full AI Crawler Reference Table
Here is every major AI crawler you need to know, what its operator uses it for, and whether to allow it if your goal is AI search visibility. The "user-agent token" column is the string you actually write in robots.txt. Get that token exactly right, because robots.txt matches it as written. Each operator publishes its own crawler list (OpenAI at developers.openai.com, Anthropic at its support and crawling docs, Perplexity at docs.perplexity.ai, Cloudflare maintains a public bot directory), and those first-party pages are the source of truth as tokens change.
| Operator | User-agent token | Purpose | Allow for AI visibility? | Respects robots.txt? |
|---|---|---|---|---|
| OpenAI | GPTBot |
Training | Optional (allow if fine with training; no live-citation cost to blocking) | Yes, per OpenAI docs |
| OpenAI | OAI-SearchBot |
Search / citation | Yes, allow | Yes, per OpenAI docs |
| OpenAI | ChatGPT-User |
User action (on-demand fetch) | Yes, allow | Yes, per OpenAI docs |
| Anthropic | ClaudeBot |
Training | Optional (allow if fine with training) | Yes, per Anthropic docs |
| Anthropic | Claude-SearchBot |
Search / citation | Yes, allow | Yes, per Anthropic docs |
| Perplexity | PerplexityBot |
Search / indexing for citation | Yes, allow | Stated yes; see caveat below |
| Perplexity | Perplexity-User |
User action (on-demand fetch) | Yes, allow | Acts on user request; may not treat rules as a hard block |
Google-Extended |
Generative-model training control | Optional (does not control AI Overviews — see note) | Yes, it is a robots.txt token | |
| Microsoft / Bing | Bingbot |
Search index (feeds Copilot) | Yes, allow | Yes |
| Common Crawl | CCBot |
Open dataset (used by many model builders) | Optional (dataset feeds third-party training) | Yes |
| ByteDance | Bytespider |
Training / data collection | Your call (data collection, not a citation engine) | Historically inconsistent; verify behavior |
A few rows deserve a sentence. Google-Extended is the one people misread most: it controls whether Google can use your content to train its generative models, and it does not control whether you appear in Google's AI Overviews. Blocking it will not pull you out of those AI answers, a confusion I see constantly; the full breakdown of that mistake and the right tool for it lives in how to opt out of Google AI Overviews. Bingbot earns a place here because Microsoft Copilot grounds on the Bing index, so a Bing block quietly costs you Copilot citations; the deeper Copilot ranking story is in how to rank on Microsoft Copilot. And Bytespider is the row to watch, because its compliance history has been the least consistent of the group.
See Which AI Crawlers Your Site Actually Allows
Check your own robots.txt before you change anything. Paste your URL into our free AI bot checker and it reads your live robots.txt, then tells you plainly which AI crawlers can reach you and which are blocked right now. No signup, no account. I built this checker because every guide says "go check your robots.txt" and none of them hand you a tool that does it for you. Start there, see your real status, then come back and fix what needs fixing.
It is genuinely the fastest way to know where you stand. Most people who run it discover at least one block they did not know about, usually a leftover from a CDN toggle or a template they inherited. Knowing your starting point turns the rest of this guide from abstract into a two-minute edit.
The robots.txt Allow Template (Copy, Paste, Adjust)
Here is a clean robots.txt block that allows the AI search and user crawlers you most likely want, while leaving you free to decide on the training bots. Drop it into the robots.txt file at your domain root (yoursite.com/robots.txt), adjust to taste, and host it as plain text.
# Allow OpenAI's search + user crawlers (citation + on-demand fetch)
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
# Allow OpenAI's training crawler too, only if you're fine with training use
User-agent: GPTBot
Allow: /
# Allow Anthropic's search + (optionally) training crawler
User-agent: Claude-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
# Allow Perplexity's crawler + user fetch
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
# Allow Bing (feeds Microsoft Copilot)
User-agent: Bingbot
Allow: /
# Keep a catch-all so normal crawling isn't accidentally disallowed
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
Three things make or break this file, and they are the errors I see most:
The token must be exact. robots.txt matches the user-agent token as a string. GPTBot works; gptbot, GPT-Bot, GPTbot, or pasting the bot's full browser-style user-agent string into the User-agent: line does not. If you want to allow OpenAI's training crawler, the line is literally User-agent: GPTBot and nothing else. A wrong case or an extra hyphen means your rule silently matches nothing, and the bot falls through to whatever your catch-all says.
An Allow without a Disallow is the default anyway, so most "allow" work is really removing a block. If your file has no rule that disallows a bot, that bot is already allowed. The reason to write explicit Allow: / lines is clarity and to override a broader disallow. So the real fix on most sites is to find and delete the line that says User-agent: GPTBot followed by Disallow: / (or a blanket User-agent: * / Disallow: /), not to add anything.
Order and specificity matter. A more specific user-agent block wins over the catch-all for that bot. If you keep a restrictive User-agent: * rule, make sure each AI bot you want has its own explicit block, or it will inherit the catch-all's disallow.
If you also want to hand AI models a clean map of your best pages, that is a different file with a different job. robots.txt is the access gate; llms.txt is a content map, advisory, not an access control. Different files, different jobs. Our llms.txt explainer covers when that one is worth the effort, and our free llms.txt generator builds you a valid one in a couple of minutes.
Your robots.txt May Be Lying to You
This is the part that justifies a live checker over a careful read of your file, and it is the gap almost no competitor guide closes. Your robots.txt can say "come on in" while something above it slams the door, and you will never see it by reading the file.
Here is the mechanism. A request from an AI crawler hits your CDN or web application firewall before it ever reaches the robots.txt your server would serve. Many of those layers ship with bot-management rules, and several now include a one-click "block AI bots" or "block AI scrapers" control. Some platforms turned versions of this on by default. When that layer recognizes a crawler's signature, it can return a block, a challenge page, or a 403 directly, and your origin's robots.txt never gets a vote. The crawler experiences a closed door; your file, read by a human, looks wide open.
So you can have a perfectly permissive robots.txt and still be invisible to GPTBot or PerplexityBot because a firewall rule, a hosting default, or a security plugin is intercepting the request upstream. This is the single most common reason "but my robots.txt allows it" turns out to be wrong. Reading the file tells you your intent. It does not tell you what the bot actually receives.
That is exactly why our AI bot checker is built around what the request really resolves to rather than just parsing your robots.txt text, and why I do not trust a clean-looking file on its own. If you suspect an upstream block, the place to look after the checker is your CDN or firewall's bot-management settings, plus any security plugin on your CMS, where a "block AI bots" toggle may be doing the damage your robots.txt would never reveal.
The Honest Tradeoff: What Allowing and Blocking Really Do
I want to be straight about the limits, because the topic is full of overconfident claims in both directions.
Blocking a training bot does not remove you from ChatGPT. This is the big one. If you block GPTBot, you are opting out of OpenAI's training crawl going forward. You are not deleting yourself from ChatGPT, and you are not blocking the search crawler that handles live citations, which is a separate token (OAI-SearchBot). Content already learned during prior training does not un-learn itself because you added a line today. So "I blocked GPTBot, why does ChatGPT still know me?" has a clean answer: you blocked the wrong half, and training is not the same as live retrieval. The same logic holds for ClaudeBot and Claude's search crawler.
Allowing a crawler is necessary but not sufficient. Letting a search bot in puts you in the pool of pages it can cite. It does not guarantee a citation any more than letting Googlebot crawl you guarantees a number-one ranking. Access is the entry ticket; whether you actually get quoted depends on whether your page answers the question cleanly. The on-page craft of being citable, not just crawlable, is its own subject, covered in how to get cited by ChatGPT and across the broader picture in AI search visibility.
robots.txt is a request, not a wall, and not every bot honors it equally. The major operators state that their crawlers respect robots.txt, and the well-behaved ones do. But robots.txt is a convention, not an enforcement mechanism. Less scrupulous crawlers have historically ignored it, and even with reputable engines, the user-action fetchers (the ones acting on a real person's request) may not treat your rules as a hard block, since the human, not the bot, initiated the fetch. If you need to guarantee a bot cannot reach something, robots.txt is the wrong tool; you need firewall-level enforcement. robots.txt is excellent at telling cooperative crawlers your preference, and it is the right and standard way to allow the engines you want. Just do not mistake a preference for a lock.
Put those three together and the allow-side strategy is simple and honest: open the search and user crawlers so you are citable, decide on the training crawlers based on how you feel about model training (knowing it costs you nothing in live visibility), and verify the door is actually open at the request level, not just in the file.
How to Verify a Crawler Can Really Reach You
Reading your robots.txt is step zero, not the finish line. Here is how I confirm a bot can actually get through, in order of how much each tells you.
Start with the live checker. Run your URL through our AI bot checker first, because it resolves what an AI crawler's request actually returns, which catches both robots.txt rules and the upstream firewall blocks that reading the file alone hides. If it comes back clean, you have your answer in seconds. If it flags a block, it tells you which bot and gives you the line to fix.
Cross-check with a manual request. From a terminal, you can fetch your page while presenting a crawler's user-agent and watch the status code. A 200 with your real HTML means the door is open for that signature; a 403, a challenge page, or a redirect to a block page means something upstream is intercepting it. Test a couple of the AI tokens, not just Googlebot, since firewalls often single out AI crawlers specifically. This is the manual version of what the checker automates.
Mind the limits of Google's tooling. Google Search Console's robots.txt report and URL inspection are genuinely useful, but they speak for Googlebot. They will not tell you whether GPTBot, ClaudeBot, or PerplexityBot can reach you, because those are not Google's crawlers. Confirming your robots.txt is valid in Search Console is worth doing, but do not read a green check there as proof the AI crawlers are getting in. That is precisely the gap the checker fills.
The pattern: verify the AI crawlers specifically, at the request level, not by trusting a file you read or a tool that only speaks for Googlebot. A clean robots.txt is necessary. It is not sufficient.
Frequently Asked Questions
How do I allow GPTBot in robots.txt?
Add a block with the exact token: User-agent: GPTBot on one line and Allow: / on the next. If your file currently has User-agent: GPTBot followed by Disallow: /, the fix is to change that Disallow to Allow or remove the disallow entirely. Case and spelling are matched literally, so gptbot or GPT-Bot will not work. Remember that GPTBot is OpenAI's training crawler; to be citable in live ChatGPT answers you also want OAI-SearchBot and ChatGPT-User allowed.
Will blocking GPTBot remove me from ChatGPT? No. Blocking GPTBot opts you out of OpenAI's training crawl going forward; it does not delete you from ChatGPT and does not block the separate search crawler (OAI-SearchBot) that handles live citations. If your goal is to stay visible in ChatGPT's answers, leave the search and user crawlers allowed regardless of what you decide about training.
What is the difference between a training crawler and a search crawler? A training crawler (GPTBot, ClaudeBot, Google-Extended) collects content to help build future model versions; blocking it has little effect on whether you appear in today's AI answers. A search crawler (OAI-SearchBot, PerplexityBot, Claude-SearchBot) fetches pages so the engine can find and cite you in live answers right now. If AI visibility is the goal, the search crawlers are the ones you must allow.
My robots.txt allows the bots, so why am I still blocked? Almost always because a layer above robots.txt is intercepting the request. CDNs, web application firewalls, and security plugins often ship "block AI bots" rules, sometimes on by default, that return a block before the crawler ever reaches your robots.txt. Reading the file shows your intent; it does not show what the bot receives. Check the request itself with our AI bot checker, then look at your CDN or firewall's bot-management settings.
Does robots.txt actually stop a crawler? For cooperative, reputable engines, yes, they honor it as a stated preference. But robots.txt is a convention, not an enforcement mechanism. Some crawlers ignore it, and user-action fetchers acting on a real person's request may not treat it as a hard block. To allow the engines you want, robots.txt is exactly the right tool. To guarantee something stays out, you need firewall-level enforcement, not a robots.txt line.
Should I add an llms.txt file too? That is a separate decision. llms.txt is an advisory content map, not an access control, so it does not allow or block anything; robots.txt is the gate. If you want to point AI models at your best pages, our llms.txt explainer covers whether it is worth it, and the generator builds one for you. But it does nothing to fix a crawler block, which is purely a robots.txt and firewall question.
Allowing AI crawlers comes down to one honest chain: know which bot does which job, write the exact tokens, delete the leftover blocks, and verify at the request level that the door is actually open, not just open in the file. The training-versus-search split is the idea that makes the rest fall into place, and the firewall-above-robots.txt trap is the reason a clean-looking file fools so many people. Start where it is fastest: run your site through the free AI bot checker, see which AI crawlers can reach you right now, fix any block in a line, and then go make the pages worth citing.
This guide is part of our series on AI search visibility, anchored by how to rank on ChatGPT. Written by Minel Gunesoglu, founder of Is My Brand in AI — more about us. I run the free AI bot checker and IndexNow live on this site, and where exact figures are not public I have described the mechanics honestly rather than inventing numbers. Reviewed June 14, 2026.