Who Blocks AI Crawlers? robots.txt in the Top 10,000 Sites

Most of the web’s biggest sites still let AI crawlers in. But the ones pulling up the drawbridge are making a surprisingly specific choice about which AI gets access, and it’s not the choice I expected.

I checked the robots.txt of the top 10,000 sites in the Majestic Million to see how many allow or block the major AI crawlers, which bots get blocked most, and whether the bigger players are more protective.

22.6% of top sites block at least one AI crawler

17.8% block CCBot, the most-blocked crawler

6.6% block OpenAI's search bot, the least-blocked

13.1% of llms.txt publishers block an AI crawler, about half the rate of the rest

AI crawler rules across the top 10,000 sites, June 2026.

How I read 10,000 robots.txt files

I fetched /robots.txt for each of the top 10,000 domains and parsed it the way a compliant crawler does, matching each bot to its most specific user-agent group and checking whether that group allows or blocks the site root.

Not every site gives a clean answer. Some sit behind bot protection that refused the request, some timed out, and some returned an error page instead of a file. I set those aside rather than guess. That leaves 7,870 sites with a determinable result: either a readable robots.txt, or a clean 404, which means no rules and an open door by default. Every percentage below is out of those 7,870.

Most sites still leave AI crawlers alone

The headline is reassuring if you’re rooting for an open web: 77% of these sites are open to every major AI crawler. About 22.6% block at least one, and only 330 sites block all ten I checked.

How many AI crawlers sites block

Open to every AI crawler 6089 (77%)
Block some (1-9) 1451 (18%)
Block all 10 330 (4%)

Sites with a determinable robots.txt, by how many of the ten AI crawlers they block.

So blocking is a minority behavior. But who gets blocked, and by whom, is where it gets interesting.

Training crawlers get blocked, search crawlers get a pass

Here’s the part I didn’t expect. When sites block, they overwhelmingly block the crawlers that collect data for model training, while leaving the crawlers that power AI search and citations alone.

The clearest proof is within a single company. OpenAI runs two crawlers: GPTBot, which gathers training data, and OAI-SearchBot, which builds the index ChatGPT cites when it searches the web. Holding the company constant removes any “people just trust one lab over another” explanation, and what’s left is a clean split.

One company, two crawlers

GPTBot (training)

17.6% 1,388

OAI-SearchBot (search)

6.6% 520

GPTBot collects training data; OAI-SearchBot powers ChatGPT's live citations. Share of the 7,870 sites that block each.

Sites are nearly three times more willing to be cited than to be trained on. And it isn’t just OpenAI: the same split runs across every lab.

Block rate by AI crawler

CCBot

17.8% 1,400

GPTBot

17.6% 1,388

Bytespider

16.4% 1,291

ClaudeBot

16.0% 1,256

Google-Extended

14.5% 1,141

Meta-ExternalAgent

13.5% 1,059

Applebot-Extended

13.1% 1,030

anthropic-ai

11.3% 888

PerplexityBot

10.4% 822

OAI-SearchBot

6.6% 520

Share of the 7,870 sites that block each crawler.

The most-blocked crawlers (CCBot, GPTBot, Bytespider, ClaudeBot, Google-Extended) are all training collectors, and the least-blocked are the ones tied to live search and answers. One nuance on the leader: Common Crawl’s CCBot has been around since 2011, so some of its blocks predate the AI moment and reflect old bandwidth decisions rather than a fresh call on training. The pattern doesn’t lean on it, though. GPTBot, a purpose-built AI training crawler, sits right behind. Whether sites are doing this deliberately or just copying blocklists that circulated when “block AI” meant “block training,” the effect is the same: they’re protecting their content from being learned while staying open to being found.

Bigger sites block more

Blocking also tracks with size. The top 500 sites block an AI crawler about a third of the time, while the rate drifts down to roughly 20% in the 5,000 to 10,000 range.

Block at least one AI crawler, by rank

1-100

32.9% 28 / 85

101-500

33.1% 108 / 326

501-1k

25.5% 103 / 404

1k-2k

26.7% 214 / 803

2k-5k

23.6% 574 / 2,436

5k-10k

19.8% 754 / 3,816

Share of sites in each Majestic Million rank tier that block at least one AI crawler.

That fits what you’d expect. Larger sites are more likely to be publishers with content-licensing concerns, to have legal teams with opinions, and to have the resources to manage bot policy at all. Smaller sites mostly haven’t touched the question. The very top tiers are small samples, though (the top 100 is only 85 sites with a clear answer), so I’d read the broad gradient here, not the exact tier-by-tier order.

llms.txt publishers block the least

One cross-reference, since I had the data from a companion study. Sites that publish an llms.txt file, an explicit invitation for AI tools to read them, block an AI crawler only 13.1% of the time, versus 23.6% for everyone else.

Block rate: llms.txt adopters vs everyone else

llms.txt adopters

13.1% 95 / 724

Everyone else

23.6% 1,686 / 7,146

Sites publishing an llms.txt are far more likely to leave AI crawlers open.

That’s the consistency you’d hope for: the sites rolling out a welcome mat are also the ones leaving the door unlocked. It’s the inverse of the contradiction I found in the llms.txt study, where a small slice did both at once. One caveat: this is a correlation, not proof of cause. llms.txt adopters are a self-selected, AI-friendly group, so the gap probably says more about who they already are than about what publishing the file does to them.

So what should you do?

Decide your AI crawler policy on purpose, then make robots.txt say exactly that.

If you want to be found and cited in AI answers, which most brands competing for visibility do, the search and live-fetch crawlers are the ones that matter, and blocking them quietly removes you from those surfaces. Training crawlers are a separate, legitimate decision: blocking them keeps your content out of future models, at the cost of those models knowing less about you.

The common mistake is doing it by accident, with an old blanket rule that catches more than you meant. If you’re not sure what yours allows, the AI Crawler Access Checker reads your live robots.txt and tells you, bot by bot.

A note on the data

This is a snapshot of the top 10,000 Majestic Million domains, crawled in June 2026. robots.txt is a stated policy, not an enforcement mechanism: compliant crawlers honor it, but some (Bytespider has been widely reported to ignore it) do not, so a block here is an instruction, not a guarantee. The 2,130 sites without a determinable robots.txt are left out of every rate above. If you want the underlying data or have a question about the method, get in touch.

Working through the same shift?

I write about SEO, GEO, and getting found by AI search.
If this was useful, I'd love to compare notes.

Connect on LinkedIn More research