AI Training Audit
Is your website opted in to AI training?
An informational scan of your /robots.txt, /ai.txt and /llms.txt. Reports which AI crawlers you currently allow or block — no compliance verdict, just clarity.
Begin scan
Coverage
What we
read
Three small text files at the root of a domain control how AI crawlers treat your content. None are required. Most websites have none of them.
- № 01
/robots.txt
The standard. We check whether you're allowing or blocking the major AI crawlers — GPTBot, ClaudeBot, Google-Extended, PerplexityBot, CCBot, Applebot-Extended, Bytespider, Meta-ExternalAgent, and around a dozen more.
- № 02
/ai.txt
Spawning.ai's proposed opt-out file for AI training datasets. Uncommon, but if you have one we'll surface it so you know it's still being served.
- № 03
/llms.txt
An emerging standard for sites that want to be readable by LLMs — think of it as a sitemap for AI assistants. Presence is a positive signal that you've considered AI consumption of your content.
Why this matters
No "right" answer. Just clarity.
Some businesses want to be cited by ChatGPT, Claude and Perplexity — visibility in AI answers is the new SEO. Others want to keep their content out of training corpora — IP, licensing, brand control. Both are legitimate.
The audit doesn't pick a side. It tells you what your robots.txt actually says, so you can decide whether that matches your intent.
Crawlers
Who's
asking
The major AI crawlers the scan looks for, what each one is for, and the user-agent string to block or allow in /robots.txt if you want to change your stance.
- № 01
GPTBot
OpenAI's crawler for training future ChatGPT models. Honours
/robots.txt. Block withUser-agent: GPTBot / Disallow: /. Separate fromOAI-SearchBot(live ChatGPT search results, usually worth allowing) andChatGPT-User(on-demand fetches by ChatGPT users). - № 02
ClaudeBot
Anthropic's crawler for training Claude. Honours
/robots.txt. Block withUser-agent: ClaudeBot / Disallow: /.Claude-Webis the older identifier, still seen on some logs. - № 03
Google-Extended
Google's opt-out token for Gemini training and AI features built on its index. Blocking
Google-Extendeddoes not affect normal Google Search ranking — the regularGooglebotrespects a separate directive. This is the cleanest "no AI training, keep my SEO" switch available today. - № 04
PerplexityBot
Perplexity's crawler. Used for both live answer-engine retrieval and training. Perplexity has been the most controversial of the AI crawlers — reports of ignoring robots.txt in 2024 led to many sites switching to IP-based blocks. Worth checking your logs as well as your robots.txt.
- № 05
CCBot (Common Crawl)
The crawler behind the Common Crawl dataset, which has been the upstream training source for many open LLMs (LLaMA, Falcon, Mistral, etc.). Blocking CCBot is one of the highest-leverage opt-outs — it cascades into dozens of downstream models.
- № 06
Applebot-Extended
Apple's opt-out token for Apple Intelligence and Siri training. Mirrors the Google-Extended pattern — blocking it does not affect Spotlight or Safari search. Both Apple and Google adopted this two-tier model so publishers can opt out of AI training without losing classical search visibility.
- № 07
Meta-ExternalAgent & Bytespider
Meta-ExternalAgentis Meta's training crawler for Llama.Bytespideris ByteDance's, feeding into Doubao and other products. Both honour robots.txt and are commonly blocked by publishers wanting to limit Chinese-market and social-media training datasets.
FAQ
Common
questions
What UK site owners ask before they touch their robots.txt. The scanner is intentionally non-prescriptive — the answers below explain the tradeoffs rather than picking a side.
- Q1
Should I block AI crawlers from my website?
No universal answer. Block them if your content is licensed, IP-sensitive, or your business model depends on direct site visits. Allow them if AI visibility is part of your distribution — being cited by ChatGPT or Perplexity is increasingly how customers find SMEs. Most publishers split the difference: block training crawlers (GPTBot, ClaudeBot, CCBot) and allow live-retrieval crawlers (OAI-SearchBot, PerplexityBot when used for retrieval).
- Q2
How do I opt out of ChatGPT training?
Add two lines to your
/robots.txt:User-agent: GPTBotfollowed byDisallow: /. OpenAI honours this — they published the GPTBot identifier specifically for this purpose. Your site stays fully visible in ChatGPT's web search mode (which uses a different crawler) and in classical Google Search. - Q3
Does blocking Google-Extended affect Google Search rankings?
No. Google explicitly designed Google-Extended as an opt-out token that controls Gemini training and AI features only — classical Google Search uses the regular Googlebot directive, which is unaffected. This is the cleanest way to opt out of Google's AI training without losing search visibility.
- Q4
What is llms.txt and do I need one?
/llms.txtis an emerging convention, proposed in late 2024, for sites that want to be readable by LLMs. Think of it as a sitemap for AI assistants — a markdown index of your most important content with descriptions, optimised for LLM context windows. Not required, but having one is a positive signal you've thought about AI discoverability. - Q5
Do AI crawlers actually honour robots.txt?
The major ones do: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent and CCBot all publish identifiers and honour the standard. The exception has been Perplexity — multiple investigations in 2024 found their crawler ignoring directives. If you specifically want to block Perplexity, plan for IP-based blocking in addition to robots.txt as a backstop.
- Q6
Is blocking AI crawlers legally required?
No UK statute requires it. UK GDPR Article 4 may apply if your content includes personal data being scraped at scale, but that is a fact-specific legal question. The EU AI Act's text and data mining (TDM) opt-out provisions under Article 4(3) of the Copyright Directive apply to EU operators — many UK publishers signal an opt-out via robots.txt or ai.txt as good practice even though enforcement is unclear. None of this is legal advice.
Last word
Run the scan. See your stance.
Free, instant, and without registration.
Run a free audit