54 Days of AI Crawler Data: Who's Taking and Who's Giving Back

The web has a new class of visitor. The AI crawler: from OpenAI, Anthropic, Google, Amazon, and others are indexing content at a scale and frequency that dwarfs traditional search engine bots - and most site owners have no idea it's happening.

Standard analytics tools are blind to them. AI crawlers don't execute JavaScript. Google Analytics sees nothing. Server logs fill up with User-Agent strings that most people never look at. The result is a growing layer of traffic - in some cases substantial traffic - that is completely invisible to conventional measurement.

This article presents 54 days of AI crawler data collected from a real, production site using server-level instrumentation. The numbers tell a story that changes how you should think about who you let in - and who you should show the door.

A note on methodology: This data was collected using an NGINX Lua module that detects AI crawler User-Agent strings and fires events to Google Analytics 4 via the Measurement Protocol API - completely bypassing the JavaScript dependency that makes AI crawlers invisible to standard analytics. The full implementation is documented here: Tracking AI Crawlers with NGINX and Google Analytics 4. This approach requires control over your server infrastructure. Shared hosting environments cannot implement it at this layer - which is precisely why this data is scarce.

The AI Crawler Numbers

Over 54 days - February 16 through April 10, 2026 - the instrumentation recorded 28,878 AI crawler events across 4,570 unique pages, averaging 534 events per day.

Daily AI Crawler Events - Yellow marks the day AmazonBot was blocked

The peak was February 17th with 1,549 events. The visible drop around March 4th is not a decline in overall AI crawler interest - it is the result of a single decision: blocking one crawler. That story is below.

Two Kinds of Crawlers

Before the data makes sense, a framework is needed. AI crawlers are not a monolith. They divide cleanly into two categories based on one question: does this crawler have a return path to your content?

Return-value crawlers surface your content to humans in real time. When ChatGPT-User visits your site, it is because a person is actively chatting with ChatGPT and the model has decided your page answers their question. That person can see a citation. They can click through. You gave something, and there is a mechanism to get something back.

Extraction-only crawlers collect content for training data. There is no search result, no citation, no referral. GPTBot is building the next version of a language model. ClaudeBot is doing the same for Claude. AmazonBot is doing it for whatever Amazon is building. The transaction is one-directional: your content leaves, nothing comes back.

Neither category is inherently malicious. Training data serves a legitimate purpose. But understanding which crawlers fall into which category is the foundation of a deliberate access policy.

The AmazonBot Problem

AmazonBot was active for 16 of the 54 days in this dataset before being blocked on March 4th. In those 16 days it generated 6,852 events - an average of 428 crawls per day.

To put that in context: the average across all crawlers combined for the entire 54-day period was 534 per day. AmazonBot alone, in under three weeks, was consuming bandwidth at a rate that nearly matched the entire AI crawler load from every other source combined.

Extrapolated to the full 54-day period, AmazonBot would have generated an estimated 23,126 events - representing 51% of all AI crawler traffic.

The return on that investment: zero. Amazon has no search engine that cites sources. No AI assistant with a referral mechanism. No path from AmazonBot's crawl to a human arriving at your site. It is pure extraction at industrial scale.

Blocking it cut the daily average from 887 events to 386 - a 56% reduction in AI crawler load - with no corresponding loss in visibility, citations, or referral potential.

The lesson is not that Amazon is uniquely bad. The lesson is that crawler volume and crawler value are completely uncorrelated. The most aggressive crawler in this dataset was also the least valuable.

The ChatGPT-User Finding

The single most important number in this dataset is 9,491.

That is the count of ChatGPT-User events over 54 days - approximately 176 per day. ChatGPT-User is not a training crawler. It is the real-time browsing agent that fires when a person is actively chatting with ChatGPT and the model browses the web to answer their question.

Every one of those 9,491 events represents a human conversation in which this site's content was surfaced as a relevant answer. Not someday in a future model. Right now, in an active session.

This distinction - between GPTBot (training) and ChatGPT-User (live in-chat browsing) - is one that most site owners do not know exists. They are listed as separate crawlers in your server logs, they serve completely different functions, and they have completely different implications for your visibility in the AI-driven web.

GPTBot crawled 3,644 pages in this period. That content may or may not influence a future model training run. ChatGPT-User crawled 9,491 pages, each one serving a person who needed an answer today.

If you are thinking about AI visibility, ChatGPT-User is the number to watch.

What They're Actually Crawling

4,570 unique pages were crawled across the 54-day period. The distribution is revealing.

The homepage absorbed 2,528 crawls - the most of any single destination. robots.txt recorded 1,116 hits. The three sitemap files combined for 2,539. Together, infrastructure pages account for more than 6,400 events - bots doing reconnaissance before they ever touch content.

Among content pages, the most crawled were an extension page for an Amazon Product Advertising API plugin (1,869 events) and a blog post troubleshooting Amazon API errors (970 events). The likely explanation: AmazonBot, crawling documentation about Amazon's own product during its 16-day run.

The AI crawler tracking article itself - the one describing exactly this instrumentation - recorded 281 crawls. AI agents reading about AI agent tracking. The irony is noted.

A Framework for Access Decisions

Not every site owner needs to block aggressively. But every site owner should have a deliberate policy rather than an accidental one. The following decision framework is derived directly from what this data reveals:

The questions are simple:

Does the crawler have a return path to your content - a citation mechanism, a search result, a live browsing agent that surfaces your page to humans? If yes, it earns its bandwidth.

Is the crawler volume proportionate to its value? A crawler hitting 428 pages per day with no return mechanism is a different proposition than one hitting 50.

Does the crawler respect robots.txt? A bot that ignores your directives has already answered the question of whether it deserves access. AmazonBot DOES, in fact, respect robots.txt.

The Broader Implication

This data was collected from one site over 54 days. It is not a representative sample of the web. But the patterns it reveals - the volume disparity between crawlers, the value gap between training bots and live browsing agents, the outsized impact of a single aggressive extractor - are structural, not coincidental.

Most sites are flying completely blind on this layer of traffic. The instrumentation required to see it demands server-level access that shared hosting cannot provide. The result is that decisions about robots.txt, crawler access, and AI visibility are being made without data by almost everyone.

That is the gap this article exists to close - at least partially, for the people who can act on it.

The full implementation for collecting this data yourself is documented here: Tracking AI Crawlers with NGINX and Google Analytics 4

Frequently Asked Questions:

What is the difference between ChatGPT-User and GPTBot?

They are completely separate crawlers serving different purposes. GPTBot crawls content to train OpenAI's AI foundation models — future model versions. ChatGPT-User is the real-time browsing agent that fires when a person is actively chatting with ChatGPT and the model fetches a live web page to answer their question. OpenAI has publicly stated that ChatGPT-User is not used for training generative AI foundation models. You can block one without affecting the other via robots.txt.

Citation:: OpenAI — Overview of OpenAI Crawlers

Can I block GPTBot without losing visibility in ChatGPT responses?

Yes. Blocking GPTBot prevents your content from being used in training data but does not affect ChatGPT-User. As long as ChatGPT-User is permitted, ChatGPT can still browse your site in real time during active user conversations and cite your pages in responses. The settings are fully independent.

Citation:: Search Engine World — Tracking OpenAI ChatGPT Bots

What does AmazonBot actually do with the content it collects?

According to Amazon's own documentation, AmazonBot is used to improve their products and services and may be used to train Amazon AI models. It is distinct from Amzn-SearchBot, which powers search experiences in Alexa and Rufus and explicitly does not crawl for AI model training. AmazonBot is the training crawler. There is no referral mechanism — no Amazon search product sends users to external sites based on AmazonBot's crawl data.

Citation:: Amazon Developer — About AmazonBot

Why can't standard analytics tools track AI crawlers?

AI crawlers fetch HTML and leave. They do not execute JavaScript. Since Google Analytics, Adobe Analytics, and virtually every other standard analytics platform relies on a JavaScript tag firing in the browser to record a visit, AI crawler requests pass through completely undetected. The only reliable way to capture this traffic is at the server layer — before any JavaScript dependency enters the picture.

Citation:: RicheyWeb — Tracking AI Crawlers with NGINX and Google Analytics 4

Should I block ClaudeBot?

ClaudeBot is Anthropic's training crawler and falls into the extraction-only category — it collects content to train Claude models. However, blocking it does not affect Claude's web search or citation capabilities in the same way that blocking GPTBot doesn't affect ChatGPT-User. Anthropic's real-time browsing agent operates separately. If training data contribution is your concern, blocking ClaudeBot is a reasonable choice that carries no cost to your current AI visibility.

Citation:: Scrunch — Guide to AI User Agents

How do I block AmazonBot without affecting legitimate Amazon search crawlers?

Amazon operates three distinct crawlers. AmazonBot is the training and general-purpose crawler. Amzn-SearchBot powers Alexa and Rufus search experiences and explicitly does not crawl for AI model training. Amzn-User handles real-time user queries. You can block AmazonBot specifically in robots.txt while allowing the others — the settings are fully independent.

Citation:: Amazon Developer — About AmazonBot

Does blocking AI crawlers affect traditional search engine rankings?

No. Google, Bing, and other traditional search engines use entirely separate crawlers — Googlebot, Bingbot — that are not affected by rules targeting AI crawler user agents. Blocking GPTBot, ClaudeBot, or AmazonBot has no effect on your traditional search rankings.

Citation:: Scrunch — Guide to AI User Agents

Is AmazonBot's aggressive crawl behavior well-documented?

Yes — it is a known issue reported across multiple webmaster communities. Reports of AmazonBot crawling entire sites in under an hour, generating request volumes far exceeding any other crawler, and consuming disproportionate bandwidth appear consistently in webmaster forums. Amazon does not support the crawl-delay directive, which removes one of the standard tools site owners use to moderate aggressive crawlers.

Citation:: Amazon Developer — About AmazonBot / Discourse Meta — AmazonBot abusive crawling

What is PerplexityBot and why does it matter despite low volume?

PerplexityBot is the crawler for Perplexity AI, a search engine built entirely around AI-generated answers with cited sources. Unlike training crawlers, every PerplexityBot crawl is directly connected to Perplexity's citation and referral system — when your content is indexed, it can appear as a cited source with a clickable link in Perplexity's answers. Low volume with direct citation return makes it one of the higher-value crawlers in the dataset despite its small numbers.

Citation:: Scrunch — Guide to AI User Agents