Publishers say no to AI scrapers, block bots at server level

A growing number of websites are taking steps to ban AI bot traffic so that their work isn’t used as training data and their servers aren’t overwhelmed by non-human users. However, some companies are ignoring the bans and scraping anyway.

Online traffic analysis conducted by BuiltWith, a web metrics biz, indicates that the number of publishers trying to prevent AI bots from scraping content for use in model training has surged since July.

About 5.6 million websites presently have added OpenAI’s GPTBot to the disallow list in their robots.txt file, up from about 3.3 million at the start of July 2025. That’s an increase of almost 70 percent.

Websites can signal to visiting crawlers whether they allow automated requests to harv…

Websites can signal to visiting crawlers whether they allow automated requests to harvest information through entries in their robots.txt files. Compliance with these directives is voluntary, but repeated failure to respect these rules may come up in litigation, as it did in Reddit’s scraping lawsuit against Anthropic earlier this year.

Speaking of Anthropic, the company’s ClaudeBot is also increasingly wearing out its welcome. ClaudeBot is now blocked at about 5.8 million websites, up from 3.2 million in early July. The company’s Claude-SearchBot – used for surfacing sites in Claude search results – also faces a rising block rate.

The situation is similar for AppleBot, now blocked at about 5.8 million websites, up from about 3.2 million in July.

Even GoogleBot – which indexes data for search – faces growing resistance, perhaps because it’s also used for the AI Overviews now surfaced atop search results. BuiltWith reports that 18 million sites now ban the bot, which would also mean that those sites could not be indexed in Google Search.

As of July, about half of news sites blocked GPTBot, according to Arc XP, a publishing platform biz spun out of The Washington Post.

Anthropic, OpenAI, and Google did not immediately respond to requests for comment.

Anirudh Agarwal, CEO of OutreachX, a web marketing consultancy, said in an emailed statement that it’s noteworthy how often GPTBot is getting turned away because that signals how publishers think about AI crawlers. If OpenAI’s GPTBot is being blocked, every other AI crawler faces that possibility.

Tollbit, a biz that aims to help publishers monetize AI traffic through access fees for crawlers, said in its Q2 2025 report that, in the past year, there’s been a 336 percent increase in sites blocking AI crawlers.

The company also said that, across all AI bots, 13.26 percent of requests ignored robots.txt directives in Q2 2025, up from 3.3 percent in Q4 2024. This alleged behavior has been challenged in court by Reddit as noted above, and in a lawsuit filed by major news publishers against Perplexity in 2024.

But bot blocking efforts have become more complicated because AI firms like OpenAI and Perplexity have launched browsers that incorporate their AI models. According to the Tollbit report, "The latest AI browsers like Perplexity Comet, and devtools like Firecrawl or Browserless are indistinguishable from humans in site logs." So publishers that block Comet or the like might just be blocking human traffic. As a result, Tollbit argues, it’s critical that non-human site traffic accurately identifies itself.

For organizations that are not major publishers, the AI bot onslaught can be overwhelming. In October, blogging service Bear reported an outage based on AI bot traffic, a problem also noted by Belgium-based blogger Wouter Groeneveld. And developer David Gerard, who runs AI-skeptic blog Pivot-to-AI, last month wrote on Mastodon about how RationalWiki.org was having trouble keeping AI bots at bay.

Will Allen, VP of product at Cloudflare, told The Register in an interview last month that the company sees "a lot of people that are out there trying to scrape large amounts of data, ignoring any robots.txt directives, and ignoring other attempts to block them."

Bot traffic, said Allen, is increasing, which in and of itself isn’t necessarily a bad thing. But it does mean, he said, that there are more attacks and more people trying to get around paywalls and content restrictions.

Cloudflare, over the summer, launched a service called Pay per crawl in a bid to allow content owners to offer automated access for a price.

Allen declined to disclose which sites have signed up to participate in the beta testing but said it’s clear that new economic options would be helpful.

"We have a thesis or two about how that could evolve," he said. "But really, we think there’s going to be a lot of different evolution, a lot of different experimentation. And so we’re keeping a pretty tight private beta for our Pay per crawl product just to really learn, from both sides of the market – people who are looking to access content at scale and people who are looking to protect content." ®

Similar Posts