Common Crawl defends archive practices amid deletion claims

Nonprofit Common Crawl issued November 4 statement defending data collection methods, citing technical constraints preventing content deletion.

Common Crawl Foundation released a detailed response on November 4, 2025, addressing allegations that the organization has misled publishers about removing their content from its web archives. The statement, published on the nonprofit’s blog, comes in direct response to an Atlantic investigation by Alex Reisner that questioned the organization’s handling of takedown requests.

Executive Director Rich Skrenta s…

Nonprofit Common Crawl issued November 4 statement defending data collection methods, citing technical constraints preventing content deletion.

Common Crawl defends archive practices amid deletion claims

Executive Director Rich Skrenta stated in the response that Common Crawl “communicates honestly with publishers who contact us” and emphasized that the foundation has “always operated in good faith, in public view, and in accordance with our mission to serve the common good.” The organization rejected characterizations that it has lied to publishers, calling such allegations “untrue” and stating they “misrepresent both how Common Crawl operates and the values that guide our work.”

Common Crawl has operated as a nonprofit foundation since 2007 with a stated mission to make a public, open archive of the web freely available to researchers, educators, journalists, and developers. The organization’s web crawler, known as CCBot, collects data from publicly accessible web pages. According to the foundation’s statement, “We do not go ‘behind paywalls,’ do not log in to any websites, and do not employ any method designed to evade access restrictions.”

Subscribe PPC Land newsletter ✉️ for similar stories like this one

Technical barriers explained

Skrenta provided technical explanations for the challenges the organization faces in removing content from its archives. Common Crawl’s archives are stored in an immutable format known as WARC files, which are used by libraries and archivists worldwide. The foundation cannot edit those files after publication without breaking their integrity. “Because Common Crawl’s archives are stored in an immutable format (WARC files) used by libraries and archivists worldwide, we cannot ‘edit’ those files after publication without breaking their integrity,” the statement explained.

Instead of editing published archives, the foundation removes or filters affected URLs from subsequent crawls and makes them inaccessible through public tools and indices. The organization characterized this approach as “standard practice in large-scale web archiving” rather than concealment. The statement noted that file storage system logs show no content file modifications in Common Crawl’s archives since 2016, which the foundation attributes to the technical design of its immutable storage format rather than inaction on removal requests.

The nonprofit addressed questions about its public “Index Server” and “CC Index” tools, which some publishers have used to verify removal status. According to the statement, these tools “are designed for efficient search, not for legal confirmation of every URL’s removal status.” A “no captures” result in a search interface reflects how indices are generated rather than what is stored internally. The foundation rejected suggestions that this represents deception, calling it a misrepresentation of technical facts.

Funding and independence

Common Crawl disclosed its funding sources in response to questions about relationships with the artificial intelligence industry. For over fifteen years, the organization received support almost entirely from the Elbaz Family Foundation Trust. In recent years, as public and research interest in large-scale text analysis has grown, several organizations have contributed donations to support the cost of running and maintaining the public archive.

The foundation received donations from OpenAI totaling $250,000 and Anthropic totaling $250,000 in 2023, along with other organizations involved in AI development. These donations represent a small fraction of overall operating needs, according to the statement. Skrenta told Reisner that running Common Crawl costs “millions of dollars.” The foundation emphasized that these contributions are disclosed publicly in financial statements and that “no donor, corporate or otherwise, has any control over what we collect, publish, or remove.”

Common Crawl rejected characterizations that it has become aligned with AI industry interests. “Common Crawl is not - and has never been - ‘doing the AI industry’s dirty work,’” the statement declared. The organization provides open data for everyone, including researchers studying misinformation, linguistics, digital preservation, machine translation and public health. Tens of thousands of academic papers and public-interest projects have relied on Common Crawl over the years, many entirely unrelated to artificial intelligence.

Buy ads on PPC Land. PPC Land has standard and native ad formats via major DSPs and ad platforms like Google Ads. Via an auction CPM, you can reach industry professionals.

Learn more

Removal process complexity

The foundation acknowledged that its small team manages an archive of many petabytes, a scale that makes real-time deletion technically complex. The organization stated it continues to work diligently to meet removal requests and communicate progress transparently. “No one at Common Crawl has ever claimed this work was instantaneous or complete; rather, we have been open about its complexity and ongoing nature,” the statement explained.

When publishers request content removal, Common Crawl responds promptly and initiates a removal process that reflects the technical design of its dataset. The organization has engaged directly with organizations such as The New York Times, the Danish Rights Alliance, and others that have requested data removal or clarification. According to the statement, “In every case, we have responded, cooperated, and implemented the requested changes to the extent technically possible.”

The Atlantic investigation found many Times articles still present in the archives despite a July 2023 removal request. Times spokesperson Charlie Stadtlander initially told Business Insider in November 2023 that Common Crawl had complied with the removal request. After being informed of the investigation’s findings, Stadtlander stated in 2025 that “our understanding from them is that they have deleted the majority of the Times’s content, and continue to work on full removal.”

The Danish Rights Alliance described a similar interaction in documents shown to Reisner. The organization’s head of content protection and enforcement, Thomas Heldrup, showed a redacted email exchange beginning in July 2024 requesting member content removal. In December 2024, more than six months after the initial request, Common Crawl’s attorney wrote that “approximately 50% of this content has been removed.” Other publishers received similar messages indicating removal was 50 percent, 70 percent, and then 80 percent complete.

Operational transparency

Common Crawl emphasized its commitment to transparency throughout its statement. The foundation publishes its crawling code and documentation publicly, identifies itself clearly as “CCBot” in its user agent string, honors robots.txt exclusions, and complies with takedown and removal requests sent in good faith. “These principles have not changed in over a decade,” the statement noted.

The organization’s approach has remained consistent since its founding. Every dataset published is openly documented, every update is logged and timestamped, and the code is public. “Our operations are open for inspection,” the statement declared. The foundation characterized accusations of “masking its archives” or “lying to publishers” as not just inaccurate but undermining a valuable public resource that exists precisely to promote transparency.

Skrenta told Reisner that removal requests are “a pain in the ass” but insisted the foundation complies. In a second conversation, Skrenta was more forthcoming about technical constraints, stating Common Crawl is “making an earnest effort” to remove content but that the file format storing its archives is “meant to be immutable. You can’t delete anything from it.” The executive director did not answer questions about where the 50, 70, and 80 percent removal figures originate.

Robot rights and fair use

Skrenta has publicly framed the debate around AI training data in terms of robot rights and fair use. The executive director told Reisner that “the robots are people too,” suggesting they should be allowed to “read the books” for free. This perspective aligns with arguments from AI companies that using copyrighted material constitutes fair use under existing law.

Common Crawl founder Gil Elbaz stated in a 2012 interview that “we just have to make sure that people use it in the right way. Fair use says you can do certain things with the world’s data.” The organization sent a letter in 2023 urging the U.S. Copyright Office not “to hinder the development of intelligent machines.” The letter included two illustrations of robots reading books.

When Reisner asked about publishers excluding themselves from what Skrenta called “Search 2.0,” referring to generative AI products now widely used to find information online, Skrenta stated that publishers “shouldn’t have put your content on the internet if you didn’t want it to be on the internet.”

The executive director expressed little concern about the importance of individual publications. He told Reisner that The Atlantic is not a crucial part of the internet. “Whatever you’re saying, other people are saying too, on other sites,” Skrenta said. He did express reverence for Common Crawl’s archive, viewing it as a record of civilization’s achievements. Skrenta told Reisner he wants to “put it on a crystal cube and stick it on the moon” so that “if the Earth blows up,” aliens might be able to reconstruct human history. “The Economist and The Atlantic will not be on that cube,” Skrenta told Reisner. “Your article will not be on that cube. This article.”

Former Mozilla researcher Stefan Baack pointed out in his 2024 report that Common Crawl could require attribution whenever its scraped content is used. This would help publishers track use of their work, including when it appears in training data of AI models not supposed to have access. Attribution requirements are common for open datasets and would cost Common Crawl nothing. When asked if he had considered this suggestion, Skrenta told Reisner he had read Baack’s report but didn’t plan on implementing the recommendation because it wasn’t Common Crawl’s responsibility. “We can’t police that whole thing,” Skrenta said. “It’s not our job. We’re just a bunch of dusty bookshelves.”

AI training data ecosystem

Common Crawl has become integral to the artificial intelligence training data ecosystem. According to Baack, “generative AI in its current form would probably not be possible without Common Crawl.” In 2020, OpenAI used Common Crawl’s archives to train GPT-3. OpenAI claimed the program could generate “news articles which human evaluators have difficulty distinguishing from articles written by humans.” In 2022, an iteration on that model, GPT-3.5, became the basis for ChatGPT.

The organization has been publishing regular crawls since 2013, with each containing 1 billion to 4 billion webpages. The foundation adds new crawls to its archive every few weeks. When training AI models, developers such as OpenAI and Google typically filter Common Crawl’s archives to remove unwanted material including racism, profanity, and low-quality prose. Each developer employs its own filtering strategy, leading to proliferation of Common Crawl-based training datasets including c4 (created by Google), FineWeb, DCLM, and more than 50 others.

Common Crawl doesn’t only supply raw text. The organization has been helping assemble and distribute AI training datasets itself. Its developers have co-authored multiple papers about large language model training data curation. They sometimes appear at conferences showing AI developers how to use Common Crawl for training. Common Crawl even hosts several AI training datasets derived from its crawls, including one for Nvidia. In its paper on the dataset, Nvidia thanks certain Common Crawl developers for their advice.

Publisher response and industry context

The marketing community faces mounting challenges from unauthorized AI training data collection. Over 35% of the world’s top 1000 websites now block OpenAI’s GPTBot web crawler, representing a seven-fold increase from August 2023 when only 5% blocked the crawler. Common Crawl’s CCBot has become the scraper most widely blocked by the top 1,000 websites in the past year, surpassing even OpenAI’s GPTBot.

More than 80 media executives gathered in New York during the week of July 30, 2025, under the IAB Tech Lab bannerto address what many consider an existential threat to digital publishing. Mediavine Chief Revenue Officer Amanda Martin joined representatives from Google, Meta, and numerous other industry leaders in confronting AI companies that scrape publisher content without consent or compensation. Notably absent from the gathering were OpenAI, Anthropic, and Perplexity.

Cloudflare research released August 29, 2025, revealed stark imbalances between how much content AI platforms crawl for training purposes versus traffic they refer back to publishers. Anthropic crawled 38,000 pages for every referred page visit in July 2025, while OpenAI maintained a ratio of 1,091 crawls per referral. Training-related crawling now drives nearly 80% of all AI bot activity, representing an increase from 72% documented one year earlier.

Blocking only prevents future content from being scraped. It doesn’t affect webpages Common Crawl has already collected and stored in its archives. The foundation has been publishing these regular installments since 2013, creating an archive that spans more than a decade of web content.

Legal developments

A federal court dismissed Raw Story’s lawsuit against OpenAI on November 10, 2024, citing lack of standing under the Digital Millennium Copyright Act. Raw Story Media and AlterNet Media collectively published over 400,000 articles allegedly scraped and included in OpenAI’s training datasets WebText, WebText2, and Common Crawl. The court’s decision suggests that simply having content included in AI training datasets, without specific instances of harmful use, may not be sufficient for legal standing.

IAB Europe released technical standards in September 2025 requiring AI platforms to compensate publishers for content ingestion. According to IAB Europe Data Analyst Dimitris Beis, the framework addresses “a paradigm of publisher remuneration for content ingestion” through three core mechanisms: content access controls, discovery protocols, and monetization APIs. The framework emerges from documented traffic disruptions, with referrals from AI platforms increasing 357% year-over-year in June 2025.

Cloudflare launched a pay-per-crawl service on July 2, 2025, allowing content creators to charge AI crawlers for access. The service affects major AI crawlers including CCBot (Common Crawl), ChatGPT-User (OpenAI), ClaudeBot (Anthropic), and GPTBot (OpenAI). Publishers control three distinct options for each crawler: allow free access, charge at configured domain-wide pricing, or block access entirely.

Implications for publishers

The developments create significant implications for the marketing community. TikTok emerged as the most scraped website in 2025, jumping from outside the top 10 with 321% traffic growth, according to research released September 9, 2025, by web scraping company Decodo. Video and social media platforms now represent 38% of all scraping activity, reflecting demand for multimodal AI training data.

Meta’s leaked scraping operations revealed systematic data collection from approximately 6 million unique websites, according to documents published by Drop Site News on August 6, 2025. The comprehensive operation encompasses roughly 100,000 of the internet’s most-trafficked domains, demonstrating the scope of modern data collection efforts for AI training.

Publishers already struggle with identity challenges, with 84% unable to identify more than 25% of their website visitors according to Wunderkind research. Traditional reliance on search traffic for building audience relationships through newsletter subscriptions, social media follows, and direct website bookmarks becomes compromised as AI features provide answers without directing users to source websites.

Reddit filed a lawsuit against Anthropic on June 4, 2025, alleging the AI company violated contractual agreements and engaged in unfair business practices by using Reddit content without authorization to train its Claude chatbot. The 28-page complaint seeks damages and injunctive relief for what Reddit characterizes as “commercial exploitation” of user-generated content valued at tens of billions of dollars.

The case documents reveal that Anthropic researchers, including CEO Dario Amodei, acknowledged using “Reddit comments” as training data to improve AI model performance. Reddit alleges this unauthorized use violates explicit terms in its User Agreement prohibiting commercial exploitation without written consent.

Common Crawl’s statement concluded by inviting dialogue about the ethics and responsibilities of web archiving. “We recognize that the digital landscape is changing and that publishers face real challenges in balancing openness with commercial sustainability,” the foundation stated. The organization invited The Atlantic and all interested parties to engage directly to verify claims, inspect datasets, and better understand the realities of open web archiving at scale. “We will continue to listen, improve our tools, and uphold our commitment to public transparency,” the statement concluded.

Timeline

2012: Gil Elbaz, founder of Common Crawl, states in interview that “we just have to make sure that people use it in the right way. Fair use says you can do certain things with the world’s data”
2013: Common Crawl begins publishing regular crawls containing 1 billion to 4 billion webpages every few weeks
2016: Last modification of content files in Common Crawl’s archives, according to system logs
2020: OpenAI uses Common Crawl’s archives to train GPT-3
2022: GPT-3.5 becomes basis for ChatGPT
August 2023: OpenAI’s GPTBot blocked by 5% of top 1000 websites
July 2023: The New York Times sends notice to Common Crawl requesting removal of previously scraped Times content
November 2023: Times spokesperson tells Business Insider that Common Crawl complied with removal request
2023: Common Crawl receives donations from OpenAI ($250,000), Anthropic ($250,000), and other AI development organizations after 15 years of support from Elbaz Family Foundation Trust
June 2024: Cloudflare introduces feature to block AI scrapers and crawlers
July 2024: Danish Rights Alliance initiates email exchange with Common Crawl requesting member content removal
August 3, 2024: Study shows 35.7% of top 1000 websites now block OpenAI’s GPTBot, seven-fold increase from 2023
November 10, 2024: Federal court dismisses Raw Story’s DMCA lawsuit against OpenAI over training data
December 2024: Common Crawl’s attorney tells Danish Rights Alliance that “approximately 50% of this content has been removed”
June 4, 2025: Reddit files lawsuit against Anthropic over unauthorized Claude AI training
July 2, 2025: Cloudflare launches pay-per-crawl service allowing publishers to charge AI crawlers for access
July 30, 2025: Over 80 media executives rally against AI scraping at IAB Tech Lab summit
August 6, 2025: Meta leaked scraping list reveals harvesting from 6 million unique websites
August 29, 2025: Cloudflare releases data showing Anthropic crawls 38,000 pages per referral while training-related crawling reaches 79% of AI bot traffic
September 9, 2025: TikTok emerges as most scraped website with 321% traffic growth as video platforms represent 38% of scraping activity
September 2025: IAB Europe unveils framework for AI publisher compensation requiring content ingestion payments
November 4, 2025: The Atlantic publishes investigation by Alex Reisner revealing Common Crawl’s practices
November 4, 2025: Common Crawl Foundation releases detailed response defending its operations and technical constraints

Subscribe PPC Land newsletter ✉️ for similar stories like this one

Summary

Who: Common Crawl Foundation, a nonprofit organization directed by Rich Skrenta and founded by Gil Elbaz, supplies archived web content to major AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon. Major news organizations including The New York Times, The Economist, Los Angeles Times, Wall Street Journal, The New Yorker, Harper’s, The Atlantic, BBC, Reuters, Wired, Financial Times, and Washington Post have their content included in the archives.

What: Common Crawl maintains petabyte-scale archives that AI companies use to train large language models. The organization issued a November 4, 2025, statement defending its operations and explaining technical constraints that prevent content deletion from immutable WARC file archives. Despite claiming compliance with publisher removal requests, file system logs show no content modifications since 2016. The organization received $250,000 donations each from OpenAI and Anthropic in 2023 and helps assemble AI training datasets. Common Crawl emphasized its transparency practices, mission to serve the public good, and rejection of allegations that it has misled publishers.

When: Common Crawl has been scraping webpages since the early 2010s and publishing regular crawls since 2013. The New York Times requested content removal in July 2023. The Danish Rights Alliance initiated removal requests in July 2024. Alex Reisner published his Atlantic investigation on November 4, 2025. Common Crawl released its response statement the same day. OpenAI used Common Crawl archives to train GPT-3 in 2020, which led to ChatGPT’s launch in 2022. No content files in the archives appear modified since 2016.

Where: Common Crawl operates by scraping billions of webpages globally to build archives measured in petabytes. The archived content has appeared in training data of thousands of AI models. Common Crawl adds new crawls to its archive every few weeks, each containing 1 billion to 4 billion webpages. The organization hosts AI training datasets including one for Nvidia and makes archives freely available for download through platforms including Hugging Face.

Why: This matters for the marketing community because publishers face declining traffic revenues as AI platforms consume content at unprecedented scales while providing minimal referrals. Training-related crawling now drives 79% of all AI bot activity, with Anthropic crawling 38,000 pages per referral according to August 2025 Cloudflare data. Publishers struggle with 84% unable to identify more than 25% of website visitors. Traditional audience-building strategies through search traffic become compromised as AI features provide answers without directing users to source websites. The developments represent an existential threat to digital publishing business models, with referrals from AI platforms increasing 357% year-over-year while publishers receive no compensation for content used in AI training. More than 80 media executives convened in July 2025 to address these challenges, implementing technical standards for AI publisher compensation through frameworks established by IAB Tech Lab and IAB Europe. Common Crawl’s response highlights the technical and philosophical tensions between open data principles and publisher compensation demands.

Technical barriers explained

Funding and independence

Removal process complexity

Operational transparency

Robot rights and fair use

AI training data ecosystem

Publisher response and industry context

Legal developments

Implications for publishers

Timeline

Summary

Similar Posts