A Profile of Common Crawl *⇥ theatlantic.com *
Alex Reisner, the Atlantic:
The Common Crawl Foundation is little known outside of Silicon Valley. For more than a decade, the nonprofit has been scraping billions of webpages to build a massive archive of the internet. This database — large enough to be measured in petabytes — is made freely available for research. In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, …
A Profile of Common Crawl *⇥ theatlantic.com *
Alex Reisner, the Atlantic:
The Common Crawl Foundation is little known outside of Silicon Valley. For more than a decade, the nonprofit has been scraping billions of webpages to build a massive archive of the internet. This database — large enough to be measured in petabytes — is made freely available for research. In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models. In the process, my reporting has found, Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites. And the foundation appears to be lying to publishers about this — as well as masking the actual contents of its archives.
What I particularly like about this investigation is how Reisner actually checked the claims of Common Crawl against its archives. That does not sound like much, but the weight it has when I read it is more impactful than a typical experts say paragraph.
Reisner:
Common Crawl doesn’t log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you’re a subscriber and hides the content if you’re not. Common Crawl’s scraper never executes that code, so it gets the full articles. Thus, by my estimate, the foundation’s archives contain millions of articles from news organizations around the world, including The Economist, the Los Angeles Times, The Wall Street Journal, The New York Times, The New Yorker, Harper’s, and The Atlantic.
Publishers configure their websites like this for a couple reasons, one of which is that it is beneficial from a search engine perspective because search crawlers can crawl the full text. I get why it feels wrong that Common Crawl takes advantage of this, and that it effectively grants full-text access to A.I. training data sets; I am not arguing it is wrong for this to be a violation. But if publishers wanted to have a harder paywall, that is very possible. One of the problems with A.I. is that reasonable trade-offs have, quite suddenly, become fully opened backdoors.
Reisner:
In our conversation, Skrenta downplayed the importance of any particular newspaper or magazine. He told me that The Atlantic is not a crucial part of the internet. “Whatever you’re saying, other people are saying too, on other sites,” he said. Throughout our conversation, Skrenta gave the impression of having little respect for (or understanding of) how original reporting works.
Here is another problem with A.I.: no website is individually meaningful, yet if all major publishers truly managed to remove their material from these data sets, A.I. tools would be meaningfully less capable. It highlights the same problem as is found in targeted advertising, which is that nobody’s individual data is very important, but all of ours have eroded our privacy to a criminal degree. Same problem as emissions, too, while I am at it. And the frequent response to these collective problems is so frequently an individualized solution: opt out, decline tracking, ride your bike. It is simply not enough.