The Nonprofit Feeding the Entire Internet to AI Companies
theatlantic.com·3d·
Flag this post

Editor’s note: This work is part of AI Watchdog, The Atlantic’s ongoing investigation into the generative-AI industry.


The Common Crawl Foundation is little known outside of Silicon Valley. For more than a decade, the nonprofit has been scraping billions of webpages to build a massive archive of the internet. This database—large enough to be measured in petabytes—is made freely available for research. In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models. In the process, my reporting has found, Common Crawl has opened a back do…

Similar Posts

Loading similar posts...