Published July 13, 2025 | Version 1.0.0
Dataset Open
Description
Details
This dataset provides a lightweight extraction of approximately 200 million org-level (registerable) domains referenced in Common Crawl’s main indexes. The aim is to provide an open, free and mostly useful subset of global DNS entries – web sites that were in use at the time of crawling.
To keep data minimal and useful at smaller scales, it contains only 3 columns: the domain name, the time first crawled and the time last crawled. Data is provided for each year and the entire range, and filtered.tsv.gz contains a smaller subset with domains that were only seen once removed.
Source code
The scripts to recreate this dataset can be found in [the project’s github repo](https://github.com/bitpla…
Published July 13, 2025 | Version 1.0.0
Dataset Open
Description
Details
This dataset provides a lightweight extraction of approximately 200 million org-level (registerable) domains referenced in Common Crawl’s main indexes. The aim is to provide an open, free and mostly useful subset of global DNS entries – web sites that were in use at the time of crawling.
To keep data minimal and useful at smaller scales, it contains only 3 columns: the domain name, the time first crawled and the time last crawled. Data is provided for each year and the entire range, and filtered.tsv.gz contains a smaller subset with domains that were only seen once removed.
Source code
The scripts to recreate this dataset can be found in the project’s github repo.
Caveats
This data has not yet been tested in production apps, and so may contain errors. Two known issues are:
- URL encoded names have not been deduplicated.
- The extractor uses a naive rule to find the main domain (first domain that’s longer than 3 characters, and would not create more than 3 subdomains). This may result in lost or duplicate data.
Contact
To report issues, suggest changes, or collaborate, open an issue or pull request on the project’s repository.
Files
Files (40.8 GB)
| Name | Size | Download all |
|---|---|---|
| 2012-domains.tsv.gz md5:b2ac2b1014086cf286c54ffb714c3a95 | 278.9 MB | Download |
| 2013-domains.tsv.gz md5:1373f78db2e631e40dba73861fbc5b32 | 130.8 MB | Download |
| 2014-domains.tsv.gz md5:1bf89ab65dedbf9289abe414c8d277c1 | 138.3 MB | Download |
| 2015-domains.tsv.gz md5:78fbae35182fac110371d8f68bdb93ad | 129.1 MB | Download |
| 2016-domains.tsv.gz md5:fba874e7582bc56f76edea74b60d7ebb | 350.5 MB | Download |
| 2017-domains.tsv.gz md5:245137b04d62a064509ca462afe50ed7 | 531.4 MB | Download |
| 2018-domains.tsv.gz md5:6fe4112d01b254561635313ea43785b9 | 541.5 MB | Download |
| 2019-domains.tsv.gz md5:4eb43ff31569310dcd93287d6dcd8790 | 610.2 MB | Download |
| 2020-domains.tsv.gz md5:e9d9bf31335d1cbcdedbb3306bc731fc | 570.3 MB | Download |
| 2021-domains.tsv.gz md5:240edb3b776d632a83b8625c37e07b88 | 561.6 MB | Download |
| 2022-domains.tsv.gz md5:64bff270c8218ab07ac9ed508cab475e | 519.3 MB | Download |
| 2023-domains.tsv.gz md5:3ceb33a6dc1d70d293b22dc197104ba0 | 482.3 MB | Download |
| 2024-domains.tsv.gz md5:8058247c4ec1667dd06bd637343aa55b | 640.0 MB | Download |
| 2025-domains.tsv.gz md5:8620629090d666ac940bca8a6c50ca0e | 558.3 MB | Download |
| all_domains.tsv.gz md5:06eac2962040837d41546514442d43c9 | 1.9 GB | Download |
| CC-MAIN.tar.gz.tar md5:61dbe96665714daee010e6470989241d | 31.3 GB | Download |
| filtered.tsv.gz md5:11840d83ccbe2881c0642b3880d4aabe | 1.6 GB | Download |