200M unique domains extracted from Common Crawl

Published July 13, 2025 | Version 1.0.0

Dataset Open

Description

Details

This dataset provides a lightweight extraction of approximately 200 million org-level (registerable) domains referenced in Common Crawl’s main indexes. The aim is to provide an open, free and mostly useful subset of global DNS entries – web sites that were in use at the time of crawling.

To keep data minimal and useful at smaller scales, it contains only 3 columns: the domain name, the time first crawled and the time last crawled. Data is provided for each year and the entire range, and filtered.tsv.gz contains a smaller subset with domains that were only seen once removed.

Source code

The scripts to recreate this dataset can be found in [the project’s github repo](https://github.com/bitpla…

Published July 13, 2025 | Version 1.0.0

Dataset Open

Description

Details

Source code

The scripts to recreate this dataset can be found in the project’s github repo.

Caveats

This data has not yet been tested in production apps, and so may contain errors. Two known issues are:

URL encoded names have not been deduplicated.
The extractor uses a naive rule to find the main domain (first domain that’s longer than 3 characters, and would not create more than 3 subdomains). This may result in lost or duplicate data.

Contact

To report issues, suggest changes, or collaborate, open an issue or pull request on the project’s repository.

Files

Files (40.8 GB)

Name	Size	Download all
2012-domains.tsv.gz md5:b2ac2b1014086cf286c54ffb714c3a95	278.9 MB	Download
2013-domains.tsv.gz md5:1373f78db2e631e40dba73861fbc5b32	130.8 MB	Download
2014-domains.tsv.gz md5:1bf89ab65dedbf9289abe414c8d277c1	138.3 MB	Download
2015-domains.tsv.gz md5:78fbae35182fac110371d8f68bdb93ad	129.1 MB	Download
2016-domains.tsv.gz md5:fba874e7582bc56f76edea74b60d7ebb	350.5 MB	Download
2017-domains.tsv.gz md5:245137b04d62a064509ca462afe50ed7	531.4 MB	Download
2018-domains.tsv.gz md5:6fe4112d01b254561635313ea43785b9	541.5 MB	Download
2019-domains.tsv.gz md5:4eb43ff31569310dcd93287d6dcd8790	610.2 MB	Download
2020-domains.tsv.gz md5:e9d9bf31335d1cbcdedbb3306bc731fc	570.3 MB	Download
2021-domains.tsv.gz md5:240edb3b776d632a83b8625c37e07b88	561.6 MB	Download
2022-domains.tsv.gz md5:64bff270c8218ab07ac9ed508cab475e	519.3 MB	Download
2023-domains.tsv.gz md5:3ceb33a6dc1d70d293b22dc197104ba0	482.3 MB	Download
2024-domains.tsv.gz md5:8058247c4ec1667dd06bd637343aa55b	640.0 MB	Download
2025-domains.tsv.gz md5:8620629090d666ac940bca8a6c50ca0e	558.3 MB	Download
all_domains.tsv.gz md5:06eac2962040837d41546514442d43c9	1.9 GB	Download
CC-MAIN.tar.gz.tar md5:61dbe96665714daee010e6470989241d	31.3 GB	Download
filtered.tsv.gz md5:11840d83ccbe2881c0642b3880d4aabe	1.6 GB	Download

Description

Details

Source code

Description

Details

Source code

Caveats

Contact

Files

Additional details

Similar Posts