Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Common Crawl is a 501(c)(3) non–profit founded in 2007.
We make wholesale extraction, transformation and analysis of open web data accessible to researchers.
Over 250 billion pages spanning 18 years.
Free and open corpus since 2007.
Cited in over 10,000 research papers.
3–5 billion new pages added each month.
### Featured Papers:
#### Research on Free Expression Online
##### Jeffrey Knockel, Jakub Dalek, Noura Aljizawi, Mohamed Ahmed, Levi Meletti, and Justin Lau
### Banned Books: Analysis of Censorship on Amazon.com
#### Analyzing the Australian Web with Web Graphs: Harmonic Centrality at the Domain Level
##### Xian Gong, Paul X. McCarthy, Marian-Andrei Rizoiu, Paolo Boldi
### Harmony in the Australian Domain Space
#### The Dangers of Hijacked Hyperlinks
##### Kevin Saric, Felix Savins, Gowri Sankar Ramachandran, Raja Jurdak, Surya Nepal
### Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains
#### Enhancing Computational Analysis
##### Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo
### DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
#### Computation and Language
##### Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas
### esCorpius: A Massive Spanish Crawling Corpus
#### The Web as a Graph (Master's Thesis)
##### Marius Løvold Jørgensen, UiT Norges Arktiske Universitet
### BacklinkDB: A Purpose-Built Backlink Database Management System
#### Internet Security: Phishing Websites
##### Asadullah Safi, Satwinder Singh
### A Systematic Literature Review on Phishing Website Detection Techniques
[More on Google Scholar](https://scholar.google.com/scholar?q=common+crawl)[Curated BibTeX Dataset](https://github.com/commoncrawl/cc-citations/)
### Latest Blog Post:
### Submission to the UK’s Copyright and AI Consultation
Read our submission to the UK government's Copyright and AI consultation, supporting a legal exception for text and data mining (TDM) while respecting creators’ rights.
Common Crawl Foundation
Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.
The Data
### [Overview](/overview)
### [Web Graphs](/web-graphs)
### [Latest Crawl](/latest-crawl)
### [Crawl Stats](https://commoncrawl.github.io/cc-crawl-statistics/)
### [Graph Stats](https://commoncrawl.github.io/cc-webgraph-statistics/)
### [Errata](/errata)
### [Get Started](/get-started)
### [Blog](/blog)
### [Examples](/examples)
### [Use Cases](/use-cases)
### [CCBot](/ccbot)
### [Infra Status](https://status.commoncrawl.org)
### [FAQ](/faq)
### [Research Papers](/research-papers)
### [Mailing List Archive](https://groups.google.com/g/common-crawl)
### [Hugging Face](https://huggingface.co/commoncrawl)
### [Discord](https://discord.gg/njaVFh7avF)
### [Collaborators](/collaborators)
### [Team](/team)
### [Jobs](/jobs)
### [Mission](/mission)
### [Impact](/impact)
### [Privacy Policy](/privacy-policy)
### [Terms of Use](/terms-of-use)
© 2025 Common Crawl