🚨 Last chance to register 🚨 In a week, the Common Crawl team will be presenting at Stanford Institute for Human-Centered Artificial Intelligence (HAI) Seminar on “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”. Date: Wednesday, October 22, 2025 Time: starting at 12:00 PM Attendance: in person and virtual attendance available upon RSVP ✅ Please register ASAP at: https://lnkd.in/gBvsjr-r Come learn about what we’ve been working on and to meet the team! See you soon 👋
Common Crawl Foundation
Technology, Information and Internet
Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
About us
The Common Crawl Foundation is a California 501(c)(3) registered non-profit founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable. Our vision is of a truly open web that allows open access to information and enables greater innovation in research, business and education. We level the playing field by making wholesale extraction, transformation and analysis of web data cheap and easy.
- Website
- 
        
                  
    
      http://www.commoncrawl.org
      
    
  
                  External link for Common Crawl Foundation 
- Industry
- Technology, Information and Internet
- Company size
- 2-10 employees
- Type
- Nonprofit
- Founded
- 2007
Employees at Common Crawl Foundation
- 
              
    
    
    
    
    
      
        
      
          
      Rich SkrentaExecutive Director at Common Crawl Foundation
- 
              
    
    
    
    
    
      
        
      
          
      Wayne YamamotoHead of Development
- 
              
    
    
    
    
    
      
        
      
          
      Stephen BurnsImproving discoverability, structure, and machine-readability of the open web. Focused on SEO/AEO signals, large-scale crawl analysis, and open data…
- 
              
    
    
    
    
    
      
        
      
          
      Sarmeesha ReddyEngineer, Founder, Advisor
Updates
- 
              
        
    Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology. https://lnkd.in/g_qib6vG 
- 
              
        
    Common Crawl Foundation reposted this Just discovered we got a shoutout in Common Crawl Foundation's trip report from AI_dev Amsterdam! The Common Crawl team wrote about our chat at the conference, and honestly, meeting the folks behind one of the most foundational datasets in AI was a bit of a fanboy moment for Colin and me. When we saw Thom Vaughan, Pedro Ortiz Suarez, and Thijs Dalhuijsen, our immediate reaction was "Oh wow, you're from Common Crawl?!" They seemed genuinely surprised when we told them their dataset is one of the most popular among Daft users. These engineers have quietly built infrastructure that powers countless AI breakthroughs, yet they're humble about the massive impact they've created. The reality is that Common Crawl has become critical infrastructure for the AI revolution - petabytes of web data that researchers and companies use to train models, extract insights, and build applications that wouldn't exist otherwise. Check out their full trip report to see what else went down at AI_dev - including the opening of Internet Archive's new European HQ in Amsterdam (Brewster Kahle was there!): https://lnkd.in/gZ4SBG5i Infrastructure that just works, built by teams that genuinely care about democratizing access to data. That's the future we're all building together. 
- 
                  
- 
              
        
    Common Crawl Foundation reposted this Contrary to splashy headlines, AI isn’t killing the internet—it’s reshaping how real people, researchers, and businesses of all sizes access and use it. In my latest article for Financier Worldwide, I examine how the battle for an open web is unfolding — from standards bodies like the IETF to corporate initiatives like Cloudflare’s “pay-per-crawl” model. The stakes are high: Will the open web remain a shared resource, or will it fragment into a gated system where only a few players can afford access? https://lnkd.in/gFEpQUvt Farzaneh Badiei, PhD Luke Hogg Tim Hwang Alliance for Responsible Data Collection Common Crawl Foundation Internet Archive Zoe Darmé Mark Gray Sarah McKenna Rony Shalit The Norton Law Firm Rich Skrenta Jordan Gimbel 
- 
              
        
    Common Crawl Foundation reposted this Contrary to splashy headlines, AI isn’t killing the internet—it’s reshaping how real people, researchers, and businesses of all sizes access and use it. In my latest article for Financier Worldwide, I examine how the battle for an open web is unfolding — from standards bodies like the IETF to corporate initiatives like Cloudflare’s “pay-per-crawl” model. The stakes are high: Will the open web remain a shared resource, or will it fragment into a gated system where only a few players can afford access? https://lnkd.in/gFEpQUvt Farzaneh Badiei, PhD Luke Hogg Tim Hwang Alliance for Responsible Data Collection Common Crawl Foundation Internet Archive Zoe Darmé Mark Gray Sarah McKenna Rony Shalit The Norton Law Firm Rich Skrenta Jordan Gimbel 
- 
              
        
    Common Crawl Foundation Opt-Out Registry Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received. https://lnkd.in/gj-mG8Tb 
- 
              
        
    In a few weeks, the Common Crawl team will convene in Palo Alto for a seminar at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Our topic of discussion is “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”. Space is limited for those attending in-person. There is also an option to join virtually if you are not located in the Bay Area. 👉 Please register at: https://lnkd.in/gBvsjr-r About: The Common Crawl Foundation is dedicated to preserving humanity's knowledge and making it accessible through its free public web dataset, a vital resource since 2008. As AI development accelerates, concerns have emerged regarding the accessibility and transparency of public web data, impacting open datasets in three key ways: robots.txt exclusions, legal demands, and "bot defenses." Two of these are not visible in public and are not very well understood. In this seminar, Common Crawl will present insights from a new data product that utilizes Common Crawl's crawl metadata to visually explore these three problems, advocating for greater transparency and informed solutions for the future of public web data. Hope to see you there!