Common Crawl Foundation

Technology, Information and Internet

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

View all 18 employees

About us

The Common Crawl Foundation is a California 501(c)(3) registered non-profit founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable. Our vision is of a truly open web that allows open access to information and enables greater innovation in research, business and education. We level the playing field by making wholesale extraction, transformation and analysis of web data cheap and easy.

Website: http://www.commoncrawl.org
External link for Common Crawl Foundation
Industry: Technology, Information and Internet
Company size: 2-10 employees
Type: Nonprofit
Founded: 2007

Employees at Common Crawl Foundation

See all employees

Updates

Common Crawl Foundation

1,414 followers
2w Edited
Report this post
🚨 Last chance to register 🚨 In a week, the Common Crawl team will be presenting at Stanford Institute for Human-Centered Artificial Intelligence (HAI) Seminar on “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”. Date: Wednesday, October 22, 2025 Time: starting at 12:00 PM Attendance: in person and virtual attendance available upon RSVP ✅ Please register ASAP at: https://lnkd.in/gBvsjr-r Come learn about what we’ve been working on and to meet the team! See you soon 👋

Common Crawl Foundation | Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data | Stanford HAI hai.stanford.edu

Like Comment Share
Common Crawl Foundation

1,414 followers
3w
Report this post
Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology. https://lnkd.in/g_qib6vG

Common Crawl - Blog - Announcing GneissWeb Annotations commoncrawl.org

Like Comment Share
Common Crawl Foundation reposted this
Sammy Sidhu

CEO at Eventual | We're Hiring! | YC W22
1mo Edited
Report this post
Just discovered we got a shoutout in Common Crawl Foundation's trip report from AI_dev Amsterdam! The Common Crawl team wrote about our chat at the conference, and honestly, meeting the folks behind one of the most foundational datasets in AI was a bit of a fanboy moment for Colin and me. When we saw Thom Vaughan, Pedro Ortiz Suarez, and Thijs Dalhuijsen, our immediate reaction was "Oh wow, you're from Common Crawl?!" They seemed genuinely surprised when we told them their dataset is one of the most popular among Daft users. These engineers have quietly built infrastructure that powers countless AI breakthroughs, yet they're humble about the massive impact they've created. The reality is that Common Crawl has become critical infrastructure for the AI revolution - petabytes of web data that researchers and companies use to train models, extract insights, and build applications that wouldn't exist otherwise. Check out their full trip report to see what else went down at AI_dev - including the opening of Internet Archive's new European HQ in Amsterdam (Brewster Kahle was there!): https://lnkd.in/gZ4SBG5i Infrastructure that just works, built by teams that genuinely care about democratizing access to data. That's the future we're all building together.
Like Comment Share
Common Crawl Foundation

1,414 followers
1mo
Report this post
https://lnkd.in/gNMDtuCK

Common Crawl - Blog - From SEO to AIO: Why Your Content Needs to Exist in AI Training Data commoncrawl.org

Like Comment Share
Common Crawl Foundation reposted this
Jo Levy
1mo
Report this post
Contrary to splashy headlines, AI isn’t killing the internet—it’s reshaping how real people, researchers, and businesses of all sizes access and use it. In my latest article for Financier Worldwide, I examine how the battle for an open web is unfolding — from standards bodies like the IETF to corporate initiatives like Cloudflare’s “pay-per-crawl” model. The stakes are high: Will the open web remain a shared resource, or will it fragment into a gated system where only a few players can afford access? https://lnkd.in/gFEpQUvt Farzaneh Badiei, PhD Luke Hogg Tim Hwang Alliance for Responsible Data Collection Common Crawl Foundation Internet Archive Zoe Darmé Mark Gray Sarah McKenna Rony Shalit The Norton Law Firm Rich Skrenta Jordan Gimbel

Data scraping, AI and the battle for the open web — Financier Worldwide financierworldwide.com

3 Comments

Like Comment Share
Common Crawl Foundation reposted this
Jo Levy
1mo
Report this post
Contrary to splashy headlines, AI isn’t killing the internet—it’s reshaping how real people, researchers, and businesses of all sizes access and use it. In my latest article for Financier Worldwide, I examine how the battle for an open web is unfolding — from standards bodies like the IETF to corporate initiatives like Cloudflare’s “pay-per-crawl” model. The stakes are high: Will the open web remain a shared resource, or will it fragment into a gated system where only a few players can afford access? https://lnkd.in/gFEpQUvt Farzaneh Badiei, PhD Luke Hogg Tim Hwang Alliance for Responsible Data Collection Common Crawl Foundation Internet Archive Zoe Darmé Mark Gray Sarah McKenna Rony Shalit The Norton Law Firm Rich Skrenta Jordan Gimbel

Data scraping, AI and the battle for the open web — Financier Worldwide financierworldwide.com

3 Comments

Like Comment Share
Common Crawl Foundation

1,414 followers
1mo
Report this post
Common Crawl Foundation Opt-Out Registry Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received. https://lnkd.in/gj-mG8Tb

Common Crawl - Blog - Common Crawl Foundation Opt-Out Registry commoncrawl.org

Like Comment Share
Common Crawl Foundation

1,414 followers
1mo
Report this post
In a few weeks, the Common Crawl team will convene in Palo Alto for a seminar at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Our topic of discussion is “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”. Space is limited for those attending in-person. There is also an option to join virtually if you are not located in the Bay Area. 👉 Please register at: https://lnkd.in/gBvsjr-r About: The Common Crawl Foundation is dedicated to preserving humanity's knowledge and making it accessible through its free public web dataset, a vital resource since 2008. As AI development accelerates, concerns have emerged regarding the accessibility and transparency of public web data, impacting open datasets in three key ways: robots.txt exclusions, legal demands, and "bot defenses." Two of these are not visible in public and are not very well understood. In this seminar, Common Crawl will present insights from a new data product that utilizes Common Crawl's crawl metadata to visually explore these three problems, advocating for greater transparency and informed solutions for the future of public web data. Hope to see you there!

Common Crawl Foundation | Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data | Stanford HAI hai.stanford.edu

1 Comment

Like Comment Share
Common Crawl Foundation

1,414 followers
1mo
Report this post
https://lnkd.in/ggG-DeTR

We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else http://www.techdirt.com

Like Comment Share

LinkedIn respects your privacy

Common Crawl Foundation

Technology, Information and Internet

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

About us

Employees at Common Crawl Foundation

Rich Skrenta

Executive Director at Common Crawl Foundation

Wayne Yamamoto

Head of Development

Stephen Burns

Improving discoverability, structure, and machine-readability of the open web. Focused on SEO/AEO signals, large-scale crawl analysis, and open data…

Sarmeesha Reddy

Engineer, Founder, Advisor

Updates

Join now to see what you are missing

Similar pages

XPRIZE

Foursquare

TenOneTen Ventures

Warecorp

Constellation Network

The AI Alliance

MiraeTech

Bright Data

Diffbot

Databricks

Browse jobs

Legal Assistant jobs

Lawyer jobs

Marketing Specialist jobs

Animator jobs