SlideShare a Scribd company logo
3
Most read
8
Most read
9
Most read
CommonCrawl
Building an open Web-Scale crawl using Hadoop.
Ahad Rana
Architect / Engineer at CommonCrawl
ahad@commoncrawl.org
Who is CommonCrawl ?
• A 501(c)3 non-profit “dedicated to building, maintaining and
making widely available a comprehensive crawl of the
Internet for the purpose of enabling a new wave of
innovation, education and research.”
• Funded through a grant by Gil Elbaz, former Googler and
founder of Applied Semantics, and current CEO of Factual Inc.
• Board members include Carl Malamud and Nova Spivack.
Motivations Behind CommonCrawl
• Internet is a massively disruptive force.
• Exponential advances in computing capacity, storage and
bandwidth are creating constant flux and disequilibrium in the IT
domain.
• Cloud computing makes large scale, on-demand computing
affordable for even the smallest startup.
• Hadoop provides the technology stack that enables us to crunch
massive amounts of data.
• Having the ability to “Map-Reduce the Internet” opens up lots of
new opportunities for disruptive innovation and we would like to
reduce the cost of doing this by an order of magnitude, at least.
• White list only the major search engines trend by Webmasters puts
the future of the Open Web at risk and stifles future search
innovation and evolution.
Our Strategy
• Crawl broadly and frequently across all TLDs.
• Prioritize the crawl based on simplified criteria (rank and
freshness).
• Upload the crawl corpus to S3.
• Make our S3 bucket widely accessible to as many users as
possible.
• Build support libraries to facilitate access to the S3 data via
Hadoop.
• Focus on doing a few things really well.
• Listen to customers and open up more metadata and services
as needed.
• We are not a comprehensive crawl, and may never be 
Some Numbers
• URLs in Crawl DB – 14 billion
• URLs with inverse link graph – 1.6 billion
• URLS with content in S3 – 2.5 billion
• Recent crawled documents – 500 million
• Uploaded documents after Deduping 300 million.
• Newly discovered URLs – 1.9 billion
• # of Vertices in Page Rank (recent caclulation) – 3.5 billion
• # of Edges in Page Rank Graph (recent caclulation) – 17 billion
Current System Design
• Batch oriented crawl list generation.
• High volume crawling via independent crawlers.
• Crawlers dump data into HDFS.
• Map-Reduce jobs parse, extract metadata from crawled
documents in bulk independently of crawlers.
• Periodically, we ‘checkpoint’ the crawl, which involves, among
other things:
– Post processing of crawled documents (deduping etc.)
– ARC file generation
– Link graph updates
– Crawl database updates.
– Crawl list regeneration.
Our Cluster Config
• Modest internal cluster consisting of 24 Hadoop nodes,4
crawler nodes, and 2 NameNode / Database servers.
• Each Hadoop node has 6 x 1.5 TB drives and Dual-QuadCore
Xeons with 24 or 32 GB of RAM.
• 9 Map Tasks per node, avg 4 Reducers per node, BLOCK
compression using LZO.
Crawler Design Overview
Crawler Design Details
• Java codebase.
• Asynchronous IO model using custom NIO based HTTP stack.
• Lots of worker threads that synchronize with main thread via
Asynchronous message queues.
• Can sustain a crawl rate of ~250 URLS per second.
• Up to 500 active HTTP connections at any one time.
• Currently, no document parsing in crawler process.
• We currently run 8 crawlers and crawl on average ~100 million
URLs per day, when crawling.
• During post processing phase, on average we process 800
million documents.
• After Deduping, we package and upload on average
approximately 500 million documents to S3.
Crawl Database
• Primary Keys are 128 bit URL fingerprints, consisting of 64 bit
domain fingerprint, and 64 bit URL fingerprint (Rabin-Hash).
• Keys are distributed via modulo operation of URL portion of
fingerprint only.
• Currently, we run 4 reducers per node, and there is one node
down, so we have 92 unique shards.
• Keys in each shard are sorted by Domain FP, then URL FP.
• We like the 64 bit domain id, since it is a generated key, but it
is wasteful.
• We may move to a 32 bit root domain id / 32 bit domain id +
64 URL fingerprint key scheme in the future, and then sort by
root domain, domain, and then FP per shard.
Crawl Database – Continued
• Values in the Crawl Database consist of extensible Metadata
structures.
• We currently use our own DDL and compiler for generating
structures (vs. using Thrift/ProtoBuffers/Avro).
• Avro / ProtoBufs were not available when we started, and we
added lots of Hadoop friendly stuff to our version (multipart [key]
attributes lead to auto WritableComparable derived classes, with
built-in Raw Comparator support etc.).
• Our compiler also generates RPC stubs, with Google ProtoBuf style
message passing semantics (Message w/ optional Struct In, optional
Struct Out) instead of Thrift style semantics (Method with multiple
arguments and a return type).
• We prefer the former because it is better attuned to our preference
towards the asynchronous style of RPC programming.
Map-Reduce Pipeline – Parse/Dedupe/Arc Generation
Phase 1
Phase 2
Map-Reduce Pipeline – Link Graph Construction
Link Graph Construction
Inverse Link Graph Construction
Map-Reduce Pipeline – PageRank Edge Graph Construction
Page Rank Process
Distribution Phase
Calculation Phase
Generate Page Rank Values
The Need For a Smarter Merge
• Pipelining nature of HDFS means each Reducer writes it’s
output to local disk first, then to Repl Level – 1 other nodes.
• If intermediate data record sets are already sorted, the need
to run an Identity Mapper/Shuffle/Merge Sort phase to join to
sorted record sets is very expensive.
Our Solution:

More Related Content

What's hot (20)

PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
ODP
Large scale crawling with Apache Nutch
Julien Nioche
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Cenitpede: Analyzing Webcrawl
Primal Pappachan
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
PDF
CockroachDB: Architecture of a Geo-Distributed SQL Database
C4Media
 
PPTX
An Introduction To NoSQL & MongoDB
Lee Theobald
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PPTX
ElasticSearch Basic Introduction
Mayur Rathod
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PPTX
Grafana vs Kibana
jeetendra mandal
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PPTX
MongoDB
nikhil2807
 
PPT
Introduction to MongoDB
Ravi Teja
 
PDF
Sqoop
Prashant Gupta
 
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
PPTX
Druid and Hive Together : Use Cases and Best Practices
DataWorks Summit
 
PDF
PySpark in practice slides
Dat Tran
 
ODP
MongoDB: Advance concepts - Replication and Sharding
Knoldus Inc.
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Large scale crawling with Apache Nutch
Julien Nioche
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Cenitpede: Analyzing Webcrawl
Primal Pappachan
 
Introduction to Spark with Python
Gokhan Atil
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
C4Media
 
An Introduction To NoSQL & MongoDB
Lee Theobald
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
ElasticSearch Basic Introduction
Mayur Rathod
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Grafana vs Kibana
jeetendra mandal
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
MongoDB
nikhil2807
 
Introduction to MongoDB
Ravi Teja
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Druid and Hive Together : Use Cases and Best Practices
DataWorks Summit
 
PySpark in practice slides
Dat Tran
 
MongoDB: Advance concepts - Replication and Sharding
Knoldus Inc.
 

Similar to Building a Scalable Web Crawler with Hadoop (20)

PDF
Design and Implementation of a High- Performance Distributed Web Crawler
George Ang
 
PDF
Scalable crawling with Kafka, scrapy and spark - November 2021
Max Lapan
 
PDF
Web Crawling with Apache Nutch
sebastian_nagel
 
PPT
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
PPTX
Cloudstone - Sharpening Your Weapons Through Big Data
Christopher Grayson
 
PPTX
The Internet as a Single Database
Datafiniti
 
PDF
Crawling and Processing the Italian Corporate Web
Speck&Tech
 
PDF
hadoop
longhao
 
PDF
IRJET - Review on Search Engine Optimization
IRJET Journal
 
PPTX
Big data at scrapinghub
Dana Brophy
 
PDF
Getting Started with Hadoop
Josh Devins
 
PPTX
Web Archives and the dream of the Personal Search Engine
Arjen de Vries
 
PDF
The Hadoop Ecosystem
Mathias Herberts
 
PPTX
Scalability andefficiencypres
NekoGato
 
PDF
Frontera: open source, large scale web crawling framework
Scrapinghub
 
PDF
Petabyte scale on commodity infrastructure
elliando dias
 
PDF
353 357
Editor IJARCET
 
PPTX
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
 
PPTX
Steve Watt Presentation
Big Data Houston
 
PDF
Searchland: Search quality for Beginners
Valeria de Paiva
 
Design and Implementation of a High- Performance Distributed Web Crawler
George Ang
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Max Lapan
 
Web Crawling with Apache Nutch
sebastian_nagel
 
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
Cloudstone - Sharpening Your Weapons Through Big Data
Christopher Grayson
 
The Internet as a Single Database
Datafiniti
 
Crawling and Processing the Italian Corporate Web
Speck&Tech
 
hadoop
longhao
 
IRJET - Review on Search Engine Optimization
IRJET Journal
 
Big data at scrapinghub
Dana Brophy
 
Getting Started with Hadoop
Josh Devins
 
Web Archives and the dream of the Personal Search Engine
Arjen de Vries
 
The Hadoop Ecosystem
Mathias Herberts
 
Scalability andefficiencypres
NekoGato
 
Frontera: open source, large scale web crawling framework
Scrapinghub
 
Petabyte scale on commodity infrastructure
elliando dias
 
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
 
Steve Watt Presentation
Big Data Houston
 
Searchland: Search quality for Beginners
Valeria de Paiva
 
Ad

More from Hadoop User Group (20)

PPTX
Common crawlpresentation
Hadoop User Group
 
PDF
Hdfs high availability
Hadoop User Group
 
ODP
Cascalog internal dsl_preso
Hadoop User Group
 
PDF
Karmasphere hadoop-productivity-tools
Hadoop User Group
 
PDF
Hdfs high availability
Hadoop User Group
 
PPT
Pig at Linkedin
Hadoop User Group
 
PDF
HUG August 2010: Best practices
Hadoop User Group
 
PPT
2 hadoop@e bay-hug-2010-07-21
Hadoop User Group
 
PPT
1 content optimization-hug-2010-07-21
Hadoop User Group
 
PDF
3 avro hug-2010-07-21
Hadoop User Group
 
PPT
1 hadoop security_in_details_hadoop_summit2010
Hadoop User Group
 
PPT
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Hadoop User Group
 
PPT
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Hadoop User Group
 
PDF
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Hadoop User Group
 
PPT
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Hadoop User Group
 
PPT
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop User Group
 
PPTX
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop User Group
 
PPT
Hadoop Security Preview
Hadoop User Group
 
PPT
Flightcaster Presentation Hadoop
Hadoop User Group
 
PPTX
Map Reduce Online
Hadoop User Group
 
Common crawlpresentation
Hadoop User Group
 
Hdfs high availability
Hadoop User Group
 
Cascalog internal dsl_preso
Hadoop User Group
 
Karmasphere hadoop-productivity-tools
Hadoop User Group
 
Hdfs high availability
Hadoop User Group
 
Pig at Linkedin
Hadoop User Group
 
HUG August 2010: Best practices
Hadoop User Group
 
2 hadoop@e bay-hug-2010-07-21
Hadoop User Group
 
1 content optimization-hug-2010-07-21
Hadoop User Group
 
3 avro hug-2010-07-21
Hadoop User Group
 
1 hadoop security_in_details_hadoop_summit2010
Hadoop User Group
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Hadoop User Group
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Hadoop User Group
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Hadoop User Group
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Hadoop User Group
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop User Group
 
Hadoop Security Preview
Hadoop User Group
 
Flightcaster Presentation Hadoop
Hadoop User Group
 
Map Reduce Online
Hadoop User Group
 
Ad

Recently uploaded (20)

PPTX
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
PDF
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PDF
AI-Powered-Visual-Storytelling-for-Nonprofits.pdf
TechSoup
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PDF
epi editorial commitee meeting presentation
MIPLM
 
PPTX
How to Manage Allocation Report for Manufacturing Orders in Odoo 18
Celine George
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PPTX
DIGITAL CITIZENSHIP TOPIC TLE 8 MATATAG CURRICULUM
ROBERTAUGUSTINEFRANC
 
PPTX
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
PDF
Vani - The Voice of Excellence - Jul 2025 issue
Savipriya Raghavendra
 
PPTX
Controller Request and Response in Odoo18
Celine George
 
PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PDF
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
PDF
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PDF
Is Assignment Help Legal in Australia_.pdf
thomas19williams83
 
PPTX
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
AI-Powered-Visual-Storytelling-for-Nonprofits.pdf
TechSoup
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
epi editorial commitee meeting presentation
MIPLM
 
How to Manage Allocation Report for Manufacturing Orders in Odoo 18
Celine George
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
DIGITAL CITIZENSHIP TOPIC TLE 8 MATATAG CURRICULUM
ROBERTAUGUSTINEFRANC
 
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
Vani - The Voice of Excellence - Jul 2025 issue
Savipriya Raghavendra
 
Controller Request and Response in Odoo18
Celine George
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
Is Assignment Help Legal in Australia_.pdf
thomas19williams83
 
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 

Building a Scalable Web Crawler with Hadoop

  • 1. CommonCrawl Building an open Web-Scale crawl using Hadoop. Ahad Rana Architect / Engineer at CommonCrawl [email protected]
  • 2. Who is CommonCrawl ? • A 501(c)3 non-profit “dedicated to building, maintaining and making widely available a comprehensive crawl of the Internet for the purpose of enabling a new wave of innovation, education and research.” • Funded through a grant by Gil Elbaz, former Googler and founder of Applied Semantics, and current CEO of Factual Inc. • Board members include Carl Malamud and Nova Spivack.
  • 3. Motivations Behind CommonCrawl • Internet is a massively disruptive force. • Exponential advances in computing capacity, storage and bandwidth are creating constant flux and disequilibrium in the IT domain. • Cloud computing makes large scale, on-demand computing affordable for even the smallest startup. • Hadoop provides the technology stack that enables us to crunch massive amounts of data. • Having the ability to “Map-Reduce the Internet” opens up lots of new opportunities for disruptive innovation and we would like to reduce the cost of doing this by an order of magnitude, at least. • White list only the major search engines trend by Webmasters puts the future of the Open Web at risk and stifles future search innovation and evolution.
  • 4. Our Strategy • Crawl broadly and frequently across all TLDs. • Prioritize the crawl based on simplified criteria (rank and freshness). • Upload the crawl corpus to S3. • Make our S3 bucket widely accessible to as many users as possible. • Build support libraries to facilitate access to the S3 data via Hadoop. • Focus on doing a few things really well. • Listen to customers and open up more metadata and services as needed. • We are not a comprehensive crawl, and may never be 
  • 5. Some Numbers • URLs in Crawl DB – 14 billion • URLs with inverse link graph – 1.6 billion • URLS with content in S3 – 2.5 billion • Recent crawled documents – 500 million • Uploaded documents after Deduping 300 million. • Newly discovered URLs – 1.9 billion • # of Vertices in Page Rank (recent caclulation) – 3.5 billion • # of Edges in Page Rank Graph (recent caclulation) – 17 billion
  • 6. Current System Design • Batch oriented crawl list generation. • High volume crawling via independent crawlers. • Crawlers dump data into HDFS. • Map-Reduce jobs parse, extract metadata from crawled documents in bulk independently of crawlers. • Periodically, we ‘checkpoint’ the crawl, which involves, among other things: – Post processing of crawled documents (deduping etc.) – ARC file generation – Link graph updates – Crawl database updates. – Crawl list regeneration.
  • 7. Our Cluster Config • Modest internal cluster consisting of 24 Hadoop nodes,4 crawler nodes, and 2 NameNode / Database servers. • Each Hadoop node has 6 x 1.5 TB drives and Dual-QuadCore Xeons with 24 or 32 GB of RAM. • 9 Map Tasks per node, avg 4 Reducers per node, BLOCK compression using LZO.
  • 9. Crawler Design Details • Java codebase. • Asynchronous IO model using custom NIO based HTTP stack. • Lots of worker threads that synchronize with main thread via Asynchronous message queues. • Can sustain a crawl rate of ~250 URLS per second. • Up to 500 active HTTP connections at any one time. • Currently, no document parsing in crawler process. • We currently run 8 crawlers and crawl on average ~100 million URLs per day, when crawling. • During post processing phase, on average we process 800 million documents. • After Deduping, we package and upload on average approximately 500 million documents to S3.
  • 10. Crawl Database • Primary Keys are 128 bit URL fingerprints, consisting of 64 bit domain fingerprint, and 64 bit URL fingerprint (Rabin-Hash). • Keys are distributed via modulo operation of URL portion of fingerprint only. • Currently, we run 4 reducers per node, and there is one node down, so we have 92 unique shards. • Keys in each shard are sorted by Domain FP, then URL FP. • We like the 64 bit domain id, since it is a generated key, but it is wasteful. • We may move to a 32 bit root domain id / 32 bit domain id + 64 URL fingerprint key scheme in the future, and then sort by root domain, domain, and then FP per shard.
  • 11. Crawl Database – Continued • Values in the Crawl Database consist of extensible Metadata structures. • We currently use our own DDL and compiler for generating structures (vs. using Thrift/ProtoBuffers/Avro). • Avro / ProtoBufs were not available when we started, and we added lots of Hadoop friendly stuff to our version (multipart [key] attributes lead to auto WritableComparable derived classes, with built-in Raw Comparator support etc.). • Our compiler also generates RPC stubs, with Google ProtoBuf style message passing semantics (Message w/ optional Struct In, optional Struct Out) instead of Thrift style semantics (Method with multiple arguments and a return type). • We prefer the former because it is better attuned to our preference towards the asynchronous style of RPC programming.
  • 12. Map-Reduce Pipeline – Parse/Dedupe/Arc Generation Phase 1 Phase 2
  • 13. Map-Reduce Pipeline – Link Graph Construction Link Graph Construction Inverse Link Graph Construction
  • 14. Map-Reduce Pipeline – PageRank Edge Graph Construction
  • 15. Page Rank Process Distribution Phase Calculation Phase Generate Page Rank Values
  • 16. The Need For a Smarter Merge • Pipelining nature of HDFS means each Reducer writes it’s output to local disk first, then to Repl Level – 1 other nodes. • If intermediate data record sets are already sorted, the need to run an Identity Mapper/Shuffle/Merge Sort phase to join to sorted record sets is very expensive.