SlideShare a Scribd company logo
WEB CRAWLERs

Mr. Abhishek Gupta
content
•
• Why is web crawler required?
• How does web crawler work?
• Crawling strategies
Breadth first search traversal
depth first search traversal
• Architecture of web crawler
• Crawling policies
• Distributed crawling
WEB CRAWLERS
 The process or program used by search engines to
download pages from the web for later processing by a
search engine that will index the downloaded pages to
provide fast searches.
 A program or automated script which browses the World
Wide Web in a methodical, automated manner
 also known as web spiders and web robots.
 less used names- ants, bots and worms.
content
• What is a web crawler?
• How does web crawler work?
• Crawling strategies
Breadth first search traversal
depth first search traversal
• Architecture of web crawler
• Crawling policies
• Distributed crawling
WHY CRAWLERS?
Internet has a
wide expanse of
Information.
 Finding
relevant
information
requires an
efficient
mechanism.
Web Crawlers
provide that
scope to the
search engine.
content
• What is a web crawler?
• Why is web crawler required?
• How does web crawler work?
• Crawling strategies
Breadth first search traversal
depth first search traversal
• Architecture of web crawler
• Crawling policies
• Distributed crawling
• It starts with a list of URLs to visit, called the seeds . As
the crawler visits these URLs, it identifies all the
hyperlinks in the page and adds them to the list of visited
URLs, called the crawl frontier
• URLs from the frontier are recursively visited according
to a set of policies.

How does web crawler
Googlebot, Google’s
Web Crawler

New url’s can be
specified here. This is
google’s web Crawler.
Crawling Algorithm
Initialize queue (Q) with initial set of known URL’s.
Until Q empty or page or time limit exhausted:
Pop URL, L, from front of Q.
If L is not an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…)
exit loop.
If already visited L, continue loop(get next url).
Download page, P, for L.
If cannot download P (e.g. 404 error, robot excluded)
exit loop, else.
Index P (e.g. add to inverted index or store cached copy).
Parse P to obtain list of new links N.
Append N to the end of Q.
Keeping Track of Webpages to
Index
content
• What is a web crawler?
• Why is web crawler required?
• How does web crawler work?
• Crawling strategies
Breadth first search traversal
depth first search traversal
• Architecture of web crawler
• Crawling policies
• Distributed crawling
Crawling Strategies
Alternate way of looking at the problem.
Web is a huge directed graph, with documents as vertices
and hyperlinks as edges.
 Need to explore the graph using a suitable
graph traversal algorithm.
 W.r.t. previous ex: nodes are represented
by
rectangles and directed edges are
drawn as arrows.
Breadth-First Traversal
Given any graph and a set of seeds at which to start, the graph can be
traversed using the algorithm
1. Put all the given seeds into the queue;
2. Prepare to keep a list of “visited” nodes (initially empty);
3. As long as the queue is not empty:
a. Remove the first node from the queue;
b. Append that node to the list of “visited” nodes
c. For each edge starting at that node:
i. If the node at the end of the edge already appears on the list of “visited”
nodes or it is already in the queue, then do nothing more with that
edge;
ii. Otherwise, append the node at the end of the edge to the end of the
queue.
Breadth First Crawlers
content
• What is a web crawler?
• Why is web crawler required?
• How does web crawler work?
• Crawling strategies
Breadth first search traversal
Depth first search traversal
• Architecture of web crawler
• Crawling policies
• Parallel crawling
Depth First Crawlers
Use depth first search (DFS) algorithm
• Get the 1st link not visited from the start page
• Visit link and get 1st non-visited link
• Repeat above step till no non-visited links
• Go to next non-visited link in the previous level and repeat 2nd
step
Depth first traversal
Depth-First vs. Breadth-First
• depth-first goes off into one branch until it reaches a
leaf node
• not good if the goal node is on another branch
• neither complete nor optimal
• uses much less space than breadth-first
• much fewer visited nodes to keep track of
• smaller fringe

• breadth-first is more careful by checking all
alternatives
• complete and optimal
• very memory-intensive
content
• What is a web crawler?
• Why is web crawler required?
• How does web crawler work?
• Crawling strategies
Breadth first search traversal
Depth first search traversal
• Architecture of web crawler
• Crawling policies
• Distributed crawling
Architecture of search engine
ARCHITECTURE OF crawler
Doc

Robots

URL

Fingerprint

templates

set

DNS

Parse

www

Content
Seen?

URL
Filter

Fetch

URL Frontier

Dup
URL
Elim
Architecture
URL Frontier: containing URLs yet to be fetches in the current
crawl. At first, a seed set is stored in URL Frontier, and a crawler
begins by taking a URL from the seed set.
 DNS: domain name service resolution. Look up IP address for
domain names.
 Fetch: generally use the http protocol to fetch the URL.
 Parse: the page is parsed. Texts (images, videos, and etc.) and
Links are extracted.


Content

Seen?: test whether a web page with the
same content has already been seen at another
URL. Need to develop a way to measure the
fingerprint of a web page.
Architecture(cont)
 URL Filter:

Whether the extracted URL should be excluded from the
frontier (robots.txt).
 URL should be normalized (relative encoding).
 en.wikipedia.org/wiki/Main_Page
 <a href="/wiki/Wikipedia:General_disclaimer"
title="Wikipedia:General
disclaimer">Disclaimers</a>
 Dup URL Elim: the URL is checked for duplicate
elimination.

content
• What is a web crawler?
• Why is web crawler required?
• How does web crawler work?
• Crawling strategies
Breadth first search traversal
Depth first search traversal
• Architecture of web crawler
• Crawling policies
• Distributed crawling
Crawling Policies
• Selection Policy that states which pages to download.
• Re-visit Policy that states when to check for changes to
the pages.
• Politeness Policy that states how to avoid overloading
Web sites.
• Parallelization Policy that states how to coordinate
distributed Web crawlers.
Selection policy
 Search engines covers only a fraction of Internet.
 This requires download of relevant pages, hence a
good selection policy is very important.
 Common Selection policies:
Restricting followed links
Path-ascending crawling
Focused crawling
Crawling the Deep Web
Re-Visit Policy





Web is dynamic; crawling takes a long time.
Cost factors play important role in crawling.
Freshness and Age- commonly used cost functions.
Objective of crawler- high average freshness;
low average age of web pages.
 Two re-visit policies:
Uniform policy
Proportional policy
Politeness Policy
Crawlers can have a crippling impact on the overall
performance of a site.
The costs of using Web crawlers include:
Network resources
Server overload
Server/ router crashes
Network and server disruption
A partial solution to these problems is the robots
exclusion protocol.
Robot Exclusion
• How to control those robots!
Web sites and pages can specify that robots should not
crawl/index certain areas.
Two components:
• Robots Exclusion Protocol (robots.txt): Site wide specification of
excluded directories.
• Robots META Tag: Individual document tag to exclude indexing or
following links.
Robots Exclusion Protocol
• Site administrator puts a “robots.txt” file at the root
of the host’s web directory.
• http://www.ebay.com/robots.txt
• http://www.cnn.com/robots.txt
• http://clgiles.ist.psu.edu/robots.txt

• File is a list of excluded directories for a given robot
(user-agent).
• Exclude all robots from the entire site:
User-agent: *
Disallow: /
New Allow:
• Find some interesting robots.txt
Robot Exclusion Protocol Examples
• Exclude specific directories:
User-agent:
Disallow:
Disallow:
Disallow:

*
/tmp/
/cgi-bin/
/users/paranoid/

• Exclude a specific robot:

User-agent: GoogleBot
Disallow: /

• Allow a specific robot:

User-agent: GoogleBot
Disallow:
User-agent: *
Disallow: /
Robot Exclusion Protocol Has Not Well
Defined Details
• Only use blank lines to separate different User-agent
disallowed directories.
• One directory per “Disallow” line.
• No regex (regular expression) patterns in directories.
Parallelization Policy
The crawler runs multiple processes in parallel.
The goal is:
To maximize the download rate.
To minimize the overhead from parallelization.
To avoid repeated downloads of the same page.
The crawling system requires a policy for assigning the
new URLs discovered during the crawling process.
content
• What is a web crawler?
• Why is web crawler required?
• How does web crawler work?
• Mechanism used
Breadth first search traversal
Depth first search traversal
• Architecture of web crawler
• Crawling policies
• Distributed crawling
Figure: parallel crawler
distributed WEB CRAWLING
• A distributed computing technique whereby search
engines employ many computers to index the Internet
via web crawling.
• The idea is to spread out the required resources of
computation and bandwidth to many computers and
networks.
• Types of distributed web crawling:
1. Dynamic Assignment
2. Static Assignment
DYNAMIC ASSIGNMENT
• With this, a central server assigns new URLs to different
crawlers dynamically. This allows the central server
dynamically balance the load of each crawler.
• Configurations of crawling architectures with dynamic
assignments:
• A small crawler configuration, in which there is
a central DNS resolver and central queues per Web
site, and distributed down loaders.
• A large crawler configuration, in which the DNS resolver
and the queues are also distributed.
STATIC ASSIGNMENT
• Here a fixed rule is stated from the beginning of the
crawl that defines how to assign new URLs to the
crawlers.
• A hashing function can be used to transform URLs into a
number that corresponds to the index of the
corresponding crawling process.
• To reduce the overhead due to the exchange of URLs
between crawling processes, when links switch from one
website to another, the exchange should be done in
batch.
FOCUSED CRAWLING
• Focused crawling was first introduced by Chakrabarti.
• A focused crawler ideally would like to download only
web pages that are relevant to a particular topic and
avoid downloading all others.
• It assumes that some labeled examples of relevant and
not relevant pages are available.
STRATEGIES OF FOCUSED
CRAWLING
• A focused crawler predict the probability that a link to a
particular page is relevant before actually downloading
the page. A possible predictor is the anchor text of links.
• In another approach, the relevance of a page is
determined after downloading its content. Relevant
pages are sent to content indexing and their contained
URLs are added to the crawl frontier; pages that fall
below a relevance threshold are discarded.
EXAMPLES
•
•
•
•

Yahoo! Slurp: Yahoo Search crawler.
Msnbot: Microsoft's Bing web crawler.
Googlebot : Google’s web crawler.
WebCrawler : Used to build the first publicly-available
full-text index of a subset of the Web.
• World Wide Web Worm : Used to build a simple index of
document titles and URLs.
• Web Fountain: Distributed, modular crawler written in
C++.
• Slug: Semantic web crawler
Important questions
1)Draw a neat labeled diagram to explain how does a web
crawler work?
2)What is the function of crawler?
3)How does the crawler knows if it can crawl and index data
from website? Explain.
4)Write a note on robot.txt.
5)Discuss the architecture of a search engine.
7)Explain difference between crawler and focused crawler.
Web crawler

More Related Content

PPT
Web Crawler
iamthevictory
 
PDF
Web Crawling & Crawler
Amir Masoud Sefidian
 
PPTX
Web crawler
poonamkenkre
 
PPTX
A completed modeling of local binary pattern operator
Win Yu
 
PPT
Webcrawler
Govind Raj
 
PPT
WebCrawler
mynameismrslide
 
PPTX
Presentation on queue
Rojan Pariyar
 
PPTX
Web Crawlers
Suhasini S Kulkarni
 
Web Crawler
iamthevictory
 
Web Crawling & Crawler
Amir Masoud Sefidian
 
Web crawler
poonamkenkre
 
A completed modeling of local binary pattern operator
Win Yu
 
Webcrawler
Govind Raj
 
WebCrawler
mynameismrslide
 
Presentation on queue
Rojan Pariyar
 
Web Crawlers
Suhasini S Kulkarni
 

What's hot (20)

PPT
“Web crawler”
ranjit banshpal
 
PPTX
Web browser architecture.pptx
BabarHussain607332
 
PPTX
The Complete HTML
Rohit Buddabathina
 
PDF
Topic detection & tracking
George Ang
 
PDF
Python multithreaded programming
Learnbay Datascience
 
PPTX
Morphological image processing
Raghu Kumar
 
PPTX
Introduction to xml
Gtu Booker
 
PDF
Quick sort algorithn
Kumar
 
PPTX
Image Enhancement using Frequency Domain Filters
Karthika Ramachandran
 
PPTX
Vector quantization
Rajani Sharma
 
PPTX
PageRank Algorithm In data mining
Mai Mustafa
 
PPT
Pagerank Algorithm Explained
jdhaar
 
PPTX
Html list
sayed fathey
 
PPTX
Css gradients
skyler hildreth
 
PPTX
Datamining - On What Kind of Data
wina wulansari
 
PPTX
Graph representation
Tech_MX
 
PPTX
The BCD to excess-3 converter
Mahady Hasan
 
“Web crawler”
ranjit banshpal
 
Web browser architecture.pptx
BabarHussain607332
 
The Complete HTML
Rohit Buddabathina
 
Topic detection & tracking
George Ang
 
Python multithreaded programming
Learnbay Datascience
 
Morphological image processing
Raghu Kumar
 
Introduction to xml
Gtu Booker
 
Quick sort algorithn
Kumar
 
Image Enhancement using Frequency Domain Filters
Karthika Ramachandran
 
Vector quantization
Rajani Sharma
 
PageRank Algorithm In data mining
Mai Mustafa
 
Pagerank Algorithm Explained
jdhaar
 
Html list
sayed fathey
 
Css gradients
skyler hildreth
 
Datamining - On What Kind of Data
wina wulansari
 
Graph representation
Tech_MX
 
The BCD to excess-3 converter
Mahady Hasan
 
Ad

Viewers also liked (18)

PPT
Working of a Web Crawler
Sanchit Saini
 
PPT
Web crawler
anusha kurapati
 
PDF
Colloquim Report on Crawler - 1 Dec 2014
Sunny Gupta
 
PDF
Search engine and web crawler
ishmecse13
 
PPTX
Web crawler with seo analysis
Vikram Parmar
 
PDF
Web Crawling and Reinforcement Learning
Francesco Gadaleta
 
DOCX
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
CloudTechnologies
 
PPT
AOL Master Plan - Content Strategy, SEO, Publishing
Mark Nicholson
 
PDF
Web Crawling
Carlos Castillo (ChaTo)
 
PPTX
SEO Presentation
RAJU MAKWANA
 
PDF
What is a web crawler and how does it work
Swati Sharma
 
PPTX
Smart Crawler
Luiz Henrique Zambom Santana
 
PDF
Frontera: open source, large scale web crawling framework
Scrapinghub
 
PPTX
Web browser architecture
Nguyen Quang
 
PPTX
Neumonía en pediatría
University of Nariño
 
PDF
Architecture of the Web browser
Sabin Buraga
 
PPT
SOA Unit I
Dileep Kumar G
 
PPT
An Introduction To Graphic Design
Afshan Kirmani
 
Working of a Web Crawler
Sanchit Saini
 
Web crawler
anusha kurapati
 
Colloquim Report on Crawler - 1 Dec 2014
Sunny Gupta
 
Search engine and web crawler
ishmecse13
 
Web crawler with seo analysis
Vikram Parmar
 
Web Crawling and Reinforcement Learning
Francesco Gadaleta
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
CloudTechnologies
 
AOL Master Plan - Content Strategy, SEO, Publishing
Mark Nicholson
 
SEO Presentation
RAJU MAKWANA
 
What is a web crawler and how does it work
Swati Sharma
 
Frontera: open source, large scale web crawling framework
Scrapinghub
 
Web browser architecture
Nguyen Quang
 
Neumonía en pediatría
University of Nariño
 
Architecture of the Web browser
Sabin Buraga
 
SOA Unit I
Dileep Kumar G
 
An Introduction To Graphic Design
Afshan Kirmani
 
Ad

Similar to Web crawler (20)

PPTX
webcrawler.pptx
NiteshKumar176268
 
PPT
Webcrawler
Ekansh Purwar
 
PDF
Brief Introduction on Working of Web Crawler
rahulmonikasharma
 
PPTX
Scalability andefficiencypres
NekoGato
 
PDF
Web Crawler For Mining Web Data
IRJET Journal
 
PPT
Smart Web Crawling in Search Engine Optimization
bismayabaliarsingh00
 
PPTX
4 Web Crawler.pptx
DEEPAK948083
 
PDF
Web crawler
crazyprave12490
 
PDF
Design and Implementation of a High- Performance Distributed Web Crawler
George Ang
 
PPT
Webcrawler
Ekansh Purwar
 
PDF
Smart Crawler Automation with RMI
IRJET Journal
 
PDF
Web Crawlers - Exploring the WWW
Siddhartha Anand
 
PPTX
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Sean Golliher
 
PDF
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Denis Shestakov
 
PDF
[LvDuit//Lab] Crawling the web
Van-Duyet Le
 
PDF
IRJET - Review on Search Engine Optimization
IRJET Journal
 
PDF
Pdd crawler a focused web
csandit
 
PDF
Effective Searching Policies for Web Crawler
IJMER
 
PPTX
Challenges in web crawling
Burhan Ahmed
 
PDF
Web crawling
Tushar Tilwani
 
webcrawler.pptx
NiteshKumar176268
 
Webcrawler
Ekansh Purwar
 
Brief Introduction on Working of Web Crawler
rahulmonikasharma
 
Scalability andefficiencypres
NekoGato
 
Web Crawler For Mining Web Data
IRJET Journal
 
Smart Web Crawling in Search Engine Optimization
bismayabaliarsingh00
 
4 Web Crawler.pptx
DEEPAK948083
 
Web crawler
crazyprave12490
 
Design and Implementation of a High- Performance Distributed Web Crawler
George Ang
 
Webcrawler
Ekansh Purwar
 
Smart Crawler Automation with RMI
IRJET Journal
 
Web Crawlers - Exploring the WWW
Siddhartha Anand
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Sean Golliher
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Denis Shestakov
 
[LvDuit//Lab] Crawling the web
Van-Duyet Le
 
IRJET - Review on Search Engine Optimization
IRJET Journal
 
Pdd crawler a focused web
csandit
 
Effective Searching Policies for Web Crawler
IJMER
 
Challenges in web crawling
Burhan Ahmed
 
Web crawling
Tushar Tilwani
 

Recently uploaded (20)

PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PPTX
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
DOCX
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
PPTX
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
PPTX
Basics and rules of probability with real-life uses
ravatkaran694
 
PDF
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
PPTX
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
PPTX
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
PPTX
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
PPTX
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
DOCX
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PDF
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
PPTX
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
PDF
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
PPTX
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
PDF
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
Basics and rules of probability with real-life uses
ravatkaran694
 
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 

Web crawler

  • 2. content • • Why is web crawler required? • How does web crawler work? • Crawling strategies Breadth first search traversal depth first search traversal • Architecture of web crawler • Crawling policies • Distributed crawling
  • 3. WEB CRAWLERS  The process or program used by search engines to download pages from the web for later processing by a search engine that will index the downloaded pages to provide fast searches.  A program or automated script which browses the World Wide Web in a methodical, automated manner  also known as web spiders and web robots.  less used names- ants, bots and worms.
  • 4. content • What is a web crawler? • How does web crawler work? • Crawling strategies Breadth first search traversal depth first search traversal • Architecture of web crawler • Crawling policies • Distributed crawling
  • 5. WHY CRAWLERS? Internet has a wide expanse of Information.  Finding relevant information requires an efficient mechanism. Web Crawlers provide that scope to the search engine.
  • 6. content • What is a web crawler? • Why is web crawler required? • How does web crawler work? • Crawling strategies Breadth first search traversal depth first search traversal • Architecture of web crawler • Crawling policies • Distributed crawling
  • 7. • It starts with a list of URLs to visit, called the seeds . As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of visited URLs, called the crawl frontier • URLs from the frontier are recursively visited according to a set of policies. How does web crawler
  • 8. Googlebot, Google’s Web Crawler New url’s can be specified here. This is google’s web Crawler.
  • 9. Crawling Algorithm Initialize queue (Q) with initial set of known URL’s. Until Q empty or page or time limit exhausted: Pop URL, L, from front of Q. If L is not an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…) exit loop. If already visited L, continue loop(get next url). Download page, P, for L. If cannot download P (e.g. 404 error, robot excluded) exit loop, else. Index P (e.g. add to inverted index or store cached copy). Parse P to obtain list of new links N. Append N to the end of Q.
  • 10. Keeping Track of Webpages to Index
  • 11. content • What is a web crawler? • Why is web crawler required? • How does web crawler work? • Crawling strategies Breadth first search traversal depth first search traversal • Architecture of web crawler • Crawling policies • Distributed crawling
  • 12. Crawling Strategies Alternate way of looking at the problem. Web is a huge directed graph, with documents as vertices and hyperlinks as edges.  Need to explore the graph using a suitable graph traversal algorithm.  W.r.t. previous ex: nodes are represented by rectangles and directed edges are drawn as arrows.
  • 13. Breadth-First Traversal Given any graph and a set of seeds at which to start, the graph can be traversed using the algorithm 1. Put all the given seeds into the queue; 2. Prepare to keep a list of “visited” nodes (initially empty); 3. As long as the queue is not empty: a. Remove the first node from the queue; b. Append that node to the list of “visited” nodes c. For each edge starting at that node: i. If the node at the end of the edge already appears on the list of “visited” nodes or it is already in the queue, then do nothing more with that edge; ii. Otherwise, append the node at the end of the edge to the end of the queue.
  • 15. content • What is a web crawler? • Why is web crawler required? • How does web crawler work? • Crawling strategies Breadth first search traversal Depth first search traversal • Architecture of web crawler • Crawling policies • Parallel crawling
  • 16. Depth First Crawlers Use depth first search (DFS) algorithm • Get the 1st link not visited from the start page • Visit link and get 1st non-visited link • Repeat above step till no non-visited links • Go to next non-visited link in the previous level and repeat 2nd step
  • 18. Depth-First vs. Breadth-First • depth-first goes off into one branch until it reaches a leaf node • not good if the goal node is on another branch • neither complete nor optimal • uses much less space than breadth-first • much fewer visited nodes to keep track of • smaller fringe • breadth-first is more careful by checking all alternatives • complete and optimal • very memory-intensive
  • 19. content • What is a web crawler? • Why is web crawler required? • How does web crawler work? • Crawling strategies Breadth first search traversal Depth first search traversal • Architecture of web crawler • Crawling policies • Distributed crawling
  • 22. Architecture URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set.  DNS: domain name service resolution. Look up IP address for domain names.  Fetch: generally use the http protocol to fetch the URL.  Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted.  Content Seen?: test whether a web page with the same content has already been seen at another URL. Need to develop a way to measure the fingerprint of a web page.
  • 23. Architecture(cont)  URL Filter: Whether the extracted URL should be excluded from the frontier (robots.txt).  URL should be normalized (relative encoding).  en.wikipedia.org/wiki/Main_Page  <a href="/wiki/Wikipedia:General_disclaimer" title="Wikipedia:General disclaimer">Disclaimers</a>  Dup URL Elim: the URL is checked for duplicate elimination. 
  • 24. content • What is a web crawler? • Why is web crawler required? • How does web crawler work? • Crawling strategies Breadth first search traversal Depth first search traversal • Architecture of web crawler • Crawling policies • Distributed crawling
  • 25. Crawling Policies • Selection Policy that states which pages to download. • Re-visit Policy that states when to check for changes to the pages. • Politeness Policy that states how to avoid overloading Web sites. • Parallelization Policy that states how to coordinate distributed Web crawlers.
  • 26. Selection policy  Search engines covers only a fraction of Internet.  This requires download of relevant pages, hence a good selection policy is very important.  Common Selection policies: Restricting followed links Path-ascending crawling Focused crawling Crawling the Deep Web
  • 27. Re-Visit Policy     Web is dynamic; crawling takes a long time. Cost factors play important role in crawling. Freshness and Age- commonly used cost functions. Objective of crawler- high average freshness; low average age of web pages.  Two re-visit policies: Uniform policy Proportional policy
  • 28. Politeness Policy Crawlers can have a crippling impact on the overall performance of a site. The costs of using Web crawlers include: Network resources Server overload Server/ router crashes Network and server disruption A partial solution to these problems is the robots exclusion protocol.
  • 29. Robot Exclusion • How to control those robots! Web sites and pages can specify that robots should not crawl/index certain areas. Two components: • Robots Exclusion Protocol (robots.txt): Site wide specification of excluded directories. • Robots META Tag: Individual document tag to exclude indexing or following links.
  • 30. Robots Exclusion Protocol • Site administrator puts a “robots.txt” file at the root of the host’s web directory. • http://www.ebay.com/robots.txt • http://www.cnn.com/robots.txt • http://clgiles.ist.psu.edu/robots.txt • File is a list of excluded directories for a given robot (user-agent). • Exclude all robots from the entire site: User-agent: * Disallow: / New Allow: • Find some interesting robots.txt
  • 31. Robot Exclusion Protocol Examples • Exclude specific directories: User-agent: Disallow: Disallow: Disallow: * /tmp/ /cgi-bin/ /users/paranoid/ • Exclude a specific robot: User-agent: GoogleBot Disallow: / • Allow a specific robot: User-agent: GoogleBot Disallow: User-agent: * Disallow: /
  • 32. Robot Exclusion Protocol Has Not Well Defined Details • Only use blank lines to separate different User-agent disallowed directories. • One directory per “Disallow” line. • No regex (regular expression) patterns in directories.
  • 33. Parallelization Policy The crawler runs multiple processes in parallel. The goal is: To maximize the download rate. To minimize the overhead from parallelization. To avoid repeated downloads of the same page. The crawling system requires a policy for assigning the new URLs discovered during the crawling process.
  • 34. content • What is a web crawler? • Why is web crawler required? • How does web crawler work? • Mechanism used Breadth first search traversal Depth first search traversal • Architecture of web crawler • Crawling policies • Distributed crawling
  • 36. distributed WEB CRAWLING • A distributed computing technique whereby search engines employ many computers to index the Internet via web crawling. • The idea is to spread out the required resources of computation and bandwidth to many computers and networks. • Types of distributed web crawling: 1. Dynamic Assignment 2. Static Assignment
  • 37. DYNAMIC ASSIGNMENT • With this, a central server assigns new URLs to different crawlers dynamically. This allows the central server dynamically balance the load of each crawler. • Configurations of crawling architectures with dynamic assignments: • A small crawler configuration, in which there is a central DNS resolver and central queues per Web site, and distributed down loaders. • A large crawler configuration, in which the DNS resolver and the queues are also distributed.
  • 38. STATIC ASSIGNMENT • Here a fixed rule is stated from the beginning of the crawl that defines how to assign new URLs to the crawlers. • A hashing function can be used to transform URLs into a number that corresponds to the index of the corresponding crawling process. • To reduce the overhead due to the exchange of URLs between crawling processes, when links switch from one website to another, the exchange should be done in batch.
  • 39. FOCUSED CRAWLING • Focused crawling was first introduced by Chakrabarti. • A focused crawler ideally would like to download only web pages that are relevant to a particular topic and avoid downloading all others. • It assumes that some labeled examples of relevant and not relevant pages are available.
  • 40. STRATEGIES OF FOCUSED CRAWLING • A focused crawler predict the probability that a link to a particular page is relevant before actually downloading the page. A possible predictor is the anchor text of links. • In another approach, the relevance of a page is determined after downloading its content. Relevant pages are sent to content indexing and their contained URLs are added to the crawl frontier; pages that fall below a relevance threshold are discarded.
  • 41. EXAMPLES • • • • Yahoo! Slurp: Yahoo Search crawler. Msnbot: Microsoft's Bing web crawler. Googlebot : Google’s web crawler. WebCrawler : Used to build the first publicly-available full-text index of a subset of the Web. • World Wide Web Worm : Used to build a simple index of document titles and URLs. • Web Fountain: Distributed, modular crawler written in C++. • Slug: Semantic web crawler
  • 42. Important questions 1)Draw a neat labeled diagram to explain how does a web crawler work? 2)What is the function of crawler? 3)How does the crawler knows if it can crawl and index data from website? Explain. 4)Write a note on robot.txt. 5)Discuss the architecture of a search engine. 7)Explain difference between crawler and focused crawler.