SlideShare a Scribd company logo
How Web Search Engines Work ? Apurva Jadhav apurvajadhav[at]gmail[dot]com
Outline Text Search Indexing  Query Processing Relevance Ranking Vector space Model Performance Measures Link Analysis to rank Web pages
Text Search Given query  q “honda car”,  find documents that contain terms  honda  and  car   Text search index structure - Inverted Index Steps for construction of inverted index Document preprocessing  Tokenization Stemming, removal of stop words Indexing Tokenizer Stemming Indexer documents Inverted  Index
Document Preprocessing Includes following steps: Removing all html tags Tokenization: Break document into constituent words or terms  Removal of common stop words such as  the, a, an, at Stemming:  Find and replace words with their root shirts     shirt  Assign a unique token Id to each token Assign a unique document Id to each document
Inverted Index For each term  t , inverted index stores a list of IDs of documents that contain  t Example: Documents Doc1 :  The quick brown fox jumps over the lazy dog Doc2:   Fox News is the number one cable news channel The postings list is sorted by document ID Supports advanced query operators such AND, OR, NOT Postings list fox Doc1 Doc1 Doc2 dog
Query Processing Consider query  honda car Retrieve postings list for  honda Retrieve postings list for  car Merge the two postings list Postings list sorted by doc ID If length of postings list are m and n then it takes  O(m + n) to merge them 1 2 16 4 8 9 3 16 8 honda car 8, 16
Inverted Index Construction Estimate size of index Use integer 32 bits to represent a document ID Average number of unique terms in a document be 100 Any document ID occurs in 100 postings lists on average Index Size = 4 * 100 * number of documents bytes At Web scale, it runs into 100s of GB Clearly, one cannot hold the index structure in RAM
Inverted Index Construction For each document output (term, documentID) pairs to a file on disk Note this file size is same as index size Documents Doc1 :  The quick brown fox jumps over the lazy dog Doc2:   Fox News is the number one cable news channel Sort this file by terms . This uses disk based external sort Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
Inverted Index Construction sort The result is split into dictionary file and postings list file 1 1 2 1 Dictionary file Postings file Term brown fox fox jumps over . . . news number one . Doc ID 1 1 2 1 1 . . 2 2 2 2 . Term brown fox jumps over . . news number one . Postings file offset 0 . . . . Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
Relevance Ranking Inverted index returns a list of documents which contain query terms How do we rank these documents ? Use frequency of query terms Use importance / rareness of query terms Do query terms occur in title of the document?
Vector space model Documents are represented as vectors in a multi-dimensional Euclidean space. Each term/word of the vocabulary represents a dimension The weight (co-ordinate) of document  d  along the dimension represented by term  t  is a product of the following Term Frequency TF ( d , t ): The number of times term  t  occurs in document  d Inverted document frequency IDF ( t ): All terms are not equally important. IDF captures the importance or rareness of terms.  IDF ( t ) = log ( 1 + |D| / |D t |) where |D| is the total number of documents   |D t | is the number of documents which contain term  t Car d q ө can Computer
Vector space model Queries are also represented in terms of term vectors Documents are ranked by their proximity to query vector Cosine of the angle between document vector and query vector is to measure proximity between two vectors Cos( ө ) = d.q / (|d||q|) The smaller the angle between vectors d and q, the more relevant document  d  is for query  q
Performance Measure Search Engines return a ranked list of result documents for a given query To measure accuracy, we use a set of queries  Q  and manually identified set of relevant documents  Dq  for each query  q .  We define two measures to assess accuracy of search engines. Let Rel(q,k) be number of documents relevant to query q returned in top k positions Recall   for query  q,  at position  k,  is the fraction of all relevant documents  Dq  that are returned in top k postions.  Recall(k)  =  1/|Dq| * Rel(q,k) Precision  for query  q , at position  k , is the fraction of top k results that are relevant   Precision(k) = 1/k * Rel(q,k)
Challenges in ranking Web pages Spamming: Many Web page authors resort to spam, ie adding unrelated words, to rank higher in search engines for certain queries Finding authoritative sources: There are thousands of documents that contain the given query terms.  Example: For query ‘ yahoo ’,  www.yahoo.com   is the most relevant result  Anchor text: gives important information about a document. It is indexed as part of the document
Page Rank Measure of authority or prestige of a Web page It is based on the link structure of the Web graph A Web page which is linked / cited by many other Web pages is popular and has higher PageRank It is a query independent static ranking of Web pages Roughly, given two pages both of which contain the query terms, the page with higher PageRank is more relevant
Page Rank Web pages link to each other through hyperlinks. (hrefs in HTML) Thus, the Web can be visualized as a directed graph where web pages constitute the set of  nodes  N  and hyperlinks constitute the set edges  E Each web page (node) has a measure of  authority  or  prestige  called  PageRank   PageRank  of a page (node)  v  is proportional to sum of  PageRank  of all web pages that  link to it p[v] =  Σ (u,v)  Є  E   p[u] / N u N u  is number of outlinks of node u u1 p[v1] = p[u1] + p[u2] / 2 p[v2] = p[u2]/2 + p[u3]  u2 u3 v1 v2 w1
Page Rank Computation Consider N x N Link Matrix  L and Page Rank Vector p L(u, v) = E(u,v) / Nu where E(u,v) = 1 iff there is an edge from u to v Nu = number of outlinks from node u p = L T  p   Page Rank vector is the first eigen vector of link matrix L T Page Rank is computed by power iteration method
References Books and Papers S. Chakrabarti . Mining the Web – Discovering Knowledge From Hypertext Data C Manning and P Raghavan . Introduction to Information Retrieval http://www-csli.stanford.edu/ ~hinrich/information-retrieval-book.html S. Brin and L. Page . Anatomy of a large scale hypertextual Web search engine. WWW7, 1998 Software Nutch is an open source Java Web crawler.  http://lucene.apache.org/nutch/about.html Lucene is an open source Java text search engine.  http://lucene.apache.org/
Introduction Web Search is the dominant means of online information retrieval More than 200 million searches performed each day in US alone. Aim of a search engine is to find documents relevant to a user query Most search engines try to find and rank documents which contain the query terms
Web Crawlers Fetches Web pages. The basic idea is pretty simple as illustrated below. Add a few seed URLs ( www.yahoo.com ) to a queue While (!queue.isEmpty()) do  URL u = queue.remove() fetch Web page W(u) Extract all hyperlinks from W(u) and add them to the queue Done. To fetch all or a significant percentage of all Web pages (millions) one needs to engineer a large scale crawler A large scale crawler has to be distributed and multi-threaded Nutch is an open source Web crawler.  http://lucene.apache.org/nutch/about.html

More Related Content

What's hot (20)

PPT
Zhishi.me - Weaving Chinese Linking Open Data
Xing Niu
 
PDF
Context-Enhanced Adaptive Entity Linking
Giuseppe Rizzo
 
PPTX
Isam
Javed Khan
 
PPTX
Indexing
myrajendra
 
PPTX
SWT Lecture Session 9 - RDB2RDF direct mapping
Mariano Rodriguez-Muro
 
PPT
Indexing and Hashing
sathish sak
 
PPTX
SWT Lecture Session 11 - R2RML part 2
Mariano Rodriguez-Muro
 
PPTX
SWT Lecture Session 10 R2RML Part 1
Mariano Rodriguez-Muro
 
PPTX
File Structures(Part 2)
Dr. SURBHI SAROHA
 
PPTX
Document Classification and Clustering
Ankur Shrivastava
 
PDF
Grades nda 2018 - gremlinator demo talk - harsh thakkar
Harsh Thakkar
 
PPTX
Indexing structure for files
Zainab Almugbel
 
PDF
Coling2014:Single Document Keyphrase Extraction Using Label Information
Ryuchi Tachibana
 
PPTX
Relational Database Management System
sweetysweety8
 
PPTX
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
PPTX
Overview of Storage and Indexing ...
Javed Khan
 
PPTX
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
Muhammad Saleem
 
PPT
12. Indexing and Hashing in DBMS
koolkampus
 
PPTX
Federated Query Formulation and Processing Through BioFed
Muhammad Saleem
 
PPT
File organization 1
Rupali Rana
 
Zhishi.me - Weaving Chinese Linking Open Data
Xing Niu
 
Context-Enhanced Adaptive Entity Linking
Giuseppe Rizzo
 
Indexing
myrajendra
 
SWT Lecture Session 9 - RDB2RDF direct mapping
Mariano Rodriguez-Muro
 
Indexing and Hashing
sathish sak
 
SWT Lecture Session 11 - R2RML part 2
Mariano Rodriguez-Muro
 
SWT Lecture Session 10 R2RML Part 1
Mariano Rodriguez-Muro
 
File Structures(Part 2)
Dr. SURBHI SAROHA
 
Document Classification and Clustering
Ankur Shrivastava
 
Grades nda 2018 - gremlinator demo talk - harsh thakkar
Harsh Thakkar
 
Indexing structure for files
Zainab Almugbel
 
Coling2014:Single Document Keyphrase Extraction Using Label Information
Ryuchi Tachibana
 
Relational Database Management System
sweetysweety8
 
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
Overview of Storage and Indexing ...
Javed Khan
 
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
Muhammad Saleem
 
12. Indexing and Hashing in DBMS
koolkampus
 
Federated Query Formulation and Processing Through BioFed
Muhammad Saleem
 
File organization 1
Rupali Rana
 

Similar to How web searching engines work (20)

PPTX
Anatomy of google
Iftikhar Alam
 
PPT
Web Search Engine
Chidanand Byahatti
 
PPT
Working Of Search Engine
NIKHIL NAIR
 
PDF
Nutch and lucene_framework
samuelhard
 
PPTX
How a search engine works slide
Sovan Misra
 
DOC
How a search engine works report
Sovan Misra
 
PPT
Googling of GooGle
binit singh
 
PDF
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
Computer Science Journals
 
PDF
Comparisons of ranking algorithms
Pravin Patil
 
PDF
Ir
almashraee
 
PDF
Ibm haifa.mq.final
Pranav Prakash
 
DOCX
Seminar report(rohitsahu cs 17 vth sem)
ROHIT SAHU
 
DOCX
Excel analysis assignment this is an independent assignment me
joney4
 
PPT
Introduction into Search Engines and Information Retrieval
A. LE
 
PPT
Annotating Digital Texts in the Brown University Library
Timothy Cole
 
PPT
Understanding Seo At A Glance
poojagupta267
 
PDF
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
PDF
Meta documents and query extension to enhance information retrieval process
eSAT Journals
 
PPTX
Annotations chicago
Timothy Cole
 
Anatomy of google
Iftikhar Alam
 
Web Search Engine
Chidanand Byahatti
 
Working Of Search Engine
NIKHIL NAIR
 
Nutch and lucene_framework
samuelhard
 
How a search engine works slide
Sovan Misra
 
How a search engine works report
Sovan Misra
 
Googling of GooGle
binit singh
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
Computer Science Journals
 
Comparisons of ranking algorithms
Pravin Patil
 
Ibm haifa.mq.final
Pranav Prakash
 
Seminar report(rohitsahu cs 17 vth sem)
ROHIT SAHU
 
Excel analysis assignment this is an independent assignment me
joney4
 
Introduction into Search Engines and Information Retrieval
A. LE
 
Annotating Digital Texts in the Brown University Library
Timothy Cole
 
Understanding Seo At A Glance
poojagupta267
 
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
Meta documents and query extension to enhance information retrieval process
eSAT Journals
 
Annotations chicago
Timothy Cole
 
Ad

More from VNIT-ACM Student Chapter (12)

PPS
An approach to Programming Contests with C++
VNIT-ACM Student Chapter
 
PPS
An introduction to Reverse Engineering
VNIT-ACM Student Chapter
 
PPS
Introduction to the OSI 7 layer model and Data Link Layer
VNIT-ACM Student Chapter
 
PPTX
Research Opportunities in the United States
VNIT-ACM Student Chapter
 
PPS
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
VNIT-ACM Student Chapter
 
PPT
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
PPS
Web Designing
VNIT-ACM Student Chapter
 
PPS
Inaugural Session
VNIT-ACM Student Chapter
 
PPS
Hacking - Web based attacks
VNIT-ACM Student Chapter
 
PPS
Computers and Algorithms - What can they do and what can they not?
VNIT-ACM Student Chapter
 
PPS
Foundations of Programming Part II
VNIT-ACM Student Chapter
 
PPS
Foundations of Programming Part I
VNIT-ACM Student Chapter
 
An approach to Programming Contests with C++
VNIT-ACM Student Chapter
 
An introduction to Reverse Engineering
VNIT-ACM Student Chapter
 
Introduction to the OSI 7 layer model and Data Link Layer
VNIT-ACM Student Chapter
 
Research Opportunities in the United States
VNIT-ACM Student Chapter
 
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
VNIT-ACM Student Chapter
 
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Inaugural Session
VNIT-ACM Student Chapter
 
Hacking - Web based attacks
VNIT-ACM Student Chapter
 
Computers and Algorithms - What can they do and what can they not?
VNIT-ACM Student Chapter
 
Foundations of Programming Part II
VNIT-ACM Student Chapter
 
Foundations of Programming Part I
VNIT-ACM Student Chapter
 
Ad

Recently uploaded (20)

PDF
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PPTX
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
PDF
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
PDF
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
PPTX
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PPT
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PPTX
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
PDF
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PDF
The-Ever-Evolving-World-of-Science (1).pdf/7TH CLASS CURIOSITY /1ST CHAPTER/B...
Sandeep Swamy
 
PPTX
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
PDF
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
PDF
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPTX
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
Dimensions of Societal Planning in Commonism
StefanMz
 
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
The-Ever-Evolving-World-of-Science (1).pdf/7TH CLASS CURIOSITY /1ST CHAPTER/B...
Sandeep Swamy
 
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 

How web searching engines work

  • 1. How Web Search Engines Work ? Apurva Jadhav apurvajadhav[at]gmail[dot]com
  • 2. Outline Text Search Indexing Query Processing Relevance Ranking Vector space Model Performance Measures Link Analysis to rank Web pages
  • 3. Text Search Given query q “honda car”, find documents that contain terms honda and car Text search index structure - Inverted Index Steps for construction of inverted index Document preprocessing Tokenization Stemming, removal of stop words Indexing Tokenizer Stemming Indexer documents Inverted Index
  • 4. Document Preprocessing Includes following steps: Removing all html tags Tokenization: Break document into constituent words or terms Removal of common stop words such as the, a, an, at Stemming: Find and replace words with their root shirts  shirt Assign a unique token Id to each token Assign a unique document Id to each document
  • 5. Inverted Index For each term t , inverted index stores a list of IDs of documents that contain t Example: Documents Doc1 : The quick brown fox jumps over the lazy dog Doc2: Fox News is the number one cable news channel The postings list is sorted by document ID Supports advanced query operators such AND, OR, NOT Postings list fox Doc1 Doc1 Doc2 dog
  • 6. Query Processing Consider query honda car Retrieve postings list for honda Retrieve postings list for car Merge the two postings list Postings list sorted by doc ID If length of postings list are m and n then it takes O(m + n) to merge them 1 2 16 4 8 9 3 16 8 honda car 8, 16
  • 7. Inverted Index Construction Estimate size of index Use integer 32 bits to represent a document ID Average number of unique terms in a document be 100 Any document ID occurs in 100 postings lists on average Index Size = 4 * 100 * number of documents bytes At Web scale, it runs into 100s of GB Clearly, one cannot hold the index structure in RAM
  • 8. Inverted Index Construction For each document output (term, documentID) pairs to a file on disk Note this file size is same as index size Documents Doc1 : The quick brown fox jumps over the lazy dog Doc2: Fox News is the number one cable news channel Sort this file by terms . This uses disk based external sort Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
  • 9. Inverted Index Construction sort The result is split into dictionary file and postings list file 1 1 2 1 Dictionary file Postings file Term brown fox fox jumps over . . . news number one . Doc ID 1 1 2 1 1 . . 2 2 2 2 . Term brown fox jumps over . . news number one . Postings file offset 0 . . . . Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
  • 10. Relevance Ranking Inverted index returns a list of documents which contain query terms How do we rank these documents ? Use frequency of query terms Use importance / rareness of query terms Do query terms occur in title of the document?
  • 11. Vector space model Documents are represented as vectors in a multi-dimensional Euclidean space. Each term/word of the vocabulary represents a dimension The weight (co-ordinate) of document d along the dimension represented by term t is a product of the following Term Frequency TF ( d , t ): The number of times term t occurs in document d Inverted document frequency IDF ( t ): All terms are not equally important. IDF captures the importance or rareness of terms. IDF ( t ) = log ( 1 + |D| / |D t |) where |D| is the total number of documents |D t | is the number of documents which contain term t Car d q ө can Computer
  • 12. Vector space model Queries are also represented in terms of term vectors Documents are ranked by their proximity to query vector Cosine of the angle between document vector and query vector is to measure proximity between two vectors Cos( ө ) = d.q / (|d||q|) The smaller the angle between vectors d and q, the more relevant document d is for query q
  • 13. Performance Measure Search Engines return a ranked list of result documents for a given query To measure accuracy, we use a set of queries Q and manually identified set of relevant documents Dq for each query q . We define two measures to assess accuracy of search engines. Let Rel(q,k) be number of documents relevant to query q returned in top k positions Recall for query q, at position k, is the fraction of all relevant documents Dq that are returned in top k postions. Recall(k) = 1/|Dq| * Rel(q,k) Precision for query q , at position k , is the fraction of top k results that are relevant Precision(k) = 1/k * Rel(q,k)
  • 14. Challenges in ranking Web pages Spamming: Many Web page authors resort to spam, ie adding unrelated words, to rank higher in search engines for certain queries Finding authoritative sources: There are thousands of documents that contain the given query terms. Example: For query ‘ yahoo ’, www.yahoo.com is the most relevant result Anchor text: gives important information about a document. It is indexed as part of the document
  • 15. Page Rank Measure of authority or prestige of a Web page It is based on the link structure of the Web graph A Web page which is linked / cited by many other Web pages is popular and has higher PageRank It is a query independent static ranking of Web pages Roughly, given two pages both of which contain the query terms, the page with higher PageRank is more relevant
  • 16. Page Rank Web pages link to each other through hyperlinks. (hrefs in HTML) Thus, the Web can be visualized as a directed graph where web pages constitute the set of nodes N and hyperlinks constitute the set edges E Each web page (node) has a measure of authority or prestige called PageRank PageRank of a page (node) v is proportional to sum of PageRank of all web pages that link to it p[v] = Σ (u,v) Є E p[u] / N u N u is number of outlinks of node u u1 p[v1] = p[u1] + p[u2] / 2 p[v2] = p[u2]/2 + p[u3] u2 u3 v1 v2 w1
  • 17. Page Rank Computation Consider N x N Link Matrix L and Page Rank Vector p L(u, v) = E(u,v) / Nu where E(u,v) = 1 iff there is an edge from u to v Nu = number of outlinks from node u p = L T p Page Rank vector is the first eigen vector of link matrix L T Page Rank is computed by power iteration method
  • 18. References Books and Papers S. Chakrabarti . Mining the Web – Discovering Knowledge From Hypertext Data C Manning and P Raghavan . Introduction to Information Retrieval http://www-csli.stanford.edu/ ~hinrich/information-retrieval-book.html S. Brin and L. Page . Anatomy of a large scale hypertextual Web search engine. WWW7, 1998 Software Nutch is an open source Java Web crawler. http://lucene.apache.org/nutch/about.html Lucene is an open source Java text search engine. http://lucene.apache.org/
  • 19. Introduction Web Search is the dominant means of online information retrieval More than 200 million searches performed each day in US alone. Aim of a search engine is to find documents relevant to a user query Most search engines try to find and rank documents which contain the query terms
  • 20. Web Crawlers Fetches Web pages. The basic idea is pretty simple as illustrated below. Add a few seed URLs ( www.yahoo.com ) to a queue While (!queue.isEmpty()) do URL u = queue.remove() fetch Web page W(u) Extract all hyperlinks from W(u) and add them to the queue Done. To fetch all or a significant percentage of all Web pages (millions) one needs to engineer a large scale crawler A large scale crawler has to be distributed and multi-threaded Nutch is an open source Web crawler. http://lucene.apache.org/nutch/about.html