SlideShare a Scribd company logo
GOOGLING OF How Google Search Engine Works….
The Web is both an excellent medium for sharing information, as well as an attractive platform for delivering products and services . Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than any existing search engine. Many web pages are unscrupulous and try to fool search engines to get to the top of ranking Google uses page rank and trust rank techniques to give accurate results of queries  Introduction
What is Search engine A tool designed to search information on web Works with the help of Crawler Indexer Search algorithms Gives precise results on the basis of different ranking procedures
Hi how are u WEB CRAWLER
WEB CRAWLER
Indexer It collects, parses and stores data to facilitate fast and accurate information retrieval for a  search query The inverted index stores a list of the documents containing each word associated with the query   Document 5 red Document 2,document 4 is Document 1,document 2,document 3 apple Document Word
The search engine then matches the query with  each document indexed. Then it filters the matching results. Since each query would require approx. 250GB of  memory so with the help of compression techniques the  index can be reduced to a fraction of this size. Indexer are regularly updated with help of index merging
SEARCH ALGORITHM Search for a query return millions of important or authoritative pages then search algorithm is used by engine to decide which one is going to be the listing that comes to the top  There are two key drivers in web search: content analysis and linkage analysis. Famous algorithms used by different search engines are 1.Page rank 2.Trust rank 3.Hilltop algorithm  4.Binary search
Different search engines uses different algorithms to rank the priority of pages. Different engines look for different things to determine search relevancy. Things that help you rank in one engine could preclude you from ranking in another
73% 71% 64% 56% 51% Positive ranking factors 68% 56% 51% 51% 46% Negative ranking factors Keyword focused anchor text from external links External link Popularity Diversity of link sources Keyword Use Anywhere in the title tag Trustworthiness of the Domain Based on Link Distance from Trusted Cloaking with Malicious intent Link acquisition from known link brokers  Link from the page to Web Spam Pages Cloaking by User Agent Frequent Server Downtime & Site Inaccessibility
OVERALL  RANKING  FACTORS
 
Google architecture Web crawling done by several distributed crawlers The web pages fetched then sent to store server which compresses the pages and stores them in repository. Indexer then reads repository, uncompress  the documents and parses them. Every web page has an associated ID number called DocID, which is assigned during parsing. Each documents is converted into sets of word occurrence called HITS, which is distributed by indexer into barrels, creating a partially sorted index.
Indexer also parses out all the link in every page and stores important information about them in an anchors file which determine where each link points from and to, and the text of the link The URLresolver reads the anchor file, retrieve their anchor text, puts the anchor text into forward index and generate a database of links Link database is used to compute PageRanks of all the documents. The sorter takes the barrels, which are sorted by docID, resorts them into wordID, produces a list of it and offsets into inverted index. A program called Dump Lexicon takes this list together with the lexicon produced by indexer and create a new lexicon for the searcher. The searcher uses these lexicons  together with inverted index and Page Rank to answer our queries.
Crawling deeply in Google's Architecture Major data structure Google’s data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost Cpus and bulk input output have increased upto millions.Google is designed to avoid disk seeks whenever possible  Repositry  It contains the full HTML of every web page and compresses it using Zlib which is a tradeoff between speed The documents are stored one after the other and are prefixed by docID, length, and URL as shown  url page pagelen Url len Ecode  Doc Id
Document index It keeps information about each document which include the current document status, pointer into the repository, document checksum and various statistics URLs is converted into docIDs in batch by doing a merge with this file  To find the docID of a particular URL, the URL’s checksum is computed and a binary search is performed Lexicon It’s a program used by indexer as a word storage system and fit in machine memory for a reasonable price The current lexicon contains 14 million words and takes only 256 MB of main memory of machine.
Hit list It’s a list of occurrences of a particular word in a particular document including position, font and capitalization information.  It also account for most of the space used in both the forward and the inverted indices Types of hits: Fancy hits and Plain hits Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. Plain hits include everything The length of the hit list is combined with the wordID in the forward index and the docID in the inverted index Forward index It is partially sorted and stored in a number of barrels. Each barrel holds a range of wordID’s. If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID’s
Inverted index   Inverted index consists of the same barrels as the forward index, except it is processed by the sorter For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into and points to a doclist of docID’s together It has two sets of inverted barrels, one set for hit lists which include title or anchor hits and another set for all hit lists It checks for the first set of barrels first and if there are not enough matches within those barrels it checks the larger ones. Indexing the web Any parser which is designed to run on the entire Web handle a huge array of possible errors. For maximum speed it uses flex to generate a lexical analyzer which runs at a reasonable speed and is very robust involved a fair amount of work.
Searching techniques The goal of searching is to provide quality search results efficiently. once a certain number (currently 40,000) of matching documents are found, the searcher automatically sort the documents that have matched, by rank, and return the top results. Google considers each hit (title, anchor, URL,large and small font), each of which has its own type-weight. The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list. Every count is converted into a count-weight. We take the dot product of the vector of count-weights Vector of type-weights is used to compute an IR score for the document.  IR score is combined with PageRank to give a final rank to the document.
Page Rank is based on a  mutual reinforcement between pages. It’s a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents A page that is linked to by many pages with high Page Rank receives a high rank itself. If there are no links to a web page there is no support for that page. A recent analysis of the algorithm showed that the total Page Rank score PR (t) of a group t of pages depends on four factors:  PR(t) = PRstatic(t)+PRin(t)-PRout(t)-PRsink(t) Page rank
Page C has a higher PageRank than Page E, even though it has fewer links to it; the link it has is of a much higher value. A web surfer who chooses a random link on every page (but with 15% likelihood jumps to a random page on the whole web) is going to be on Page E for 8.1% of the time.  (The 15% likelihood of jumping to an arbitrary page corresponds to a damping factor of 85%.) Without damping, all web surfers would eventually end up on Pages A, B, or C, and all other pages would have Page Rank zero. Page A is assumed to link to all pages in the web, because it has no outgoing links  Mathematical Page Ranks
Trust rank
Google and Web Spam All deceptive actions which try to increase the ranking of a page in search engines are generally referred to as  Web spam . It also refereed as “any attempt to deceive a search engine’s relevancy algorithm”. There are three types of web spam on the web .They are:-  Content spam   :  Maliciously crafting the content of Web pages. It refers to changes in the content of the pages, for instance by inserting a large number of keywords. Link spam   :  Includes changes to the link structure of the sites, by creating link farms .A link farm is a densely connected set of pages, created explicitly with the purpose of deceiving a link based ranking algorithm. Cloaking   :   It is achieved by creating a rogue copy of a popular website which shows contents similar to the original to a web crawler, but redirects web surfers to unrelated or malicious websites. Spammers can use this technique to achieve high rankings in result pages for certain key words .
Link based web spam
The foundation of spam detection system is a cost sensitive decision tree. It incorporates a combined approach based on link and content analysis to detect different types of Web spam pages Content Based Features   Number of words in the page  Fraction of anchor text  Fraction of visible text  A comparative study content based features  of the below mentioned figures show following results: Figure 1- Average Word Length in Spam pages are much higher in spam pages  Figure2-Number of words in spam page is much higher than non-spam page  Web spam detection and result
Thus based on the following features the content based spam pages can be detected by Naïve Bayesian Classifier which focuses on the no of times a word is repeated in the content of the page .  Figure 1: Figure 2:
Link Based Features    Data set is obtained by using web crawler .   For each page, links and its contents are obtained.  From data set, a full graph is built .   For each host and page, certain features are computed .   Link-based features are extracted from host graph.  Link Based classifier operates on the three features of the link farm which are as follows :- Based on the Estimation of Supporters  Based on Trust Rank and Page Rank
It has been observed that a normal webpage have their graph of the supporter increasing exponentially and the number of supporters increases with the distance. But in the case of the web spam their graph has a sudden increase in the supporters over a small distance of time and decreasing to zero after some distance. The distribution of the supporters over the distance has been shown in the figure  Distribution of supporters over a distance of the spam and non-spam page Non spam spam
System performance It is important for a search engine to crawl and index efficiently. This way information can be kept up to date and major changes to the system can be tested relatively quickly In total it took roughly 9 days to download the 26 million pages (including errors) downloading the last11 million pages in just 63 hours, averaging just over 4 million pages per day or 48.5 pages per second. The indexer runs at roughly 54 pages per second. The sorters can be run completely in parallel; using four machines, the whole process of sorting takes about 24 hours.
Google’s immediate goals are to improve search efficiency and to scale to approximately 100 million web pages. They  are planning to add simple features supported by commercial search engines like boolean operators, negation, and stemming and extending the use of link structure and link text. Page Rank can be personalized by increasing the weight of a user’s home page or bookmarks. Google is planning to use all the other centrality measures. The Centrality measures of a node are Degree centrality  Betweenness centrality Closeness centrality   Future work
Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. Google keeps us away from spammy link exchange hubs and other sources of junk links. It gives more importance to .gov and .edu web pages. We had applied algorithms for Web spam detection based on these features of the web farm i.e Context based(Naïve Bayesian Classifier) and link based(PageRank Algorithm).  conclusion
Best of the Web 1994 -- Navigators  http://botw.org/1994/awards/navigators.html l.Bzip2 Homepage  http://www.muraroa.demon.co.uk/ Google Search Engine http://google.stanford.edu/ Harvest  http://harvest.transarc.com/ Mauldin, Michael L. Lycos Design Choices in an Internet Search Service, IEEE Expert Interview http://www.computer.org/pubs/expert/1997/trends/x1008/mauldin.htm Search Engine Watch  http://www.searchenginewatch.com/ Robots Exclusion Protocol: http://info.webcrawler.com/mak/projects/robots/exclusion.htm References
Thank  You  All  !!

More Related Content

What's hot (18)

PPT
Search engine
Alisha Korpal
 
PPTX
Search engine
swaraj27
 
PDF
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
iosrjce
 
PPT
Google Search Engine
guestf460ed0
 
PPTX
Search Engine
Ankush Srivastava
 
PPTX
Efficient focused web crawling approach
Syed Islam
 
PPT
Training Project Report on Search Engines
Shivam Saxena
 
PDF
Search engine and web crawler
ishmecse13
 
PPT
Google Search Engine
Aniket_1415
 
PPTX
Search engines powerpoint
vbaker2210
 
PPTX
Search Engines and its working
Mukesh Kumar
 
PPTX
Anatomy of google
Iftikhar Alam
 
PPTX
Search Engine Optimization(SEO)
Surit Datta
 
PDF
Pagerank and hits
Shatakirti Er
 
PPT
Compare & Contrast Using The Web To Discover Comparable Cases For News Stories
Jason Yang
 
PPTX
On page Optimization
Web Development Montreal
 
PPT
Google
Mohd Arif
 
Search engine
Alisha Korpal
 
Search engine
swaraj27
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
iosrjce
 
Google Search Engine
guestf460ed0
 
Search Engine
Ankush Srivastava
 
Efficient focused web crawling approach
Syed Islam
 
Training Project Report on Search Engines
Shivam Saxena
 
Search engine and web crawler
ishmecse13
 
Google Search Engine
Aniket_1415
 
Search engines powerpoint
vbaker2210
 
Search Engines and its working
Mukesh Kumar
 
Anatomy of google
Iftikhar Alam
 
Search Engine Optimization(SEO)
Surit Datta
 
Pagerank and hits
Shatakirti Er
 
Compare & Contrast Using The Web To Discover Comparable Cases For News Stories
Jason Yang
 
On page Optimization
Web Development Montreal
 
Google
Mohd Arif
 

Similar to Googling of GooGle (20)

PPTX
page ranking web crawling
pradiprahul
 
PPT
Understanding Seo At A Glance
poojagupta267
 
PPT
Working Of Search Engine
NIKHIL NAIR
 
PPTX
Google history nd architecture
Divyangee Jain
 
PPTX
CRAWLER,INDEX,RANKING AND ITS WORKING.pptx
ajajkhan16
 
PDF
IRJET- Page Ranking Algorithms – A Comparison
IRJET Journal
 
DOC
SEO Tutorial - SEO Company in India
annakoch32
 
ODP
Web2.0.2012 - lesson 8 - Google world
Carlo Vaccari
 
DOCX
SEO Basics - SEO Company in India
annakoch32
 
PPTX
How search engine works
leoniehannah
 
PDF
Search Engine Optimization - Aykut Aslantaş
Aykut Aslantaş
 
PPTX
Ranking algorithms
Ankit Raj
 
PPTX
Google indexing
tahoor71
 
PPT
Comparative study of different ranking algorithms adopted by search engine
Echelon Institute of Technology
 
PPTX
SEO Glossary By Rahul Gupta-SEO Lucknow-Hyderabad
Rahul Gupta
 
PPTX
Components of a search engine
Primya Tamil
 
PPTX
Search Engine
Ram Dutt Shukla
 
PDF
Digital Markeing
UTTAMTADWAL
 
PDF
Topic-specific Web Crawler using Probability Method
IOSR Journals
 
PDF
E017624043
IOSR Journals
 
page ranking web crawling
pradiprahul
 
Understanding Seo At A Glance
poojagupta267
 
Working Of Search Engine
NIKHIL NAIR
 
Google history nd architecture
Divyangee Jain
 
CRAWLER,INDEX,RANKING AND ITS WORKING.pptx
ajajkhan16
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET Journal
 
SEO Tutorial - SEO Company in India
annakoch32
 
Web2.0.2012 - lesson 8 - Google world
Carlo Vaccari
 
SEO Basics - SEO Company in India
annakoch32
 
How search engine works
leoniehannah
 
Search Engine Optimization - Aykut Aslantaş
Aykut Aslantaş
 
Ranking algorithms
Ankit Raj
 
Google indexing
tahoor71
 
Comparative study of different ranking algorithms adopted by search engine
Echelon Institute of Technology
 
SEO Glossary By Rahul Gupta-SEO Lucknow-Hyderabad
Rahul Gupta
 
Components of a search engine
Primya Tamil
 
Search Engine
Ram Dutt Shukla
 
Digital Markeing
UTTAMTADWAL
 
Topic-specific Web Crawler using Probability Method
IOSR Journals
 
E017624043
IOSR Journals
 
Ad

Recently uploaded (20)

PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
July Patch Tuesday
Ivanti
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Ad

Googling of GooGle

  • 1. GOOGLING OF How Google Search Engine Works….
  • 2. The Web is both an excellent medium for sharing information, as well as an attractive platform for delivering products and services . Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than any existing search engine. Many web pages are unscrupulous and try to fool search engines to get to the top of ranking Google uses page rank and trust rank techniques to give accurate results of queries Introduction
  • 3. What is Search engine A tool designed to search information on web Works with the help of Crawler Indexer Search algorithms Gives precise results on the basis of different ranking procedures
  • 4. Hi how are u WEB CRAWLER
  • 6. Indexer It collects, parses and stores data to facilitate fast and accurate information retrieval for a search query The inverted index stores a list of the documents containing each word associated with the query Document 5 red Document 2,document 4 is Document 1,document 2,document 3 apple Document Word
  • 7. The search engine then matches the query with each document indexed. Then it filters the matching results. Since each query would require approx. 250GB of memory so with the help of compression techniques the index can be reduced to a fraction of this size. Indexer are regularly updated with help of index merging
  • 8. SEARCH ALGORITHM Search for a query return millions of important or authoritative pages then search algorithm is used by engine to decide which one is going to be the listing that comes to the top There are two key drivers in web search: content analysis and linkage analysis. Famous algorithms used by different search engines are 1.Page rank 2.Trust rank 3.Hilltop algorithm 4.Binary search
  • 9. Different search engines uses different algorithms to rank the priority of pages. Different engines look for different things to determine search relevancy. Things that help you rank in one engine could preclude you from ranking in another
  • 10. 73% 71% 64% 56% 51% Positive ranking factors 68% 56% 51% 51% 46% Negative ranking factors Keyword focused anchor text from external links External link Popularity Diversity of link sources Keyword Use Anywhere in the title tag Trustworthiness of the Domain Based on Link Distance from Trusted Cloaking with Malicious intent Link acquisition from known link brokers Link from the page to Web Spam Pages Cloaking by User Agent Frequent Server Downtime & Site Inaccessibility
  • 11. OVERALL RANKING FACTORS
  • 12.  
  • 13. Google architecture Web crawling done by several distributed crawlers The web pages fetched then sent to store server which compresses the pages and stores them in repository. Indexer then reads repository, uncompress the documents and parses them. Every web page has an associated ID number called DocID, which is assigned during parsing. Each documents is converted into sets of word occurrence called HITS, which is distributed by indexer into barrels, creating a partially sorted index.
  • 14. Indexer also parses out all the link in every page and stores important information about them in an anchors file which determine where each link points from and to, and the text of the link The URLresolver reads the anchor file, retrieve their anchor text, puts the anchor text into forward index and generate a database of links Link database is used to compute PageRanks of all the documents. The sorter takes the barrels, which are sorted by docID, resorts them into wordID, produces a list of it and offsets into inverted index. A program called Dump Lexicon takes this list together with the lexicon produced by indexer and create a new lexicon for the searcher. The searcher uses these lexicons together with inverted index and Page Rank to answer our queries.
  • 15. Crawling deeply in Google's Architecture Major data structure Google’s data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost Cpus and bulk input output have increased upto millions.Google is designed to avoid disk seeks whenever possible Repositry It contains the full HTML of every web page and compresses it using Zlib which is a tradeoff between speed The documents are stored one after the other and are prefixed by docID, length, and URL as shown url page pagelen Url len Ecode Doc Id
  • 16. Document index It keeps information about each document which include the current document status, pointer into the repository, document checksum and various statistics URLs is converted into docIDs in batch by doing a merge with this file To find the docID of a particular URL, the URL’s checksum is computed and a binary search is performed Lexicon It’s a program used by indexer as a word storage system and fit in machine memory for a reasonable price The current lexicon contains 14 million words and takes only 256 MB of main memory of machine.
  • 17. Hit list It’s a list of occurrences of a particular word in a particular document including position, font and capitalization information. It also account for most of the space used in both the forward and the inverted indices Types of hits: Fancy hits and Plain hits Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. Plain hits include everything The length of the hit list is combined with the wordID in the forward index and the docID in the inverted index Forward index It is partially sorted and stored in a number of barrels. Each barrel holds a range of wordID’s. If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID’s
  • 18. Inverted index Inverted index consists of the same barrels as the forward index, except it is processed by the sorter For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into and points to a doclist of docID’s together It has two sets of inverted barrels, one set for hit lists which include title or anchor hits and another set for all hit lists It checks for the first set of barrels first and if there are not enough matches within those barrels it checks the larger ones. Indexing the web Any parser which is designed to run on the entire Web handle a huge array of possible errors. For maximum speed it uses flex to generate a lexical analyzer which runs at a reasonable speed and is very robust involved a fair amount of work.
  • 19. Searching techniques The goal of searching is to provide quality search results efficiently. once a certain number (currently 40,000) of matching documents are found, the searcher automatically sort the documents that have matched, by rank, and return the top results. Google considers each hit (title, anchor, URL,large and small font), each of which has its own type-weight. The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list. Every count is converted into a count-weight. We take the dot product of the vector of count-weights Vector of type-weights is used to compute an IR score for the document. IR score is combined with PageRank to give a final rank to the document.
  • 20. Page Rank is based on a mutual reinforcement between pages. It’s a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents A page that is linked to by many pages with high Page Rank receives a high rank itself. If there are no links to a web page there is no support for that page. A recent analysis of the algorithm showed that the total Page Rank score PR (t) of a group t of pages depends on four factors: PR(t) = PRstatic(t)+PRin(t)-PRout(t)-PRsink(t) Page rank
  • 21. Page C has a higher PageRank than Page E, even though it has fewer links to it; the link it has is of a much higher value. A web surfer who chooses a random link on every page (but with 15% likelihood jumps to a random page on the whole web) is going to be on Page E for 8.1% of the time. (The 15% likelihood of jumping to an arbitrary page corresponds to a damping factor of 85%.) Without damping, all web surfers would eventually end up on Pages A, B, or C, and all other pages would have Page Rank zero. Page A is assumed to link to all pages in the web, because it has no outgoing links Mathematical Page Ranks
  • 23. Google and Web Spam All deceptive actions which try to increase the ranking of a page in search engines are generally referred to as Web spam . It also refereed as “any attempt to deceive a search engine’s relevancy algorithm”. There are three types of web spam on the web .They are:- Content spam : Maliciously crafting the content of Web pages. It refers to changes in the content of the pages, for instance by inserting a large number of keywords. Link spam : Includes changes to the link structure of the sites, by creating link farms .A link farm is a densely connected set of pages, created explicitly with the purpose of deceiving a link based ranking algorithm. Cloaking : It is achieved by creating a rogue copy of a popular website which shows contents similar to the original to a web crawler, but redirects web surfers to unrelated or malicious websites. Spammers can use this technique to achieve high rankings in result pages for certain key words .
  • 25. The foundation of spam detection system is a cost sensitive decision tree. It incorporates a combined approach based on link and content analysis to detect different types of Web spam pages Content Based Features Number of words in the page Fraction of anchor text Fraction of visible text A comparative study content based features of the below mentioned figures show following results: Figure 1- Average Word Length in Spam pages are much higher in spam pages Figure2-Number of words in spam page is much higher than non-spam page Web spam detection and result
  • 26. Thus based on the following features the content based spam pages can be detected by Naïve Bayesian Classifier which focuses on the no of times a word is repeated in the content of the page . Figure 1: Figure 2:
  • 27. Link Based Features  Data set is obtained by using web crawler .  For each page, links and its contents are obtained.  From data set, a full graph is built .  For each host and page, certain features are computed .  Link-based features are extracted from host graph. Link Based classifier operates on the three features of the link farm which are as follows :- Based on the Estimation of Supporters Based on Trust Rank and Page Rank
  • 28. It has been observed that a normal webpage have their graph of the supporter increasing exponentially and the number of supporters increases with the distance. But in the case of the web spam their graph has a sudden increase in the supporters over a small distance of time and decreasing to zero after some distance. The distribution of the supporters over the distance has been shown in the figure Distribution of supporters over a distance of the spam and non-spam page Non spam spam
  • 29. System performance It is important for a search engine to crawl and index efficiently. This way information can be kept up to date and major changes to the system can be tested relatively quickly In total it took roughly 9 days to download the 26 million pages (including errors) downloading the last11 million pages in just 63 hours, averaging just over 4 million pages per day or 48.5 pages per second. The indexer runs at roughly 54 pages per second. The sorters can be run completely in parallel; using four machines, the whole process of sorting takes about 24 hours.
  • 30. Google’s immediate goals are to improve search efficiency and to scale to approximately 100 million web pages. They are planning to add simple features supported by commercial search engines like boolean operators, negation, and stemming and extending the use of link structure and link text. Page Rank can be personalized by increasing the weight of a user’s home page or bookmarks. Google is planning to use all the other centrality measures. The Centrality measures of a node are Degree centrality Betweenness centrality Closeness centrality Future work
  • 31. Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. Google keeps us away from spammy link exchange hubs and other sources of junk links. It gives more importance to .gov and .edu web pages. We had applied algorithms for Web spam detection based on these features of the web farm i.e Context based(Naïve Bayesian Classifier) and link based(PageRank Algorithm). conclusion
  • 32. Best of the Web 1994 -- Navigators http://botw.org/1994/awards/navigators.html l.Bzip2 Homepage http://www.muraroa.demon.co.uk/ Google Search Engine http://google.stanford.edu/ Harvest http://harvest.transarc.com/ Mauldin, Michael L. Lycos Design Choices in an Internet Search Service, IEEE Expert Interview http://www.computer.org/pubs/expert/1997/trends/x1008/mauldin.htm Search Engine Watch http://www.searchenginewatch.com/ Robots Exclusion Protocol: http://info.webcrawler.com/mak/projects/robots/exclusion.htm References
  • 33. Thank You All !!