Googling of GooGle

GOOGLING OF How Google Search Engine Works….

The Web is both an excellent medium for sharing information, as well as an attractive platform for delivering products and services . Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than any existing search engine. Many web pages are unscrupulous and try to fool search engines to get to the top of ranking Google uses page rank and trust rank techniques to give accurate results of queries Introduction

What is Search engine A tool designed to search information on web Works with the help of Crawler Indexer Search algorithms Gives precise results on the basis of different ranking procedures

Indexer It collects, parses and stores data to facilitate fast and accurate information retrieval for a search query The inverted index stores a list of the documents containing each word associated with the query Document 5 red Document 2,document 4 is Document 1,document 2,document 3 apple Document Word

The search engine then matches the query with each document indexed. Then it filters the matching results. Since each query would require approx. 250GB of memory so with the help of compression techniques the index can be reduced to a fraction of this size. Indexer are regularly updated with help of index merging

SEARCH ALGORITHM Search for a query return millions of important or authoritative pages then search algorithm is used by engine to decide which one is going to be the listing that comes to the top There are two key drivers in web search: content analysis and linkage analysis. Famous algorithms used by different search engines are 1.Page rank 2.Trust rank 3.Hilltop algorithm 4.Binary search

Different search engines uses different algorithms to rank the priority of pages. Different engines look for different things to determine search relevancy. Things that help you rank in one engine could preclude you from ranking in another

73% 71% 64% 56% 51% Positive ranking factors 68% 56% 51% 51% 46% Negative ranking factors Keyword focused anchor text from external links External link Popularity Diversity of link sources Keyword Use Anywhere in the title tag Trustworthiness of the Domain Based on Link Distance from Trusted Cloaking with Malicious intent Link acquisition from known link brokers Link from the page to Web Spam Pages Cloaking by User Agent Frequent Server Downtime & Site Inaccessibility

Google architecture Web crawling done by several distributed crawlers The web pages fetched then sent to store server which compresses the pages and stores them in repository. Indexer then reads repository, uncompress the documents and parses them. Every web page has an associated ID number called DocID, which is assigned during parsing. Each documents is converted into sets of word occurrence called HITS, which is distributed by indexer into barrels, creating a partially sorted index.

Indexer also parses out all the link in every page and stores important information about them in an anchors file which determine where each link points from and to, and the text of the link The URLresolver reads the anchor file, retrieve their anchor text, puts the anchor text into forward index and generate a database of links Link database is used to compute PageRanks of all the documents. The sorter takes the barrels, which are sorted by docID, resorts them into wordID, produces a list of it and offsets into inverted index. A program called Dump Lexicon takes this list together with the lexicon produced by indexer and create a new lexicon for the searcher. The searcher uses these lexicons together with inverted index and Page Rank to answer our queries.

Crawling deeply in Google's Architecture Major data structure Google’s data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost Cpus and bulk input output have increased upto millions.Google is designed to avoid disk seeks whenever possible Repositry It contains the full HTML of every web page and compresses it using Zlib which is a tradeoff between speed The documents are stored one after the other and are prefixed by docID, length, and URL as shown url page pagelen Url len Ecode Doc Id

Document index It keeps information about each document which include the current document status, pointer into the repository, document checksum and various statistics URLs is converted into docIDs in batch by doing a merge with this file To find the docID of a particular URL, the URL’s checksum is computed and a binary search is performed Lexicon It’s a program used by indexer as a word storage system and fit in machine memory for a reasonable price The current lexicon contains 14 million words and takes only 256 MB of main memory of machine.

Hit list It’s a list of occurrences of a particular word in a particular document including position, font and capitalization information. It also account for most of the space used in both the forward and the inverted indices Types of hits: Fancy hits and Plain hits Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. Plain hits include everything The length of the hit list is combined with the wordID in the forward index and the docID in the inverted index Forward index It is partially sorted and stored in a number of barrels. Each barrel holds a range of wordID’s. If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID’s

Inverted index Inverted index consists of the same barrels as the forward index, except it is processed by the sorter For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into and points to a doclist of docID’s together It has two sets of inverted barrels, one set for hit lists which include title or anchor hits and another set for all hit lists It checks for the first set of barrels first and if there are not enough matches within those barrels it checks the larger ones. Indexing the web Any parser which is designed to run on the entire Web handle a huge array of possible errors. For maximum speed it uses flex to generate a lexical analyzer which runs at a reasonable speed and is very robust involved a fair amount of work.

Searching techniques The goal of searching is to provide quality search results efficiently. once a certain number (currently 40,000) of matching documents are found, the searcher automatically sort the documents that have matched, by rank, and return the top results. Google considers each hit (title, anchor, URL,large and small font), each of which has its own type-weight. The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list. Every count is converted into a count-weight. We take the dot product of the vector of count-weights Vector of type-weights is used to compute an IR score for the document. IR score is combined with PageRank to give a final rank to the document.

Page Rank is based on a mutual reinforcement between pages. It’s a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents A page that is linked to by many pages with high Page Rank receives a high rank itself. If there are no links to a web page there is no support for that page. A recent analysis of the algorithm showed that the total Page Rank score PR (t) of a group t of pages depends on four factors: PR(t) = PRstatic(t)+PRin(t)-PRout(t)-PRsink(t) Page rank

Page C has a higher PageRank than Page E, even though it has fewer links to it; the link it has is of a much higher value. A web surfer who chooses a random link on every page (but with 15% likelihood jumps to a random page on the whole web) is going to be on Page E for 8.1% of the time. (The 15% likelihood of jumping to an arbitrary page corresponds to a damping factor of 85%.) Without damping, all web surfers would eventually end up on Pages A, B, or C, and all other pages would have Page Rank zero. Page A is assumed to link to all pages in the web, because it has no outgoing links Mathematical Page Ranks

Google and Web Spam All deceptive actions which try to increase the ranking of a page in search engines are generally referred to as Web spam . It also refereed as “any attempt to deceive a search engine’s relevancy algorithm”. There are three types of web spam on the web .They are:- Content spam : Maliciously crafting the content of Web pages. It refers to changes in the content of the pages, for instance by inserting a large number of keywords. Link spam : Includes changes to the link structure of the sites, by creating link farms .A link farm is a densely connected set of pages, created explicitly with the purpose of deceiving a link based ranking algorithm. Cloaking : It is achieved by creating a rogue copy of a popular website which shows contents similar to the original to a web crawler, but redirects web surfers to unrelated or malicious websites. Spammers can use this technique to achieve high rankings in result pages for certain key words .

The foundation of spam detection system is a cost sensitive decision tree. It incorporates a combined approach based on link and content analysis to detect different types of Web spam pages Content Based Features Number of words in the page Fraction of anchor text Fraction of visible text A comparative study content based features of the below mentioned figures show following results: Figure 1- Average Word Length in Spam pages are much higher in spam pages Figure2-Number of words in spam page is much higher than non-spam page Web spam detection and result

Thus based on the following features the content based spam pages can be detected by Naïve Bayesian Classifier which focuses on the no of times a word is repeated in the content of the page . Figure 1: Figure 2:

Link Based Features  Data set is obtained by using web crawler .  For each page, links and its contents are obtained.  From data set, a full graph is built .  For each host and page, certain features are computed .  Link-based features are extracted from host graph. Link Based classifier operates on the three features of the link farm which are as follows :- Based on the Estimation of Supporters Based on Trust Rank and Page Rank

It has been observed that a normal webpage have their graph of the supporter increasing exponentially and the number of supporters increases with the distance. But in the case of the web spam their graph has a sudden increase in the supporters over a small distance of time and decreasing to zero after some distance. The distribution of the supporters over the distance has been shown in the figure Distribution of supporters over a distance of the spam and non-spam page Non spam spam

System performance It is important for a search engine to crawl and index efficiently. This way information can be kept up to date and major changes to the system can be tested relatively quickly In total it took roughly 9 days to download the 26 million pages (including errors) downloading the last11 million pages in just 63 hours, averaging just over 4 million pages per day or 48.5 pages per second. The indexer runs at roughly 54 pages per second. The sorters can be run completely in parallel; using four machines, the whole process of sorting takes about 24 hours.

Google’s immediate goals are to improve search efficiency and to scale to approximately 100 million web pages. They are planning to add simple features supported by commercial search engines like boolean operators, negation, and stemming and extending the use of link structure and link text. Page Rank can be personalized by increasing the weight of a user’s home page or bookmarks. Google is planning to use all the other centrality measures. The Centrality measures of a node are Degree centrality Betweenness centrality Closeness centrality Future work

Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. Google keeps us away from spammy link exchange hubs and other sources of junk links. It gives more importance to .gov and .edu web pages. We had applied algorithms for Web spam detection based on these features of the web farm i.e Context based(Naïve Bayesian Classifier) and link based(PageRank Algorithm). conclusion

Best of the Web 1994 -- Navigators http://botw.org/1994/awards/navigators.html l.Bzip2 Homepage http://www.muraroa.demon.co.uk/ Google Search Engine http://google.stanford.edu/ Harvest http://harvest.transarc.com/ Mauldin, Michael L. Lycos Design Choices in an Internet Search Service, IEEE Expert Interview http://www.computer.org/pubs/expert/1997/trends/x1008/mauldin.htm Search Engine Watch http://www.searchenginewatch.com/ Robots Exclusion Protocol: http://info.webcrawler.com/mak/projects/robots/exclusion.htm References

Googling of GooGle

More Related Content

What's hot (18)

Similar to Googling of GooGle (20)

Recently uploaded (20)

Googling of GooGle