Web Search Advances &  Link Analysis
Meta-Search Engines Search engine that passes query to several other search engines and integrate results. Submit queries to host sites. Parse resulting HTML pages to extract search results. Integrate multiple rankings into a “consensus” ranking. Present integrated results to user. Examples: Metacrawler SavvySearch     Dogpile
HTML Structure & Feature Weighting Weight tokens under particular HTML tags more heavily: <TITLE> tokens  (Google seems to like title matches) <H1>,<H2>… tokens <META> keyword tokens Parse page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.
Bibliometrics: Citation Analysis Many standard documents include  bibliographies  (or  references ), explicit  citations  to other previously published documents. Using citations as links, standard corpora can be viewed as a graph. The structure of this graph, independent of content, can provide interesting information about the similarity of documents and the structure of information. CF corpus includes citation information.
Impact Factor Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals. Measure of how often papers in the journal are cited by other scientists. Computed and published annually by the Institute for Scientific Information (ISI). The  impact factor  of a journal  J  in year  Y  is the average number of citations (from indexed documents published in year  Y ) to a paper published in  J  in year  Y  1 or  Y  2. Does not account for the quality of the citing article.
Bibliographic Coupling Measure of similarity of documents introduced by Kessler in 1963. The bibliographic coupling of two documents  A  and  B  is the number of documents cited by  both   A  and  B . Size of the intersection of their bibliographies. Maybe want to normalize by size of bibliographies? A B
Co-Citation An alternate citation-based measure of similarity introduced by Small in 1973. Number of documents that cite both  A  and  B . Maybe want to normalize by total number of documents citing either  A  or  B  ? A B
Citations vs. Links Web links are a bit different than citations: Many links are navigational. Many pages with high in-degree are portals not content providers. Not all links are endorsements. Company websites don’t point to their competitors. Citations to relevant literature is enforced by peer-review.
Authorities Authorities  are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree  (number of pointers to a page) is one simple measure of authority. However in-degree treats all links as equal. Should links from pages that are themselves authoritative count more?
Hubs Hubs  are index pages that provide lots of useful links to relevant content pages (topic authorities). Hub pages for IR are included in the course home page: http://www.cs.utexas.edu/users/mooney/ir-course
HITS Algorithm developed by Kleinberg in 1998. Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. Based on mutually recursive facts: Hubs point to lots of authorities. Authorities are pointed to by lots of hubs.
Hubs and Authorities Together they tend to form a bipartite graph: Hubs  Authorities
HITS Algorithm Computes hubs and authorities for a particular topic specified by a normal query. First determines a set of relevant pages for the query called the  base  set  S . Analyze the link structure of the web subgraph defined by  S  to find authority and hub pages in this set.
Constructing a Base Subgraph For a specific query  Q , let the set of documents returned by a standard search engine (e.g. VSR) be called the  root  set  R . Initialize  S  to  R . Add to  S  all pages pointed to by any page in  R . Add to  S  all pages that point to any page in  R . R S
Base Limitations To limit computational expense: Limit number of root pages to the top 200 pages retrieved for the query. Limit number of “back-pointer” pages to a random set of at most 50 pages returned by a “reverse link” query. To eliminate purely navigational links: Eliminate links between two pages on the same host. To eliminate “non-authority-conveying” links: Allow only  m  ( m     4  8) pages from a given host as pointers to any individual page.
Authorities and In-Degree Even within the base set  S  for a given query, the nodes with highest in-degree are not necessarily authorities (may just be generally popular pages like Yahoo or Amazon). True authority pages are pointed to by a number of hubs (i.e. pages that point to lots of authorities).
Iterative Algorithm Use an iterative algorithm to slowly converge on a mutually reinforcing set of hubs and authorities. Maintain for each page  p     S: Authority score:  a p  (vector   a ) Hub score :  h p  (vector   h ) Initialize all  a p  = h p  = 1 Maintain normalized scores:
HITS Update Rules Authorities are pointed to by lots of good hubs: Hubs point to lots of good authorities:
Illustrated Update Rules 2 3 a 4  = h 1  + h 2  + h 3 1 5 7 6 4 4 h 4  = a 5  + a 6  + a 7
HITS Iterative Algorithm Initialize for all  p     S :  a p  = h p  = 1  For i = 1 to k: For all  p     S:  ( update auth. scores ) For all  p     S:   ( update hub scores ) For all  p     S:   a p = a p /c  c: For all  p     S:   h p = h p /c  c: ( normalize   a ) ( normalize   h )
Convergence Algorithm converges to a  fix-point  if iterated indefinitely. Define  A  to be the adjacency matrix for the subgraph defined by  S. A ij   = 1 for  i    S,  j    S iff  i  j Authority vector,  a , converges to the principal eigenvector of  A T A Hub vector,  h , converges to the principal eigenvector of  AA T In practice, 20 iterations produces fairly stable results.
Results Authorities for query:  “Java” java.sun.com comp.lang.java FAQ Authorities for query “search engine” Yahoo.com Excite.com Lycos.com Altavista.com Authorities for query “Gates” Microsoft.com roadahead.com
Result Comments In most cases, the final authorities were not in the initial root set generated using Altavista. Authorities were brought in from linked and reverse-linked pages and then HITS computed their high authority score.
Finding Similar Pages Using Link Structure Given a page,  P , let  R  (the root set) be  t  (e.g. 200) pages that point to  P . Grow a base set  S  from  R . Run HITS on  S . Return the best authorities in  S  as the best similar-pages for  P . Finds authorities in the “link neighbor-hood” of  P .
Similar Page Results Given “honda.com” toyota.com ford.com bmwusa.com saturncars.com nissanmotors.com audi.com volvocars.com
HITS for Clustering An ambiguous query can result in the principal eigenvector only covering one of the possible meanings. Non-principal eigenvectors may contain hubs & authorities for other meanings. Example: “jaguar”: Atari video game (principal eigenvector) NFL Football team (2 nd  non-princ. eigenvector) Automobile (3 rd   non-princ. eigenvector)
PageRank Alternative link-analysis method used by Google  (Brin & Page, 1998) . Does not attempt to capture the distinction between hubs and authorities. Ranks pages just by authority. Applied to the entire web rather than a local neighborhood of pages surrounding the results of a query.
Initial PageRank Idea Just measuring in-degree (citation count) doesn’t account for the authority of the source of a link. Initial page rank equation for page  p : N q  is the total number of out-links from page  q . A page,  q , “gives” an equal fraction of its authority to all the pages it points to (e.g.  p ). c  is a normalizing constant set so that the rank of all pages always sums to 1.
Initial PageRank Idea (cont.) Can view it as a process of PageRank “flowing” from pages to the pages they cite. .1 .09 .05 .05 .03 .03 .03 .08 .08 .03
Initial Algorithm Iterate rank-flowing process until convergence: Let  S  be the total set of pages. Initialize   p  S: R ( p ) = 1/| S|  Until ranks do not change (much)  ( convergence ) For each  p  S: For each  p  S: R ( p ) =  cR ´ ( p )  ( normalize )
Sample Stable Fixpoint 0.4 0.4 0.2 0.2 0.2 0.2 0.4
Linear Algebra Version Treat  R  as a vector over web pages. Let  A  be a 2-d matrix over pages where  A vu = 1/ N u   if  u   v  else   A vu = 0 Then  R = c AR R  converges to the principal eigenvector of  A .
Problem with Initial Idea A group of pages that only point to themselves but are pointed to by other pages act as a “rank sink” and absorb all the rank in the system. Rank flows into cycle and can’t get out
Rank Source Introduce a “rank source”  E  that continually replenishes the rank of each page,  p ,  by a fixed amount  E ( p ).
PageRank Algorithm Let  S  be the total set of pages. Let   p  S: E ( p ) =   /| S|  (for some 0<  <1 ,  e.g. 0.15 ) Initialize   p  S: R ( p ) = 1/| S|  Until ranks do not change (much)  ( convergence ) For each  p  S: For each  p  S: R ( p ) =  cR ´ ( p )  ( normalize )
Linear Algebra Version R  = c( AR  +  E ) Since || R || 1  =1 :  R  = c( A  +  E  1 ) R Where  1  is the vector consisting of all 1’s. So  R  is an eigenvector of ( A  +  E x 1 )
Random Surfer Model PageRank can be seen as modeling a “random surfer” that starts on a random page and then at each point: With probability  E ( p ) randomly jumps to page  p . Otherwise, randomly follows a link on the current page. R ( p ) models the probability that this random surfer will be on page  p  at any given time. “ E jumps” are needed to prevent the random surfer from getting “trapped” in web sinks with no outgoing links.
Speed of Convergence Early experiments on Google used 322 million links. PageRank algorithm converged (within small tolerance) in about 52 iterations. Number of iterations required for convergence is empirically O(log  n ) (where  n  is the number of links). Therefore calculation is quite efficient.
Simple Title Search with PageRank Use simple Boolean search to search web-page titles and rank the retrieved pages by their PageRank. Sample search for “university”: Altavista returned a random set of pages with “university” in the title (seemed to prefer short URLs). Primitive Google returned the home pages of top universities.
Google Ranking Complete Google ranking includes (based on university publications prior to commercialization). Vector-space similarity component. Keyword proximity component. HTML-tag weight component (e.g. title preference). PageRank component. Details of current commercial ranking functions are trade secrets.
Personalized PageRank PageRank can be biased (personalized) by changing  E  to a non-uniform distribution. Restrict “random jumps” to a set of specified relevant pages. For example, let  E ( p ) = 0 except for one’s own home page, for which  E ( p ) =   This results in a bias towards pages that are closer in the web graph to your own homepage.
Google PageRank-Biased Spidering Use PageRank to direct (focus) a spider on “important” pages. Compute page-rank using the current set of crawled pages. Order the spider’s search queue based on current estimated PageRank.
Link Analysis Conclusions Link analysis uses information about the structure of the web graph to aid search. It is one of the major innovations in web search. It is the primary reason for Google’s success.

More Related Content

PPT
Graph-theory.ppt
PPTX
Spatial Database
PPTX
HML: Historical View and Trends of Deep Learning
PPTX
UNIT - 1 Part 2: Data Warehousing and Data Mining
PDF
Graph Data Structure
PPTX
Adjacency list
PDF
Link prediction
PPTX
Apache HBase™
Graph-theory.ppt
Spatial Database
HML: Historical View and Trends of Deep Learning
UNIT - 1 Part 2: Data Warehousing and Data Mining
Graph Data Structure
Adjacency list
Link prediction
Apache HBase™

What's hot (20)

PPTX
Graphs in data structure
PDF
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
PPT
5.5 graph mining
PPT
B trees in Data Structure
PPT
Graph theory
PPTX
A survey on graph kernels
PPTX
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
PPTX
Graph Representation Learning
PPTX
AI_Session 7 Greedy Best first search algorithm.pptx
PPTX
Deep learning lecture - part 1 (basics, CNN)
PPTX
Introduction to Graph Databases
PPTX
Commercially use GIS & REMOTE SENSING Software
PPTX
Planar graph
PPTX
Visualizing Data with Geographic Information Systems (GIS)
PDF
graph representation.pdf
PPTX
Open addressiing &amp;rehashing,extendiblevhashing
PPTX
Graph theory
PPTX
Map reduce programming model to solve graph problems
PPT
Graphs
PPTX
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
Graphs in data structure
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
5.5 graph mining
B trees in Data Structure
Graph theory
A survey on graph kernels
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Graph Representation Learning
AI_Session 7 Greedy Best first search algorithm.pptx
Deep learning lecture - part 1 (basics, CNN)
Introduction to Graph Databases
Commercially use GIS & REMOTE SENSING Software
Planar graph
Visualizing Data with Geographic Information Systems (GIS)
graph representation.pdf
Open addressiing &amp;rehashing,extendiblevhashing
Graph theory
Map reduce programming model to solve graph problems
Graphs
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
Ad

Viewers also liked (20)

PPT
Lenny Koupal Writing Samples
PDF
РИФ 2016, CRM в кризис: рецепт выживания
PPS
H palliroia tis_venetias
PPT
40 belolikov-optimization2010 почему не стоит снимать контекст после выхода в...
PPT
в вики Netpromoter2010 ludkevich
KEY
10bigideaspresentation
PDF
РИФ 2016, Ценность профстандарта и сертификации для специалистов UI/UX отрасли
PPTX
The state of uma 2014 11-03
PPT
Leticia Zapata
PDF
РИФ 2016, Smart TV как платформа для просмотра ТВ и онлайн-видео
PPT
Websites
PDF
05 paderin-etarget2011
PPT
ALA Retention Talk
PPT
power point
PPT
! мк сеть гугл 09 jossin-etarget2011
PPTX
Se vuoi vedere impara ad agire
PPTX
Ezagutza Askea
PPT
Portfolio
PDF
аудит сайта торговой сети
PDF
22apr s51-a-unisov-110426092333-phpapp02
Lenny Koupal Writing Samples
РИФ 2016, CRM в кризис: рецепт выживания
H palliroia tis_venetias
40 belolikov-optimization2010 почему не стоит снимать контекст после выхода в...
в вики Netpromoter2010 ludkevich
10bigideaspresentation
РИФ 2016, Ценность профстандарта и сертификации для специалистов UI/UX отрасли
The state of uma 2014 11-03
Leticia Zapata
РИФ 2016, Smart TV как платформа для просмотра ТВ и онлайн-видео
Websites
05 paderin-etarget2011
ALA Retention Talk
power point
! мк сеть гугл 09 jossin-etarget2011
Se vuoi vedere impara ad agire
Ezagutza Askea
Portfolio
аудит сайта торговой сети
22apr s51-a-unisov-110426092333-phpapp02
Ad

Similar to Link Analysis (20)

PDF
HITS algorithm : NOTES
PDF
Pagerank and hits
PDF
A survey of_eigenvector_methods_for_web_information_retrieval
DOC
Done reread deeperinsidepagerank
PPTX
WEB Data Mining
PDF
Deeper Inside PageRank (NOTES)
PDF
Page rank talk at NTU-EE
PPTX
HITS + Pagerank
PDF
Link analysis for web search
PPTX
Graph Mining_Module-3_CS7 (PageRank).pptx
PPTX
Web mining: Concepts and applications
PDF
Link Analysis
PDF
Link-Based Ranking
PPTX
Link analysis : Comparative study of HITS and Page Rank Algorithm
PDF
Evaluation of Web Search Engines Based on Ranking of Results and Features
PPTX
Discovering knowledge using web structure mining
PDF
Random web surfer pagerank algorithm
PPTX
Page rank and hyperlink
PPT
Social (1)
PPTX
PageRank Algorithm In data mining
HITS algorithm : NOTES
Pagerank and hits
A survey of_eigenvector_methods_for_web_information_retrieval
Done reread deeperinsidepagerank
WEB Data Mining
Deeper Inside PageRank (NOTES)
Page rank talk at NTU-EE
HITS + Pagerank
Link analysis for web search
Graph Mining_Module-3_CS7 (PageRank).pptx
Web mining: Concepts and applications
Link Analysis
Link-Based Ranking
Link analysis : Comparative study of HITS and Page Rank Algorithm
Evaluation of Web Search Engines Based on Ranking of Results and Features
Discovering knowledge using web structure mining
Random web surfer pagerank algorithm
Page rank and hyperlink
Social (1)
PageRank Algorithm In data mining

More from marco larco (13)

PPTX
PPTX
Presentación sobre revisión de planillas
PPTX
Presentacion power point
PPT
PAGERANK
PPT
Link Analysis
PPT
Pagerank
PPT
Expo Internet
PPT
Expo Mis Alumnos
PPT
Expo Mis Alumnos
PPT
Expo Mis Alumnos
PPS
Con El Tiempo
PPS
Mi Padre
PPTX
Herramientas De Colaboracion Digital
Presentación sobre revisión de planillas
Presentacion power point
PAGERANK
Link Analysis
Pagerank
Expo Internet
Expo Mis Alumnos
Expo Mis Alumnos
Expo Mis Alumnos
Con El Tiempo
Mi Padre
Herramientas De Colaboracion Digital

Link Analysis

  • 1. Web Search Advances & Link Analysis
  • 2. Meta-Search Engines Search engine that passes query to several other search engines and integrate results. Submit queries to host sites. Parse resulting HTML pages to extract search results. Integrate multiple rankings into a “consensus” ranking. Present integrated results to user. Examples: Metacrawler SavvySearch Dogpile
  • 3. HTML Structure & Feature Weighting Weight tokens under particular HTML tags more heavily: <TITLE> tokens (Google seems to like title matches) <H1>,<H2>… tokens <META> keyword tokens Parse page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.
  • 4. Bibliometrics: Citation Analysis Many standard documents include bibliographies (or references ), explicit citations to other previously published documents. Using citations as links, standard corpora can be viewed as a graph. The structure of this graph, independent of content, can provide interesting information about the similarity of documents and the structure of information. CF corpus includes citation information.
  • 5. Impact Factor Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals. Measure of how often papers in the journal are cited by other scientists. Computed and published annually by the Institute for Scientific Information (ISI). The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y ) to a paper published in J in year Y  1 or Y  2. Does not account for the quality of the citing article.
  • 6. Bibliographic Coupling Measure of similarity of documents introduced by Kessler in 1963. The bibliographic coupling of two documents A and B is the number of documents cited by both A and B . Size of the intersection of their bibliographies. Maybe want to normalize by size of bibliographies? A B
  • 7. Co-Citation An alternate citation-based measure of similarity introduced by Small in 1973. Number of documents that cite both A and B . Maybe want to normalize by total number of documents citing either A or B ? A B
  • 8. Citations vs. Links Web links are a bit different than citations: Many links are navigational. Many pages with high in-degree are portals not content providers. Not all links are endorsements. Company websites don’t point to their competitors. Citations to relevant literature is enforced by peer-review.
  • 9. Authorities Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority. However in-degree treats all links as equal. Should links from pages that are themselves authoritative count more?
  • 10. Hubs Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). Hub pages for IR are included in the course home page: http://www.cs.utexas.edu/users/mooney/ir-course
  • 11. HITS Algorithm developed by Kleinberg in 1998. Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. Based on mutually recursive facts: Hubs point to lots of authorities. Authorities are pointed to by lots of hubs.
  • 12. Hubs and Authorities Together they tend to form a bipartite graph: Hubs Authorities
  • 13. HITS Algorithm Computes hubs and authorities for a particular topic specified by a normal query. First determines a set of relevant pages for the query called the base set S . Analyze the link structure of the web subgraph defined by S to find authority and hub pages in this set.
  • 14. Constructing a Base Subgraph For a specific query Q , let the set of documents returned by a standard search engine (e.g. VSR) be called the root set R . Initialize S to R . Add to S all pages pointed to by any page in R . Add to S all pages that point to any page in R . R S
  • 15. Base Limitations To limit computational expense: Limit number of root pages to the top 200 pages retrieved for the query. Limit number of “back-pointer” pages to a random set of at most 50 pages returned by a “reverse link” query. To eliminate purely navigational links: Eliminate links between two pages on the same host. To eliminate “non-authority-conveying” links: Allow only m ( m  4  8) pages from a given host as pointers to any individual page.
  • 16. Authorities and In-Degree Even within the base set S for a given query, the nodes with highest in-degree are not necessarily authorities (may just be generally popular pages like Yahoo or Amazon). True authority pages are pointed to by a number of hubs (i.e. pages that point to lots of authorities).
  • 17. Iterative Algorithm Use an iterative algorithm to slowly converge on a mutually reinforcing set of hubs and authorities. Maintain for each page p  S: Authority score: a p (vector a ) Hub score : h p (vector h ) Initialize all a p = h p = 1 Maintain normalized scores:
  • 18. HITS Update Rules Authorities are pointed to by lots of good hubs: Hubs point to lots of good authorities:
  • 19. Illustrated Update Rules 2 3 a 4 = h 1 + h 2 + h 3 1 5 7 6 4 4 h 4 = a 5 + a 6 + a 7
  • 20. HITS Iterative Algorithm Initialize for all p  S : a p = h p = 1 For i = 1 to k: For all p  S: ( update auth. scores ) For all p  S: ( update hub scores ) For all p  S: a p = a p /c c: For all p  S: h p = h p /c c: ( normalize a ) ( normalize h )
  • 21. Convergence Algorithm converges to a fix-point if iterated indefinitely. Define A to be the adjacency matrix for the subgraph defined by S. A ij = 1 for i  S, j  S iff i  j Authority vector, a , converges to the principal eigenvector of A T A Hub vector, h , converges to the principal eigenvector of AA T In practice, 20 iterations produces fairly stable results.
  • 22. Results Authorities for query: “Java” java.sun.com comp.lang.java FAQ Authorities for query “search engine” Yahoo.com Excite.com Lycos.com Altavista.com Authorities for query “Gates” Microsoft.com roadahead.com
  • 23. Result Comments In most cases, the final authorities were not in the initial root set generated using Altavista. Authorities were brought in from linked and reverse-linked pages and then HITS computed their high authority score.
  • 24. Finding Similar Pages Using Link Structure Given a page, P , let R (the root set) be t (e.g. 200) pages that point to P . Grow a base set S from R . Run HITS on S . Return the best authorities in S as the best similar-pages for P . Finds authorities in the “link neighbor-hood” of P .
  • 25. Similar Page Results Given “honda.com” toyota.com ford.com bmwusa.com saturncars.com nissanmotors.com audi.com volvocars.com
  • 26. HITS for Clustering An ambiguous query can result in the principal eigenvector only covering one of the possible meanings. Non-principal eigenvectors may contain hubs & authorities for other meanings. Example: “jaguar”: Atari video game (principal eigenvector) NFL Football team (2 nd non-princ. eigenvector) Automobile (3 rd non-princ. eigenvector)
  • 27. PageRank Alternative link-analysis method used by Google (Brin & Page, 1998) . Does not attempt to capture the distinction between hubs and authorities. Ranks pages just by authority. Applied to the entire web rather than a local neighborhood of pages surrounding the results of a query.
  • 28. Initial PageRank Idea Just measuring in-degree (citation count) doesn’t account for the authority of the source of a link. Initial page rank equation for page p : N q is the total number of out-links from page q . A page, q , “gives” an equal fraction of its authority to all the pages it points to (e.g. p ). c is a normalizing constant set so that the rank of all pages always sums to 1.
  • 29. Initial PageRank Idea (cont.) Can view it as a process of PageRank “flowing” from pages to the pages they cite. .1 .09 .05 .05 .03 .03 .03 .08 .08 .03
  • 30. Initial Algorithm Iterate rank-flowing process until convergence: Let S be the total set of pages. Initialize  p  S: R ( p ) = 1/| S| Until ranks do not change (much) ( convergence ) For each p  S: For each p  S: R ( p ) = cR ´ ( p ) ( normalize )
  • 31. Sample Stable Fixpoint 0.4 0.4 0.2 0.2 0.2 0.2 0.4
  • 32. Linear Algebra Version Treat R as a vector over web pages. Let A be a 2-d matrix over pages where A vu = 1/ N u if u  v else A vu = 0 Then R = c AR R converges to the principal eigenvector of A .
  • 33. Problem with Initial Idea A group of pages that only point to themselves but are pointed to by other pages act as a “rank sink” and absorb all the rank in the system. Rank flows into cycle and can’t get out
  • 34. Rank Source Introduce a “rank source” E that continually replenishes the rank of each page, p , by a fixed amount E ( p ).
  • 35. PageRank Algorithm Let S be the total set of pages. Let  p  S: E ( p ) =  /| S| (for some 0<  <1 , e.g. 0.15 ) Initialize  p  S: R ( p ) = 1/| S| Until ranks do not change (much) ( convergence ) For each p  S: For each p  S: R ( p ) = cR ´ ( p ) ( normalize )
  • 36. Linear Algebra Version R = c( AR + E ) Since || R || 1 =1 : R = c( A + E  1 ) R Where 1 is the vector consisting of all 1’s. So R is an eigenvector of ( A + E x 1 )
  • 37. Random Surfer Model PageRank can be seen as modeling a “random surfer” that starts on a random page and then at each point: With probability E ( p ) randomly jumps to page p . Otherwise, randomly follows a link on the current page. R ( p ) models the probability that this random surfer will be on page p at any given time. “ E jumps” are needed to prevent the random surfer from getting “trapped” in web sinks with no outgoing links.
  • 38. Speed of Convergence Early experiments on Google used 322 million links. PageRank algorithm converged (within small tolerance) in about 52 iterations. Number of iterations required for convergence is empirically O(log n ) (where n is the number of links). Therefore calculation is quite efficient.
  • 39. Simple Title Search with PageRank Use simple Boolean search to search web-page titles and rank the retrieved pages by their PageRank. Sample search for “university”: Altavista returned a random set of pages with “university” in the title (seemed to prefer short URLs). Primitive Google returned the home pages of top universities.
  • 40. Google Ranking Complete Google ranking includes (based on university publications prior to commercialization). Vector-space similarity component. Keyword proximity component. HTML-tag weight component (e.g. title preference). PageRank component. Details of current commercial ranking functions are trade secrets.
  • 41. Personalized PageRank PageRank can be biased (personalized) by changing E to a non-uniform distribution. Restrict “random jumps” to a set of specified relevant pages. For example, let E ( p ) = 0 except for one’s own home page, for which E ( p ) =  This results in a bias towards pages that are closer in the web graph to your own homepage.
  • 42. Google PageRank-Biased Spidering Use PageRank to direct (focus) a spider on “important” pages. Compute page-rank using the current set of crawled pages. Order the spider’s search queue based on current estimated PageRank.
  • 43. Link Analysis Conclusions Link analysis uses information about the structure of the web graph to aid search. It is one of the major innovations in web search. It is the primary reason for Google’s success.