SlideShare a Scribd company logo
Deriving “The Google Matrix”:
       G = αS + (1- α)1/neeT
             Lecture 4
 B.S Physics 1993, University of Washington
 M.S EE 1998, Washington State (four patents)
 10+ Years in Search Marketing
 Founder of SEMJ.org (Research Journal)
 Blogger for SemanticWeb.com
 President of Future Farm Inc.
   Build a focused crawler in:
    Java, Python, PERL
 Point at MSU home page. Gather all the URLs and
  store for later use.
  http://www.montana.edu/robots.txt
 Store all the HTML and label with DocID.
 Read Google’s Paper. Next time Page Rank & the
  Google Matrix.
 Contest: Who can store the most unique URLS?
 Due Feb 7th (Next week). Send coded and URL list.
   #! /user/bin/python
   ### Basic Web Crawler in Python to Grab a URL from command
    line
   ## Use the urllib2 library for URLs, Use BeautifulSoup
   #
   from BeautifulSoup import BeautifulSoup
   import sys #allow users to input string
   import urllib2
   ####change user-agent name
   from urllib import FancyURLopener
   class MyOpener(FancyURLopener):
      version = 'BadBot/1.0'
   print MyOpener.version # print the user agent name
   httpResponse = urllib2.urlopen(sys.argv[1])
  #store html page in an object called htmlPage
 htmlPage = httpResponse.read()
 print htmlPage
 htmlDom = BeautifulSoup(htmlPage)
 # dump page title
 print htmlDom.title.string
 # dump all links in page
 allLinks = htmlDom.findAll('a', {'href': True})
 for link in allLinks:
 print link['href']
#Print name of Bot
 MyOpener.version
 Open source Java-based crawler
 https://webarchive.jira.com/wiki/display/H
  eritrix/Heritrix;jsessionid=AE9A595F01C
  AAB59BBCDC50C8A3ED2A9
 http://www.robotstxt.org/robotstxt.html
 http://www.commoncrawl.org/
1               2



        3




    6               5




            4
r(Pi) = Σr(Pj)/|Pj|
                    PjΞBPi



• r(Pi) is page rank of Page Pi
• Pj is number of outlinks from page Pj
• BPi is set of pages pointing into Pi
   r(Pj) values of Inlinking page is unknown. Need a starting value.
     Could initialize the values to 1/n (number of pages)




   R0(Pi) = 1/n for all pages Pi


   Process is repeated until a stable value is obtained (Will not
    happen in all cases). Will this converge?
R k + 1(Pi) = Σrk(Pj)/|Pj|
                       PjΞBPi



•   R k + 1 PageRank at of Pi at iteration K + 1
•   Ro(Pi) = 1/n, where in is all nodes
•   r(Pi) is page rank of Page Pi
•   Pj is number of outlinks from page Pj
•   BPi is set of pages pointing into Pi
Iteration 0    Iteration 1     Iteration2       Rankk= 2

r0(P1) = 1/6   r1(P1) = 1/18   r2(P1) = 1/36          5

r0(P2) = 1/6   r1(P2) = 5/36   r2(P2) = 1/18          4

r0(P3) = 1/6   r1(P3) = 1/12   r2(P3) = 1/36          5

r0(P4) = 1/6   r1(P4) = 1/4    r2(P4) = 17/72         1

r0(P5) = 1/6   r1(P5) = 5/36   r2(P5) = 11/72         3

r0(P6) = 1/6   r1(P6) = 1/6    R2(P6) = 14/72         2
[nxm] * [mxr] = nxr
•   Non-zero row elements i are outlinking
    pages of page i

•   Non-zero column elements I are inlinking
    pages of page i
PageRank and The Google Matrix
π (k + 1) T =   π (k)T*H

Where: πT is a 1x n row vector
• Rank sinks & Convergence
• Resembles work done on Markov Chains
    • H = transitional probability matrix
    • Converges to a unique positive vector if
      • Stochastic: Each row sum = 1
      • Irreducible: Non-zero probability of transitioning
        (even if more than one state) to any other state.
      • Aperiodic: No requirements on how many steps
        to get to a state i. Can be irregular.
      • Primitive: Irreducible and Periodic
Next state depends on current state (no memory)
•   “Random Surfer” Model
    • Following hyperlinks
    • Time spent on a page is proportional to its
      importance.
    • Fixes the “dangling node” problem. Surfer gets
      stuck on a node. Pdf files, images, etc.
    • Need to allow surfer to “teleport” or make
      random jumps.
PageRank and The Google Matrix
S = H + a(1/n *eT)


Where: ai = 1 if page i is dangling otherwise
 0.

  eT(1x6) = all 1’s, n = number of nodes
PageRank and The Google Matrix
Serendipity?: Page and Brin introduced an
 “adjustment”. Random Surfer can “teleport”
 and enter a new destination into a browser.
• Teleportation matrix: E = 1/n * eeT
• α controls the proportion of time a “rand
  surfer” follows hyperlinks as opposed to
  teleporting. If = 0.5 then half the time is
  spent doing both.
• At 0.5 about 34 iterations required to
  converge to a tolerance of 10^-10.
• Originally set at 0.85. As it -> 1
  computation time grows. Sensitivity issue.
G = αS + (1-   α)1/nee T
π (k + 1) T =         π (k)T*G


*2002 World’s largest matrix computation. Order in
   2002 ~8.1 x10^9 !
G = αH + (αa + (1-α)e)1/neT
PageRank and The Google Matrix

More Related Content

What's hot (20)

KEY
Linuxconf 2011 parallel languages talk
Lenz Gschwendtner
 
PDF
Taming Rich GML with Stetl - FOSS4G 2013 Nottingham
Just van den Broecke
 
PPTX
Go. Why it goes
Sergey Pichkurov
 
PDF
Stetl for INSPIRE Data Transformation
Just van den Broecke
 
PDF
Geospatial ETL with Stetl - GeoPython 2016
Just van den Broecke
 
PDF
15CS664 Python Question Bank-3
Syed Mustafa
 
PDF
5 Minute Intro to Stetl
Just van den Broecke
 
PDF
Golang concurrency design
Hyejong
 
PDF
IT talk "Python language evolution"
DataArt
 
PPTX
Need 4 speed
alikonweb
 
PDF
Declarative Infrastructure Tools
Yulia Shcherbachova
 
PDF
Bind Python and C @ COSCUP 2015
Jian-Hong Pan
 
PPTX
APMG juni 2014 - Regular Expression
Byte
 
PDF
15CS664- Python Application Programming- Question bank 1
Syed Mustafa
 
PDF
Linux Administration (Revised Syllabus) [QP / May - 2016]
Mumbai B.Sc.IT Study
 
PDF
Linux Administration (Revised Syllabus) [QP / October - 2012]
Mumbai B.Sc.IT Study
 
PDF
Infecting Python Bytecode
Iftach Ian Amit
 
PDF
Asynchronous single page applications without a line of HTML or Javascript, o...
Robert Schadek
 
PPTX
Lua. The Splendors and Miseries of Game Scripting
DevGAMM Conference
 
PPTX
Working with file(35,45,46)
Dishant Modi
 
Linuxconf 2011 parallel languages talk
Lenz Gschwendtner
 
Taming Rich GML with Stetl - FOSS4G 2013 Nottingham
Just van den Broecke
 
Go. Why it goes
Sergey Pichkurov
 
Stetl for INSPIRE Data Transformation
Just van den Broecke
 
Geospatial ETL with Stetl - GeoPython 2016
Just van den Broecke
 
15CS664 Python Question Bank-3
Syed Mustafa
 
5 Minute Intro to Stetl
Just van den Broecke
 
Golang concurrency design
Hyejong
 
IT talk "Python language evolution"
DataArt
 
Need 4 speed
alikonweb
 
Declarative Infrastructure Tools
Yulia Shcherbachova
 
Bind Python and C @ COSCUP 2015
Jian-Hong Pan
 
APMG juni 2014 - Regular Expression
Byte
 
15CS664- Python Application Programming- Question bank 1
Syed Mustafa
 
Linux Administration (Revised Syllabus) [QP / May - 2016]
Mumbai B.Sc.IT Study
 
Linux Administration (Revised Syllabus) [QP / October - 2012]
Mumbai B.Sc.IT Study
 
Infecting Python Bytecode
Iftach Ian Amit
 
Asynchronous single page applications without a line of HTML or Javascript, o...
Robert Schadek
 
Lua. The Splendors and Miseries of Game Scripting
DevGAMM Conference
 
Working with file(35,45,46)
Dishant Modi
 

Similar to PageRank and The Google Matrix (20)

PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
PDF
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Rafał Leszko
 
PDF
Swift for tensorflow
규영 허
 
PDF
Apache Flink & Graph Processing
Vasia Kalavri
 
PPT
02 functions, variables, basic input and output of c++
Manzoor ALam
 
PDF
Machine Learning on Code - SF meetup
source{d}
 
PDF
ClojureScript for the web
Michiel Borkent
 
PDF
Sessisgytcfgggggggggggggggggggggggggggggggg
pawankamal3
 
PDF
Large scale graph processing
Harisankar H
 
PDF
Python for Scientific Computing
Albert DeFusco
 
PDF
JavaScript Foundations Day1
Troy Miles
 
PDF
Clojure intro
Basav Nagur
 
PPTX
Lec2_cont.pptx galgotias University questions
YashJain47002
 
PPT
l7-pointers.ppt
ShivamChaturvedi67
 
PDF
Value Objects, Full Throttle (to be updated for spring TC39 meetings)
Brendan Eich
 
PDF
Clojure
Rohit Vaidya
 
PPTX
Go from a PHP Perspective
Barry Jones
 
PPT
go.ppt
ssuser4ca1eb
 
PDF
Processing large-scale graphs with Google(TM) Pregel
ArangoDB Database
 
PDF
Frank Celler – Processing large-scale graphs with Google(TM) Pregel - NoSQL m...
NoSQLmatters
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Rafał Leszko
 
Swift for tensorflow
규영 허
 
Apache Flink & Graph Processing
Vasia Kalavri
 
02 functions, variables, basic input and output of c++
Manzoor ALam
 
Machine Learning on Code - SF meetup
source{d}
 
ClojureScript for the web
Michiel Borkent
 
Sessisgytcfgggggggggggggggggggggggggggggggg
pawankamal3
 
Large scale graph processing
Harisankar H
 
Python for Scientific Computing
Albert DeFusco
 
JavaScript Foundations Day1
Troy Miles
 
Clojure intro
Basav Nagur
 
Lec2_cont.pptx galgotias University questions
YashJain47002
 
l7-pointers.ppt
ShivamChaturvedi67
 
Value Objects, Full Throttle (to be updated for spring TC39 meetings)
Brendan Eich
 
Clojure
Rohit Vaidya
 
Go from a PHP Perspective
Barry Jones
 
go.ppt
ssuser4ca1eb
 
Processing large-scale graphs with Google(TM) Pregel
ArangoDB Database
 
Frank Celler – Processing large-scale graphs with Google(TM) Pregel - NoSQL m...
NoSQLmatters
 
Ad

More from Sean Golliher (9)

PDF
Time Series Forecasting using Neural Nets (GNNNs)
Sean Golliher
 
PDF
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
Sean Golliher
 
PDF
Goprez sg
Sean Golliher
 
PDF
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Sean Golliher
 
PPTX
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Sean Golliher
 
PPTX
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
PPTX
Lecture 7- Text Statistics and Document Parsing
Sean Golliher
 
PPTX
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Sean Golliher
 
PPTX
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Sean Golliher
 
Time Series Forecasting using Neural Nets (GNNNs)
Sean Golliher
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
Sean Golliher
 
Goprez sg
Sean Golliher
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Sean Golliher
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Sean Golliher
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
Lecture 7- Text Statistics and Document Parsing
Sean Golliher
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Sean Golliher
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Sean Golliher
 
Ad

Recently uploaded (20)

PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 

PageRank and The Google Matrix

  • 1. Deriving “The Google Matrix”: G = αS + (1- α)1/neeT Lecture 4
  • 2.  B.S Physics 1993, University of Washington  M.S EE 1998, Washington State (four patents)  10+ Years in Search Marketing  Founder of SEMJ.org (Research Journal)  Blogger for SemanticWeb.com  President of Future Farm Inc.
  • 3. Build a focused crawler in: Java, Python, PERL  Point at MSU home page. Gather all the URLs and store for later use. http://www.montana.edu/robots.txt  Store all the HTML and label with DocID.  Read Google’s Paper. Next time Page Rank & the Google Matrix.  Contest: Who can store the most unique URLS?  Due Feb 7th (Next week). Send coded and URL list.
  • 4. #! /user/bin/python  ### Basic Web Crawler in Python to Grab a URL from command line  ## Use the urllib2 library for URLs, Use BeautifulSoup  #  from BeautifulSoup import BeautifulSoup  import sys #allow users to input string  import urllib2  ####change user-agent name  from urllib import FancyURLopener  class MyOpener(FancyURLopener):  version = 'BadBot/1.0'  print MyOpener.version # print the user agent name  httpResponse = urllib2.urlopen(sys.argv[1])
  • 5.  #store html page in an object called htmlPage  htmlPage = httpResponse.read()  print htmlPage  htmlDom = BeautifulSoup(htmlPage)  # dump page title  print htmlDom.title.string  # dump all links in page  allLinks = htmlDom.findAll('a', {'href': True})  for link in allLinks:  print link['href'] #Print name of Bot  MyOpener.version
  • 6.  Open source Java-based crawler  https://webarchive.jira.com/wiki/display/H eritrix/Heritrix;jsessionid=AE9A595F01C AAB59BBCDC50C8A3ED2A9  http://www.robotstxt.org/robotstxt.html  http://www.commoncrawl.org/
  • 7. 1 2 3 6 5 4
  • 8. r(Pi) = Σr(Pj)/|Pj| PjΞBPi • r(Pi) is page rank of Page Pi • Pj is number of outlinks from page Pj • BPi is set of pages pointing into Pi
  • 9. r(Pj) values of Inlinking page is unknown. Need a starting value.  Could initialize the values to 1/n (number of pages)  R0(Pi) = 1/n for all pages Pi  Process is repeated until a stable value is obtained (Will not happen in all cases). Will this converge?
  • 10. R k + 1(Pi) = Σrk(Pj)/|Pj| PjΞBPi • R k + 1 PageRank at of Pi at iteration K + 1 • Ro(Pi) = 1/n, where in is all nodes • r(Pi) is page rank of Page Pi • Pj is number of outlinks from page Pj • BPi is set of pages pointing into Pi
  • 11. Iteration 0 Iteration 1 Iteration2 Rankk= 2 r0(P1) = 1/6 r1(P1) = 1/18 r2(P1) = 1/36 5 r0(P2) = 1/6 r1(P2) = 5/36 r2(P2) = 1/18 4 r0(P3) = 1/6 r1(P3) = 1/12 r2(P3) = 1/36 5 r0(P4) = 1/6 r1(P4) = 1/4 r2(P4) = 17/72 1 r0(P5) = 1/6 r1(P5) = 5/36 r2(P5) = 11/72 3 r0(P6) = 1/6 r1(P6) = 1/6 R2(P6) = 14/72 2
  • 12. [nxm] * [mxr] = nxr
  • 13. Non-zero row elements i are outlinking pages of page i • Non-zero column elements I are inlinking pages of page i
  • 15. π (k + 1) T = π (k)T*H Where: πT is a 1x n row vector
  • 16. • Rank sinks & Convergence • Resembles work done on Markov Chains • H = transitional probability matrix • Converges to a unique positive vector if • Stochastic: Each row sum = 1 • Irreducible: Non-zero probability of transitioning (even if more than one state) to any other state. • Aperiodic: No requirements on how many steps to get to a state i. Can be irregular. • Primitive: Irreducible and Periodic
  • 17. Next state depends on current state (no memory)
  • 18. “Random Surfer” Model • Following hyperlinks • Time spent on a page is proportional to its importance. • Fixes the “dangling node” problem. Surfer gets stuck on a node. Pdf files, images, etc. • Need to allow surfer to “teleport” or make random jumps.
  • 20. S = H + a(1/n *eT) Where: ai = 1 if page i is dangling otherwise 0. eT(1x6) = all 1’s, n = number of nodes
  • 22. Serendipity?: Page and Brin introduced an “adjustment”. Random Surfer can “teleport” and enter a new destination into a browser.
  • 23. • Teleportation matrix: E = 1/n * eeT • α controls the proportion of time a “rand surfer” follows hyperlinks as opposed to teleporting. If = 0.5 then half the time is spent doing both. • At 0.5 about 34 iterations required to converge to a tolerance of 10^-10. • Originally set at 0.85. As it -> 1 computation time grows. Sensitivity issue.
  • 24. G = αS + (1- α)1/nee T
  • 25. π (k + 1) T = π (k)T*G *2002 World’s largest matrix computation. Order in 2002 ~8.1 x10^9 !
  • 26. G = αH + (αa + (1-α)e)1/neT

Editor's Notes

  • #3: Never taught this course in MT. Taught for MASCO last Jan.
  • #4: Never taught this course in MT. Taught for MASCO last Jan.
  • #5: Never taught this course in MT. Taught for MASCO last Jan.
  • #6: Never taught this course in MT. Taught for MASCO last Jan.
  • #7: Hyper text transer protocol…
  • #10: Never taught this course in MT. Taught for MASCO last Jan.
  • #13: Rows n and columns m. Inner dimensions must match.
  • #16: In this example initialize pi(o) matrix to [1/6, 1/6, 1/6, … ] multiply out times H and you get Iteration 1 in table 4.1 of book. This gives the same results as the page rank formula.
  • #18: A11 could be a probability that we stay where we are. A12 is probablity that we go to s@.
  • #21: The I refers to rows only. So if there is all zeros in a row then ai = 1. S is the same dimension as H. a is 6 x 1 and eT is 1 x 6 which gives 6 x 6 matrix Plus H. eT is all ones.
  • #23: The I refers to rows only. So if there is all zeros in a row then ai = 1. S is the same dimension as H. a is 6 x 1 and eT is 1 x 6 which gives 6 x 6 matrix Plus H. eT is all ones.
  • #26: Order of a matrix is m times n!
  • #28: Multiply this by pi(0) which is a 1x6 matrix [ 1/6 , 1/1…. End up with page rank vector of 1x6. Interpretation. If one value is 0.37 then 37% of the time is spent on that page.