SlideShare a Scribd company logo
Discovering knowledge using web structure mining
1. What is Web?
1.1 Problems With Web
 Difficulty in finding

relevant information
 Personalization of

information
 Learning about

consumers or individual
users
2.Objectives
i.

To Survey the area of
web mining.

ii.

Introduction to Link
Mining.

iii.

Review of HITS and
Page Rank algorithm.
3. Web Mining: Definition
 Process of discovering
 potentially useful &
 previously unknown

information or knowledge
from the web data.
3.1 Web Mining: Subtasks
 Resource finding

 Information selection

and pre-processing
 Generalization
 Analysis
3.1 Web Mining Categories
Web Mining

Web Content
Mining

Web Structure
Mining

Text and
Multimedia
Documents

Hyperlink
Structure

Web Usage
Mining

Web Log
Records
3.1.1 Web Content Mining
 Scanning data of a Web page to determine content
relevance with respect to search query.
Web Content
Mining

Agent Based
Approach

Database
Approach
3.1.2 Web Structure Mining
 Identifies relationships

between Web pages.
 Focuses on following

problems
 Reducing irrelevant search

results.
 Helps indexing
information on the web.
3.1.3 Web Usage Mining
 Focuses on techniques that predict user behavior while

interacting with the WWW.
 Web log records analyzed to discover user access pattern.
 The challenges could be

divided into three phases:
 Pre-processing
 Pattern discovery

 Pattern Analysis
4. Link Mining
 It is located at the intersection of the work in





Link analysis
Hypertext and web mining
Relational learning and inductive logic programming
Graph mining.

 Some tasks of link mining applicable in web structure

mining are:






Linked-based classification
Linked-based cluster analysis
Link Type
Link Strength
Link Cardinality
(i) Link-based Classification
 Predicts category of a web

page, based on
 words that occur on the page

 Links between pages
 anchor text
 HTML tags
 and other possible attributes

on web page.

 Eg: Predicting the category

of a paper, based on its
citations and the co-citations.
(ii) Link-based Cluster Analysis
 Goal : Finding naturally occurring subclasses.
 Data is segmented into groups
 similar objects - grouped together
 dissimilar objects - different groups.
 Helps in discovering hidden patterns.
 Eg: Finding diseases with similar transmission pattern.
(iii) Link Type
 Predicting link type

between two entities.
 Predicting purpose of

a link.
 Eg. Navigational or

Advertising
(iv) Link Strength
 Links could be associated with weights.
 Strong links - higher weight
 Weak links – lower weight
(v) Link Cardinality
 Refers to the number

of inbound links to a
web site.
 Link popularity :
 combination of
factors that weigh the
importance of each
incoming link.
5. Hyperlink-Induced Topic Search
(HITS)
 Link analysis algorithm that

rates pages.
 Identifies two kinds of pages

from Web hyperlink structure:

Web
Pages

With
Links
To

Web
Pages

With

 Authorities: Contains valuable

information on the subject.
 Hubs: Contains useful links
towards the authoritative
pages.

Other
Pages

Hubs

Content

Authority
HITS Contd…
 Two step process:
 Sampling step: Set of
relevant pages collected
 Iterative step: Hubs and
authorities are found
using output of above step
HITS Contd…
 Sampling Step:
 Query submitted to search engine yields a root set
 From root set we expand to base set

Expanding the root set into base set
HITS Contd…
 Iterative step:
 Associate non-negative authority weight x<p> and nonnegative hub weight y<p>.

Computing Authority Weight

Computing Hub Weight
Problems With HITS Algorithm
 Some problems with the HITS algorithm are:
 Mutually reinforced relationships between hosts
 Automatically generated links
 Non-relevant nodes
 Hubs and authorities
 Topic drift
 Efficiency
6. PageRank Model
 It is a link analysis algorithm.
 Numeric value to know the

importance of a web page
 Computes importance by no.

of incoming links
PageRank Contd…
 Rank of a page is divided evenly among its out-links to

contribute to the ranks of the pages they point to.

 Page Ranks form a probability distribution over web

pages, so the sum of all pages’ Page Ranks will be one.
PageRank Contd…
 PageRank can be calculated by:
PR(A)= (1-d) + d (PR (T1)/C (T1) +…+ PR (Tn)/C (Tn))
 T1..Tn are the pages that point to page A.
 C(A) is defined as the number of links going out of page A.
 d is the dampening factor which is usually set to 0.85

 The dampening factor is the probability at each page a

random surfer will get bored and will request another
random page.
Applications
 HITS was used in Clever search engine by IBM.
 PageRank is used by Google.
References
 Knowledge Discovery and Retrieval on World Wide Web Using Web Structure









Mining: Sekhar Babu Boddu, V.P Krishna Anne, Rajesekhara Rao Kurra and
Durgesh Kumar Mishra, 2010, In proceedings of Fourth Asia International
Conference on Mathematical/Analytical Modelling and Computer Simulation
(AMS), IEEE.
Link Mining: A New Data Mining Challenge by Lise Getoor, 2003, SIGKDD
Explorations, Volume 4, Issue 2
Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg, 1998, In
proceedings of ACM-SIAM Symposium on Discrete Algorithms
The PageRank Citation Ranking: Bringing Order to the Web by L. Page, S. Brin and
T. Winograd, 1998, Technical report, Stanford University
wikipedia.org
web-datamining.net
maya.cs.depaul.edu
Discovering knowledge using web structure mining

More Related Content

What's hot (20)

PPT
Care and Handling of Library Materials
guest394b44
 
PPTX
Preservación y conservación documentos digitales
Yannick Garavito
 
PPT
Web data mining
Institute of Technology Telkom
 
PDF
Fundamentos de bases de datos
yumitacohen
 
PPTX
Mongodb basics and architecture
Bishal Khanal
 
PDF
NoSQL Now! NoSQL Architecture Patterns
DATAVERSITY
 
PPTX
Mis diapositivas uml
Beatriz Moreyra
 
PPTX
Acquisition of Multimedia Sources of Information
Michelle Ann Manalo
 
PPTX
Web tools ppt
Tamara Pia Agavi
 
PPTX
Data-base-system-and-big-data.pptx
MelchorCleve
 
PDF
HTML e CSS
Manuel Scapolan
 
PDF
Ventajas y desventajas de los modelos de bd
Irene Lorza
 
PPTX
An Introduction To NoSQL & MongoDB
Lee Theobald
 
PPTX
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Vikas Bhushan
 
PPT
Marc 21 A DublíN Core
osvaldoorozco
 
PPTX
Nosql-Module 1 PPT.pptx
Radhika R
 
PDF
Bases de Datos NoSQL
Isabel Gómez
 
PPTX
Web Mining Presentation Final
Er. Jagrat Gupta
 
PPTX
6 Navigation Bar Design Ideas for Web Designers
Think Sumo Creative Media Inc
 
PDF
Manuscripts: Concept, Importance and History of manuscripts in Assam
Dr. Utpal Das
 
Care and Handling of Library Materials
guest394b44
 
Preservación y conservación documentos digitales
Yannick Garavito
 
Fundamentos de bases de datos
yumitacohen
 
Mongodb basics and architecture
Bishal Khanal
 
NoSQL Now! NoSQL Architecture Patterns
DATAVERSITY
 
Mis diapositivas uml
Beatriz Moreyra
 
Acquisition of Multimedia Sources of Information
Michelle Ann Manalo
 
Web tools ppt
Tamara Pia Agavi
 
Data-base-system-and-big-data.pptx
MelchorCleve
 
HTML e CSS
Manuel Scapolan
 
Ventajas y desventajas de los modelos de bd
Irene Lorza
 
An Introduction To NoSQL & MongoDB
Lee Theobald
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Vikas Bhushan
 
Marc 21 A DublíN Core
osvaldoorozco
 
Nosql-Module 1 PPT.pptx
Radhika R
 
Bases de Datos NoSQL
Isabel Gómez
 
Web Mining Presentation Final
Er. Jagrat Gupta
 
6 Navigation Bar Design Ideas for Web Designers
Think Sumo Creative Media Inc
 
Manuscripts: Concept, Importance and History of manuscripts in Assam
Dr. Utpal Das
 

Viewers also liked (16)

PPTX
Page rank and hyperlink
Silicon
 
ODP
Web content mining
Daminda Herath
 
PDF
Linear Regression Parameters
camposer
 
PDF
Machine Learning with WEKA
butest
 
PDF
DATA MINING WITH WEKA
Shubham Gupta
 
PDF
Survey on data mining techniques in heart disease prediction
Sivagowry Shathesh
 
PPTX
How to detect &amp; diagnose congenital heart disease in children
plus100years | elkoochi healthcare technology pvt ltd
 
PDF
Clustering and Regression using WEKA
Vijaya Prabhu
 
PDF
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Mohammed Bennamoun
 
PDF
Web mining slides
mahavir_a
 
PPT
Multiple regression presentation
Carlo Magno
 
PPTX
Presentation On Regression
alok tiwari
 
PPT
Regression analysis
Ravi shankar
 
PPTX
Web Usage Mining - Temas Avanzados
Juan Azcurra
 
ODP
Multiple linear regression
James Neill
 
PPS
Correlation and regression
Khalid Aziz
 
Page rank and hyperlink
Silicon
 
Web content mining
Daminda Herath
 
Linear Regression Parameters
camposer
 
Machine Learning with WEKA
butest
 
DATA MINING WITH WEKA
Shubham Gupta
 
Survey on data mining techniques in heart disease prediction
Sivagowry Shathesh
 
How to detect &amp; diagnose congenital heart disease in children
plus100years | elkoochi healthcare technology pvt ltd
 
Clustering and Regression using WEKA
Vijaya Prabhu
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Mohammed Bennamoun
 
Web mining slides
mahavir_a
 
Multiple regression presentation
Carlo Magno
 
Presentation On Regression
alok tiwari
 
Regression analysis
Ravi shankar
 
Web Usage Mining - Temas Avanzados
Juan Azcurra
 
Multiple linear regression
James Neill
 
Correlation and regression
Khalid Aziz
 
Ad

Similar to Discovering knowledge using web structure mining (20)

PPTX
Web Mining.pptx
ScrbifPt
 
PPT
Web mining
MohamadHayeri1
 
PDF
IRJET- Page Ranking Algorithms – A Comparison
IRJET Journal
 
PPTX
Web mining
Rashmi Bhat
 
PDF
A Study on Web Structure Mining
IRJET Journal
 
PPTX
Web mining: Concepts and applications
Utkarsh Sharma
 
PDF
A Study On Web Structure Mining
Nicole Heredia
 
PPTX
Link analysis : Comparative study of HITS and Page Rank Algorithm
Kavita Kushwah
 
PPTX
web mining
Arpit Verma
 
PDF
International conference On Computer Science And technology
anchalsinghdm
 
PDF
Data Mining Module 5 Business Analytics.pdf
Jayanti Pande
 
PPTX
WEB MINING.pptx
HarshithRaj21
 
PPT
Data.Mining.C.8(Ii).Web Mining 570802461
Margaret Wang
 
PDF
Pagerank and hits
Shatakirti Er
 
PPTX
HITS + Pagerank
ajkt
 
PDF
Ambiguity Resolution in Information Retrieval
kevig
 
PPT
4.1 webminig
Krish_ver2
 
PPT
4.5 webminig
Krish_ver2
 
PPT
Web Mining
dataminers.ir
 
PPT
Web Mining
guestb73ec6
 
Web Mining.pptx
ScrbifPt
 
Web mining
MohamadHayeri1
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET Journal
 
Web mining
Rashmi Bhat
 
A Study on Web Structure Mining
IRJET Journal
 
Web mining: Concepts and applications
Utkarsh Sharma
 
A Study On Web Structure Mining
Nicole Heredia
 
Link analysis : Comparative study of HITS and Page Rank Algorithm
Kavita Kushwah
 
web mining
Arpit Verma
 
International conference On Computer Science And technology
anchalsinghdm
 
Data Mining Module 5 Business Analytics.pdf
Jayanti Pande
 
WEB MINING.pptx
HarshithRaj21
 
Data.Mining.C.8(Ii).Web Mining 570802461
Margaret Wang
 
Pagerank and hits
Shatakirti Er
 
HITS + Pagerank
ajkt
 
Ambiguity Resolution in Information Retrieval
kevig
 
4.1 webminig
Krish_ver2
 
4.5 webminig
Krish_ver2
 
Web Mining
dataminers.ir
 
Web Mining
guestb73ec6
 
Ad

Recently uploaded (20)

PPTX
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
PPSX
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
PPTX
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PPTX
THE TAME BIRD AND THE FREE BIRD.pptxxxxx
MarcChristianNicolas
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PPTX
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PDF
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
PPTX
Pyhton with Mysql to perform CRUD operations.pptx
Ramakrishna Reddy Bijjam
 
PPTX
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
PPTX
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Mathematics 5 - Time Measurement: Time Zone
menchreo
 
PPTX
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PPTX
grade 5 lesson ENGLISH 5_Q1_PPT_WEEK3.pptx
SireQuinn
 
PDF
The Different Types of Non-Experimental Research
Thelma Villaflores
 
PPT
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
THE TAME BIRD AND THE FREE BIRD.pptxxxxx
MarcChristianNicolas
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
Pyhton with Mysql to perform CRUD operations.pptx
Ramakrishna Reddy Bijjam
 
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
Mathematics 5 - Time Measurement: Time Zone
menchreo
 
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
grade 5 lesson ENGLISH 5_Q1_PPT_WEEK3.pptx
SireQuinn
 
The Different Types of Non-Experimental Research
Thelma Villaflores
 
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
Dimensions of Societal Planning in Commonism
StefanMz
 

Discovering knowledge using web structure mining

  • 2. 1. What is Web?
  • 3. 1.1 Problems With Web  Difficulty in finding relevant information  Personalization of information  Learning about consumers or individual users
  • 4. 2.Objectives i. To Survey the area of web mining. ii. Introduction to Link Mining. iii. Review of HITS and Page Rank algorithm.
  • 5. 3. Web Mining: Definition  Process of discovering  potentially useful &  previously unknown information or knowledge from the web data.
  • 6. 3.1 Web Mining: Subtasks  Resource finding  Information selection and pre-processing  Generalization  Analysis
  • 7. 3.1 Web Mining Categories Web Mining Web Content Mining Web Structure Mining Text and Multimedia Documents Hyperlink Structure Web Usage Mining Web Log Records
  • 8. 3.1.1 Web Content Mining  Scanning data of a Web page to determine content relevance with respect to search query. Web Content Mining Agent Based Approach Database Approach
  • 9. 3.1.2 Web Structure Mining  Identifies relationships between Web pages.  Focuses on following problems  Reducing irrelevant search results.  Helps indexing information on the web.
  • 10. 3.1.3 Web Usage Mining  Focuses on techniques that predict user behavior while interacting with the WWW.  Web log records analyzed to discover user access pattern.  The challenges could be divided into three phases:  Pre-processing  Pattern discovery  Pattern Analysis
  • 11. 4. Link Mining  It is located at the intersection of the work in     Link analysis Hypertext and web mining Relational learning and inductive logic programming Graph mining.  Some tasks of link mining applicable in web structure mining are:      Linked-based classification Linked-based cluster analysis Link Type Link Strength Link Cardinality
  • 12. (i) Link-based Classification  Predicts category of a web page, based on  words that occur on the page  Links between pages  anchor text  HTML tags  and other possible attributes on web page.  Eg: Predicting the category of a paper, based on its citations and the co-citations.
  • 13. (ii) Link-based Cluster Analysis  Goal : Finding naturally occurring subclasses.  Data is segmented into groups  similar objects - grouped together  dissimilar objects - different groups.  Helps in discovering hidden patterns.  Eg: Finding diseases with similar transmission pattern.
  • 14. (iii) Link Type  Predicting link type between two entities.  Predicting purpose of a link.  Eg. Navigational or Advertising
  • 15. (iv) Link Strength  Links could be associated with weights.  Strong links - higher weight  Weak links – lower weight
  • 16. (v) Link Cardinality  Refers to the number of inbound links to a web site.  Link popularity :  combination of factors that weigh the importance of each incoming link.
  • 17. 5. Hyperlink-Induced Topic Search (HITS)  Link analysis algorithm that rates pages.  Identifies two kinds of pages from Web hyperlink structure: Web Pages With Links To Web Pages With  Authorities: Contains valuable information on the subject.  Hubs: Contains useful links towards the authoritative pages. Other Pages Hubs Content Authority
  • 18. HITS Contd…  Two step process:  Sampling step: Set of relevant pages collected  Iterative step: Hubs and authorities are found using output of above step
  • 19. HITS Contd…  Sampling Step:  Query submitted to search engine yields a root set  From root set we expand to base set Expanding the root set into base set
  • 20. HITS Contd…  Iterative step:  Associate non-negative authority weight x<p> and nonnegative hub weight y<p>. Computing Authority Weight Computing Hub Weight
  • 21. Problems With HITS Algorithm  Some problems with the HITS algorithm are:  Mutually reinforced relationships between hosts  Automatically generated links  Non-relevant nodes  Hubs and authorities  Topic drift  Efficiency
  • 22. 6. PageRank Model  It is a link analysis algorithm.  Numeric value to know the importance of a web page  Computes importance by no. of incoming links
  • 23. PageRank Contd…  Rank of a page is divided evenly among its out-links to contribute to the ranks of the pages they point to.  Page Ranks form a probability distribution over web pages, so the sum of all pages’ Page Ranks will be one.
  • 24. PageRank Contd…  PageRank can be calculated by: PR(A)= (1-d) + d (PR (T1)/C (T1) +…+ PR (Tn)/C (Tn))  T1..Tn are the pages that point to page A.  C(A) is defined as the number of links going out of page A.  d is the dampening factor which is usually set to 0.85  The dampening factor is the probability at each page a random surfer will get bored and will request another random page.
  • 25. Applications  HITS was used in Clever search engine by IBM.  PageRank is used by Google.
  • 26. References  Knowledge Discovery and Retrieval on World Wide Web Using Web Structure       Mining: Sekhar Babu Boddu, V.P Krishna Anne, Rajesekhara Rao Kurra and Durgesh Kumar Mishra, 2010, In proceedings of Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation (AMS), IEEE. Link Mining: A New Data Mining Challenge by Lise Getoor, 2003, SIGKDD Explorations, Volume 4, Issue 2 Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg, 1998, In proceedings of ACM-SIAM Symposium on Discrete Algorithms The PageRank Citation Ranking: Bringing Order to the Web by L. Page, S. Brin and T. Winograd, 1998, Technical report, Stanford University wikipedia.org web-datamining.net maya.cs.depaul.edu