A Presentation
on
Web Mining
Presented By
Tanjarul Islam Mishu
[@tanjarul26]
Dept. of CSE
Jatiya Kabi Kazi Nazrul Islam University
Spring 2006
Spring 2006
Overview
 Web Mining
 Opportunities and Challenges
 Data Mining vs. Web Mining
 Classification of Web Mining Techniques
 Web Content Mining Techniques
 Web Usage Data Sources
 Web Usage Mining Model
Web Mining
 Mining means extracting something useful or valuable
from a baser substance, such as mining gold from the
earth.
 Web mining is the process of using data mining
techniques and algorithms to extract information
directly from the Web.
 It uses Web documents and services,
Web content, hyperlinks and
server logs.
Opportunities and Challenges
 The amount of information on the Web is huge.
 The coverage of Web information
is very wide and diverse.
 All types exist on the Web.
 Much of the Web information is semi-structured due to
the nested structure of HTML code.
Opportunities and Challenges
 Much of the Web information is
linked.
 Much of the Web information is
redundant.
 The Web is noisy.
a mixture of many kinds of
information.
 The Web is dynamic.
Data Mining vs. Web Mining
 Traditional data mining
 data is structured and relational
 well-defined tables, columns, rows, keys, and constraints.
 Web data Mining
 Semi-structured and unstructured
 readily available data
 rich in features and patterns
Classification of Web Mining Techniques
Web Mining
Web
Structure
Mining
Web Content
Mining
Web-Usage
Mining
Web-Structure Mining
 Web Structure Mining is a tool used to identify the
relationship between Web pages linked by information
or direct link connection.
 Structure mining uses minimize two main problems of
the World Wide Web.
 Irrelevant search results
 inability to index the vast
amount if information
provided on the Web.
Web Content Mining
 ‘Process of information’ or resource discovery
from content of millions of sources across the
World Wide Web
 E.g. Web data contents: text, Image, audio, video,
metadata and hyperlinks
 It is related to text mining
because much of
the web contents are texts.
Web Content Mining Techniques
Web Content
Mining
Classifications Clustering Association
Document Classification
 Supervised Learning
 Supervised learning is a ‘machine learning’ technique for creating
a function from training data .
 The output can predict a class label of the input object (called
classification).
 Techniques used are
 Nearest Neighbor Classifier
 Feature Selection
 Decision Tree
Association
Web Content Mining Tech.
ClusteringClassification
Document Clustering
 Unsupervised Learning : a data set of input objects is
gathered
 Goal : Evolve measures of similarity to cluster a collection
of documents/terms into groups within which similarity
within a cluster is larger than across clusters.
 Hypothesis : Given a `suitable‘ clustering of a collection, if
the user is interested in document/term d/t, he is likely to
be interested in other members of the cluster to which d/t
belongs.
Web Content Mining Tech.
ClusteringClassification Association
Association
Example: Supermarket
Transaction ID Items Purchased
1 butter, bread, tea
2 bread, tea, sugar, egg
3 diaper
… ………
 An association rule can be
“If a customer buys tea, in 50% of cases, he/she also
buys sugar. This happens in 33% of all transactions.
50%: confidence
33%: support
Can also Integrate in Hyperlinks
Web Content Mining Tech.
ClusteringClassification Association
Web-Usage Mining
 What is Usage Mining?
Discovering user ‘navigation patterns’ from web data.
Prediction of user behavior while the user interacts
with the web.
Web-Usage Mining
 Usage Mining Techniques
Data Preparation
Data Collection
Data Selection
Data Cleaning
Data Mining
Navigation Patterns
Sequential Patterns
Web-Usage Mining
 Data Mining Techniques – Navigation Patterns
Web Page Hierarchy
of a Web Site
A
B
C D
E
Web-Usage Mining
 Data Mining Techniques – Navigation Patterns
Analysis:
Example:
70% of users who accessed /company/product2 did so by starting
at /company and proceeding through /company/new,
/company/products and company/product1
80% of users who accessed the site started from
/company/products
65% of users left the site after
four or less page references
Web-Usage Mining cont…
 Data Mining Techniques – Sequential Patterns
Example:
Supermarket
Cont…
Customer Transaction Time Purchased Items
John 6/21/05 5:30 pm Beer
John 6/22/05 10:20 pm Brandy
Frank 6/20/05 10:15 am Juice, Coke
Frank 6/20/05 11:50 am Beer
Frank 6/20/05 12:50 am Wine, Cider
Mary 6/20/05 2:30 pm Beer
Mary 6/21/05 6:17 pm Wine, Cider
Mary 6/22/05 5:05 pm Brandy
Web-Usage Mining cont…
 Data Mining Techniques – Sequential Patterns
Customer Sequence
Customer Customer Sequences
John (Beer) (Brandy)
Frank (Juice, Coke) (Beer) (Wine, Cider)
Mary (Beer) (Wine, Cider) (Brandy)
Example:
Supermarket
Cont…
Sequential Patterns with Supporting
Support >= 40% Customers
(Beer) (Brandy) John, Frank
(Beer) (Wine, Cider) Frank, Mary
Mining Result
Web-Usage Mining
 Data Mining Techniques – Sequential Patterns
Web usage examples
 In Google search, within past week 30% of users who visited
/company/product/ had ‘camera’ as text.
 60% of users who placed an online order in
/company/product1 also placed an order in /company/product4
within 15 days
Web Usage Data
Sources
 Server access logs
 Server Referrer logs
 Agent logs
 Client-side cookies
 User profiles
 Search engine logs
 Database logs
The record of what actions a user takes with his
mouse and keyboard while visiting a site.
Transfer / Access Log
 The transfer/access log contains detailed information about
each request that the server receives from user’s web
browsers.
Time Date Hostname File Requested Amount of data
transferred
Status of the
request
CLIENT
SERVER
Agent Log
 The agent log lists the browsers (including version
number and the platform) that people are using to
connect to your server.
Hostname Version Number Platform
CLIENT
SERVER
Referrer Log
 If a user gets to one of the server’s pages by clicking on a link
from another site, that URL of that site will appear in this
log.
URL REFERRER URL
CLIENT
SERVER
Error Log
 The error log keeps a record of errors and failed requests.
 A request may fail if the page contains links to a file that
does not exist or if the user is not authorized to access a
specific page or file.
CLIENT
SERVER
Web Usage Mining Model
AnyQuestions???

More Related Content

PPTX
web mining
PPTX
Web mining
PPTX
Mining Association Rules in Large Database
PPTX
Clustering in Data Mining
PPTX
Semantic web
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PPS
Neural Networks
PDF
Misinformation, Disinformation, Malinformation, fake news and libraries
web mining
Web mining
Mining Association Rules in Large Database
Clustering in Data Mining
Semantic web
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Networks
Misinformation, Disinformation, Malinformation, fake news and libraries

What's hot (20)

PPTX
Text clustering
PPT
3. mining frequent patterns
PPTX
Key-Value NoSQL Database
PPTX
Web mining (1)
PPT
1.2 steps and functionalities
PPTX
Clustering for Stream and Parallelism (DATA ANALYTICS)
PPTX
Web Mining Presentation Final
PPT
Data preprocessing in Data Mining
PPT
Data preprocessing
ODP
Web Content Mining
PPTX
Hadoop File system (HDFS)
PPTX
Data warehouse architecture
PPTX
OLAP v/s OLTP
PPTX
Web mining (structure mining)
PPT
4.2 spatial data mining
PPTX
Map Reduce
PDF
Data Mining & Data Warehousing Lecture Notes
PPTX
Web crawler
PPT
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
PPTX
Data mining primitives
Text clustering
3. mining frequent patterns
Key-Value NoSQL Database
Web mining (1)
1.2 steps and functionalities
Clustering for Stream and Parallelism (DATA ANALYTICS)
Web Mining Presentation Final
Data preprocessing in Data Mining
Data preprocessing
Web Content Mining
Hadoop File system (HDFS)
Data warehouse architecture
OLAP v/s OLTP
Web mining (structure mining)
4.2 spatial data mining
Map Reduce
Data Mining & Data Warehousing Lecture Notes
Web crawler
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data mining primitives
Ad

Similar to Web mining (20)

PDF
PDF
WEB MINING – A CATALYST FOR E-BUSINESS
PDF
RESEARCH ISSUES IN WEB MINING
PDF
RESEARCH ISSUES IN WEB MINING
PDF
RESEARCH ISSUES IN WEB MINING
PDF
RESEARCH ISSUES IN WEB MINING
PDF
RESEARCH ISSUES IN WEB MINING
PDF
RESEARCH ISSUES IN WEB MINING
PDF
RESEARCH ISSUES IN WEB MINING
PDF
International Journal of Engineering Research and Development
PDF
Pxc3893553
PDF
International conference On Computer Science And technology
PPTX
Web content mining
PDF
Business Intelligence: A Rapidly Growing Option through Web Mining
PDF
Recommendation generation by integrating sequential
PDF
Recommendation generation by integrating sequential pattern mining and semantics
PDF
Performance of Real Time Web Traffic Analysis Using Feed Forward Neural Netw...
PPTX
Web mining
PDF
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
PDF
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
WEB MINING – A CATALYST FOR E-BUSINESS
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
International Journal of Engineering Research and Development
Pxc3893553
International conference On Computer Science And technology
Web content mining
Business Intelligence: A Rapidly Growing Option through Web Mining
Recommendation generation by integrating sequential
Recommendation generation by integrating sequential pattern mining and semantics
Performance of Real Time Web Traffic Analysis Using Feed Forward Neural Netw...
Web mining
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
Ad

More from Tanjarul Islam Mishu (11)

PDF
Vulnerabilities of Fingerprint Authentication Systems and Their Securities
PPTX
Dynamic time wrapping
PDF
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
PPTX
A presentation on windowing
PPTX
Mobile satellite communication
PPTX
Shop management system
PDF
E health system design
PPTX
Multiplication algorithm, hardware and flowchart
PPTX
Rules of Karnaugh Map
PPSX
Implement Fingerprint authentication for employee automation system
PDF
Implement fingerprint authentication for employee automation system
Vulnerabilities of Fingerprint Authentication Systems and Their Securities
Dynamic time wrapping
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
A presentation on windowing
Mobile satellite communication
Shop management system
E health system design
Multiplication algorithm, hardware and flowchart
Rules of Karnaugh Map
Implement Fingerprint authentication for employee automation system
Implement fingerprint authentication for employee automation system

Recently uploaded (20)

PDF
Complications of Minimal Access-Surgery.pdf
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PPTX
Computer Architecture Input Output Memory.pptx
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
PPTX
Virtual and Augmented Reality in Current Scenario
PDF
advance database management system book.pdf
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PDF
International_Financial_Reporting_Standa.pdf
PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
Trump Administration's workforce development strategy
PDF
Hazard Identification & Risk Assessment .pdf
Complications of Minimal Access-Surgery.pdf
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
Computer Architecture Input Output Memory.pptx
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
Share_Module_2_Power_conflict_and_negotiation.pptx
Virtual and Augmented Reality in Current Scenario
advance database management system book.pdf
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
B.Sc. DS Unit 2 Software Engineering.pptx
Paper A Mock Exam 9_ Attempt review.pdf.
AI-driven educational solutions for real-life interventions in the Philippine...
International_Financial_Reporting_Standa.pdf
TNA_Presentation-1-Final(SAVE)) (1).pptx
History, Philosophy and sociology of education (1).pptx
Trump Administration's workforce development strategy
Hazard Identification & Risk Assessment .pdf

Web mining

  • 1. A Presentation on Web Mining Presented By Tanjarul Islam Mishu [@tanjarul26] Dept. of CSE Jatiya Kabi Kazi Nazrul Islam University Spring 2006
  • 2. Spring 2006 Overview  Web Mining  Opportunities and Challenges  Data Mining vs. Web Mining  Classification of Web Mining Techniques  Web Content Mining Techniques  Web Usage Data Sources  Web Usage Mining Model
  • 3. Web Mining  Mining means extracting something useful or valuable from a baser substance, such as mining gold from the earth.  Web mining is the process of using data mining techniques and algorithms to extract information directly from the Web.  It uses Web documents and services, Web content, hyperlinks and server logs.
  • 4. Opportunities and Challenges  The amount of information on the Web is huge.  The coverage of Web information is very wide and diverse.  All types exist on the Web.  Much of the Web information is semi-structured due to the nested structure of HTML code.
  • 5. Opportunities and Challenges  Much of the Web information is linked.  Much of the Web information is redundant.  The Web is noisy. a mixture of many kinds of information.  The Web is dynamic.
  • 6. Data Mining vs. Web Mining  Traditional data mining  data is structured and relational  well-defined tables, columns, rows, keys, and constraints.  Web data Mining  Semi-structured and unstructured  readily available data  rich in features and patterns
  • 7. Classification of Web Mining Techniques Web Mining Web Structure Mining Web Content Mining Web-Usage Mining
  • 8. Web-Structure Mining  Web Structure Mining is a tool used to identify the relationship between Web pages linked by information or direct link connection.  Structure mining uses minimize two main problems of the World Wide Web.  Irrelevant search results  inability to index the vast amount if information provided on the Web.
  • 9. Web Content Mining  ‘Process of information’ or resource discovery from content of millions of sources across the World Wide Web  E.g. Web data contents: text, Image, audio, video, metadata and hyperlinks  It is related to text mining because much of the web contents are texts.
  • 10. Web Content Mining Techniques Web Content Mining Classifications Clustering Association
  • 11. Document Classification  Supervised Learning  Supervised learning is a ‘machine learning’ technique for creating a function from training data .  The output can predict a class label of the input object (called classification).  Techniques used are  Nearest Neighbor Classifier  Feature Selection  Decision Tree Association Web Content Mining Tech. ClusteringClassification
  • 12. Document Clustering  Unsupervised Learning : a data set of input objects is gathered  Goal : Evolve measures of similarity to cluster a collection of documents/terms into groups within which similarity within a cluster is larger than across clusters.  Hypothesis : Given a `suitable‘ clustering of a collection, if the user is interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs. Web Content Mining Tech. ClusteringClassification Association
  • 13. Association Example: Supermarket Transaction ID Items Purchased 1 butter, bread, tea 2 bread, tea, sugar, egg 3 diaper … ………  An association rule can be “If a customer buys tea, in 50% of cases, he/she also buys sugar. This happens in 33% of all transactions. 50%: confidence 33%: support Can also Integrate in Hyperlinks Web Content Mining Tech. ClusteringClassification Association
  • 14. Web-Usage Mining  What is Usage Mining? Discovering user ‘navigation patterns’ from web data. Prediction of user behavior while the user interacts with the web.
  • 15. Web-Usage Mining  Usage Mining Techniques Data Preparation Data Collection Data Selection Data Cleaning Data Mining Navigation Patterns Sequential Patterns
  • 16. Web-Usage Mining  Data Mining Techniques – Navigation Patterns Web Page Hierarchy of a Web Site A B C D E
  • 17. Web-Usage Mining  Data Mining Techniques – Navigation Patterns Analysis: Example: 70% of users who accessed /company/product2 did so by starting at /company and proceeding through /company/new, /company/products and company/product1 80% of users who accessed the site started from /company/products 65% of users left the site after four or less page references
  • 18. Web-Usage Mining cont…  Data Mining Techniques – Sequential Patterns Example: Supermarket Cont… Customer Transaction Time Purchased Items John 6/21/05 5:30 pm Beer John 6/22/05 10:20 pm Brandy Frank 6/20/05 10:15 am Juice, Coke Frank 6/20/05 11:50 am Beer Frank 6/20/05 12:50 am Wine, Cider Mary 6/20/05 2:30 pm Beer Mary 6/21/05 6:17 pm Wine, Cider Mary 6/22/05 5:05 pm Brandy
  • 19. Web-Usage Mining cont…  Data Mining Techniques – Sequential Patterns Customer Sequence Customer Customer Sequences John (Beer) (Brandy) Frank (Juice, Coke) (Beer) (Wine, Cider) Mary (Beer) (Wine, Cider) (Brandy) Example: Supermarket Cont… Sequential Patterns with Supporting Support >= 40% Customers (Beer) (Brandy) John, Frank (Beer) (Wine, Cider) Frank, Mary Mining Result
  • 20. Web-Usage Mining  Data Mining Techniques – Sequential Patterns Web usage examples  In Google search, within past week 30% of users who visited /company/product/ had ‘camera’ as text.  60% of users who placed an online order in /company/product1 also placed an order in /company/product4 within 15 days
  • 21. Web Usage Data Sources  Server access logs  Server Referrer logs  Agent logs  Client-side cookies  User profiles  Search engine logs  Database logs The record of what actions a user takes with his mouse and keyboard while visiting a site.
  • 22. Transfer / Access Log  The transfer/access log contains detailed information about each request that the server receives from user’s web browsers. Time Date Hostname File Requested Amount of data transferred Status of the request CLIENT SERVER
  • 23. Agent Log  The agent log lists the browsers (including version number and the platform) that people are using to connect to your server. Hostname Version Number Platform CLIENT SERVER
  • 24. Referrer Log  If a user gets to one of the server’s pages by clicking on a link from another site, that URL of that site will appear in this log. URL REFERRER URL CLIENT SERVER
  • 25. Error Log  The error log keeps a record of errors and failed requests.  A request may fail if the page contains links to a file that does not exist or if the user is not authorized to access a specific page or file. CLIENT SERVER