Presented By: Akshat Saxena  Anjul Sahu
Definition Application of  data mining techniques on the web to discover interesting patterns.
Introduction Size of web is extremely large Data present on web is unstructured Good scope of data mining Types of data on web Content of actual webpage Intrapage structure Interpage structure Usage data User profiles and cookies
Web Mining Taxonomy
Web Content Mining Extends work of search engine Improves on traditional crawler technique Use data mining for efficiency, effectiveness and scalability Further divided into Agent based approach Database based approach Text mining is/isn’t content mining Crawlers Personalization
Web Content Mining Subtasks Resource finding Retrieving intended documents Information selection/pre-processing Select and pre-process specific information from selected documents Generalization Discover general patterns within and across web sites Analysis Validation and/or interpretation of mined patterns
Text Mining
Web Crawler Program which browses WWW in a methodical, automated manner Copy in cache and do Indexing Starts from a seed url Searches and finds links, keywords Types of Crawler Context focused Focused Incremental Periodic
Focused Crawler
Focused Crawler Visits only pages of interest Architecture consists of: Hyperlink Classifier Distiller Crawler Hub pages - links to relevant pages Hard focus - parent node relevant Soft focus - probability of relevance Harvest rate – precision rate
Context Focused Crawler Focused crawler was static Drawbacks: Non-relevant pages having links to relevant ones. These to be followed Relevant ones not having links to other relevant ones. Backward crawling  CFC in two steps Construct context graphs and classifiers Crawl using these classifiers
Harvest System Uses caching, indexing and crawling Act as a tool in gathering information from other sources Components: Gatherer - obtains information Broker - provides index and query interface Essence systems Semantic indexing
Virtual Web View Web as multiple layer database  A view of MLDB is virtual web view No spiders used Websites send their indices to others WebML – DMQL for web mining KEYWORDS – covers, covered by, like, close to Difficult to implement
Personalization Contents of web are modified as per user’s desires Personalized not targeted Use cookies, userID, profile information Legal issues to be considered Includes clustering, classification or even prediction
Personalization Types: User preference Collaborative filtering Content based filtering Example : My Yahoo! was first. Now almost every service offers personalization.
Personalization  Yahoo was the first to introduce the concept of a ’personalized portal’, i.e. a Web site designed to have the look-and-feel as well as content personalized to the needs of an individual end-user. Mining MyYahoo usage logs provides Yahoo valuable insight into an individual’s Web usage habits, enabling Yahoo to provide compelling personalized content, which in turn has led to the tremendous popularity of the Yahoo Web site.
Web Structure Mining Creating a model of web organization Classify web pages Create similarity measures between web pages Page Rank The Clever system Hyperlink induced topic search(HITS)
PageRank TM Link analysis algorithm which assigns numerical weight to a webpage. The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E). the PageRank value for a page  u  is dependent on the PageRank values for each page  v  out of the set  B u  (this set contains all pages linking to page  u ), divided by the number  L ( v ) of links from page  v .
Page Rank Increase effectiveness of search engines Based on number of back links Rank sink problem exists
Clever System Finds both authoritative pages and hubs Authoritative - best source Hub - link to authoritative pages Most value page returned Hyperlink Induced Topic Search Keywords Authority and hub measure
Alternatives to PageRank HITS Algorithm IBM Clever Project TrustRank But PageRank is the most popular and widely used algorithm by search engines
Web Usage Mining Applies mining on web usage data or weblogs or clickstream data Client perspective  Server perspective Aid in personalization Helps in evaluating quality and effectiveness Preprocessing, pattern discovery and data structures
Trackers for site usage and analysis
 
Issues in Web Log Identify exact user Exact sequence of pages visited Security, privacy and legal issues
Preprocessing Information not in presentable format Data cleaning required Log: (<src id>,<literal>,<timestamp>) Data might be grouped Sessions  Path completion
Data Structure DS needed to keep track of patterns identified DS used is  trie A rooted tree where each path from root to node represents a sequence
Pattern Discovery Traversal pattern - pages visited in a session Properties: Duplicate reference may / may not be allowed Consist of only contiguous page reference Pattern may / may not be maximal Association rules - pages accessed together
Pattern Discovery Sequential Pattern - ordered set satisfying a support and maximal Similar to apriori algorithm Web access pattern - efficient counting Episodes – partially ordered by access time; users not identified Pattern analysis
Queries ‘N Suggestions References:  http://maya.cs.depaul.edu/~mobasher/webminer/survey/ Google.com/Technology http://www.almaden.ibm.com/projects/clever.shtml Thanks !!     {akshatsaxena11, anjulsahu}@gmail.com

More Related Content

PPTX
Web Mining Presentation Final
DOCX
Open source search engine
ODP
Web mining
PPT
Information Retrieval Models
PPTX
Web crawler
ODP
Web content mining
PPTX
Web mining
Web Mining Presentation Final
Open source search engine
Web mining
Information Retrieval Models
Web crawler
Web content mining
Web mining

What's hot (20)

PPTX
web mining
PDF
Web mining slides
PPTX
Semantic web
PDF
Data Science: Applying Random Forest
PPT
Web Mining
PPTX
Web mining (1)
PPS
Google Search Presentation
PPTX
Presentation on World Wide Web (WWW)
DOCX
NE7012- SOCIAL NETWORK ANALYSIS
PPT
Effective Internet Searching
ODP
Web 3.0 The Semantic Web
PPTX
Data Mining: Text and web mining
DOCX
Web Mining
PPT
Working Of Search Engine
PPT
Google Search Engine
PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
PPT
Introduction into Search Engines and Information Retrieval
PPTX
Information retrieval (introduction)
PPT
Web Crawler
PPTX
Web mining (structure mining)
web mining
Web mining slides
Semantic web
Data Science: Applying Random Forest
Web Mining
Web mining (1)
Google Search Presentation
Presentation on World Wide Web (WWW)
NE7012- SOCIAL NETWORK ANALYSIS
Effective Internet Searching
Web 3.0 The Semantic Web
Data Mining: Text and web mining
Web Mining
Working Of Search Engine
Google Search Engine
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Introduction into Search Engines and Information Retrieval
Information retrieval (introduction)
Web Crawler
Web mining (structure mining)
Ad

Similar to Web Mining (20)

PPTX
SEO 101 | New York University
PPTX
Web Mining.pptx
PDF
A machine learning approach to web page filtering using ...
PDF
A machine learning approach to web page filtering using ...
PDF
Mining web-logs-to-improve-website-organization1
PDF
The Research on Related Technologies of Web Crawler
PDF
Charting Searchland, ACM SIG Data Mining
PDF
PPT
Basic SEO Lecture Presentation
PPTX
PAGE RANKING
PPTX
page ranking web crawling
PPT
SEO and IA: The Beginning of a Beautiful Friendship
PDF
Searchland: Search quality for Beginners
PPTX
What Is SEO / Search Engine Optimization
PDF
Data mining in web search engine optimization
PPT
Web Usage Pattern
PDF
WEBMINING_SOWMYAJYOTHI.pdf
PDF
The Process Behind Search Engines A Simple Overview | Eflot
PDF
International conference On Computer Science And technology
PPTX
CRAWLER,INDEX,RANKING AND ITS WORKING.pptx
SEO 101 | New York University
Web Mining.pptx
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
Mining web-logs-to-improve-website-organization1
The Research on Related Technologies of Web Crawler
Charting Searchland, ACM SIG Data Mining
Basic SEO Lecture Presentation
PAGE RANKING
page ranking web crawling
SEO and IA: The Beginning of a Beautiful Friendship
Searchland: Search quality for Beginners
What Is SEO / Search Engine Optimization
Data mining in web search engine optimization
Web Usage Pattern
WEBMINING_SOWMYAJYOTHI.pdf
The Process Behind Search Engines A Simple Overview | Eflot
International conference On Computer Science And technology
CRAWLER,INDEX,RANKING AND ITS WORKING.pptx
Ad

Recently uploaded (20)

PDF
Child-friendly e-learning for artificial intelligence education in Indonesia:...
PPTX
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
PPTX
Strategic Picks — Prioritising the Right Agentic Use Cases [2/6]
PDF
“Introduction to Designing with AI Agents,” a Presentation from Amazon Web Se...
PDF
Uncertainty-aware contextual multi-armed bandits for recommendations in e-com...
PDF
Peak of Data & AI Encore: Scalable Design & Infrastructure
PDF
Be ready for tomorrow’s needs with a longer-lasting, higher-performing PC
PDF
State of AI in Business 2025 - MIT NANDA
PDF
Applying Agentic AI in Enterprise Automation
PDF
TicketRoot: Event Tech Solutions Deck 2025
PDF
Advancements in abstractive text summarization: a deep learning approach
PDF
Ebook - The Future of AI A Comprehensive Guide.pdf
PDF
Addressing the challenges of harmonizing law and artificial intelligence tech...
PDF
Human Computer Interaction Miterm Lesson
PDF
Secure Java Applications against Quantum Threats
PPTX
From Curiosity to ROI — Cost-Benefit Analysis of Agentic Automation [3/6]
PPTX
Presentation - Principles of Instructional Design.pptx
PPTX
From XAI to XEE through Influence and Provenance.Controlling model fairness o...
PDF
The Basics of Artificial Intelligence - Understanding the Key Concepts and Te...
PPT
Overviiew on Intellectual property right
Child-friendly e-learning for artificial intelligence education in Indonesia:...
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
Strategic Picks — Prioritising the Right Agentic Use Cases [2/6]
“Introduction to Designing with AI Agents,” a Presentation from Amazon Web Se...
Uncertainty-aware contextual multi-armed bandits for recommendations in e-com...
Peak of Data & AI Encore: Scalable Design & Infrastructure
Be ready for tomorrow’s needs with a longer-lasting, higher-performing PC
State of AI in Business 2025 - MIT NANDA
Applying Agentic AI in Enterprise Automation
TicketRoot: Event Tech Solutions Deck 2025
Advancements in abstractive text summarization: a deep learning approach
Ebook - The Future of AI A Comprehensive Guide.pdf
Addressing the challenges of harmonizing law and artificial intelligence tech...
Human Computer Interaction Miterm Lesson
Secure Java Applications against Quantum Threats
From Curiosity to ROI — Cost-Benefit Analysis of Agentic Automation [3/6]
Presentation - Principles of Instructional Design.pptx
From XAI to XEE through Influence and Provenance.Controlling model fairness o...
The Basics of Artificial Intelligence - Understanding the Key Concepts and Te...
Overviiew on Intellectual property right

Web Mining

  • 1. Presented By: Akshat Saxena Anjul Sahu
  • 2. Definition Application of data mining techniques on the web to discover interesting patterns.
  • 3. Introduction Size of web is extremely large Data present on web is unstructured Good scope of data mining Types of data on web Content of actual webpage Intrapage structure Interpage structure Usage data User profiles and cookies
  • 5. Web Content Mining Extends work of search engine Improves on traditional crawler technique Use data mining for efficiency, effectiveness and scalability Further divided into Agent based approach Database based approach Text mining is/isn’t content mining Crawlers Personalization
  • 6. Web Content Mining Subtasks Resource finding Retrieving intended documents Information selection/pre-processing Select and pre-process specific information from selected documents Generalization Discover general patterns within and across web sites Analysis Validation and/or interpretation of mined patterns
  • 8. Web Crawler Program which browses WWW in a methodical, automated manner Copy in cache and do Indexing Starts from a seed url Searches and finds links, keywords Types of Crawler Context focused Focused Incremental Periodic
  • 10. Focused Crawler Visits only pages of interest Architecture consists of: Hyperlink Classifier Distiller Crawler Hub pages - links to relevant pages Hard focus - parent node relevant Soft focus - probability of relevance Harvest rate – precision rate
  • 11. Context Focused Crawler Focused crawler was static Drawbacks: Non-relevant pages having links to relevant ones. These to be followed Relevant ones not having links to other relevant ones. Backward crawling CFC in two steps Construct context graphs and classifiers Crawl using these classifiers
  • 12. Harvest System Uses caching, indexing and crawling Act as a tool in gathering information from other sources Components: Gatherer - obtains information Broker - provides index and query interface Essence systems Semantic indexing
  • 13. Virtual Web View Web as multiple layer database A view of MLDB is virtual web view No spiders used Websites send their indices to others WebML – DMQL for web mining KEYWORDS – covers, covered by, like, close to Difficult to implement
  • 14. Personalization Contents of web are modified as per user’s desires Personalized not targeted Use cookies, userID, profile information Legal issues to be considered Includes clustering, classification or even prediction
  • 15. Personalization Types: User preference Collaborative filtering Content based filtering Example : My Yahoo! was first. Now almost every service offers personalization.
  • 16. Personalization Yahoo was the first to introduce the concept of a ’personalized portal’, i.e. a Web site designed to have the look-and-feel as well as content personalized to the needs of an individual end-user. Mining MyYahoo usage logs provides Yahoo valuable insight into an individual’s Web usage habits, enabling Yahoo to provide compelling personalized content, which in turn has led to the tremendous popularity of the Yahoo Web site.
  • 17. Web Structure Mining Creating a model of web organization Classify web pages Create similarity measures between web pages Page Rank The Clever system Hyperlink induced topic search(HITS)
  • 18. PageRank TM Link analysis algorithm which assigns numerical weight to a webpage. The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E). the PageRank value for a page u is dependent on the PageRank values for each page v out of the set B u (this set contains all pages linking to page u ), divided by the number L ( v ) of links from page v .
  • 19. Page Rank Increase effectiveness of search engines Based on number of back links Rank sink problem exists
  • 20. Clever System Finds both authoritative pages and hubs Authoritative - best source Hub - link to authoritative pages Most value page returned Hyperlink Induced Topic Search Keywords Authority and hub measure
  • 21. Alternatives to PageRank HITS Algorithm IBM Clever Project TrustRank But PageRank is the most popular and widely used algorithm by search engines
  • 22. Web Usage Mining Applies mining on web usage data or weblogs or clickstream data Client perspective Server perspective Aid in personalization Helps in evaluating quality and effectiveness Preprocessing, pattern discovery and data structures
  • 23. Trackers for site usage and analysis
  • 24.  
  • 25. Issues in Web Log Identify exact user Exact sequence of pages visited Security, privacy and legal issues
  • 26. Preprocessing Information not in presentable format Data cleaning required Log: (<src id>,<literal>,<timestamp>) Data might be grouped Sessions Path completion
  • 27. Data Structure DS needed to keep track of patterns identified DS used is trie A rooted tree where each path from root to node represents a sequence
  • 28. Pattern Discovery Traversal pattern - pages visited in a session Properties: Duplicate reference may / may not be allowed Consist of only contiguous page reference Pattern may / may not be maximal Association rules - pages accessed together
  • 29. Pattern Discovery Sequential Pattern - ordered set satisfying a support and maximal Similar to apriori algorithm Web access pattern - efficient counting Episodes – partially ordered by access time; users not identified Pattern analysis
  • 30. Queries ‘N Suggestions References: http://maya.cs.depaul.edu/~mobasher/webminer/survey/ Google.com/Technology http://www.almaden.ibm.com/projects/clever.shtml Thanks !!  {akshatsaxena11, anjulsahu}@gmail.com