SlideShare a Scribd company logo
MODULE 6
Hierarchical Clustering method: BIRCH. Density-Based Clustering –
DBSCAN and OPTICS. Advanced Data Mining Techniques: Introduction,
Web Mining- Web Content Mining, Web Structure Mining, Web Usage
Mining. Text Mining. Graph mining:- Apriori based approach for
mining frequent subgraphs. Social Network Analysis:- characteristics
of social networks. Link mining:- Tasks and challenges
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 1
Hierarchical Clustering method
NIMMY RAJU,AP,VKCET,TVM 26/30/2020
NIMMY RAJU,AP,VKCET,TVM 36/30/2020
• Figure below shows the application of AGNES (AGglomerative NESting), an
agglomerative hierarchical clustering method, and DIANA (DIvisia ANAlysis), a
divisive hierarchical clustering method, to a data set of five objects, {a, b, c, d, e}.
In either agglomerative or divisive hierarchical clustering, one can specify the
desired number of clusters as a termination condition.NIMMY RAJU,AP,VKCET,TVM 46/30/2020
NIMMY RAJU,AP,VKCET,TVM 56/30/2020
NIMMY RAJU,AP,VKCET,TVM 66/30/2020
BIRCH: Balanced Iterative Reducing and Clustering using
Hierarchies
NIMMY RAJU,AP,VKCET,TVM 76/30/2020
NIMMY RAJU,AP,VKCET,TVM 86/30/2020
“How does the BIRCH algorithm work?" It consists of
two phases:
NIMMY RAJU,AP,VKCET,TVM 96/30/2020
NIMMY RAJU,AP,VKCET,TVM 106/30/2020
Density-based methods
NIMMY RAJU,AP,VKCET,TVM 116/30/2020
NIMMY RAJU,AP,VKCET,TVM 126/30/2020
NIMMY RAJU,AP,VKCET,TVM 136/30/2020
NIMMY RAJU,AP,VKCET,TVM 146/30/2020
NIMMY RAJU,AP,VKCET,TVM 156/30/2020
OPTICS: Ordering Points To Identify the Clustering
Structure
• DBSCAN can cluster objects given input parameters such as ε and MinP ts,
• Thus it leaves the user with the responsibility of selecting parameter
values that will lead to the discovery of acceptable clusters.
• Such parameter settings are usually empirically set and difficult to
determine, especially for real-world, high-dimensional data sets.
• Most algorithms are very sensitive to such parameter values: slightly
different settings may lead to very dfferent clusterings of the data.
• There does not even exist a global parameter setting for which the result
of a clustering algorithm may accurately describe the intrinsic clustering
structure.
• To help overcome this difficulty, a cluster ordering method called OPTICS
(Ordering Points To Identify the Clustering Structure) was proposed.
• OPTICS computes an augmented cluster ordering for automatic and
interactive cluster analysis.
• This ordering represents the density-based clustering structure of the
data.
NIMMY RAJU,AP,VKCET,TVM 166/30/2020
NIMMY RAJU,AP,VKCET,TVM 176/30/2020
NIMMY RAJU,AP,VKCET,TVM 186/30/2020
Graph mining
• Graphs become increasingly important in
modeling complicated structures, such as
circuits, images, chemical compounds, protein
structures, biological networks, social
networks, the Web, workflows, and XML
documents.
• Frequent substructures are the very basic
patterns that can be discovered in a collection
of graphs
NIMMY RAJU,AP,VKCET,TVM 196/30/2020
Methods for Mining Frequent Subgraphs
• We denote the vertex set of a graph g by V(g) and the
edge set by E(g).
• A label function, L, maps a vertex or an edge to a label.
• A graph g is a subgraph of another graph g0 if there
exists a subgraph isomorphism from g to g0.
• Given a labeled graph data set, D = {G1,G2, ... ;Gn } ,
we define support(g) (or frequency(g)) as the
percentage (or number) of graphs in D where g is a
subgraph.
• A frequent graph is a graph whose support is no less
than a minimum support threshold, min sup
NIMMY RAJU,AP,VKCET,TVM 206/30/2020
“How can we discover frequent substructures?”
• The discovery of frequent substructures usually
consists of two steps.
• In the first step, we generate frequent
substructure candidates.
• The frequency of each candidate is checked in the
second step.
• Apriori-based approach is used here.
NIMMY RAJU,AP,VKCET,TVM 216/30/2020
Apriori-based Approach
• The search for frequent graphs starts with
graphs of small “size,” and proceeds in a
bottom-up manner by generating candidates
having an extra vertex, edge, or path.
• Sk is the frequent substructure set of size
• AprioriGraph adopts a level-wise mining
methodology.
NIMMY RAJU,AP,VKCET,TVM 226/30/2020
• At each iteration, the size of newly discovered
frequent substructures is increased by one.
• These new substructures are first generated
by joining two similar but slightly different
frequent subgraphs that were discovered in
the previous call .
• The frequency of the newly formed graphs is
then checked.
• Those found to be frequent are used to
generate larger candidates in the next round.
NIMMY RAJU,AP,VKCET,TVM 236/30/2020
• The main design complexity of Apriori-based
substructure mining algorithms is the
candidate generation step.
• Recent Apriori-based algorithms for frequent
substructure mining include AGM, FSG, and a
path-join method.
NIMMY RAJU,AP,VKCET,TVM 246/30/2020
• The AGM algorithm uses a vertex-based
candidate generation method that increases the
substructure size by one vertex at each iteration
of AprioriGraph.
• Two size-k frequent graphs are joined only if they
have the same size-(k-1) subgraph.
• Here, graph size is the number of vertices in the
graph.
• The newly formed candidate includes the size-(k-
1) subgraph in common and the additional two
vertices from the two size-k patterns.
• Because it is undetermined whether there is an
edge connecting the additional two vertices, we
actually can form two substructures.
NIMMY RAJU,AP,VKCET,TVM 256/30/2020
NIMMY RAJU,AP,VKCET,TVM 266/30/2020
• The FSG algorithm adopts an edge-based
candidate generation strategy that increases
the substructure size by one edge.
• Two size-k patterns are merged if and only if
they share the same subgraph having k - 1
edges, which is called the core.
• Here, graph size is taken to be the number of
edges in the graph.
• The newly formed candidate includes the core
and the additional two edges from the size-k
patterns.
NIMMY RAJU,AP,VKCET,TVM 276/30/2020
Web Mining
• Web mining is mining of data related to the
World Wide Web.
• Web data
Content of actual Web pages.
 Intrapage structure includes the HTML or XML code for
the page.
Interpage structure is the actual linkage structure between
Web pages.
Usage data that describe how Web pages are accessed by
visitors.
User profiles include demographic and registration
information obtained about users.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 28
Web mining tasks can be divided into several classes
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 29
• Web content mining examines the content of Web pages
as well as results of Web searching.
• Web content mining is further divided into Web page
content mining and search results mining.
• The first is traditional searching of Web pages via
content, while the second is a further search of pages
found from a previous search.
• With Web structure mining, information is obtained
from the actual organization of pages on the Web.
• Web usage mining looks at logs of Web access. General
access pattern tracking is a type of usage mining that
looks at a history of Web pages visited. This usage may
be general or may be targeted to specific usage or users
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 30
WEB CONTENT MINING
• Web content mining includes different techniques that can be
used to search the Internet.
• One taxonomy of Web mining divided Web content mining
into agent-based and database approaches.
• Agent-based approaches have software systems (agents)that
perform the content mining.
• The database approaches view the Web data as belonging to a
database.
• One problem associated with retrieval of data from Web
documents is that Web pages created using HTML are only
semi structured, thus making querying more difficult.
• HTML ultimately will be replaced by (XML), which will
provide structured documents and facilitate easier mining
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 31
Crawlers(web crawling)
• A robot (or spider or crawler) is a program that
traverses the hypertext structure in the Web.
• The page (or set of pages) that the crawler starts with
are referred to as the seed URLs.
• By starting at one page, all links from it are recorded
and saved in a queue.
• These new pages are in turn searched and their links
are saved.
• As these robots search the Web, they may collect
information about each page, such as extract
keywords and store in indices for users of the
associated search engine.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 32
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 33
 The focused crawler architecture consists of
three primary components .
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 34
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 35
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 36
Harvest System
• Harvest is actually a set of tools that facilitate
gathering of information from diverse sources.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 37
Virtual Web View
• To handle the large amounts of somewhat unstructured
data on the Web is to create a multiple layered database
(MLDB) .
• Each layer of this database is more generalized than the
layer beneath it.
• The MLDB provides an abstracted and condensed view of
a portion of the Web.
• A view of the MLDB, which is called a Virtual Web View
(VWV), can be constructed.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 38
WebML
• A Web data mining query language, WebML is
proposed to provide data mining operations on
the MLDB .
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 39
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 40
WEB STRUCTURE MINING
• Web structure mining can be viewed as
creating a model of the Web organization .
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 41
1 Page Rank
• PageRank is used to measure the importance
of a page .
• The PageRank value for a page is calculated
based on the number of pages that point to it.
• This is actually a measure based on the number
of backlinks to a page.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 42
• Given a page p, we use Bp to be the set of
pages that point to p, and Fp to be the set of
links out of p. The Page Rank of a page p is
defined as .
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 43
2.Clever
• Clever, is aimed at finding both authoritative
pages and hubs .
• The authors define an authority as the "best
source" for the requested information .
• In addition, a hub is a page that contains links
to authoritative pages
• The Clever system identifies authoritative
pages and hub pages by creating weights.
• Hyperlink-induced topic search (HITS) finds
hubs and authoritative pages .
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 44
The HITS technique contains two components:
• Based on a given set of keywords a set of
relevant pages is found.
• Hub and authority measures are associated
with these pages.
Pages with the highest values are returned.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 45
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 46
WEB USAGE MINING
• Web usage mining performs mining on Web
usage data, or Web logs.
• A Web log is a listing of page reference data.
• Sometimes it is referred to as clickstream data
because each entry corresponds to a mouse
click.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 47
Web usage mining applications
• Personalization for a user can be achieved by
keeping track of previously accessed pages..
• Improve the overall performance of future
accesses.
• Information concerning frequently accessed pages
can be used for caching.
• Identifying common access behaviors can be used
to improve the actual design of Web pages and to
make other modifications to the site.
• Web usage patterns can be used to gather
business intelligence to improve sales and
advertisement.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 48
Web usage mining actually consists of three
separate types of activities
1. Preprocessing activities: reformatting the
Web log data before processing.
2. Pattern discovery: finding hidden patterns
within the log data.
3. Pattern analysis :looking at and interpreting
the results of the discovery activities.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 49
There are many issues associated with using the
Web log for mining purposes:
• Identification of the exact user is not possible
from the log alone.
• With a Web client cache, the exact sequence of
pages a user actually visits · difficult to uncover
from the server site.
• There are many security, privacy, and legal issues
yet to be solved
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 50
1 Preprocessing
• The Web usage log probably is not in a format
that is usable by mining applications.
• As With any data to be used in a mining
application, the data may need to be reformatted
and cleansed.
• Steps that are part of the preprocessing phase
include cleansing, user identification ,session
identification, path completion, and formatting .
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 51
• Problems
 correct identification of the actual user. User
identification is complicated by the use of proxy
servers, client side caching, and corporate firewalls.
 Identifying the actual sequence of pages accessed by
a user is complicated by the use of client side caching.
• In this case, actual pages accessed will be missing from
the server side log.
• Techniques can be used to complete the log by
predicting missing pages.
• Path completion is an attempt to add page accesses that
do not exist in the log but that actually occurred.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 52
Data Structures
• A basic data structure is called a trie.
• A trie is a rooted tree, where each path from the root to a leaf
represents a sequence.
• Tries are used to store strings for pattern-matching applications.
• Each character in the string is stored on the edge to the node.
• A problem in using tries for many long strings is the space required.
• This waste of space that is solved by compressing nodes together
when they have degrees of one.
• The compressed trie is called a suffix tree. A suffix tree has the
following characteristics:
• Each internal node except the root has at least two children.
• Each edge represents a nonempty subsequence.
• The subsequences represented by sibling edges begin with
different symbols.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 53
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 54
2.Pattern Discovery
• The most common data mining technique used
on clickstream data is that of uncovering
traversal patterns. A traversal pattern is a set of
pages visited by a user in a session.
• Several different types of traversal patterns
have been examined.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 55
3.Pattern Analysis
• Once patterns have been identified, they must be
analyzed to determine how that information can
be used.
• Some of the generated patterns may be deleted
and determined not to be of interest.
• Recent work has proposed examining Web logs
not only to identify frequent types of traversal
patterns, but also to identify patterns that are of
interest because of their uniqueness or statistical .
• A Web mining query language, MINT is used.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 56
Text Mining
• Mining of data from text database.
• Text databases consist of large collections of documents
from various sources, such as news articles, research
papers, books, digital libraries, e-mail messages, and Web
pages.
• Nowadays most of the information in government, industry,
business, and other institutions are stored electronically, in
the form of text databases.
• Data stored in most text databases are semistructured
data in that they are neither completely unstructured nor
completely structured.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 57
• Information retrieval (IR)/Text mining is concerned with
the organization and retrieval of information from a
large number of text-based documents.
• Since information retrieval and database systems each
handle different kinds of data, some database system
problems are usually not present in information
retrieval systems, such as concurrency control,
recovery, transaction management, and update.
• Also, some common information retrieval problems
are usually not encountered in traditional database
systems, such as unstructured documents and
approximate search based on keywords.
Difference between text database and traditional data base
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 58
• Let the set of documents relevant to a query
be denoted as {Relevant}, and the set of
documents retrieved be denoted as
{Retrieved}.
• The set of documents that are both relevant
and retrieved is denoted as
{Relevant} U {Retrieved}
• There are two basic measures for assessing
the quality of text retrieval:
• Precision, recall, and F-score are the basic
measures of a retrieved set of documents.
Basic Measures for Text Retrieval: Precision and Recall
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 59
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 60
Social Network Analysis
What Is a Social Network?
• A social network is a heterogeneous and
multirelational data set represented by a graph.
• The graph is typically very large, with nodes
corresponding to objects and edges
corresponding to links representing relationships
or interactions between objects.
• Examples include electrical power grids,
telephone call graphs, the spread of computer
viruses, the World Wide Web
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 61
Characteristics of Social Networks
• Nodes’ degrees, that is, the number of edges
incident to each node
• Distances between a pair of nodes, as
measured by the shortest path length.
• Network diameter is the maximum distance
between pairs of nodes.
• Average distance between pairs
• Effective diameter (i.e., the minimum
distance, d.)
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 62
• In general, social networks tend to exhibit the
following phenomena:
• Densification power law:
• The number of degrees grows linearly in the number
of nodes. This was known as the constant average
degree assumption. The densification follows the
densification power law (or growth power law),
which states
• Shrinking diameter:
• It has been experimentally shown that the effective
diameter tends to decrease as the network grows.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 63
Link Mining: Tasks and Challenges
• By considering links (the relationships between
objects), more information is made available to the
mining process.
TASKS
1. Link-based object classification:
• Link-based classification predicts the category of an
object based not only on its attributes, but also on its
links, and on the attributes of linked objects.
• Web page classification is a well-recognized example
of link-based classification
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 64
2. Object type prediction.
• This predicts the type of an object, based on
its attributes and its links, and on the
attributes of objects linked to it.
3. Link type prediction.
• This predicts the type or purpose of a link,
based on properties of the objects involved.
• Given Web page data, we can try to predict
whether a link on a page is an advertising link
or a navigational link.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 65
4. Predicting link existence:
• Predict whether a link exists between two
objects. Examples include predicting whether
there will be a link between two Web pages.
5. Link cardinality estimation.
• There are two forms of link cardinality
estimation.
• First, we may predict the number of links to an
object. This is useful, for instance, in predicting
the authoritativeness of a Web page based on the
number of links to it (in-links).
• Similarly, the number of out-links can be used to
identify Web pages that act as hubs, where a hub
is one or a set of Web pages that point to many
authoritative pages of the same topic
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 66
6. Object reconciliation.
• In object reconciliation, the task is to predict
whether two objects are, in fact, the same,
based on their attributes and links.
• Examples include predicting whether two
websites are mirrors of each other, and
whether two apparent disease strains are
really the same.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 67
7. Group detection.
• Group detection is a clustering task.
• It predicts when a set of objects belong to the
same group or cluster, based on their
attributes as well as their link structure.
• An area of application is the identification of
Web communities, where a Web community is
a collection of Web pages that focus on a
particular theme or topic.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 68
9. Metadata mining.
• Metadata are data about data. Metadata
provide semi-structured data about
unstructured data, ranging from text and Web
data to multimedia databases.
• It is useful for data integration tasks in many
domains
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 69
Challenges
• Logical versus statistical dependencies
• Feature construction.
• Effective use of labeled and unlabeled data.
• Link prediction.
• Community mining from multirelational
networks
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 70

More Related Content

PPTX
CS 402 DATAMINING AND WAREHOUSING -MODULE 5
NIMMYRAJU
 
PPT
CS 402 DATAMINING AND WAREHOUSING -MODULE 3
NIMMYRAJU
 
PPTX
CS 402 DATAMINING AND WAREHOUSING -PROBLEMS
NIMMYRAJU
 
PPT
CS 402 DATAMINING AND WAREHOUSING -MODULE 2
NIMMYRAJU
 
PPTX
CS 402 DATAMINING AND WAREHOUSING -MODULE 4
NIMMYRAJU
 
PPTX
ACCIDENT PREVENTION AND DETECTION SYSTEM
anand bedre
 
PDF
Blockchain: The New Technology and Its Applications for Libraries
Bohyun Kim
 
PPT
Bhopal gas tragedy
Reliance
 
CS 402 DATAMINING AND WAREHOUSING -MODULE 5
NIMMYRAJU
 
CS 402 DATAMINING AND WAREHOUSING -MODULE 3
NIMMYRAJU
 
CS 402 DATAMINING AND WAREHOUSING -PROBLEMS
NIMMYRAJU
 
CS 402 DATAMINING AND WAREHOUSING -MODULE 2
NIMMYRAJU
 
CS 402 DATAMINING AND WAREHOUSING -MODULE 4
NIMMYRAJU
 
ACCIDENT PREVENTION AND DETECTION SYSTEM
anand bedre
 
Blockchain: The New Technology and Its Applications for Libraries
Bohyun Kim
 
Bhopal gas tragedy
Reliance
 

What's hot (20)

PDF
Matrix Factorization In Recommender Systems
YONG ZHENG
 
PDF
Classification and Clustering
Eng Teong Cheah
 
PPT
Chapter 11. Cluster Analysis Advanced Methods.ppt
Subrata Kumer Paul
 
PPTX
Association rule mining.pptx
maha797959
 
PPTX
Dbscan algorithom
Mahbubur Rahman Shimul
 
PPTX
Association Analysis in Data Mining
Kamal Acharya
 
PPTX
Cluster Analysis Introduction
PrasiddhaSarma
 
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
PPT
Chapter 8. Classification Basic Concepts.ppt
Subrata Kumer Paul
 
PPT
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Salah Amean
 
PDF
Hierarchical clustering
Ashek Farabi
 
PPT
Cluster analysis
Kamalakshi Deshmukh-Samag
 
PPTX
Recommendation system
Rishabh Mehta
 
PDF
18 Data Streams
Pier Luca Lanzi
 
PPTX
Classification and Clustering
Yogendra Tamang
 
PPTX
Unsupervised learning (clustering)
Pravinkumar Landge
 
PPTX
boosting algorithm
Prithvi Paneru
 
PPTX
Community detection in social networks
Francisco Restivo
 
PPT
Chapter8
akhila chilukuri
 
Matrix Factorization In Recommender Systems
YONG ZHENG
 
Classification and Clustering
Eng Teong Cheah
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Subrata Kumer Paul
 
Association rule mining.pptx
maha797959
 
Dbscan algorithom
Mahbubur Rahman Shimul
 
Association Analysis in Data Mining
Kamal Acharya
 
Cluster Analysis Introduction
PrasiddhaSarma
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
Chapter 8. Classification Basic Concepts.ppt
Subrata Kumer Paul
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Salah Amean
 
Hierarchical clustering
Ashek Farabi
 
Cluster analysis
Kamalakshi Deshmukh-Samag
 
Recommendation system
Rishabh Mehta
 
18 Data Streams
Pier Luca Lanzi
 
Classification and Clustering
Yogendra Tamang
 
Unsupervised learning (clustering)
Pravinkumar Landge
 
boosting algorithm
Prithvi Paneru
 
Community detection in social networks
Francisco Restivo
 
Ad

Similar to CS 402 DATAMINING AND WAREHOUSING -MODULE 6 (20)

PDF
E502024047
IJERA Editor
 
PDF
E502024047
IJERA Editor
 
PDF
Study of Density Based Clustering Techniques on Data Streams
IJERA Editor
 
PDF
Approximate QoS Rule Derivation Based on Root Cause Analysis for Cloud Comput...
Satoshi Konno
 
PDF
Web Graph Clustering Using Hyperlink Structure
aciijournal
 
PDF
Web Graph Clustering Using Hyperlink Structure
aciijournal
 
PDF
Camera-Based Road Lane Detection by Deep Learning II
Yu Huang
 
PDF
REVIEW: Frequent Pattern Mining Techniques
Editor IJMTER
 
PDF
H0314450
iosrjournals
 
PPTX
slide-171212080528.pptx
SharanrajK22MMT1003
 
PPTX
Real Time Object Dectection using machine learning
pratik pratyay
 
PDF
IRJET-Multiple Object Detection using Deep Neural Networks
IRJET Journal
 
PDF
Data stream mining techniques: a review
TELKOMNIKA JOURNAL
 
PPTX
How I Learned to Stop Information Sharing and Love the DIKW
Sounil Yu
 
PDF
2019 cvpr paper overview by Ho Seong Lee
Moazzem Hossain
 
PDF
2019 cvpr paper_overview
LEE HOSEONG
 
PPT
A survey on web usage mining techniques
International Center for Research & Development
 
PDF
Recommendation Based On Comparative Analysis of Apriori and BW-Mine Algorithm
IJAEMSJORNAL
 
PDF
DIGEST PODCAST
IRJET Journal
 
E502024047
IJERA Editor
 
E502024047
IJERA Editor
 
Study of Density Based Clustering Techniques on Data Streams
IJERA Editor
 
Approximate QoS Rule Derivation Based on Root Cause Analysis for Cloud Comput...
Satoshi Konno
 
Web Graph Clustering Using Hyperlink Structure
aciijournal
 
Web Graph Clustering Using Hyperlink Structure
aciijournal
 
Camera-Based Road Lane Detection by Deep Learning II
Yu Huang
 
REVIEW: Frequent Pattern Mining Techniques
Editor IJMTER
 
H0314450
iosrjournals
 
slide-171212080528.pptx
SharanrajK22MMT1003
 
Real Time Object Dectection using machine learning
pratik pratyay
 
IRJET-Multiple Object Detection using Deep Neural Networks
IRJET Journal
 
Data stream mining techniques: a review
TELKOMNIKA JOURNAL
 
How I Learned to Stop Information Sharing and Love the DIKW
Sounil Yu
 
2019 cvpr paper overview by Ho Seong Lee
Moazzem Hossain
 
2019 cvpr paper_overview
LEE HOSEONG
 
A survey on web usage mining techniques
International Center for Research & Development
 
Recommendation Based On Comparative Analysis of Apriori and BW-Mine Algorithm
IJAEMSJORNAL
 
DIGEST PODCAST
IRJET Journal
 
Ad

Recently uploaded (20)

PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PDF
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PPTX
quantum computing transition from classical mechanics.pptx
gvlbcy
 
PDF
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
PDF
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
PPTX
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PPTX
Online Cab Booking and Management System.pptx
diptipaneri80
 
PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PDF
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
PDF
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PDF
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
PDF
Zero Carbon Building Performance standard
BassemOsman1
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PPTX
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
quantum computing transition from classical mechanics.pptx
gvlbcy
 
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
Online Cab Booking and Management System.pptx
diptipaneri80
 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
Zero Carbon Building Performance standard
BassemOsman1
 
Inventory management chapter in automation and robotics.
atisht0104
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 

CS 402 DATAMINING AND WAREHOUSING -MODULE 6

  • 1. MODULE 6 Hierarchical Clustering method: BIRCH. Density-Based Clustering – DBSCAN and OPTICS. Advanced Data Mining Techniques: Introduction, Web Mining- Web Content Mining, Web Structure Mining, Web Usage Mining. Text Mining. Graph mining:- Apriori based approach for mining frequent subgraphs. Social Network Analysis:- characteristics of social networks. Link mining:- Tasks and challenges 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 1
  • 2. Hierarchical Clustering method NIMMY RAJU,AP,VKCET,TVM 26/30/2020
  • 4. • Figure below shows the application of AGNES (AGglomerative NESting), an agglomerative hierarchical clustering method, and DIANA (DIvisia ANAlysis), a divisive hierarchical clustering method, to a data set of five objects, {a, b, c, d, e}. In either agglomerative or divisive hierarchical clustering, one can specify the desired number of clusters as a termination condition.NIMMY RAJU,AP,VKCET,TVM 46/30/2020
  • 7. BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies NIMMY RAJU,AP,VKCET,TVM 76/30/2020
  • 9. “How does the BIRCH algorithm work?" It consists of two phases: NIMMY RAJU,AP,VKCET,TVM 96/30/2020
  • 16. OPTICS: Ordering Points To Identify the Clustering Structure • DBSCAN can cluster objects given input parameters such as ε and MinP ts, • Thus it leaves the user with the responsibility of selecting parameter values that will lead to the discovery of acceptable clusters. • Such parameter settings are usually empirically set and difficult to determine, especially for real-world, high-dimensional data sets. • Most algorithms are very sensitive to such parameter values: slightly different settings may lead to very dfferent clusterings of the data. • There does not even exist a global parameter setting for which the result of a clustering algorithm may accurately describe the intrinsic clustering structure. • To help overcome this difficulty, a cluster ordering method called OPTICS (Ordering Points To Identify the Clustering Structure) was proposed. • OPTICS computes an augmented cluster ordering for automatic and interactive cluster analysis. • This ordering represents the density-based clustering structure of the data. NIMMY RAJU,AP,VKCET,TVM 166/30/2020
  • 19. Graph mining • Graphs become increasingly important in modeling complicated structures, such as circuits, images, chemical compounds, protein structures, biological networks, social networks, the Web, workflows, and XML documents. • Frequent substructures are the very basic patterns that can be discovered in a collection of graphs NIMMY RAJU,AP,VKCET,TVM 196/30/2020
  • 20. Methods for Mining Frequent Subgraphs • We denote the vertex set of a graph g by V(g) and the edge set by E(g). • A label function, L, maps a vertex or an edge to a label. • A graph g is a subgraph of another graph g0 if there exists a subgraph isomorphism from g to g0. • Given a labeled graph data set, D = {G1,G2, ... ;Gn } , we define support(g) (or frequency(g)) as the percentage (or number) of graphs in D where g is a subgraph. • A frequent graph is a graph whose support is no less than a minimum support threshold, min sup NIMMY RAJU,AP,VKCET,TVM 206/30/2020
  • 21. “How can we discover frequent substructures?” • The discovery of frequent substructures usually consists of two steps. • In the first step, we generate frequent substructure candidates. • The frequency of each candidate is checked in the second step. • Apriori-based approach is used here. NIMMY RAJU,AP,VKCET,TVM 216/30/2020
  • 22. Apriori-based Approach • The search for frequent graphs starts with graphs of small “size,” and proceeds in a bottom-up manner by generating candidates having an extra vertex, edge, or path. • Sk is the frequent substructure set of size • AprioriGraph adopts a level-wise mining methodology. NIMMY RAJU,AP,VKCET,TVM 226/30/2020
  • 23. • At each iteration, the size of newly discovered frequent substructures is increased by one. • These new substructures are first generated by joining two similar but slightly different frequent subgraphs that were discovered in the previous call . • The frequency of the newly formed graphs is then checked. • Those found to be frequent are used to generate larger candidates in the next round. NIMMY RAJU,AP,VKCET,TVM 236/30/2020
  • 24. • The main design complexity of Apriori-based substructure mining algorithms is the candidate generation step. • Recent Apriori-based algorithms for frequent substructure mining include AGM, FSG, and a path-join method. NIMMY RAJU,AP,VKCET,TVM 246/30/2020
  • 25. • The AGM algorithm uses a vertex-based candidate generation method that increases the substructure size by one vertex at each iteration of AprioriGraph. • Two size-k frequent graphs are joined only if they have the same size-(k-1) subgraph. • Here, graph size is the number of vertices in the graph. • The newly formed candidate includes the size-(k- 1) subgraph in common and the additional two vertices from the two size-k patterns. • Because it is undetermined whether there is an edge connecting the additional two vertices, we actually can form two substructures. NIMMY RAJU,AP,VKCET,TVM 256/30/2020
  • 27. • The FSG algorithm adopts an edge-based candidate generation strategy that increases the substructure size by one edge. • Two size-k patterns are merged if and only if they share the same subgraph having k - 1 edges, which is called the core. • Here, graph size is taken to be the number of edges in the graph. • The newly formed candidate includes the core and the additional two edges from the size-k patterns. NIMMY RAJU,AP,VKCET,TVM 276/30/2020
  • 28. Web Mining • Web mining is mining of data related to the World Wide Web. • Web data Content of actual Web pages.  Intrapage structure includes the HTML or XML code for the page. Interpage structure is the actual linkage structure between Web pages. Usage data that describe how Web pages are accessed by visitors. User profiles include demographic and registration information obtained about users. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 28
  • 29. Web mining tasks can be divided into several classes 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 29
  • 30. • Web content mining examines the content of Web pages as well as results of Web searching. • Web content mining is further divided into Web page content mining and search results mining. • The first is traditional searching of Web pages via content, while the second is a further search of pages found from a previous search. • With Web structure mining, information is obtained from the actual organization of pages on the Web. • Web usage mining looks at logs of Web access. General access pattern tracking is a type of usage mining that looks at a history of Web pages visited. This usage may be general or may be targeted to specific usage or users 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 30
  • 31. WEB CONTENT MINING • Web content mining includes different techniques that can be used to search the Internet. • One taxonomy of Web mining divided Web content mining into agent-based and database approaches. • Agent-based approaches have software systems (agents)that perform the content mining. • The database approaches view the Web data as belonging to a database. • One problem associated with retrieval of data from Web documents is that Web pages created using HTML are only semi structured, thus making querying more difficult. • HTML ultimately will be replaced by (XML), which will provide structured documents and facilitate easier mining 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 31
  • 32. Crawlers(web crawling) • A robot (or spider or crawler) is a program that traverses the hypertext structure in the Web. • The page (or set of pages) that the crawler starts with are referred to as the seed URLs. • By starting at one page, all links from it are recorded and saved in a queue. • These new pages are in turn searched and their links are saved. • As these robots search the Web, they may collect information about each page, such as extract keywords and store in indices for users of the associated search engine. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 32
  • 34.  The focused crawler architecture consists of three primary components . 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 34
  • 37. Harvest System • Harvest is actually a set of tools that facilitate gathering of information from diverse sources. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 37
  • 38. Virtual Web View • To handle the large amounts of somewhat unstructured data on the Web is to create a multiple layered database (MLDB) . • Each layer of this database is more generalized than the layer beneath it. • The MLDB provides an abstracted and condensed view of a portion of the Web. • A view of the MLDB, which is called a Virtual Web View (VWV), can be constructed. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 38
  • 39. WebML • A Web data mining query language, WebML is proposed to provide data mining operations on the MLDB . 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 39
  • 41. WEB STRUCTURE MINING • Web structure mining can be viewed as creating a model of the Web organization . 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 41
  • 42. 1 Page Rank • PageRank is used to measure the importance of a page . • The PageRank value for a page is calculated based on the number of pages that point to it. • This is actually a measure based on the number of backlinks to a page. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 42
  • 43. • Given a page p, we use Bp to be the set of pages that point to p, and Fp to be the set of links out of p. The Page Rank of a page p is defined as . 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 43
  • 44. 2.Clever • Clever, is aimed at finding both authoritative pages and hubs . • The authors define an authority as the "best source" for the requested information . • In addition, a hub is a page that contains links to authoritative pages • The Clever system identifies authoritative pages and hub pages by creating weights. • Hyperlink-induced topic search (HITS) finds hubs and authoritative pages . 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 44
  • 45. The HITS technique contains two components: • Based on a given set of keywords a set of relevant pages is found. • Hub and authority measures are associated with these pages. Pages with the highest values are returned. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 45
  • 47. WEB USAGE MINING • Web usage mining performs mining on Web usage data, or Web logs. • A Web log is a listing of page reference data. • Sometimes it is referred to as clickstream data because each entry corresponds to a mouse click. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 47
  • 48. Web usage mining applications • Personalization for a user can be achieved by keeping track of previously accessed pages.. • Improve the overall performance of future accesses. • Information concerning frequently accessed pages can be used for caching. • Identifying common access behaviors can be used to improve the actual design of Web pages and to make other modifications to the site. • Web usage patterns can be used to gather business intelligence to improve sales and advertisement. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 48
  • 49. Web usage mining actually consists of three separate types of activities 1. Preprocessing activities: reformatting the Web log data before processing. 2. Pattern discovery: finding hidden patterns within the log data. 3. Pattern analysis :looking at and interpreting the results of the discovery activities. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 49
  • 50. There are many issues associated with using the Web log for mining purposes: • Identification of the exact user is not possible from the log alone. • With a Web client cache, the exact sequence of pages a user actually visits · difficult to uncover from the server site. • There are many security, privacy, and legal issues yet to be solved 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 50
  • 51. 1 Preprocessing • The Web usage log probably is not in a format that is usable by mining applications. • As With any data to be used in a mining application, the data may need to be reformatted and cleansed. • Steps that are part of the preprocessing phase include cleansing, user identification ,session identification, path completion, and formatting . 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 51
  • 52. • Problems  correct identification of the actual user. User identification is complicated by the use of proxy servers, client side caching, and corporate firewalls.  Identifying the actual sequence of pages accessed by a user is complicated by the use of client side caching. • In this case, actual pages accessed will be missing from the server side log. • Techniques can be used to complete the log by predicting missing pages. • Path completion is an attempt to add page accesses that do not exist in the log but that actually occurred. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 52
  • 53. Data Structures • A basic data structure is called a trie. • A trie is a rooted tree, where each path from the root to a leaf represents a sequence. • Tries are used to store strings for pattern-matching applications. • Each character in the string is stored on the edge to the node. • A problem in using tries for many long strings is the space required. • This waste of space that is solved by compressing nodes together when they have degrees of one. • The compressed trie is called a suffix tree. A suffix tree has the following characteristics: • Each internal node except the root has at least two children. • Each edge represents a nonempty subsequence. • The subsequences represented by sibling edges begin with different symbols. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 53
  • 55. 2.Pattern Discovery • The most common data mining technique used on clickstream data is that of uncovering traversal patterns. A traversal pattern is a set of pages visited by a user in a session. • Several different types of traversal patterns have been examined. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 55
  • 56. 3.Pattern Analysis • Once patterns have been identified, they must be analyzed to determine how that information can be used. • Some of the generated patterns may be deleted and determined not to be of interest. • Recent work has proposed examining Web logs not only to identify frequent types of traversal patterns, but also to identify patterns that are of interest because of their uniqueness or statistical . • A Web mining query language, MINT is used. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 56
  • 57. Text Mining • Mining of data from text database. • Text databases consist of large collections of documents from various sources, such as news articles, research papers, books, digital libraries, e-mail messages, and Web pages. • Nowadays most of the information in government, industry, business, and other institutions are stored electronically, in the form of text databases. • Data stored in most text databases are semistructured data in that they are neither completely unstructured nor completely structured. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 57
  • 58. • Information retrieval (IR)/Text mining is concerned with the organization and retrieval of information from a large number of text-based documents. • Since information retrieval and database systems each handle different kinds of data, some database system problems are usually not present in information retrieval systems, such as concurrency control, recovery, transaction management, and update. • Also, some common information retrieval problems are usually not encountered in traditional database systems, such as unstructured documents and approximate search based on keywords. Difference between text database and traditional data base 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 58
  • 59. • Let the set of documents relevant to a query be denoted as {Relevant}, and the set of documents retrieved be denoted as {Retrieved}. • The set of documents that are both relevant and retrieved is denoted as {Relevant} U {Retrieved} • There are two basic measures for assessing the quality of text retrieval: • Precision, recall, and F-score are the basic measures of a retrieved set of documents. Basic Measures for Text Retrieval: Precision and Recall 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 59
  • 61. Social Network Analysis What Is a Social Network? • A social network is a heterogeneous and multirelational data set represented by a graph. • The graph is typically very large, with nodes corresponding to objects and edges corresponding to links representing relationships or interactions between objects. • Examples include electrical power grids, telephone call graphs, the spread of computer viruses, the World Wide Web 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 61
  • 62. Characteristics of Social Networks • Nodes’ degrees, that is, the number of edges incident to each node • Distances between a pair of nodes, as measured by the shortest path length. • Network diameter is the maximum distance between pairs of nodes. • Average distance between pairs • Effective diameter (i.e., the minimum distance, d.) 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 62
  • 63. • In general, social networks tend to exhibit the following phenomena: • Densification power law: • The number of degrees grows linearly in the number of nodes. This was known as the constant average degree assumption. The densification follows the densification power law (or growth power law), which states • Shrinking diameter: • It has been experimentally shown that the effective diameter tends to decrease as the network grows. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 63
  • 64. Link Mining: Tasks and Challenges • By considering links (the relationships between objects), more information is made available to the mining process. TASKS 1. Link-based object classification: • Link-based classification predicts the category of an object based not only on its attributes, but also on its links, and on the attributes of linked objects. • Web page classification is a well-recognized example of link-based classification 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 64
  • 65. 2. Object type prediction. • This predicts the type of an object, based on its attributes and its links, and on the attributes of objects linked to it. 3. Link type prediction. • This predicts the type or purpose of a link, based on properties of the objects involved. • Given Web page data, we can try to predict whether a link on a page is an advertising link or a navigational link. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 65
  • 66. 4. Predicting link existence: • Predict whether a link exists between two objects. Examples include predicting whether there will be a link between two Web pages. 5. Link cardinality estimation. • There are two forms of link cardinality estimation. • First, we may predict the number of links to an object. This is useful, for instance, in predicting the authoritativeness of a Web page based on the number of links to it (in-links). • Similarly, the number of out-links can be used to identify Web pages that act as hubs, where a hub is one or a set of Web pages that point to many authoritative pages of the same topic 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 66
  • 67. 6. Object reconciliation. • In object reconciliation, the task is to predict whether two objects are, in fact, the same, based on their attributes and links. • Examples include predicting whether two websites are mirrors of each other, and whether two apparent disease strains are really the same. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 67
  • 68. 7. Group detection. • Group detection is a clustering task. • It predicts when a set of objects belong to the same group or cluster, based on their attributes as well as their link structure. • An area of application is the identification of Web communities, where a Web community is a collection of Web pages that focus on a particular theme or topic. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 68
  • 69. 9. Metadata mining. • Metadata are data about data. Metadata provide semi-structured data about unstructured data, ranging from text and Web data to multimedia databases. • It is useful for data integration tasks in many domains 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 69
  • 70. Challenges • Logical versus statistical dependencies • Feature construction. • Effective use of labeled and unlabeled data. • Link prediction. • Community mining from multirelational networks 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 70