CS 402 DATAMINING AND WAREHOUSING -MODULE 6

MODULE 6
Hierarchical Clustering method: BIRCH. Density-Based Clustering –
DBSCAN and OPTICS. Advanced Data Mining Techniques: Introduction,
Web Mining- Web Content Mining, Web Structure Mining, Web Usage
Mining. Text Mining. Graph mining:- Apriori based approach for
mining frequent subgraphs. Social Network Analysis:- characteristics
of social networks. Link mining:- Tasks and challenges
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 1

Hierarchical Clustering method
NIMMY RAJU,AP,VKCET,TVM 26/30/2020

• Figure below shows the application of AGNES (AGglomerative NESting), an
agglomerative hierarchical clustering method, and DIANA (DIvisia ANAlysis), a
divisive hierarchical clustering method, to a data set of five objects, {a, b, c, d, e}.
In either agglomerative or divisive hierarchical clustering, one can specify the
desired number of clusters as a termination condition.NIMMY RAJU,AP,VKCET,TVM 46/30/2020

BIRCH: Balanced Iterative Reducing and Clustering using
Hierarchies

“How does the BIRCH algorithm work?" It consists of
two phases:

Density-based methods

OPTICS: Ordering Points To Identify the Clustering
Structure
• DBSCAN can cluster objects given input parameters such as ε and MinP ts,
• Thus it leaves the user with the responsibility of selecting parameter
values that will lead to the discovery of acceptable clusters.
• Such parameter settings are usually empirically set and difficult to
determine, especially for real-world, high-dimensional data sets.
• Most algorithms are very sensitive to such parameter values: slightly
different settings may lead to very dfferent clusterings of the data.
• There does not even exist a global parameter setting for which the result
of a clustering algorithm may accurately describe the intrinsic clustering
structure.
• To help overcome this difficulty, a cluster ordering method called OPTICS
(Ordering Points To Identify the Clustering Structure) was proposed.
• OPTICS computes an augmented cluster ordering for automatic and
interactive cluster analysis.
• This ordering represents the density-based clustering structure of the
data.

Graph mining
• Graphs become increasingly important in
modeling complicated structures, such as
circuits, images, chemical compounds, protein
structures, biological networks, social
networks, the Web, workflows, and XML
documents.
• Frequent substructures are the very basic
patterns that can be discovered in a collection
of graphs

Methods for Mining Frequent Subgraphs
• We denote the vertex set of a graph g by V(g) and the
edge set by E(g).
• A label function, L, maps a vertex or an edge to a label.
• A graph g is a subgraph of another graph g0 if there
exists a subgraph isomorphism from g to g0.
• Given a labeled graph data set, D = {G1,G2, ... ;Gn } ,
we define support(g) (or frequency(g)) as the
percentage (or number) of graphs in D where g is a
subgraph.
• A frequent graph is a graph whose support is no less
than a minimum support threshold, min sup

“How can we discover frequent substructures?”
• The discovery of frequent substructures usually
consists of two steps.
• In the first step, we generate frequent
substructure candidates.
• The frequency of each candidate is checked in the
second step.
• Apriori-based approach is used here.

Apriori-based Approach
• The search for frequent graphs starts with
graphs of small “size,” and proceeds in a
bottom-up manner by generating candidates
having an extra vertex, edge, or path.
• Sk is the frequent substructure set of size
• AprioriGraph adopts a level-wise mining
methodology.

• At each iteration, the size of newly discovered
frequent substructures is increased by one.
• These new substructures are first generated
by joining two similar but slightly different
frequent subgraphs that were discovered in
the previous call .
• The frequency of the newly formed graphs is
then checked.
• Those found to be frequent are used to
generate larger candidates in the next round.

• The main design complexity of Apriori-based
substructure mining algorithms is the
candidate generation step.
• Recent Apriori-based algorithms for frequent
substructure mining include AGM, FSG, and a
path-join method.

• The AGM algorithm uses a vertex-based
candidate generation method that increases the
substructure size by one vertex at each iteration
of AprioriGraph.
• Two size-k frequent graphs are joined only if they
have the same size-(k-1) subgraph.
• Here, graph size is the number of vertices in the
graph.
• The newly formed candidate includes the size-(k-
1) subgraph in common and the additional two
vertices from the two size-k patterns.
• Because it is undetermined whether there is an
edge connecting the additional two vertices, we
actually can form two substructures.

• The FSG algorithm adopts an edge-based
candidate generation strategy that increases
the substructure size by one edge.
• Two size-k patterns are merged if and only if
they share the same subgraph having k - 1
edges, which is called the core.
• Here, graph size is taken to be the number of
edges in the graph.
• The newly formed candidate includes the core
and the additional two edges from the size-k
patterns.

Web Mining
• Web mining is mining of data related to the
World Wide Web.
• Web data
Content of actual Web pages.
 Intrapage structure includes the HTML or XML code for
the page.
Interpage structure is the actual linkage structure between
Web pages.
Usage data that describe how Web pages are accessed by
visitors.
User profiles include demographic and registration
information obtained about users.

Web mining tasks can be divided into several classes

• Web content mining examines the content of Web pages
as well as results of Web searching.
• Web content mining is further divided into Web page
content mining and search results mining.
• The first is traditional searching of Web pages via
content, while the second is a further search of pages
found from a previous search.
• With Web structure mining, information is obtained
from the actual organization of pages on the Web.
• Web usage mining looks at logs of Web access. General
access pattern tracking is a type of usage mining that
looks at a history of Web pages visited. This usage may
be general or may be targeted to specific usage or users

WEB CONTENT MINING
• Web content mining includes different techniques that can be
used to search the Internet.
• One taxonomy of Web mining divided Web content mining
into agent-based and database approaches.
• Agent-based approaches have software systems (agents)that
perform the content mining.
• The database approaches view the Web data as belonging to a
database.
• One problem associated with retrieval of data from Web
documents is that Web pages created using HTML are only
semi structured, thus making querying more difficult.
• HTML ultimately will be replaced by (XML), which will
provide structured documents and facilitate easier mining

Crawlers(web crawling)
• A robot (or spider or crawler) is a program that
traverses the hypertext structure in the Web.
• The page (or set of pages) that the crawler starts with
are referred to as the seed URLs.
• By starting at one page, all links from it are recorded
and saved in a queue.
• These new pages are in turn searched and their links
are saved.
• As these robots search the Web, they may collect
information about each page, such as extract
keywords and store in indices for users of the
associated search engine.

 The focused crawler architecture consists of
three primary components .

Harvest System
• Harvest is actually a set of tools that facilitate
gathering of information from diverse sources.

Virtual Web View
• To handle the large amounts of somewhat unstructured
data on the Web is to create a multiple layered database
(MLDB) .
• Each layer of this database is more generalized than the
layer beneath it.
• The MLDB provides an abstracted and condensed view of
a portion of the Web.
• A view of the MLDB, which is called a Virtual Web View
(VWV), can be constructed.

WebML
• A Web data mining query language, WebML is
proposed to provide data mining operations on
the MLDB .

WEB STRUCTURE MINING
• Web structure mining can be viewed as
creating a model of the Web organization .

1 Page Rank
• PageRank is used to measure the importance
of a page .
• The PageRank value for a page is calculated
based on the number of pages that point to it.
• This is actually a measure based on the number
of backlinks to a page.

• Given a page p, we use Bp to be the set of
pages that point to p, and Fp to be the set of
links out of p. The Page Rank of a page p is
defined as .

2.Clever
• Clever, is aimed at finding both authoritative
pages and hubs .
• The authors define an authority as the "best
source" for the requested information .
• In addition, a hub is a page that contains links
to authoritative pages
• The Clever system identifies authoritative
pages and hub pages by creating weights.
• Hyperlink-induced topic search (HITS) finds
hubs and authoritative pages .

The HITS technique contains two components:
• Based on a given set of keywords a set of
relevant pages is found.
• Hub and authority measures are associated
with these pages.
Pages with the highest values are returned.

WEB USAGE MINING
• Web usage mining performs mining on Web
usage data, or Web logs.
• A Web log is a listing of page reference data.
• Sometimes it is referred to as clickstream data
because each entry corresponds to a mouse
click.

Web usage mining applications
• Personalization for a user can be achieved by
keeping track of previously accessed pages..
• Improve the overall performance of future
accesses.
• Information concerning frequently accessed pages
can be used for caching.
• Identifying common access behaviors can be used
to improve the actual design of Web pages and to
make other modifications to the site.
• Web usage patterns can be used to gather
business intelligence to improve sales and
advertisement.

Web usage mining actually consists of three
separate types of activities
1. Preprocessing activities: reformatting the
Web log data before processing.
2. Pattern discovery: finding hidden patterns
within the log data.
3. Pattern analysis :looking at and interpreting
the results of the discovery activities.

There are many issues associated with using the
Web log for mining purposes:
• Identification of the exact user is not possible
from the log alone.
• With a Web client cache, the exact sequence of
pages a user actually visits · difficult to uncover
from the server site.
• There are many security, privacy, and legal issues
yet to be solved

1 Preprocessing
• The Web usage log probably is not in a format
that is usable by mining applications.
• As With any data to be used in a mining
application, the data may need to be reformatted
and cleansed.
• Steps that are part of the preprocessing phase
include cleansing, user identification ,session
identification, path completion, and formatting .

• Problems
 correct identification of the actual user. User
identification is complicated by the use of proxy
servers, client side caching, and corporate firewalls.
 Identifying the actual sequence of pages accessed by
a user is complicated by the use of client side caching.
• In this case, actual pages accessed will be missing from
the server side log.
• Techniques can be used to complete the log by
predicting missing pages.
• Path completion is an attempt to add page accesses that
do not exist in the log but that actually occurred.

Data Structures
• A basic data structure is called a trie.
• A trie is a rooted tree, where each path from the root to a leaf
represents a sequence.
• Tries are used to store strings for pattern-matching applications.
• Each character in the string is stored on the edge to the node.
• A problem in using tries for many long strings is the space required.
• This waste of space that is solved by compressing nodes together
when they have degrees of one.
• The compressed trie is called a suffix tree. A suffix tree has the
following characteristics:
• Each internal node except the root has at least two children.
• Each edge represents a nonempty subsequence.
• The subsequences represented by sibling edges begin with
different symbols.

2.Pattern Discovery
• The most common data mining technique used
on clickstream data is that of uncovering
traversal patterns. A traversal pattern is a set of
pages visited by a user in a session.
• Several different types of traversal patterns
have been examined.

3.Pattern Analysis
• Once patterns have been identified, they must be
analyzed to determine how that information can
be used.
• Some of the generated patterns may be deleted
and determined not to be of interest.
• Recent work has proposed examining Web logs
not only to identify frequent types of traversal
patterns, but also to identify patterns that are of
interest because of their uniqueness or statistical .
• A Web mining query language, MINT is used.

Text Mining
• Mining of data from text database.
• Text databases consist of large collections of documents
from various sources, such as news articles, research
papers, books, digital libraries, e-mail messages, and Web
pages.
• Nowadays most of the information in government, industry,
business, and other institutions are stored electronically, in
the form of text databases.
• Data stored in most text databases are semistructured
data in that they are neither completely unstructured nor
completely structured.

• Information retrieval (IR)/Text mining is concerned with
the organization and retrieval of information from a
large number of text-based documents.
• Since information retrieval and database systems each
handle different kinds of data, some database system
problems are usually not present in information
retrieval systems, such as concurrency control,
recovery, transaction management, and update.
• Also, some common information retrieval problems
are usually not encountered in traditional database
systems, such as unstructured documents and
approximate search based on keywords.
Difference between text database and traditional data base

• Let the set of documents relevant to a query
be denoted as {Relevant}, and the set of
documents retrieved be denoted as
{Retrieved}.
• The set of documents that are both relevant
and retrieved is denoted as
{Relevant} U {Retrieved}
• There are two basic measures for assessing
the quality of text retrieval:
• Precision, recall, and F-score are the basic
measures of a retrieved set of documents.
Basic Measures for Text Retrieval: Precision and Recall

Social Network Analysis
What Is a Social Network?
• A social network is a heterogeneous and
multirelational data set represented by a graph.
• The graph is typically very large, with nodes
corresponding to objects and edges
corresponding to links representing relationships
or interactions between objects.
• Examples include electrical power grids,
telephone call graphs, the spread of computer
viruses, the World Wide Web

Characteristics of Social Networks
• Nodes’ degrees, that is, the number of edges
incident to each node
• Distances between a pair of nodes, as
measured by the shortest path length.
• Network diameter is the maximum distance
between pairs of nodes.
• Average distance between pairs
• Effective diameter (i.e., the minimum
distance, d.)

• In general, social networks tend to exhibit the
following phenomena:
• Densification power law:
• The number of degrees grows linearly in the number
of nodes. This was known as the constant average
degree assumption. The densification follows the
densification power law (or growth power law),
which states
• Shrinking diameter:
• It has been experimentally shown that the effective
diameter tends to decrease as the network grows.

Link Mining: Tasks and Challenges
• By considering links (the relationships between
objects), more information is made available to the
mining process.
TASKS
1. Link-based object classification:
• Link-based classification predicts the category of an
object based not only on its attributes, but also on its
links, and on the attributes of linked objects.
• Web page classification is a well-recognized example
of link-based classification

2. Object type prediction.
• This predicts the type of an object, based on
its attributes and its links, and on the
attributes of objects linked to it.
3. Link type prediction.
• This predicts the type or purpose of a link,
based on properties of the objects involved.
• Given Web page data, we can try to predict
whether a link on a page is an advertising link
or a navigational link.

4. Predicting link existence:
• Predict whether a link exists between two
objects. Examples include predicting whether
there will be a link between two Web pages.
5. Link cardinality estimation.
• There are two forms of link cardinality
estimation.
• First, we may predict the number of links to an
object. This is useful, for instance, in predicting
the authoritativeness of a Web page based on the
number of links to it (in-links).
• Similarly, the number of out-links can be used to
identify Web pages that act as hubs, where a hub
is one or a set of Web pages that point to many
authoritative pages of the same topic

6. Object reconciliation.
• In object reconciliation, the task is to predict
whether two objects are, in fact, the same,
based on their attributes and links.
• Examples include predicting whether two
websites are mirrors of each other, and
whether two apparent disease strains are
really the same.

7. Group detection.
• Group detection is a clustering task.
• It predicts when a set of objects belong to the
same group or cluster, based on their
attributes as well as their link structure.
• An area of application is the identification of
Web communities, where a Web community is
a collection of Web pages that focus on a
particular theme or topic.

9. Metadata mining.
• Metadata are data about data. Metadata
provide semi-structured data about
unstructured data, ranging from text and Web
data to multimedia databases.
• It is useful for data integration tasks in many
domains

Challenges
• Logical versus statistical dependencies
• Feature construction.
• Effective use of labeled and unlabeled data.
• Link prediction.
• Community mining from multirelational
networks

CS 402 DATAMINING AND WAREHOUSING -MODULE 6

More Related Content

What's hot (20)

Similar to CS 402 DATAMINING AND WAREHOUSING -MODULE 6 (20)

Recently uploaded (20)

CS 402 DATAMINING AND WAREHOUSING -MODULE 6