CS8080 information retrieval techniques unit iii ppt in pdf

P1WU
UNIT – III: CLASSIFICATION
Topic 1: A CHARACTERIZATION OF TEXT
CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES

UNIT III
1.A Characterization of
Text Classification
2. Unsupervised
Algorithms: Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
SEMESTER – VIII

INTRODUCTION TO CLASSIFICATION
SEMESTER – VIII

• Scientists became very serious about addressing the question:
• “Can we build a model that learns from available data and
automatically makes the right decisions and predictions?”
• Answer can be found in numerous applications that are emerging
from the fields of
1. pattern classification,
2. machine learning, and
3. artificial intelligence.
SEMESTER – VIII

• Data from various sensoring devices combined with powerful
learning algorithms and domain knowledge led to :
• many great inventions that we now take for granted in our
everyday life:
• Internet queries via search engines like Google,
• text recognition at the post office,
• barcode scanners at the supermarket, the diagnosis of diseases,
• speech recognition by Siri or
• Google Now on our mobile phone, just to name a few.
SEMESTER – VIII

• Classification is:
• the data mining process of
• finding a model (or function) that
• describes and distinguishes data classes or concepts,
• for the purpose of being able to use the model to predict the class of objects
whose class label is unknown.
• That is, predicts categorical class labels (discrete or nominal).
• Classifies the data (constructs a model) based on the training set.
• It predict group membership for data instances.
SEMESTER – VIII

What is CLASSIFICATION?
• Classification and prediction are :
• two forms of data analysis that can used to extract models describing
important data classes or to predict the future data trends.
• C & P help us to provide a better understanding of large data.
• Classification predicts categorical (discrete, unordered) labels.
• Prediction models continuous valued functions.
SEMESTER – VIII

• How can we classify?
• The trick here is Machine Learning which requires us to make classifications based on past
observations (the learning part).
• We give the machine a set of data having texts with labels tagged to it and then we let the model
to learn on all these data which will later give us some useful insight on the categories of text
input we feed.
SEMESTER – VIII

Applications of Classification
• Classification of (potential) customers for:
• Credit approval, risk prediction, selective marketing
• Performance prediction based on
• selected indicators
• Medical diagnosis based on symptoms or reactions to Therapy
• Application areas:
• Credit approval
• Target marketing
• Medical diagnosis
• Treatment effectiveness analysis
• Performance prediction
SEMESTER – VIII

When is classification needed?
• Scenarios:
• In each of these examples, the data analysis task is classification,
• where a model or classifier is constructed to predict categorical labels, such as
• “safe” or “risky” for the loan application data;
• “yes” or “no” for the marketing data; or
• “treatment A,” “treatment B,” or “treatment C” for the medical data.
• These categories can be represented by discrete values, where the ordering among values
has no meaning.
• For example,
• the values 1, 2, and 3 may be used to represent treatments A, B, and C,
• where there is no ordering implied among this group of treatment regimes.
SEMESTER – VIII

Aim: predict categorical class labels
for new tuples/samples
Input: a training set of tuples/samples,
each with a class label
Output: a model (a classifier) based on
the training set and the class labels
SEMESTER – VIII

Why Classification?
• A classical problem extensively studied by
• statisticians and machine learning researchers
• Predicts categorical class labels.
• Produces a model (classifier).
SEMESTER – VIII

Typical Applications of Classification
• Example:
• {credit history, salary} credit approval ( Yes/No)
• {Temp, Humidity}  Rain (Yes/No)
• A set of documents  sports, technology, etc.
SEMESTER – VIII
• Another Example:
• If x >= 90 then grade =A.
• If 80<=x<90 then grade =B.
• If 70<=x<80 then grade =C.
• If 60<=x<70 then grade =D.
• If x<50 then grade =F.

WHAT ARE TEXT CLASSIFICATION?
• Text classification is a machine
learning technique that assigns a
set of predefined categories
to open-ended text.
• Text classifiers can be used to
organize, structure, and categorize
pretty much any kind of text –
from documents, medical studies
and files, and all over the web.
SEMESTER – VIII

What is meant by text classification?
• Text classification or Text Categorization
is the activity of labeling natural
language texts with relevant categories
from a predefined set.
• In laymen terms, text classification is a
process of extracting generic tags from
unstructured text.
• These generic tags come from a set of
pre-defined categories.
SEMESTER – VIII

What is meant by text classification or Document classification ?
• Document classification or document categorization is
• a problem in library science, information science and
computer science.
• The task is to assign a document to one or more classes or
categories.
• This may be done "manually" or algorithmically.
•Wikipedia
SEMESTER – VIII

What is meant by text classification?
• Text classification also known as text tagging or text
categorization is the process of categorizing text into
organized groups.
• By using Natural Language Processing (NLP), text
classifiers can automatically analyze text and then
assign a set of pre-defined tags or categories based on
its content.
SEMESTER – VIII

Text Classification Examples
• Text classification is becoming
• an increasingly important part of businesses as it allows to
easily get insights from data and automate business processes.
• Some of the most common examples and use cases for
automatic text classification include the following:
a) Sentiment Analysis
b) Topic Detection
c) Language Detection
SEMESTER – VIII

Text Classification Examples
a) Sentiment Analysis: the process of understanding if a given text is
talking positively or negatively about a given subject
(e.g. for brand monitoring purposes).
b) Topic Detection: the task of identifying the theme or topic of a piece
of text
(e.g. know if a product review is about Ease of Use, Customer Support,
or Pricing when analyzing customer feedback).
c) Language Detection: the procedure of detecting the language of a
given text
(e.g. know if an incoming support ticket is written in English or Spanish for
automatically routing tickets to the appropriate team).
SEMESTER – VIII

A Characterization of Text Classification
• For example,
• new articles can be organized by topics;
• support tickets can be organized by urgency;
• chat conversations can be organized by language;
• brand mentions can be organized by sentiment; and so on.
• Text classification is
• one of the fundamental tasks in natural language processing with broad applications such
as sentiment analysis, topic labeling, spam detection, and intent detection.
• Here’s an example of how it works:
• “The user interface is quite straightforward and easy to use.”
• A text classifier can take this phrase as an input, analyze its content, and then automatically
assign relevant tags, such as UI and Easy To Use.
SEMESTER – VIII

• First tactic for categorizing documents is to assign a
label to each document,
• but this solve the problem only when the users know the
labels of the documents they looking for.
• This tactic does not solve more generic problem of
finding documents on specific topic or subject.
SEMESTER – VIII

• For that case, better solution is to
• group documents by common generic topics and label each group
with a meaningful name.
• Each labeled group is called category or class.
• Document classification is
• the process of categorizing documents under a given cluster or
category using fully supervised learning process.
SEMESTER – VIII

Why is Text Classification Important?
• It’s estimated that around 80% of all information is unstructured, with text
being one of the most common types of unstructured data.
• Because of the messy nature of text,
• analyzing, understanding, organizing, and sorting through text data is hard and time-consuming, so
most companies fail to use it to its full potential.
• This is where text classification with machine learning comes in.
• Using text classifiers, companies can automatically structure all manner of
relevant text, from
• , legal documents, social media, chatbots, surveys, and more in a fast and cost-effective way.
• This allows companies to
• save time analyzing text data, automate business processes, and make data-driven business
decisions.
SEMESTER – VIII

Reasons for: Text Classification Important
a) Scalability
• Manually analyzing and organizing is slow and much less accurate..
• Machine learning can automatically analyze millions of surveys, comments, emails,
etc., at a fraction of the cost, often in just a few minutes.
• Text classification tools are scalable to any business needs, large or small.
b) Real-time analysis
• There are critical situations that companies need to identify as soon as possible and
take immediate action (e.g., PR crises on social media).
• Machine learning text classification can follow your brand mentions constantly and in
real time, so you'll identify critical information and be able to take action right away.
SEMESTER – VIII

Reasons for: Text Classification Important
c) Consistent criteria
• Human annotators make mistakes when classifying text data due to
distractions, fatigue, and boredom, and human subjectivity creates inconsistent
criteria.
• Machine learning, on the other hand, applies the same lens and criteria to all
data and results.
• Once a text classification model is properly trained it performs with
unsurpassed accuracy.
SEMESTER – VIII

• Classification could be performed
1. manually by domain experts or
2. automatically using well- known and
• widely used classification algorithms such as decision tree and
Naïve Bayes.
• Documents are classified according to
• other attributes (e.g. author, document type, publishing year
etc.) or according to their subjects.
SEMESTER – VIII

• there are two main kind of subject classification of documents:
1. The content based approach and
2. the request based approach.
• In Content based classification,
• the weight that is given to subjects in a document decides the class to which the document is assigned.
• For example, it is a rule in some library classification that at least 15% of the content of a book
should be about the class to which the book is assigned.
• In automatic classification, the number of times given words appears in a document determine the
class.
SEMESTER – VIII

• In Request oriented
classification, the anticipated
request from users is impacting
how documents are being
classified.
• The classifier asks himself:
• “Under which description should this
entity be found?” and
• “think of all the possible queries and
decide for which ones the entity at
hand is relevant”.
SEMESTER – VIII

Text Classification Applications
• With the help of text classification, businesses can make sense of large
amounts of data using techniques like
• aspect-based sentiment analysis to understand what people are talking about
and how they’re talking about each aspect.
• Text classification can help support teams provide a stellar experience
by
• automating tasks that are better left to computers, saving precious time that
can be spent on more important things.
SEMESTER – VIII

• models can help you analyze survey results to discover patterns and
insights like:
• What do people like about our product or service?
• What should we improve?
• What do we need to change?
• By combining both quantitative results and qualitative analyses,
• teams can make more informed decisions without having to spend hours
manually analyzing every single open-ended response.
SEMESTER – VIII

• Text classification has thousands of use cases and is applied to a wide range
of tasks.
• In some cases, data classification tools work behind the scenes to enhance
app features we interact with on a daily basis (like email spam filtering).
• In some other cases, classifiers are used by marketers, product managers,
engineers, and salespeople to automate business processes and save
hundreds of hours of manual data processing.
• Some of the top applications and use cases of text classification include:
1. Detecting urgent issues
2. Automating customer support processes
3. Listening to the Voice of customer (VoC)
SEMESTER – VIII

• Automatic document classification tasks can be divided into three
types
1. Unsupervised document classification (document clustering): the
classification must be done totally without reference to external information.
2. Semi-supervised document classification: parts of the documents are labeled
by the external method.
3. Supervised document classification where some external method (such as
human feedback) provides information on the correct classification for
documents
SEMESTER – VIII

Computational Supervised Learning
• Computational Supervised Learning is also called classification aimed
to:
• Learn from past experience, and
• use the learned knowledge to classify new data
• Knowledge learned by intelligent algorithms
• Examples:
• Clinical diagnosis for patients
• Cell type classification
SEMESTER – VIII

Overall Picture of Supervised Learning
SEMESTER – VIII
Biomedical
Financial
Government
Scientific
Decision trees
Emerging patterns
SVM
Neural networks
Classifiers (M-Doctors)

Unsupervised Learning
• Unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself
find the hidden patterns and insights from the given data. It can be
compared to learning which takes place in the human brain while
learning new things. It can be defined as:
• “Unsupervised learning is a type of machine learning in which models
are trained using unlabeled dataset and are allowed to act on that data
without any supervision”.
SEMESTER – VIII

Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have the
input data but no corresponding output data.
The goal of unsupervised learning is to
find the underlying structure of dataset, group that data according to
similarities, and represent that dataset in a compressed format.
SEMESTER – VIII

Example: Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs.
The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset.
The task of the unsupervised learning algorithm is to identify the image
features on their own.
SEMESTER – VIII

• . Unsupervised learning algorithm will
• perform this task by clustering the image dataset into the groups according to
similarities between images.
• By Simply,
• no training data is provided Examples:
• neural network models
• independent component analysis
• clustering
SEMESTER – VIII

Supervised vs. Unsupervised Learning
classification Vs clustering
• Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of establishing
the existence of classes or clusters in the data
SEMESTER – VIII

Any Questions?
SEMESTER – VIII

P1WU
Topic 2: UNSUPERVIZED ALGORITHMS -
CLUSTERING
SEMESTER – VIII

UNIT III
1.A Characterization of Text
Classification
2. Unsupervised
Algorithms: Clustering
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
Indexing
SEMESTER – VIII

INTRODUCTION TO UNSUPERVIZED ALGORITHMS
• Below is the list of some popular unsupervised learning algorithms:
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
SEMESTER – VIII

INTRODUCTION TO UNSUPERVIZED ALGORITHMS
SEMESTER – VIII

WHAT ARE CLUSTERING?
• Clustering or cluster analysis is a
machine learning technique, which
groups the unlabelled dataset.
• It can be defined as "A way of
grouping the data points into
different clusters, consisting of
similar data points. The objects with
the possible similarities remain in a
group that has less or no similarities
with another group."
SEMESTER – VIII

WHAT ARE CLUSTERING?
• It does it by
• finding some similar patterns in the unlabelled dataset
such as shape, size, color, behavior, etc., and divides them
as per the presence and absence of those similar patterns.
• It is an unsupervised learning method,
• hence no supervision is provided to the algorithm, and it
deals with the unlabeled dataset.
SEMESTER – VIII

Difference between Supervised and Unsupervised Learning
SEMESTER – VIII
Supervised Learning Unsupervised Learning
Supervised learning algorithms aretrained using labeled data. Unsupervised learning algorithmsare trained using unlabeled data.
Supervised learning model takesdirect feedback to check if it is
predicting correct output or not.
Unsupervised learning model doesnot take any feedback.
Supervised learning model predictsthe output. Unsupervised learning model findsthe hidden patterns in data.
Supervised learning needs supervision to train the model. Unsupervised learning does not needany supervision to train the model.
Supervised learning can becategorized
in Classification and Regression problems.
Unsupervised Learning can beclassified in Clustering and
Associations problems.
Supervised learning can be used for those cases where we
know theinput as well as corresponding outputs.
Unsupervised learning can be used for those cases where we have
onlyinput data and no corresponding output data.
Supervised learning model produces an accurate result. Unsupervised learning model may give less accurate result as compared
to supervised learning.
It includes various algorithms such It includes various algorithms such

Advantages of Unsupervised Learning
• Unsupervised learning is used for more complex tasks
as compared to supervised learning because,
• in unsupervised learning, we don't have labeled input data.
• Unsupervised learning is preferable as
• it is easy to get unlabeled data in comparison to labeled
data.
SEMESTER – VIII

Disadvantages of Unsupervised Learning
• Unsupervised learning is
• intrinsically more difficult than supervised learning as it does not have
corresponding output.
• The result of the unsupervised learning algorithm might be
• less accurate as input data is not labeled, and algorithms do not know the
exact output in advance.
SEMESTER – VIII

P1WU
Topic 3: NAÏVE TEXT CLASSIFICATION
SEMESTER – VIII

UNIT III : TEXT CLASSIFICATION AND CLUSTERING
Classification
2. Unsupervised Algorithms:
Clustering
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
Indexing
SEMESTER – VIII

NAÏVE TEXT CLASSIFICATION
SEMESTER – VIII

INTRODUCTION TO NAÏVE TEXT CLASSIFICATION
• Naive Bayes classifiers are a collection of classification
algorithms based on Bayes Theorem.
• It is not a single algorithm but a family of algorithms where all
of them share a common principle, i.e. every pair of features
being classified is independent of each other.
• Naive Bayes classifiers have been heavily used
for text classification and text analysis machine learning
problems.
SEMESTER – VIII

INTRODUCTION TO NAÏVE TEXT CLASSIFICATION
• Text Analysis is a major application field for machine learning
algorithms.
• However the raw data,
• a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms
themselves as most of them expect numerical feature vectors with a fixed size
rather than the raw text documents with variable length.
SEMESTER – VIII

The Naive Bayes algorithm
• Naive Bayes classifiers are a collection of classification
algorithms based on Bayes’ Theorem.
• It is not a single algorithm but a family of algorithms where all
of them share a common principle,
• i.e. every pair of features being classified is independent of each other.
• The dataset is divided into two parts, namely,
feature matrix and the response/target vector.
SEMESTER – VIII

The Naive Bayes algorithm
• The Feature matrix (X) contains all the vectors(rows) of the
dataset in which each vector consists of the value of
dependent features. The number of features is d i.e. X =
(x1,x2,x2, xd).
• The Response/target vector (y) contains the value of
class/group variable for each row of feature matrix.
SEMESTER – VIII

The Bayes’ Theorem
Bayes’ Theorem finds the probability of an event
occurring given the probability of another event that
has already occurred.
Bayes’ theorem is stated mathematically as follows:
SEMESTER – VIII

• where:
• A and B are called events.
• P(A | B) is the probability of event A, given the event B is true (has occured)
• Event B is also termed as evidence.
P(A) is the priori of A (the prior independent probability, i.e. probability of event
before evidence is seen).
• P(B | A) is the probability of B given event A, i.e. probability of event B after evidence
A is seen.
SEMESTER – VIII

• Summary
SEMESTER – VIII

Dealing with text data
• Text Analysis is a major application field for machine learning
algorithms.
However the raw data, a sequence of symbols (i.e. strings) cannot be fed
directly to the algorithms themselves as most of them expect
numerical feature vectors with a fixed size rather than the raw text
documents with variable length.
SEMESTER – VIII

Dealing with text data
• In order to address this, scikit-learn provides utilities for the most
common ways to extract numerical features from text content,
namely:
• tokenizing strings and giving an integer id for each possible token, for
instance by using w ite-spaces and punctuation as token separators.
• counting the occurrences of tokens in each document.
• In this scheme, features and samples are defined as follows:
• each individual token occurrence frequency is treated as a feature.
• the vector of all the token frequencies for a given document is considered a multivariate
sample.
SEMESTER – VIII

Example 1 : Using the Naive Bayesian Classifier
• We will consider the following training set.
• The data samples are described by attributes age, income, student,
and credit.
• The class label attribute, buy, tells whether the person buys a
computer, has two distinct values, yes (class C1) and no (class C2).
SEMESTER – VIII

SEMESTER – VIII
RID Age Income student Credit Ci: buy
1 Youth High no Fair C2: no
2 Youth High no Excellent C2: no
3 middle-aged High no Fair C1: yes
4 Senior medium no Fair C1: yes
5 Senior Low yes Fair C1: yes
6 Senior Low yes Excellent C2: no
7 middle-aged Low yes Excellent C1: yes
8 Youth medium no Fair C2: no
9 Youth Low yes Fair C1: yes
10 Senior medium yes Fair C1: yes
11 Youth medium yes Excellent C1: yes
12 middle-aged medium no Excellent C1: yes
13 middle-aged High yes Fair C1: yes
14 Senior medium no Excellent C2: no

• The sample we wish to classify is
• X = (age = youth, income = medium, student = yes, credit = fair)
• We need to maximize P (X|Ci)P (Ci), for i = 1, 2. P (Ci), the a priori
probability of each class, can be estimated based on the training
samples:
• P(buy =yes ) = 9 /14
• P(buy =no ) = 5 /14
SEMESTER – VIII

• Using the above probabilities, we obtain
SEMESTER – VIII

• Similarly
SEMESTER – VIII
To find the class that maximizes P (X|Ci)P (Ci), we compute
Thus the naive Bayesian classifier predicts buy = yes for sample X

Example 2: Predicting a class label using naïve Bayesian classification
• Predicting a class label using naïve Bayesian classification.
• The training data set is given below:
• The data tuples are described by the attributes Owns Home?, Married,
Gender and Employed.
• The class label attribute Risk Class has three distinct values.
• Let C1 corresponds to the class A, and C2 corresponds to the class B
and C3 corresponds to the class C.
SEMESTER – VIII

• The tuple is to classify is,
• X = (Owns Home = Yes, Married = No, Gender = Female, Employed = Yes)
SEMESTER – VIII
Owns Home Married Gender Employed Risk Class
Yes Yes Male Yes B
No No Female Yes A
Yes Yes Female Yes C
Yes No Male No B
No Yes Female Yes C
No No Female Yes A
No No Male No B
Yes No Female Yes A
No Yes Female Yes C
Yes Yes Female Yes C

• Solution
• There are 10 samples and three classes.
• Risk class A = 3 Risk class B = 3 Risk class C = 4
•
• The prior probabilities are obtained by dividing these frequencies by
the total number in the training data,
• P(A) = 3/10 = 0.3 P(B) = 3/10 = 0.3 P(C) = 4/10 = 0.4
SEMESTER – VIII

• To compute P(X/Ci) =P {yes, no, female, yes}/Ci) for each of the classes, the conditional probabilities for each:
• P(Owns Home = Yes/A) = 1/3 =0.33
• P(Married = No/A) = 3/3 =1
• P(Gender = Female/A) = 3/3 = 1
• P(Employed = Yes/A) = 3/3 = 1
•
• P(Owns Home = Yes/B) = 2/3 =0.67
• P(Married = No/B) = 2/3 =0.67
• P(Gender = Female/B) = 0/3 = 0
• P(Employed = Yes/B) = 1/3 = 0.33
•
• P(Owns Home = Yes/C) = 2/4 =0.5
• P(Married = No/C) = 0/4 =0
• P(Gender = Female/C) = 4/4 = 1
• P(Employed = Yes/C) = 4/4 = 1
SEMESTER – VIII

• Using the above probabilities, we obtain
• P(X/A)= P(Owns Home = Yes/A) X
• P(Married = No/A) x
• P(Gender = Female/A) X
• P(Employed = Yes/A)
= 0.33 x 1 x 1 x 1 = 0.33
• Similarly, P(X/B)= 0 , P(X/C) =0
•
• To find the class, G, that maximizes, P(X/Ci)P(Ci), we compute,
• P(X/A) P(A) = 0.33 X 0.3 = 0.099
• P(X/B) P(B) =0 X 0.3 = 0
• P(X/C) P(C) = 0 X 0.4 = 0.0
• Therefore x is assigned to class A
SEMESTER – VIII

Advantages and Disadvantages
• Advantages:
a) Have the minimum error rate in comparison to all other classifiers.
b) Easy to implement
c) Good results obtained in most of the cases.
d) They provide theoretical justification for other classifiers that do not
explicitly use
• Disadvantages:
a) Lack of available probability data.
b) Inaccuracies in the assumption.
SEMESTER – VIII

P1WU
Topic 4: SUPERVISED ALGORITHMS
SEMESTER – VIII

SUPERVIZED LEARNING
SEMESTER – VIII

INTRODUCTION TO SUPERVIZED LEARNING
• Supervised learning, also
known as supervised machine
learning, is a subcategory of
machine learning and artificial
intelligence.
• It is defined by its use of
labeled datasets to train
algorithms that to classify data
or predict outcomes accurately.
SEMESTER – VIII

What is supervised learning?
• Supervised learning, also known as supervised machine
learning, is
• a subcategory of machine learning and artificial intelligence.
• It is defined by
• its use of labeled datasets to train algorithms that to classify
data or predict outcomes accurately.
SEMESTER – VIII

Supervised Learning
• It is defined by its use of labeled datasets to
• train algorithms that to classify data or predict outcomes accurately.
• As input data is fed into the model, it adjusts its weights until the
model has been fitted appropriately,
• which occurs as part of the cross validation process.
• Supervised learning helps organizations solve for
• a variety of real-world problems at scale, such as classifying spam in a
separate folder from your inbox.
SEMESTER – VIII

What is the type of supervised learning?
• There are two types of Supervised Learning
techniques:
1. Regression and
2. Classification.
• Classification separates the data, Regression fits the
data.
SEMESTER – VIII

Example of Supervised Learning
• A great example of supervised learning is text classification
problems.
• In this set of problems, the goal is to
• predict the class label of a given piece of text.
• One particularly popular topic in text classification is to
• predict the sentiment of a piece of text, like a tweet or a product
review.
SEMESTER – VIII

SUPERVIZED ALGORITHMS
SEMESTER – VIII

INTRODUCTION TO SUPERVIZED ALGORITHMS
• Which are supervised algorithm?
• A supervised learning algorithm takes
• a known set of input data (the learning set) and known responses to the data
(the output), and forms a model to generate reasonable predictions for the
response to the new input data.
• Use supervised learning if you have existing data for the output
you are trying to predict.
SEMESTER – VIII

SUPERVIZED ALGORITHMS EXAMPLE
SEMESTER – VIII

• Various algorithms and computation techniques are used in
supervised machine learning processes.
• Most commonly used learning methods, typically calculated
through use of programs like R or Python are:
1) Neural networks
• Primarily leveraged for deep learning algorithms, neural networks process
training data by mimicking the interconnectivity of the human brain through
layers of nodes
SEMESTER – VIII

2) Naive Bayes
• Naive Bayes is classification approach that adopts the principle of class conditional
independence from the Bayes Theorem.
3) Linear regression
• Linear regression is used to identify the relationship between a dependent variable and
one or more independent variables and is typically leveraged to make predictions about
future outcomes.
4) Logistic regression
• While linear regression is leveraged when dependent variables are continuous, logistical
regression is selected when the dependent variable is categorical, meaning they have
binary outputs, such as "true" and "false" or "yes" and "no."
SEMESTER – VIII

5) Support vector machine (SVM)
• A support vector machine is a popular supervised learning model developed by
Vladimir Vapnik, used for both data classification and regression.
6) K-nearest neighbor
• K-nearest neighbor, also known as the KNN algorithm, is a non-parametric
algorithm that classifies data points based on their proximity and association to
other available data.
SEMESTER – VIII

7) Random forest
• Random forest is another flexible supervised machine learning algorithm used
for both classification and regression purposes.
• The "forest" references a collection of uncorrelated decision trees, which are
then merged together to reduce variance and create more accurate data
predictions.
SEMESTER – VIII

P1WU
Topic 5: DECISION TREES
SEMESTER – VIII

Classification
Clustering
3. Naïve Text Classification 4.
Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
Indexing
SEMESTER – VIII

DECISION TREES
SEMESTER – VIII

INTRODUCTION TO DECISION TREES
• What is a decision tree?
• A decision tree is a structure that includes a root node,
branches, and leaf nodes.
a) Each internal node denotes a test on an attribute,
b) each branch denotes the outcome of a test, and
c) each leaf node holds a class label.
• The topmost node in the tree is the root node.
SEMESTER – VIII

• ID3, C4.5, and CART adopt a greedy (i.e., non-backtracking)
approach in which decision trees are constructed in a top-
down recursive divide-and-conquer manner.
• Most algorithms for decision tree induction also follow such a
top-down approach, which starts with a training set of
tuples and their associated class labels.
• The training set is recursively partitioned into smaller
subsets as the tree is being built.
SEMESTER – VIII

• Decision tree induction is the learning of decision trees from
class-labeled training tuples.
• A decision tree is a flowchart-like tree structure, where each
internal node (nonleaf node) denotes a test on an attribute,
• each branch represents an outcome of the test, and
• each leaf node (or terminal node) holds a class label.
• The top most node in a tree is the root node.
SEMESTER – VIII

•A decision tree is a tree where
• internal node = a test on an attribute
• tree branch = an outcome of the test
• leaf node = class label or class distribution
SEMESTER – VIII

Benefits of Decision Trees
•The benefits of having a decision tree are as
follows −
a) It does not require any domain knowledge.
b) It is easy to comprehend.
c) The learning and classification steps of a decision tree are
simple and fast.
SEMESTER – VIII

Brief History of Decision Trees
SEMESTER – VIII
CLS (Hunt etal. 1966)--- cost driven
ID3 (Quinlan, 1986 MLJ) --- Information-driven
C4.5 (Quinlan, 1993) --- Gain ratio + Pruning ideas
CART (Breiman et al. 1984) --- Gini Index

Elegance of Decision Trees
SEMESTER – VIII

Structure of Decision Trees
• If x1 > a1 & x2 > a2, then it’s A class
• C4.5, CART, two of the most widely used
• Easy interpretation, but accuracy generally unattractive
SEMESTER – VIII
Leaf nodes
Internal nodes
Root node
A
B
B A
A
x1
x2
x4
x3
> a1
> a2

Example of Decision Tree
SEMESTER – VIII

Another Example of Decision Tree
SEMESTER – VIII

Decision Tree classification Tasks
SEMESTER – VIII

5/16/2022 Data Mining: Concepts and Techniques 15
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”

Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data

Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”

Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision Tree

Constructing a Decision Tree
SEMESTER – VIII

• Two phases of decision tree generation:
1. tree construction
• at start, all the training examples at the root
• partition examples based on selected attributes
• test attributes are selected based on a heuristic or a statistical measure
2. tree pruning
• identify and remove branches that reflect noise or outliers
SEMESTER – VIII

• Basic step:
Determination of the root node of the tree and
the root node of its sub-trees
• Most Discriminatory Feature
• Every feature can be used to partition the training data
• If the partitions contain a pure class of training instances, then this feature is
most discriminatory
SEMESTER – VIII

Constructing a Decision Tree:- Example of Partitions
• Categorical feature
• Number of partitions of the training data is equal to the number of values of
this feature
• Numerical feature
• Two partitions
SEMESTER – VIII

Decision Tree Induction Algorithm
• A machine researcher named J. Ross Quinlan in 1980 developed a
decision tree algorithm known as ID3 (Iterative Dichotomiser).
• Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5
adopt a greedy approach.
• In this algorithm, there is no backtracking; the trees are constructed
in a top-down recursive divide-and-conquer manner.
• Generating a decision tree form training tuples of data partition D
SEMESTER – VIII

Algorithm : Generate_decision_tree
• Input:
• Data partition, D, which is a set of training tuples and their
associated class labels.
• attribute_list, the set of candidate attributes.
• Attribute selection method, a procedure to determine the splitting
criterion that best partitions that the data tuples into individual
classes. This criterion includes a splitting_attribute and either a
splitting point or splitting subset.
Output: A Decision Tree
SEMESTER – VIII

Algorithm : Generate_decision_tree
• Method
1) create a node N;
2) if tuples in D are all of the same class, C then
3) return N as leaf node labeled with class C;
4) if attribute_list is empty then
5) return N as leaf node with labeled with majority class in D;//majority voting
6) apply attribute_selection_method(D, attribute_list) to find the best splitting_criterion;
7) label node N with splitting_criterion;
8) if splitting_attribute is discrete-valued and multiway splits allowed then // not restricted to binary trees
9) attribute_list = attribute_list - splitting attribute; // remove splitting attribute
10) for each outcome j of splitting criterion
11) // partition the tuples and grow subtrees for each partition
12) let Dj be the set of data tuples in D satisfying outcome j; // a partition
13) if Dj is empty then
14) attach a leaf labeled with the majority class in D to node N;
else attach the node returned by Generate_decision tree(Dj, attribute list) to node N;
end for
15) return N;
SEMESTER – VIII

Constructing Decision Tree Example :- Weather Forecasting
SEMESTER – VIII

Constructing Decision Tree :- A Simple Dataset
SEMESTER – VIII
9 Play samples
5 Don’t
A total of 14.

SEMESTER – VIII
Outlook Temp Humidity Windy class
Sunny 75 70 true Play
Sunny 80 90 true Don’t
Sunny 85 85 false Don’t
Sunny 72 95 true Don’t
Sunny 69 70 false Play
Overcast 72 90 true Play
Overcast 83 78 false Play
Overcast 64 65 true Play
Overcast 81 75 false Play
Rain 71 80 true Don’t
Rain 65 70 true Don’t
Rain 75 80 false Play
Instance #
1
2
3
4
5
6
7
8
9
10
11
12
13
14

SEMESTER – VIII
2
outlook
windy
humidity
Play
Play
Play
Don’t
Don’t
sunny
overcast
rain
<= 75
> 75 false
true
2
4
3
3

SEMESTER – VIII
Total 14 training
instances
1,2,3,4,5
P,D,D,D,P
6,7,8,9
P,P,P,P
10,11,12,13,14
D, D, P, P, P
Outlook =
sunny
Outlook =
overcast
Outlook =
rain

SEMESTER – VIII
Total 14 training
instances
5,8,11,13,14
P,P, D, P, P
1,2,3,4,6,7,9,10,12
P,D,D,D,P,P,P,D,P
Temperature
<= 70
Temperature
> 70

Constructing Decision Tree Example :-
Decision on Buying a Computer / customer likely to
purchase a computer
SEMESTER – VIII

Constructing Decision Tree Example :-
Decision on Buying a Computer
SEMESTER – VIII
The following decision tree is for the concept buy_computer that
indicates :
Whether a customer at a company is likely to buy a computer or not?
Each internal node represents a test on an attribute.
Each leaf node represents a class.

Constructing Decision Tree :- Training Dataset
SEMESTER – VIII
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
This follows
an
example of
Quinlan’s
ID3 (Playing
Tennis)

Constructing Decision Tree :- Output: A Decision Tree for
“buys_computer”
SEMESTER – VIII
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair
excellent
yes
no
From the training dataset , calculate entropy value, which indicates that splitting attribute is: age

A Decision Tree for “buys_computer”
SEMESTER – VIII

SEMESTER – VIII
From the training data set , age= youth has 2 classes based on student attribute

SEMESTER – VIII
based on majority voting in student attribute , RID=3 is grouped under yes group.

SEMESTER – VIII
From the training data set , age= senior has 2 classes based on credit rating.

SEMESTER – VIII
Final Decision Tree

Classification by Decision Tree
SEMESTER – VIII

• A typical decision tree that represents the concept buys
computer, that is, it predicts whether a customer at
AllElectronics is likely to purchase a computer.
• Internal nodes are denoted by rectangles, and leaf nodes are
denoted by ovals.
• Some decision tree algorithms produce only binary trees
(where each internal node branches to exactly two other
nodes), whereas others can produce non binary trees.
SEMESTER – VIII

• “How are decision trees used for classification?”
• Given a tuple, X, for which the associated class label is
unknown, the attribute values of the tuple are tested against
the decision tree.
• A path is traced from the root to a leaf node, which holds the
class prediction for that tuple.
• Decision trees can easily be converted to classification rules.
SEMESTER – VIII

Why are decision tree classifiers so popular?
• The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore is appropriate for
exploratory knowledge discovery.
• Decision trees can handle high dimensional data.
• Their representation of acquired knowledge in tree form is intuitive and
generally easy to assimilate by humans.
• The learning and classification steps of decision tree induction are simple
and fast.
• In general, decision tree classifiers have good accuracy.
SEMESTER – VIII

Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
SEMESTER – VIII

Training Set and Its AVC Sets
SEMESTER – VIII
student Buy_Computer
yes no
yes 6 1
no 3 4
Age Buy_Computer
yes no
<=30 3 2
31..40 4 0
>40 3 2
Credit
rating
Buy_Computer
yes no
fair 6 2
excellent 3 3
age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
AVC-set on income
AVC-set on Age
AVC-set on Student
Training Examples
income Buy_Computer
yes no
high 2 2
medium 4 2
low 3 1
AVC-set on
credit_rating

P1WU
Topic 6: K-NN CLASSIFIER
SEMESTER – VIII

K-NN CLASSIFIER
SEMESTER – VIII

K-NN CLASSIFIER
• K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data
and available cases and put the new case into the category that is
most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data
point based on the similarity.
• This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
SEMESTER – VIII

K-NN CLASSIFIER
• supervised ML classification algorithm-KNN(K Nearest Neighbors)
algorithm.
• It is one of the simplest and widely used classification algorithms in
which a new data point is classified based on similarity in the
specific group of neighboring data points.
• This gives a competitive result.
SEMESTER – VIII

K-NN CLASSIFIER EXAMPLE
• Example: Suppose, we have an image of a creature that looks similar
to cat and dog,
• but we want to know either it is a cat or dog. So for this identification, we can
use the KNN algorithm, as it works on a similarity measure.
• Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it
will put it in either cat or dog category.
SEMESTER – VIII

K-NN CLASSIFIER EXAMPLE
SEMESTER – VIII

K Nearest Neighbor Classification
SEMESTER – VIII

INTRODUCTION TO K-NN CLASSIFIER
• K nearest neighbors is a simple algorithm that stores
• all available cases and classifies new cases based on a similarity measure (e.g.,
distance functions).
• K represents number of nearest neighbors.
• It classify an unknown example with the most common class
among k closest examples.
• KNN is based on
• “tell me who your neighbors are, and I’ll tell you who you are”
SEMESTER – VIII

INTRODUCTION TO K-NN CLASSIFIER :- Example
If K = 5, then in this case query instance xq will be classified
as negative since three of its nearest neighbors are classified
as negative.
SEMESTER – VIII

Different Schemes of KNN
• 1-Nearest Neighbor
• K-Nearest Neighbor using a majority voting scheme
• K-NN using a weighted-sum voting Scheme
SEMESTER – VIII

Different Schemes of KNN
SEMESTER – VIII

kNN: How to Choose k?
• In theory, if infinite number of samples available, the larger is k, the
better is classification
• The limitation is that all k neighbors have to be close
• Possible when infinite no of samples available
• Impossible in practice since no of samples is finite k = 1 is often used for efficiency, but
sensitive to “noise”
SEMESTER – VIII

SEMESTER – VIII

• Larger k gives smoother boundaries, better for generalization But only
if locality is preserved. Locality is not preserved if end up looking at
samples too far away, not from the same class.
• Interesting theoretical properties if k < sqrt(n), n is # of examples .
SEMESTER – VIII
Find a heuristically optimal number k of nearest
neighbors, based on RMSE(root-mean-square error).
This is done using cross validation.
Cross-validation is another way to retrospectively determine a good K value by using an independent
dataset to validate the K value. Historically, the optimal K for most datasets has been between 3-10.
That produces much better results than 1NN.

Distance Measure in KNN
• There are three distance measures are valid for continuous variables.
SEMESTER – VIII

Distance Measure in KNN
• It should also be noted that all In the instance of categorical variables the Hamming distance must be used.
• It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a
mixture of numerical and categorical variables in the dataset.
SEMESTER – VIII

Simple KNN - Algorithm:
• For each training example , add the example to the list of training_examples.
• Given a query instance xq to be classified,
• Let x1 ,x2….xk denote the k instances from training_examples that are nearest to xq .
• Return the class that represents the maximum of the k instances
• Steps:
1. Determine parameter k= no of nearest neighbor
2. Calculate the distance between the query instance and all the training samples.
3. Sort the distance and determine nearest neighbor based on the k –th minimum distance
4. Gather the category of the nearest neighbors
5. Use simple majority of the category of nearest neighbors as the prediction value of the query
instance.
SEMESTER – VIII

Simple KNN - Algorithm:
• K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
SEMESTER – VIII

Simple KNN – Algorithm Example
• Example:
• Consider the following data concerning credit default. Age and Loan are two numerical variables
(predictors) and Default is the target.
SEMESTER – VIII

• Given Training Data set :
SEMESTER – VIII

• Data to Classify:
• to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance.
•
• Step1: Determine parameter k
• K=3
•
• Step 2: Calculate the distance
• D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y
SEMESTER – VIII

SEMESTER – VIII

• Step 3: Sort the distance ( refer above diagram) and mark upto kth rank i.e 1 to 3.
•
• Step 4: Gather the category of the nearest neighbors
SEMESTER – VIII
Age Loan Default Distance
33 $150000 Y 8000
35 $120000 N 22000
60 $100000 Y 42000
With K=3, there are two Default=Y and one Default=N out of three closest neighbors.
The prediction for the unknown case is Default=Y.

Standardized Distance ( Feature Normalization)
• One major drawback in calculating distance measures directly from the training set is in
the case where variables have different measurement scales or there is a mixture of
numerical and categorical variables.
• For example, if one variable is based on annual income in dollars, and the other is based
on age in years then income will have a much higher influence on the distance calculated.
One solution is to standardize the training set as shown below.
SEMESTER – VIII

Standardized Distance ( Feature Normalization)
• For ex loan , X =$ 40000 ,
• Xs = 40000- 20000 = 0.11
• 220000-20000
•
Same way , calculate the standardized values for age and loan attributes, then
apply the KNN algorithm.
SEMESTER – VIII

Simple KNN – Algorithm
• Advantages
• Can be applied to the data from any distribution
• for example, data does not have to be separable with a linear boundary
• Very simple and intuitive
• Good classification if the number of samples is large enough
•
• Disadvantages
• Choosing k may be tricky
• Test stage is computationally expensive
• No training stage, all the work is done during the test stage
• This is actually the opposite of what we want. Usually we can afford training step to take a long time, but we
want fast test step
• Need large number of samples for accuracy
SEMESTER – VIII

How does K-NN work?
• he K-NN working can be explained on the basis of the below algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data points in
each category.
• Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
• Step-6: Our model is ready.
SEMESTER – VIII

How does K-NN work?
• Suppose we have a new data point and we need to put it in
the required category. Consider the below image:
SEMESTER – VIII

How does K-NN work?
• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
SEMESTER – VIII

How does K-NN work?
• By calculating the Euclidean distance we got the nearest neighbors, as
three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:
SEMESTER – VIII

Why do we need a K-NN Algorithm?
•Suppose there are two categories, i.e., Category A
and Category B, and we have a new data point x1, so
this data point will lie in which of these categories.
•To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily
identify the category or class of a particular dataset. :
SEMESTER – VIII

Why do we need a K-NN Algorithm?
•Consider the below diagram:
SEMESTER – VIII

P1WU
Topic 7: SVM CLASSIFIER
SEMESTER – VIII

SUPPORT VECTOR MACHINE (SVM)
SEMESTER – VIII

INTRODUCTION TO SVM
• A new classification method for both linear and nonlinear data
• It uses a nonlinear mapping to transform the original training data into a
higher dimension
• With the new dimension, it searches for the linear optimal separating
hyperplane (i.e., “decision boundary”)
• With an appropriate nonlinear mapping to a sufficiently high dimension, data
from two classes can always be separated by a hyperplane
• SVM finds this hyperplane using support vectors (“essential” training tuples)
and margins (defined by the support vectors)
SEMESTER – VIII

INTRODUCTION TO SVM
• A support vector machine (SVM) is a supervised machine learning
model that uses classification algorithms.
• It is more preferred for classification but is sometimes very useful for
regression as well.
• Basically, SVM finds a hyper-plane that creates a boundary between the
types of data.
• In 2- dimensional space, this hyper-plane is nothing but a line.
SEMESTER – VIII

SVM—History and Applications
• Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’
statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to their ability to model
complex nonlinear decision boundaries (margin maximization)
• Used both for classification and prediction
• Applications:
• handwritten digit recognition, object recognition, speaker identification, benchmarking time-
series prediction tests
SEMESTER – VIII

SVM—General Philosophy
SEMESTER – VIII

SVM—Margins and Support Vectors
SEMESTER – VIII

INTRODUCTION TO SVM
• In SVM, we plot each data item in the dataset in an N-
dimensional space, where N is the number of features/attributes
in the data.
• Next, find the optimal hyperplane to separate the data.
• So by this, you must have understood that inherently, SVM can
only perform binary classification (i.e., choose between two
classes).
• However, there are various techniques to use for multi-class problems.
SEMESTER – VIII

Support Vector Machine for Multi- class Problems
• To perform SVM on multi-class problems, we can create a binary classifier for
each class of the data.
• The two results of each classifier will be :
• The data point belongs to that class OR
• The data point does not belong to that class.
• For example, in a class of fruits, to perform multi-class classification, we can
create a binary classifier for each fruit.
• For say, the ‘mango’ class,
• there will be a binary classifier to predict if it IS a mango OR it is NOT a mango.
• The classifier with the highest score is chosen as the output of the SVM.
SEMESTER – VIII

SVM—Linearly Separable
• A separating hyperplane can be written as
• W ● X + b = 0
• where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
• For 2-D it can be written as
• w0 + w1 x1 + w2 x2 = 0
• The hyperplane defining the sides of the margin:
• H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
• H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
• Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
• This becomes a constrained (convex) quadratic optimization problem: Quadratic objective
function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers
SEMESTER – VIII

SVM—When Data Is Linearly Separable
• Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with
the class labels yi
• There are infinite lines (hyperplanes) separating the two classes but we want to find the
best one (the one that minimizes classification error on unseen data)
• SVM searches for the hyperplane with the largest margin, i.e., maximum marginal
hyperplane (MMH)
SEMESTER – VIII

SVM for complex (Non Linearly Separable)
SVM for complex (Non Linearly Separable) SVM works very well without any modifications
for linearly separable data.
Linearly Separable Data is any data that can be plotted in a graph and can be separated into
classes using a straight line.
SEMESTER – VIII
A: Linearly Separable Data B: Non-Linearly Separable Data

SVM CLASSIFIER
SEMESTER – VIII

SVM CLASSIFIER
• A vector space method for binary classification problems
documents represented in t-dimensional space
• find a decision surface (hyperplane) that best separate
documents of two classes new document classified by its
position relative to hyperplane.
• Simple 2D example: training documents linearly separable
SEMESTER – VIII

SVM CLASSIFIER
• Simple 2D example: training documents linearly separable
SEMESTER – VIII

SVM CLASSIFIER
• Line s—The Decision Hyperplane
• maximizes distances to closest docs of each class
• it is the best separating hyperplane
• Delimiting Hyperplanes
• parallel dashed lines that delimit region where to look for a
solution
SEMESTER – VIII

SVM CLASSIFIER
• Lines that cross the delimiting hyperplanes.
• candidates to be selected as the decision hyperplane
• lines that are parallel to delimiting hyperplanes: best candidates
SEMESTER – VIII

SVM CLASSIFIER
• Support vectors: documents that belong to, and define, the delimiting
hyperplanes Our example in a 2-dimensional system of coordinates
SEMESTER – VIII

SVM vs. Neural Network
SEMESTER – VIII
• SVM
1) Relatively new concept
2) Deterministic algorithm
3) Nice Generalization
properties
4) Hard to learn – learned in
batch mode using quadratic
programming techniques
5) Using kernels can learn very
complex functions
• Neural Network
1) Relatively old
2) Nondeterministic algorithm
3) Generalizes well but doesn’t
have strong mathematical
foundation
4) Can easily be learned in
incremental fashion
5) To learn complex functions—
use multilayer perceptron (not
that trivial)

P1WU
Topic 8 FEATURE SELECTION OR
DIMENSIONALITY REDUCTION
SEMESTER – VIII

Classification
Clustering
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
Dimensionality
Reduction
15. Multi-dimensional Indexing
SEMESTER – VIII

FEATURE SELECTION OR DIMENSIONALITY REDUCTION
SEMESTER – VIII

• Feature selection and dimensionality reduction allow us to
• minimize the number of features in a dataset by only keeping features that are
important.
• In other words, we want to retain features that contain the
most useful information that is needed by our model to
• make accurate predictions while discarding redundant features that contain
little to no information.
SEMESTER – VIII

• There are several benefits in performing feature selection and
dimensionality reduction which include:
• model interpretability,
• minimizing overfitting
as well as
• reducing the size of the training set and consequently training time.
SEMESTER – VIII

• The number of input variables or features for a dataset is referred to
as its dimensionality.
• Dimensionality reduction refers to techniques that reduce the number
of input variables in a dataset.
• More input features often make a predictive modeling task more
challenging to model, more generally referred to as the curse of
dimensionality.
SEMESTER – VIII

• High-dimensionality statistics and dimensionality reduction
techniques are often used for data visualization.
• Nevertheless these techniques can be used in applied machine
learning to
• simplify a classification or regression dataset in order to better fit a predictive
model.
SEMESTER – VIII

Problem With Many Input Variables
• If your data is represented using rows and columns, such as in a
spreadsheet, then
• the input variables are the columns that are fed as input to a model to predict
the target variable. Input variables are also called features.
• We can consider the columns of data representing dimensions on an
n-dimensional feature space and the rows of data as points in that
space.
• This is a useful geometric interpretation of a dataset.
SEMESTER – VIII

Problem With Many Input Variables
• Having a large number of dimensions in the feature space can mean that
the volume of that space is very large, and in turn,
• the points that we have in that space (rows of data) often represent a small and non-
representative sample.
• This can dramatically impact the performance of machine learning
algorithms fit on data with many input features, generally referred to as the
“curse of dimensionality.”
• Therefore, it is often desirable to reduce the number of input features. This
reduces the number of dimensions of the feature space, hence the name
“dimensionality reduction.”
SEMESTER – VIII

SEMESTER – VIII

• Dimensionality reduction refers to techniques for reducing the
number of input variables in training data.
• When dealing with high dimensional data, it is often useful to reduce
the dimensionality by projecting the data to a lower dimensional
subspace which captures the “essence” of the data. This is called
dimensionality reduction.
SEMESTER – VIII

• Fewer input dimensions often mean correspondingly fewer
parameters or a simpler structure in the machine learning model,
referred to as degrees of freedom.
• A model with too many degrees of freedom is likely to overfit the
training dataset and therefore may not perform well on new data.
• It is desirable to have simple models that generalize well, and in turn,
input data with few input variables.
• This is particularly true for linear models where the number of inputs
and the degrees of freedom of the model are often closely related.
SEMESTER – VIII

Techniques for Dimensionality Reduction
SEMESTER – VIII

Techniques for Dimensionality Reduction
• There are many techniques that can be used for dimensionality
reduction.
• Feature Selection Methods
• Matrix Factorization
• Manifold Learning
• Auto encoder Methods
SEMESTER – VIII

Feature Selection Methods
• Feature selection is also called variable selection or attribute
selection.
• It is the automatic selection of attributes in your data (such as
columns in tabular data) that are most relevant to the predictive
modeling problem you are working on.
• feature selection…
• is the process of selecting a subset of relevant features for use in model
construction
SEMESTER – VIII

• Feature selection is different from dimensionality reduction.
• Both methods seek to reduce the number of attributes in the dataset,
• but a dimensionality reduction method do so by creating new combinations of
attributes,
• where as feature selection methods include and exclude attributes present in
the data without changing them.
SEMESTER – VIII

• Examples of dimensionality reduction methods include
• Principal Component Analysis,
• Singular Value Decomposition and
• Sammon’s Mapping.
• Feature selection is itself useful, but it mostly acts as a filter, muting
out features that aren’t useful in addition to your existing features.
SEMESTER – VIII

Feature Selection Algorithms
SEMESTER – VIII

Filter Methods
• Filter feature selection methods apply a statistical measure to assign a
scoring to each feature.
• The features are ranked by the score and either selected to be kept or
removed from the dataset.
• The methods are often univariate and consider the feature
independently, or with regard to the dependent variable.
• Some examples of some filter methods include the Chi squared test,
information gain and correlation coefficient scores.
SEMESTER – VIII

Wrapper Methods
• Wrapper methods consider the selection of a set of features as a search problem,
where different combinations are prepared, evaluated and compared to other
combinations.
• A predictive model us used to evaluate a combination of features and assign a
score based on model accuracy.
• The search process may be
• methodical such as a best-first search,
• it may stochastic such as a random hill-climbing algorithm, or
• it may use heuristics, like forward and backward passes to add and remove features.
• An example if a wrapper method is the recursive feature elimination algorithm.
SEMESTER – VIII

Embedded Methods
• Embedded methods learn which features best contribute to the accuracy of the
model while the model is being created.
• The most common type of embedded feature selection methods are
• regularization methods.
• Regularization methods are also called penalization methods that introduce
• additional constraints into the optimization of a predictive algorithm (such as a regression
algorithm) that bias the model toward lower complexity (fewer coefficients).
• Examples of regularization algorithms are the
• LASSO, Elastic Net and Ridge Regression.
SEMESTER – VIII

P1WU
Topic 9: EVALUATION METRICS
SEMESTER – VIII

Classification
Clustering
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or Dimensionality
Reduction
15. Multi-dimensional Indexing
SEMESTER – VIII

EVALUATION METRICS
SEMESTER – VIII

EVALUATION METRICS
• Evaluating the Accuracy of a Classifier
• Basic Evaluation Measures for Classifier Performance.
• In Bioinformatics and machine learning in general,
• there is a large variation in the measures that are used to evaluate
prediction systems.
SEMESTER – VIII

A confusion matrix
• For simplicity, the assumption is that each instance can only be assigned
one of two classes:
• Positive or
• Negative
(e.g. a patient's tumor may be malignant or benign).
• Each instance (e.g. a patient) has a Known label, and a Predicted label.
• Some method is used (e.g. cross-validation) to make predictions on each
instance. Each instance then increments one cell in the confusion matrix.
SEMESTER – VIII

EVALUATION METRICS: -A confusion matrix
SEMESTER – VIII
Predicted Label
Positive Negative
Known Label
Positive
True Positive
False
Negative
(TP) (FN)
Negative
False
Positive
True
Negative
(FP) (TN)

CONTINGENCY TABLE
SEMESTER – VIII

P1WU
Topic 10: ACCURACY AND ERROR
SEMESTER – VIII

ACCURACY AND ERROR
SEMESTER – VIII

ACCURACY AND ERROR
• Accuracy
• Accuracy is the proportion of the time that the predicted class
equals the actual class, usually expressed as a percentage.
• It's meaning is straightforward, but may obscure important
differences in costs associated with different errors.
• The classic example of such costs is the medical diagnostic situation,
in which one can err be either:
• 1. keeping a healthy patient in the hospital (low cost), or
• 2. sending home a sick patient (very high cost).
SEMESTER – VIII

ACCURACY AND ERROR
• These classifiers need to be checked for both the accuracy of their
probabilities (Do cases predicted to have a 5% (30%, 80%, etc.)
probability really belong to the target class 5% (30%, 80%, etc.) of
the time?) and their ability to separate the classes in question.
Accuracy can be measured using many of the same metrics used to
evaluate numerical models (MSE, MAE, etc.).
SEMESTER – VIII

Accuracy and Error Measures
• All models must be assessed somehow.
• Despite the existence of a bewildering array of performance
measures, much commercial modeling software provides a
surprisingly limited range of options.
SEMESTER – VIII

Mean Squared Error (MSE)
• Mean Squared Error (MSE) is by far the most common measure of
numerical model performance.
• It is simply the average of the squares of the differences between
the predicted and actual values.
• It is a reasonably good measure of performance, though it could be
argued that it overemphasizes the importance of larger errors.
• Many modeling procedures directly minimize it.
SEMESTER – VIII

Mean Absolute Error (MAE)
• Mean Absolute Error (MAE) is similar to the Mean Squared Error, but it
uses absolute values instead of squaring.
• This measure is not as popular as MSE, though its meaning is more
intuitive .
•
• Bias is the average of the differences between the predicted and actual
values.
• With this measure, positive errors cancel out negative ones.
• Bias is intended to assess how much higher or lower predictions are, on
average, than actual values.
SEMESTER – VIII

Mean Absolute Percent Error (MAPE)
• Mean Absolute Percent Error (MAPE) is the average of the absolute
errors, as a percentage of the actual values.
• This is a relative measure of error, which is useful when larger errors
are more acceptable on larger actual values.
SEMESTER – VIII

Classification Accuracy: Estimating Error Rates
• Partition: Training-and-testing
o Use two independent data sets, e.g., training set (2/3), test set(1/3)
o Used for data set with large number of samples
• Cross-validation
o Divide the data set into k subsamples
o Use k-1 subsamples as training data and one sub-sample as test data—k-fold cross-validation
o For data set with moderate size
• Bootstrapping (leave-one-out)
o For small size data
• Confusion Matrix:
o This matrix shows not only how well the classifier predicts different classes
o It describes information about actual and detected classes:
SEMESTER – VIII

Classification Accuracy: Estimating Error Rates
Detected
Positive Negative
Actual Positive A: True positive B: False Negative
Negative C: False Positive D: True Negative
• The recall (or the true positive rate) and the precision (or the positive predictive rate) can
be derived from the confusion matrix as follows:
• o Recall = A / A+B
• o Precision = A / A+ C
SEMESTER – VIII

Classifier Accuracy Measures
• Accuracy of a classifier M, acc(M):
• percentage of test set tuples that are correctly classified by the model M
• Error rate (misclassification rate) of M = 1 – acc(M)
• Given m classes, CMi,j, an entry in a confusion matrix, indicates # of tuples in class i that are
labeled by the classifier as class j
SEMESTER – VIII
classes buy_computer = yes buy_computer = no total recognition(%)
buy_computer = yes 6954 46 7000 99.34
buy_computer = no 412 2588 3000 86.27
total 7366 2634 10000 95.52
C1 C2
C1 True positive False negative
C2 False positive True negative

Classifier Accuracy Measures
• Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos /* true positive recognition rate */
specificity = t-neg/neg /* true negative recognition rate */
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)
• This model can also be used for cost-benefit analysis
SEMESTER – VIII
classes buy_computer = yes buy_computer = no total recognition(%)
buy_computer = yes 6954 46 7000 99.34
buy_computer = no 412 2588 3000 86.27
total 7366 2634 10000 95.52
C1 C2
C1 True positive False negative
C2 False positive True negative

P1WU
Topic 11: ORGANIZING THE CLASSES
SEMESTER – VIII

ORGANIZING THE CLASSES
SEMESTER – VIII

ORGANIZING THE CLASSES TAXONOMIES
SEMESTER – VIII

P1WU
Topic 12: INDEXING AND SEARCHING
SEMESTER – VIII

INDEXING AND SEARCHING
SEMESTER – VIII

P1WU
Topic 13: INVERTED INDEXES
SEMESTER – VIII

INVERTED INDEXES
SEMESTER – VIII

FULL INVERTED INDEXES
SEMESTER – VIII

P1WU
Topic 14: SEQUENTIAL SEARCHING
SEMESTER – VIII

SEARCHING
SEMESTER – VIII

SEQUENTIAL SEARCHING
SEMESTER – VIII

P1WU
Topic 15: MULTI-DIMENSIONAL INDEXING
SEMESTER – VIII

Classification
Clustering
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or Dimensionality
Reduction
Indexing
SEMESTER – VIII

MULTI-DIMENSIONAL INDEXING
SEMESTER – VIII

MULTI-DIMENSIONAL SEARCH
SEMESTER – VIII

CS8080 information retrieval techniques unit iii ppt in pdf

More Related Content

What's hot (20)

Similar to CS8080 information retrieval techniques unit iii ppt in pdf (20)

More from AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING (19)

Recently uploaded (20)

CS8080 information retrieval techniques unit iii ppt in pdf