SlideShare a Scribd company logo
P1WU
UNIT – III: CLASSIFICATION
Topic 1: A CHARACTERIZATION OF TEXT
CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III
1.A Characterization of
Text Classification
2. Unsupervised
Algorithms: Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
• Scientists became very serious about addressing the question:
• “Can we build a model that learns from available data and
automatically makes the right decisions and predictions?”
• Answer can be found in numerous applications that are emerging
from the fields of
1. pattern classification,
2. machine learning, and
3. artificial intelligence.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
• Data from various sensoring devices combined with powerful
learning algorithms and domain knowledge led to :
• many great inventions that we now take for granted in our
everyday life:
• Internet queries via search engines like Google,
• text recognition at the post office,
• barcode scanners at the supermarket, the diagnosis of diseases,
• speech recognition by Siri or
• Google Now on our mobile phone, just to name a few.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
• Classification is:
• the data mining process of
• finding a model (or function) that
• describes and distinguishes data classes or concepts,
• for the purpose of being able to use the model to predict the class of objects
whose class label is unknown.
• That is, predicts categorical class labels (discrete or nominal).
• Classifies the data (constructs a model) based on the training set.
• It predict group membership for data instances.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
What is CLASSIFICATION?
• Classification and prediction are :
• two forms of data analysis that can used to extract models describing
important data classes or to predict the future data trends.
• C & P help us to provide a better understanding of large data.
• Classification predicts categorical (discrete, unordered) labels.
• Prediction models continuous valued functions.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
• How can we classify?
• The trick here is Machine Learning which requires us to make classifications based on past
observations (the learning part).
• We give the machine a set of data having texts with labels tagged to it and then we let the model
to learn on all these data which will later give us some useful insight on the categories of text
input we feed.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Applications of Classification
• Classification of (potential) customers for:
• Credit approval, risk prediction, selective marketing
• Performance prediction based on
• selected indicators
• Medical diagnosis based on symptoms or reactions to Therapy
• Application areas:
• Credit approval
• Target marketing
• Medical diagnosis
• Treatment effectiveness analysis
• Performance prediction
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
When is classification needed?
• Scenarios:
• In each of these examples, the data analysis task is classification,
• where a model or classifier is constructed to predict categorical labels, such as
• “safe” or “risky” for the loan application data;
• “yes” or “no” for the marketing data; or
• “treatment A,” “treatment B,” or “treatment C” for the medical data.
• These categories can be represented by discrete values, where the ordering among values
has no meaning.
• For example,
• the values 1, 2, and 3 may be used to represent treatments A, B, and C,
• where there is no ordering implied among this group of treatment regimes.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO CLASSIFICATION
Aim: predict categorical class labels
for new tuples/samples
Input: a training set of tuples/samples,
each with a class label
Output: a model (a classifier) based on
the training set and the class labels
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Why Classification?
• A classical problem extensively studied by
• statisticians and machine learning researchers
• Predicts categorical class labels.
• Produces a model (classifier).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Typical Applications of Classification
• Example:
• {credit history, salary} credit approval ( Yes/No)
• {Temp, Humidity}  Rain (Yes/No)
• A set of documents  sports, technology, etc.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
• Another Example:
• If x >= 90 then grade =A.
• If 80<=x<90 then grade =B.
• If 70<=x<80 then grade =C.
• If 60<=x<70 then grade =D.
• If x<50 then grade =F.
WHAT ARE TEXT CLASSIFICATION?
• Text classification is a machine
learning technique that assigns a
set of predefined categories
to open-ended text.
• Text classifiers can be used to
organize, structure, and categorize
pretty much any kind of text –
from documents, medical studies
and files, and all over the web.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
What is meant by text classification?
• Text classification or Text Categorization
is the activity of labeling natural
language texts with relevant categories
from a predefined set.
• In laymen terms, text classification is a
process of extracting generic tags from
unstructured text.
• These generic tags come from a set of
pre-defined categories.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
What is meant by text classification or Document classification ?
• Document classification or document categorization is
• a problem in library science, information science and
computer science.
• The task is to assign a document to one or more classes or
categories.
• This may be done "manually" or algorithmically.
•Wikipedia
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
What is meant by text classification?
• Text classification also known as text tagging or text
categorization is the process of categorizing text into
organized groups.
• By using Natural Language Processing (NLP), text
classifiers can automatically analyze text and then
assign a set of pre-defined tags or categories based on
its content.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Text Classification Examples
• Text classification is becoming
• an increasingly important part of businesses as it allows to
easily get insights from data and automate business processes.
• Some of the most common examples and use cases for
automatic text classification include the following:
a) Sentiment Analysis
b) Topic Detection
c) Language Detection
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Text Classification Examples
a) Sentiment Analysis: the process of understanding if a given text is
talking positively or negatively about a given subject
(e.g. for brand monitoring purposes).
b) Topic Detection: the task of identifying the theme or topic of a piece
of text
(e.g. know if a product review is about Ease of Use, Customer Support,
or Pricing when analyzing customer feedback).
c) Language Detection: the procedure of detecting the language of a
given text
(e.g. know if an incoming support ticket is written in English or Spanish for
automatically routing tickets to the appropriate team).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• For example,
• new articles can be organized by topics;
• support tickets can be organized by urgency;
• chat conversations can be organized by language;
• brand mentions can be organized by sentiment; and so on.
• Text classification is
• one of the fundamental tasks in natural language processing with broad applications such
as sentiment analysis, topic labeling, spam detection, and intent detection.
• Here’s an example of how it works:
• “The user interface is quite straightforward and easy to use.”
• A text classifier can take this phrase as an input, analyze its content, and then automatically
assign relevant tags, such as UI and Easy To Use.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• First tactic for categorizing documents is to assign a
label to each document,
• but this solve the problem only when the users know the
labels of the documents they looking for.
• This tactic does not solve more generic problem of
finding documents on specific topic or subject.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• For that case, better solution is to
• group documents by common generic topics and label each group
with a meaningful name.
• Each labeled group is called category or class.
• Document classification is
• the process of categorizing documents under a given cluster or
category using fully supervised learning process.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Why is Text Classification Important?
• It’s estimated that around 80% of all information is unstructured, with text
being one of the most common types of unstructured data.
• Because of the messy nature of text,
• analyzing, understanding, organizing, and sorting through text data is hard and time-consuming, so
most companies fail to use it to its full potential.
• This is where text classification with machine learning comes in.
• Using text classifiers, companies can automatically structure all manner of
relevant text, from
• , legal documents, social media, chatbots, surveys, and more in a fast and cost-effective way.
• This allows companies to
• save time analyzing text data, automate business processes, and make data-driven business
decisions.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Reasons for: Text Classification Important
a) Scalability
• Manually analyzing and organizing is slow and much less accurate..
• Machine learning can automatically analyze millions of surveys, comments, emails,
etc., at a fraction of the cost, often in just a few minutes.
• Text classification tools are scalable to any business needs, large or small.
b) Real-time analysis
• There are critical situations that companies need to identify as soon as possible and
take immediate action (e.g., PR crises on social media).
• Machine learning text classification can follow your brand mentions constantly and in
real time, so you'll identify critical information and be able to take action right away.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Reasons for: Text Classification Important
c) Consistent criteria
• Human annotators make mistakes when classifying text data due to
distractions, fatigue, and boredom, and human subjectivity creates inconsistent
criteria.
• Machine learning, on the other hand, applies the same lens and criteria to all
data and results.
• Once a text classification model is properly trained it performs with
unsurpassed accuracy.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• Classification could be performed
1. manually by domain experts or
2. automatically using well- known and
• widely used classification algorithms such as decision tree and
Naïve Bayes.
• Documents are classified according to
• other attributes (e.g. author, document type, publishing year
etc.) or according to their subjects.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• there are two main kind of subject classification of documents:
1. The content based approach and
2. the request based approach.
• In Content based classification,
• the weight that is given to subjects in a document decides the class to which the document is assigned.
• For example, it is a rule in some library classification that at least 15% of the content of a book
should be about the class to which the book is assigned.
• In automatic classification, the number of times given words appears in a document determine the
class.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• In Request oriented
classification, the anticipated
request from users is impacting
how documents are being
classified.
• The classifier asks himself:
• “Under which description should this
entity be found?” and
• “think of all the possible queries and
decide for which ones the entity at
hand is relevant”.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Text Classification Applications
• With the help of text classification, businesses can make sense of large
amounts of data using techniques like
• aspect-based sentiment analysis to understand what people are talking about
and how they’re talking about each aspect.
• Text classification can help support teams provide a stellar experience
by
• automating tasks that are better left to computers, saving precious time that
can be spent on more important things.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Text Classification Applications
• models can help you analyze survey results to discover patterns and
insights like:
• What do people like about our product or service?
• What should we improve?
• What do we need to change?
• By combining both quantitative results and qualitative analyses,
• teams can make more informed decisions without having to spend hours
manually analyzing every single open-ended response.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Text Classification Applications
• Text classification has thousands of use cases and is applied to a wide range
of tasks.
• In some cases, data classification tools work behind the scenes to enhance
app features we interact with on a daily basis (like email spam filtering).
• In some other cases, classifiers are used by marketers, product managers,
engineers, and salespeople to automate business processes and save
hundreds of hours of manual data processing.
• Some of the top applications and use cases of text classification include:
1. Detecting urgent issues
2. Automating customer support processes
3. Listening to the Voice of customer (VoC)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Characterization of Text Classification
• Automatic document classification tasks can be divided into three
types
1. Unsupervised document classification (document clustering): the
classification must be done totally without reference to external information.
2. Semi-supervised document classification: parts of the documents are labeled
by the external method.
3. Supervised document classification where some external method (such as
human feedback) provides information on the correct classification for
documents
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Computational Supervised Learning
• Computational Supervised Learning is also called classification aimed
to:
• Learn from past experience, and
• use the learned knowledge to classify new data
• Knowledge learned by intelligent algorithms
• Examples:
• Clinical diagnosis for patients
• Cell type classification
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Overall Picture of Supervised Learning
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Biomedical
Financial
Government
Scientific
Decision trees
Emerging patterns
SVM
Neural networks
Classifiers (M-Doctors)
Unsupervised Learning
• Unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself
find the hidden patterns and insights from the given data. It can be
compared to learning which takes place in the human brain while
learning new things. It can be defined as:
• “Unsupervised learning is a type of machine learning in which models
are trained using unlabeled dataset and are allowed to act on that data
without any supervision”.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Unsupervised Learning
Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have the
input data but no corresponding output data.
The goal of unsupervised learning is to
find the underlying structure of dataset, group that data according to
similarities, and represent that dataset in a compressed format.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Unsupervised Learning
Example: Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs.
The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset.
The task of the unsupervised learning algorithm is to identify the image
features on their own.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Unsupervised Learning
• . Unsupervised learning algorithm will
• perform this task by clustering the image dataset into the groups according to
similarities between images.
• By Simply,
• no training data is provided Examples:
• neural network models
• independent component analysis
• clustering
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Supervised vs. Unsupervised Learning
classification Vs clustering
• Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of establishing
the existence of classes or clusters in the data
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 2: UNSUPERVIZED ALGORITHMS -
CLUSTERING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III
1.A Characterization of Text
Classification
2. Unsupervised
Algorithms: Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO UNSUPERVIZED ALGORITHMS
• Below is the list of some popular unsupervised learning algorithms:
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO UNSUPERVIZED ALGORITHMS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
WHAT ARE CLUSTERING?
• Clustering or cluster analysis is a
machine learning technique, which
groups the unlabelled dataset.
• It can be defined as "A way of
grouping the data points into
different clusters, consisting of
similar data points. The objects with
the possible similarities remain in a
group that has less or no similarities
with another group."
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
WHAT ARE CLUSTERING?
• It does it by
• finding some similar patterns in the unlabelled dataset
such as shape, size, color, behavior, etc., and divides them
as per the presence and absence of those similar patterns.
• It is an unsupervised learning method,
• hence no supervision is provided to the algorithm, and it
deals with the unlabeled dataset.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Difference between Supervised and Unsupervised Learning
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Supervised Learning Unsupervised Learning
Supervised learning algorithms aretrained using labeled data. Unsupervised learning algorithmsare trained using unlabeled data.
Supervised learning model takesdirect feedback to check if it is
predicting correct output or not.
Unsupervised learning model doesnot take any feedback.
Supervised learning model predictsthe output. Unsupervised learning model findsthe hidden patterns in data.
Supervised learning needs supervision to train the model. Unsupervised learning does not needany supervision to train the model.
Supervised learning can becategorized
in Classification and Regression problems.
Unsupervised Learning can beclassified in Clustering and
Associations problems.
Supervised learning can be used for those cases where we
know theinput as well as corresponding outputs.
Unsupervised learning can be used for those cases where we have
onlyinput data and no corresponding output data.
Supervised learning model produces an accurate result. Unsupervised learning model may give less accurate result as compared
to supervised learning.
It includes various algorithms such It includes various algorithms such
Advantages of Unsupervised Learning
• Unsupervised learning is used for more complex tasks
as compared to supervised learning because,
• in unsupervised learning, we don't have labeled input data.
• Unsupervised learning is preferable as
• it is easy to get unlabeled data in comparison to labeled
data.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Disadvantages of Unsupervised Learning
• Unsupervised learning is
• intrinsically more difficult than supervised learning as it does not have
corresponding output.
• The result of the unsupervised learning algorithm might be
• less accurate as input data is not labeled, and algorithms do not know the
exact output in advance.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 3: NAÏVE TEXT CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
NAÏVE TEXT CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO NAÏVE TEXT CLASSIFICATION
• Naive Bayes classifiers are a collection of classification
algorithms based on Bayes Theorem.
• It is not a single algorithm but a family of algorithms where all
of them share a common principle, i.e. every pair of features
being classified is independent of each other.
• Naive Bayes classifiers have been heavily used
for text classification and text analysis machine learning
problems.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO NAÏVE TEXT CLASSIFICATION
• Text Analysis is a major application field for machine learning
algorithms.
• However the raw data,
• a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms
themselves as most of them expect numerical feature vectors with a fixed size
rather than the raw text documents with variable length.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Naive Bayes algorithm
• Naive Bayes classifiers are a collection of classification
algorithms based on Bayes’ Theorem.
• It is not a single algorithm but a family of algorithms where all
of them share a common principle,
• i.e. every pair of features being classified is independent of each other.
• The dataset is divided into two parts, namely,
feature matrix and the response/target vector.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Naive Bayes algorithm
• The Feature matrix (X) contains all the vectors(rows) of the
dataset in which each vector consists of the value of
dependent features. The number of features is d i.e. X =
(x1,x2,x2, xd).
• The Response/target vector (y) contains the value of
class/group variable for each row of feature matrix.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Bayes’ Theorem
Bayes’ Theorem finds the probability of an event
occurring given the probability of another event that
has already occurred.
Bayes’ theorem is stated mathematically as follows:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Bayes’ Theorem
• where:
• A and B are called events.
• P(A | B) is the probability of event A, given the event B is true (has occured)
• Event B is also termed as evidence.
P(A) is the priori of A (the prior independent probability, i.e. probability of event
before evidence is seen).
• P(B | A) is the probability of B given event A, i.e. probability of event B after evidence
A is seen.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Bayes’ Theorem
• Summary
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dealing with text data
• Text Analysis is a major application field for machine learning
algorithms.
However the raw data, a sequence of symbols (i.e. strings) cannot be fed
directly to the algorithms themselves as most of them expect
numerical feature vectors with a fixed size rather than the raw text
documents with variable length.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dealing with text data
• In order to address this, scikit-learn provides utilities for the most
common ways to extract numerical features from text content,
namely:
• tokenizing strings and giving an integer id for each possible token, for
instance by using w ite-spaces and punctuation as token separators.
• counting the occurrences of tokens in each document.
• In this scheme, features and samples are defined as follows:
• each individual token occurrence frequency is treated as a feature.
• the vector of all the token frequencies for a given document is considered a multivariate
sample.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• We will consider the following training set.
• The data samples are described by attributes age, income, student,
and credit.
• The class label attribute, buy, tells whether the person buys a
computer, has two distinct values, yes (class C1) and no (class C2).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
RID Age Income student Credit Ci: buy
1 Youth High no Fair C2: no
2 Youth High no Excellent C2: no
3 middle-aged High no Fair C1: yes
4 Senior medium no Fair C1: yes
5 Senior Low yes Fair C1: yes
6 Senior Low yes Excellent C2: no
7 middle-aged Low yes Excellent C1: yes
8 Youth medium no Fair C2: no
9 Youth Low yes Fair C1: yes
10 Senior medium yes Fair C1: yes
11 Youth medium yes Excellent C1: yes
12 middle-aged medium no Excellent C1: yes
13 middle-aged High yes Fair C1: yes
14 Senior medium no Excellent C2: no
Example 1 : Using the Naive Bayesian Classifier
• The sample we wish to classify is
• X = (age = youth, income = medium, student = yes, credit = fair)
• We need to maximize P (X|Ci)P (Ci), for i = 1, 2. P (Ci), the a priori
probability of each class, can be estimated based on the training
samples:
• P(buy =yes ) = 9 /14
• P(buy =no ) = 5 /14
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• To compute P (X|Ci), for i = 1, 2, we compute the following conditional
probabilities:
• P(age=youth | buy =yes ) = 2/9
• P(income =medium | buy =yes ) = 4/9
• P(student =yes | buy =yes ) = 6/9
• P(credit =fair | buy =yes ) = 6/9
• P(age=youth | buy =no ) = 3/5
• P(income =medium | buy =no ) = 2/5
• P(student =yes | buy =no ) = 1/5
• P(credit =fair | buy =no ) = 2/5
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• Using the above probabilities, we obtain
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• Similarly
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
To find the class that maximizes P (X|Ci)P (Ci), we compute
Thus the naive Bayesian classifier predicts buy = yes for sample X
Example 2: Predicting a class label using naïve Bayesian classification
• Predicting a class label using naïve Bayesian classification.
• The training data set is given below:
• The data tuples are described by the attributes Owns Home?, Married,
Gender and Employed.
• The class label attribute Risk Class has three distinct values.
• Let C1 corresponds to the class A, and C2 corresponds to the class B
and C3 corresponds to the class C.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• The tuple is to classify is,
• X = (Owns Home = Yes, Married = No, Gender = Female, Employed = Yes)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Owns Home Married Gender Employed Risk Class
Yes Yes Male Yes B
No No Female Yes A
Yes Yes Female Yes C
Yes No Male No B
No Yes Female Yes C
No No Female Yes A
No No Male No B
Yes No Female Yes A
No Yes Female Yes C
Yes Yes Female Yes C
Example 2: Predicting a class label using naïve Bayesian classification
• Solution
• There are 10 samples and three classes.
• Risk class A = 3 Risk class B = 3 Risk class C = 4
•
• The prior probabilities are obtained by dividing these frequencies by
the total number in the training data,
• P(A) = 3/10 = 0.3 P(B) = 3/10 = 0.3 P(C) = 4/10 = 0.4
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 2: Predicting a class label using naïve Bayesian classification
• To compute P(X/Ci) =P {yes, no, female, yes}/Ci) for each of the classes, the conditional probabilities for each:
• P(Owns Home = Yes/A) = 1/3 =0.33
• P(Married = No/A) = 3/3 =1
• P(Gender = Female/A) = 3/3 = 1
• P(Employed = Yes/A) = 3/3 = 1
•
• P(Owns Home = Yes/B) = 2/3 =0.67
• P(Married = No/B) = 2/3 =0.67
• P(Gender = Female/B) = 0/3 = 0
• P(Employed = Yes/B) = 1/3 = 0.33
•
• P(Owns Home = Yes/C) = 2/4 =0.5
• P(Married = No/C) = 0/4 =0
• P(Gender = Female/C) = 4/4 = 1
• P(Employed = Yes/C) = 4/4 = 1
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 2: Predicting a class label using naïve Bayesian classification
• Using the above probabilities, we obtain
• P(X/A)= P(Owns Home = Yes/A) X
• P(Married = No/A) x
• P(Gender = Female/A) X
• P(Employed = Yes/A)
= 0.33 x 1 x 1 x 1 = 0.33
• Similarly, P(X/B)= 0 , P(X/C) =0
•
• To find the class, G, that maximizes, P(X/Ci)P(Ci), we compute,
• P(X/A) P(A) = 0.33 X 0.3 = 0.099
• P(X/B) P(B) =0 X 0.3 = 0
• P(X/C) P(C) = 0 X 0.4 = 0.0
• Therefore x is assigned to class A
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Advantages and Disadvantages
• Advantages:
a) Have the minimum error rate in comparison to all other classifiers.
b) Easy to implement
c) Good results obtained in most of the cases.
d) They provide theoretical justification for other classifiers that do not
explicitly use
• Disadvantages:
a) Lack of available probability data.
b) Inaccuracies in the assumption.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 4: SUPERVISED ALGORITHMS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SUPERVIZED LEARNING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SUPERVIZED LEARNING
• Supervised learning, also
known as supervised machine
learning, is a subcategory of
machine learning and artificial
intelligence.
• It is defined by its use of
labeled datasets to train
algorithms that to classify data
or predict outcomes accurately.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
What is supervised learning?
• Supervised learning, also known as supervised machine
learning, is
• a subcategory of machine learning and artificial intelligence.
• It is defined by
• its use of labeled datasets to train algorithms that to classify
data or predict outcomes accurately.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Supervised Learning
• It is defined by its use of labeled datasets to
• train algorithms that to classify data or predict outcomes accurately.
• As input data is fed into the model, it adjusts its weights until the
model has been fitted appropriately,
• which occurs as part of the cross validation process.
• Supervised learning helps organizations solve for
• a variety of real-world problems at scale, such as classifying spam in a
separate folder from your inbox.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
What is the type of supervised learning?
• There are two types of Supervised Learning
techniques:
1. Regression and
2. Classification.
• Classification separates the data, Regression fits the
data.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example of Supervised Learning
• A great example of supervised learning is text classification
problems.
• In this set of problems, the goal is to
• predict the class label of a given piece of text.
• One particularly popular topic in text classification is to
• predict the sentiment of a piece of text, like a tweet or a product
review.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SUPERVIZED ALGORITHMS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SUPERVIZED ALGORITHMS
• Which are supervised algorithm?
• A supervised learning algorithm takes
• a known set of input data (the learning set) and known responses to the data
(the output), and forms a model to generate reasonable predictions for the
response to the new input data.
• Use supervised learning if you have existing data for the output
you are trying to predict.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SUPERVIZED ALGORITHMS EXAMPLE
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SUPERVIZED ALGORITHMS
• Various algorithms and computation techniques are used in
supervised machine learning processes.
• Most commonly used learning methods, typically calculated
through use of programs like R or Python are:
1) Neural networks
• Primarily leveraged for deep learning algorithms, neural networks process
training data by mimicking the interconnectivity of the human brain through
layers of nodes
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SUPERVIZED ALGORITHMS
2) Naive Bayes
• Naive Bayes is classification approach that adopts the principle of class conditional
independence from the Bayes Theorem.
3) Linear regression
• Linear regression is used to identify the relationship between a dependent variable and
one or more independent variables and is typically leveraged to make predictions about
future outcomes.
4) Logistic regression
• While linear regression is leveraged when dependent variables are continuous, logistical
regression is selected when the dependent variable is categorical, meaning they have
binary outputs, such as "true" and "false" or "yes" and "no."
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SUPERVIZED ALGORITHMS
5) Support vector machine (SVM)
• A support vector machine is a popular supervised learning model developed by
Vladimir Vapnik, used for both data classification and regression.
6) K-nearest neighbor
• K-nearest neighbor, also known as the KNN algorithm, is a non-parametric
algorithm that classifies data points based on their proximity and association to
other available data.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SUPERVIZED ALGORITHMS
7) Random forest
• Random forest is another flexible supervised machine learning algorithm used
for both classification and regression purposes.
• The "forest" references a collection of uncorrelated decision trees, which are
then merged together to reduce variance and create more accurate data
predictions.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 5: DECISION TREES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification 4.
Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
DECISION TREES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO DECISION TREES
• What is a decision tree?
• A decision tree is a structure that includes a root node,
branches, and leaf nodes.
a) Each internal node denotes a test on an attribute,
b) each branch denotes the outcome of a test, and
c) each leaf node holds a class label.
• The topmost node in the tree is the root node.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO DECISION TREES
• ID3, C4.5, and CART adopt a greedy (i.e., non-backtracking)
approach in which decision trees are constructed in a top-
down recursive divide-and-conquer manner.
• Most algorithms for decision tree induction also follow such a
top-down approach, which starts with a training set of
tuples and their associated class labels.
• The training set is recursively partitioned into smaller
subsets as the tree is being built.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO DECISION TREES
• Decision tree induction is the learning of decision trees from
class-labeled training tuples.
• A decision tree is a flowchart-like tree structure, where each
internal node (nonleaf node) denotes a test on an attribute,
• each branch represents an outcome of the test, and
• each leaf node (or terminal node) holds a class label.
• The top most node in a tree is the root node.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO DECISION TREES
•A decision tree is a tree where
• internal node = a test on an attribute
• tree branch = an outcome of the test
• leaf node = class label or class distribution
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Benefits of Decision Trees
•The benefits of having a decision tree are as
follows −
a) It does not require any domain knowledge.
b) It is easy to comprehend.
c) The learning and classification steps of a decision tree are
simple and fast.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Brief History of Decision Trees
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
CLS (Hunt etal. 1966)--- cost driven
ID3 (Quinlan, 1986 MLJ) --- Information-driven
C4.5 (Quinlan, 1993) --- Gain ratio + Pruning ideas
CART (Breiman et al. 1984) --- Gini Index
Elegance of Decision Trees
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Structure of Decision Trees
• If x1 > a1 & x2 > a2, then it’s A class
• C4.5, CART, two of the most widely used
• Easy interpretation, but accuracy generally unattractive
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Leaf nodes
Internal nodes
Root node
A
B
B A
A
x1
x2
x4
x3
> a1
> a2
Example of Decision Tree
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Another Example of Decision Tree
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Decision Tree classification Tasks
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
5/16/2022 Data Mining: Concepts and Techniques 15
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”
5/16/2022 Data Mining: Concepts and Techniques 16
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
5/16/2022 Data Mining: Concepts and Techniques 17
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
5/16/2022 Data Mining: Concepts and Techniques 18
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
5/16/2022 Data Mining: Concepts and Techniques 19
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
5/16/2022 Data Mining: Concepts and Techniques 20
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”
5/16/2022 Data Mining: Concepts and Techniques 21
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision Tree
Constructing a Decision Tree
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Constructing a Decision Tree
• Two phases of decision tree generation:
1. tree construction
• at start, all the training examples at the root
• partition examples based on selected attributes
• test attributes are selected based on a heuristic or a statistical measure
2. tree pruning
• identify and remove branches that reflect noise or outliers
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Constructing a Decision Tree
• Basic step:
Determination of the root node of the tree and
the root node of its sub-trees
• Most Discriminatory Feature
• Every feature can be used to partition the training data
• If the partitions contain a pure class of training instances, then this feature is
most discriminatory
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Constructing a Decision Tree:- Example of Partitions
• Categorical feature
• Number of partitions of the training data is equal to the number of values of
this feature
• Numerical feature
• Two partitions
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Decision Tree Induction Algorithm
• A machine researcher named J. Ross Quinlan in 1980 developed a
decision tree algorithm known as ID3 (Iterative Dichotomiser).
• Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5
adopt a greedy approach.
• In this algorithm, there is no backtracking; the trees are constructed
in a top-down recursive divide-and-conquer manner.
• Generating a decision tree form training tuples of data partition D
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Algorithm : Generate_decision_tree
• Input:
• Data partition, D, which is a set of training tuples and their
associated class labels.
• attribute_list, the set of candidate attributes.
• Attribute selection method, a procedure to determine the splitting
criterion that best partitions that the data tuples into individual
classes. This criterion includes a splitting_attribute and either a
splitting point or splitting subset.
Output: A Decision Tree
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Algorithm : Generate_decision_tree
• Method
1) create a node N;
2) if tuples in D are all of the same class, C then
3) return N as leaf node labeled with class C;
4) if attribute_list is empty then
5) return N as leaf node with labeled with majority class in D;//majority voting
6) apply attribute_selection_method(D, attribute_list) to find the best splitting_criterion;
7) label node N with splitting_criterion;
8) if splitting_attribute is discrete-valued and multiway splits allowed then // not restricted to binary trees
9) attribute_list = attribute_list - splitting attribute; // remove splitting attribute
10) for each outcome j of splitting criterion
11) // partition the tuples and grow subtrees for each partition
12) let Dj be the set of data tuples in D satisfying outcome j; // a partition
13) if Dj is empty then
14) attach a leaf labeled with the majority class in D to node N;
else attach the node returned by Generate_decision tree(Dj, attribute list) to node N;
end for
15) return N;
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Constructing Decision Tree Example :- Weather Forecasting
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Constructing Decision Tree :- A Simple Dataset
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
9 Play samples
5 Don’t
A total of 14.
Constructing Decision Tree :- A Simple Dataset
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Outlook Temp Humidity Windy class
Sunny 75 70 true Play
Sunny 80 90 true Don’t
Sunny 85 85 false Don’t
Sunny 72 95 true Don’t
Sunny 69 70 false Play
Overcast 72 90 true Play
Overcast 83 78 false Play
Overcast 64 65 true Play
Overcast 81 75 false Play
Rain 71 80 true Don’t
Rain 65 70 true Don’t
Rain 75 80 false Play
Rain 68 80 false Play
Rain 70 96 false Play
Instance #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Constructing Decision Tree :- A Simple Dataset
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
2
outlook
windy
humidity
Play
Play
Play
Don’t
Don’t
sunny
overcast
rain
<= 75
> 75 false
true
2
4
3
3
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Total 14 training
instances
1,2,3,4,5
P,D,D,D,P
6,7,8,9
P,P,P,P
10,11,12,13,14
D, D, P, P, P
Outlook =
sunny
Outlook =
overcast
Outlook =
rain
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Total 14 training
instances
5,8,11,13,14
P,P, D, P, P
1,2,3,4,6,7,9,10,12
P,D,D,D,P,P,P,D,P
Temperature
<= 70
Temperature
> 70
Constructing Decision Tree :- A Simple Dataset
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
2
outlook
windy
humidity
Play
Play
Play
Don’t
Don’t
sunny
overcast
rain
<= 75
> 75 false
true
2
4
3
3
Constructing Decision Tree Example :-
Decision on Buying a Computer / customer likely to
purchase a computer
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Constructing Decision Tree Example :-
Decision on Buying a Computer
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The following decision tree is for the concept buy_computer that
indicates :
Whether a customer at a company is likely to buy a computer or not?
Each internal node represents a test on an attribute.
Each leaf node represents a class.
Constructing Decision Tree :- Training Dataset
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
This follows
an
example of
Quinlan’s
ID3 (Playing
Tennis)
Constructing Decision Tree :- Output: A Decision Tree for
“buys_computer”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair
excellent
yes
no
From the training dataset , calculate entropy value, which indicates that splitting attribute is: age
A Decision Tree for “buys_computer”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A Decision Tree for “buys_computer”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
From the training data set , age= youth has 2 classes based on student attribute
A Decision Tree for “buys_computer”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
based on majority voting in student attribute , RID=3 is grouped under yes group.
A Decision Tree for “buys_computer”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
From the training data set , age= senior has 2 classes based on credit rating.
A Decision Tree for “buys_computer”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Final Decision Tree
Classification by Decision Tree
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classification by Decision Tree
• A typical decision tree that represents the concept buys
computer, that is, it predicts whether a customer at
AllElectronics is likely to purchase a computer.
• Internal nodes are denoted by rectangles, and leaf nodes are
denoted by ovals.
• Some decision tree algorithms produce only binary trees
(where each internal node branches to exactly two other
nodes), whereas others can produce non binary trees.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classification by Decision Tree
• “How are decision trees used for classification?”
• Given a tuple, X, for which the associated class label is
unknown, the attribute values of the tuple are tested against
the decision tree.
• A path is traced from the root to a leaf node, which holds the
class prediction for that tuple.
• Decision trees can easily be converted to classification rules.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classification by Decision Tree
Why are decision tree classifiers so popular?
• The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore is appropriate for
exploratory knowledge discovery.
• Decision trees can handle high dimensional data.
• Their representation of acquired knowledge in tree form is intuitive and
generally easy to assimilate by humans.
• The learning and classification steps of decision tree induction are simple
and fast.
• In general, decision tree classifiers have good accuracy.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classification by Decision Tree
Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classification by Decision Tree
Training Set and Its AVC Sets
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
student Buy_Computer
yes no
yes 6 1
no 3 4
Age Buy_Computer
yes no
<=30 3 2
31..40 4 0
>40 3 2
Credit
rating
Buy_Computer
yes no
fair 6 2
excellent 3 3
age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
AVC-set on income
AVC-set on Age
AVC-set on Student
Training Examples
income Buy_Computer
yes no
high 2 2
medium 4 2
low 3 1
AVC-set on
credit_rating
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 6: K-NN CLASSIFIER
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification 4.
Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
K-NN CLASSIFIER
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
K-NN CLASSIFIER
• K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data
and available cases and put the new case into the category that is
most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data
point based on the similarity.
• This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
K-NN CLASSIFIER
• supervised ML classification algorithm-KNN(K Nearest Neighbors)
algorithm.
• It is one of the simplest and widely used classification algorithms in
which a new data point is classified based on similarity in the
specific group of neighboring data points.
• This gives a competitive result.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
K-NN CLASSIFIER EXAMPLE
• Example: Suppose, we have an image of a creature that looks similar
to cat and dog,
• but we want to know either it is a cat or dog. So for this identification, we can
use the KNN algorithm, as it works on a similarity measure.
• Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it
will put it in either cat or dog category.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
K-NN CLASSIFIER EXAMPLE
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
K Nearest Neighbor Classification
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO K-NN CLASSIFIER
• K nearest neighbors is a simple algorithm that stores
• all available cases and classifies new cases based on a similarity measure (e.g.,
distance functions).
• K represents number of nearest neighbors.
• It classify an unknown example with the most common class
among k closest examples.
• KNN is based on
• “tell me who your neighbors are, and I’ll tell you who you are”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO K-NN CLASSIFIER :- Example
If K = 5, then in this case query instance xq will be classified
as negative since three of its nearest neighbors are classified
as negative.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Different Schemes of KNN
• 1-Nearest Neighbor
• K-Nearest Neighbor using a majority voting scheme
• K-NN using a weighted-sum voting Scheme
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Different Schemes of KNN
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Different Schemes of KNN
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
kNN: How to Choose k?
• In theory, if infinite number of samples available, the larger is k, the
better is classification
• The limitation is that all k neighbors have to be close
• Possible when infinite no of samples available
• Impossible in practice since no of samples is finite k = 1 is often used for efficiency, but
sensitive to “noise”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
kNN: How to Choose k?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
kNN: How to Choose k?
• Larger k gives smoother boundaries, better for generalization But only
if locality is preserved. Locality is not preserved if end up looking at
samples too far away, not from the same class.
• Interesting theoretical properties if k < sqrt(n), n is # of examples .
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Find a heuristically optimal number k of nearest
neighbors, based on RMSE(root-mean-square error).
This is done using cross validation.
Cross-validation is another way to retrospectively determine a good K value by using an independent
dataset to validate the K value. Historically, the optimal K for most datasets has been between 3-10.
That produces much better results than 1NN.
Distance Measure in KNN
• There are three distance measures are valid for continuous variables.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Distance Measure in KNN
• It should also be noted that all In the instance of categorical variables the Hamming distance must be used.
• It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a
mixture of numerical and categorical variables in the dataset.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN - Algorithm:
• For each training example , add the example to the list of training_examples.
• Given a query instance xq to be classified,
• Let x1 ,x2….xk denote the k instances from training_examples that are nearest to xq .
• Return the class that represents the maximum of the k instances
• Steps:
1. Determine parameter k= no of nearest neighbor
2. Calculate the distance between the query instance and all the training samples.
3. Sort the distance and determine nearest neighbor based on the k –th minimum distance
4. Gather the category of the nearest neighbors
5. Use simple majority of the category of nearest neighbors as the prediction value of the query
instance.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN - Algorithm:
• K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN – Algorithm Example
• Example:
• Consider the following data concerning credit default. Age and Loan are two numerical variables
(predictors) and Default is the target.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN – Algorithm Example
• Given Training Data set :
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN – Algorithm Example
• Data to Classify:
• to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance.
•
• Step1: Determine parameter k
• K=3
•
• Step 2: Calculate the distance
• D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN – Algorithm Example
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN – Algorithm Example
• Step 3: Sort the distance ( refer above diagram) and mark upto kth rank i.e 1 to 3.
•
• Step 4: Gather the category of the nearest neighbors
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Age Loan Default Distance
33 $150000 Y 8000
35 $120000 N 22000
60 $100000 Y 42000
With K=3, there are two Default=Y and one Default=N out of three closest neighbors.
The prediction for the unknown case is Default=Y.
Standardized Distance ( Feature Normalization)
• One major drawback in calculating distance measures directly from the training set is in
the case where variables have different measurement scales or there is a mixture of
numerical and categorical variables.
• For example, if one variable is based on annual income in dollars, and the other is based
on age in years then income will have a much higher influence on the distance calculated.
One solution is to standardize the training set as shown below.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Standardized Distance ( Feature Normalization)
• For ex loan , X =$ 40000 ,
• Xs = 40000- 20000 = 0.11
• 220000-20000
•
Same way , calculate the standardized values for age and loan attributes, then
apply the KNN algorithm.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Simple KNN – Algorithm
• Advantages
• Can be applied to the data from any distribution
• for example, data does not have to be separable with a linear boundary
• Very simple and intuitive
• Good classification if the number of samples is large enough
•
• Disadvantages
• Choosing k may be tricky
• Test stage is computationally expensive
• No training stage, all the work is done during the test stage
• This is actually the opposite of what we want. Usually we can afford training step to take a long time, but we
want fast test step
• Need large number of samples for accuracy
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
How does K-NN work?
• he K-NN working can be explained on the basis of the below algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data points in
each category.
• Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
• Step-6: Our model is ready.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
How does K-NN work?
• Suppose we have a new data point and we need to put it in
the required category. Consider the below image:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
How does K-NN work?
• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
How does K-NN work?
• By calculating the Euclidean distance we got the nearest neighbors, as
three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Why do we need a K-NN Algorithm?
•Suppose there are two categories, i.e., Category A
and Category B, and we have a new data point x1, so
this data point will lie in which of these categories.
•To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily
identify the category or class of a particular dataset. :
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Why do we need a K-NN Algorithm?
•Consider the below diagram:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 7: SVM CLASSIFIER
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SUPPORT VECTOR MACHINE (SVM)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SVM
• A new classification method for both linear and nonlinear data
• It uses a nonlinear mapping to transform the original training data into a
higher dimension
• With the new dimension, it searches for the linear optimal separating
hyperplane (i.e., “decision boundary”)
• With an appropriate nonlinear mapping to a sufficiently high dimension, data
from two classes can always be separated by a hyperplane
• SVM finds this hyperplane using support vectors (“essential” training tuples)
and margins (defined by the support vectors)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SVM
• A support vector machine (SVM) is a supervised machine learning
model that uses classification algorithms.
• It is more preferred for classification but is sometimes very useful for
regression as well.
• Basically, SVM finds a hyper-plane that creates a boundary between the
types of data.
• In 2- dimensional space, this hyper-plane is nothing but a line.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM—History and Applications
• Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’
statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to their ability to model
complex nonlinear decision boundaries (margin maximization)
• Used both for classification and prediction
• Applications:
• handwritten digit recognition, object recognition, speaker identification, benchmarking time-
series prediction tests
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM—General Philosophy
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM—Margins and Support Vectors
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO SVM
• In SVM, we plot each data item in the dataset in an N-
dimensional space, where N is the number of features/attributes
in the data.
• Next, find the optimal hyperplane to separate the data.
• So by this, you must have understood that inherently, SVM can
only perform binary classification (i.e., choose between two
classes).
• However, there are various techniques to use for multi-class problems.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Support Vector Machine for Multi- class Problems
• To perform SVM on multi-class problems, we can create a binary classifier for
each class of the data.
• The two results of each classifier will be :
• The data point belongs to that class OR
• The data point does not belong to that class.
• For example, in a class of fruits, to perform multi-class classification, we can
create a binary classifier for each fruit.
• For say, the ‘mango’ class,
• there will be a binary classifier to predict if it IS a mango OR it is NOT a mango.
• The classifier with the highest score is chosen as the output of the SVM.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM—Linearly Separable
• A separating hyperplane can be written as
• W ● X + b = 0
• where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
• For 2-D it can be written as
• w0 + w1 x1 + w2 x2 = 0
• The hyperplane defining the sides of the margin:
• H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
• H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
• Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
• This becomes a constrained (convex) quadratic optimization problem: Quadratic objective
function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM—When Data Is Linearly Separable
• Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with
the class labels yi
• There are infinite lines (hyperplanes) separating the two classes but we want to find the
best one (the one that minimizes classification error on unseen data)
• SVM searches for the hyperplane with the largest margin, i.e., maximum marginal
hyperplane (MMH)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM for complex (Non Linearly Separable)
SVM for complex (Non Linearly Separable) SVM works very well without any modifications
for linearly separable data.
Linearly Separable Data is any data that can be plotted in a graph and can be separated into
classes using a straight line.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A: Linearly Separable Data B: Non-Linearly Separable Data
SVM CLASSIFIER
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
• A vector space method for binary classification problems
documents represented in t-dimensional space
• find a decision surface (hyperplane) that best separate
documents of two classes new document classified by its
position relative to hyperplane.
• Simple 2D example: training documents linearly separable
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
• Simple 2D example: training documents linearly separable
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
• Line s—The Decision Hyperplane
• maximizes distances to closest docs of each class
• it is the best separating hyperplane
• Delimiting Hyperplanes
• parallel dashed lines that delimit region where to look for a
solution
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
• Lines that cross the delimiting hyperplanes.
• candidates to be selected as the decision hyperplane
• lines that are parallel to delimiting hyperplanes: best candidates
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
• Support vectors: documents that belong to, and define, the delimiting
hyperplanes Our example in a 2-dimensional system of coordinates
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM CLASSIFIER
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SVM vs. Neural Network
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
• SVM
1) Relatively new concept
2) Deterministic algorithm
3) Nice Generalization
properties
4) Hard to learn – learned in
batch mode using quadratic
programming techniques
5) Using kernels can learn very
complex functions
• Neural Network
1) Relatively old
2) Nondeterministic algorithm
3) Generalizes well but doesn’t
have strong mathematical
foundation
4) Can easily be learned in
incremental fashion
5) To learn complex functions—
use multilayer perceptron (not
that trivial)
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 8 FEATURE SELECTION OR
DIMENSIONALITY REDUCTION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality
Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
FEATURE SELECTION OR DIMENSIONALITY REDUCTION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
FEATURE SELECTION OR DIMENSIONALITY REDUCTION
• Feature selection and dimensionality reduction allow us to
• minimize the number of features in a dataset by only keeping features that are
important.
• In other words, we want to retain features that contain the
most useful information that is needed by our model to
• make accurate predictions while discarding redundant features that contain
little to no information.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
FEATURE SELECTION OR DIMENSIONALITY REDUCTION
• There are several benefits in performing feature selection and
dimensionality reduction which include:
• model interpretability,
• minimizing overfitting
as well as
• reducing the size of the training set and consequently training time.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dimensionality Reduction
• The number of input variables or features for a dataset is referred to
as its dimensionality.
• Dimensionality reduction refers to techniques that reduce the number
of input variables in a dataset.
• More input features often make a predictive modeling task more
challenging to model, more generally referred to as the curse of
dimensionality.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dimensionality Reduction
• High-dimensionality statistics and dimensionality reduction
techniques are often used for data visualization.
• Nevertheless these techniques can be used in applied machine
learning to
• simplify a classification or regression dataset in order to better fit a predictive
model.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Problem With Many Input Variables
• If your data is represented using rows and columns, such as in a
spreadsheet, then
• the input variables are the columns that are fed as input to a model to predict
the target variable. Input variables are also called features.
• We can consider the columns of data representing dimensions on an
n-dimensional feature space and the rows of data as points in that
space.
• This is a useful geometric interpretation of a dataset.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Problem With Many Input Variables
• Having a large number of dimensions in the feature space can mean that
the volume of that space is very large, and in turn,
• the points that we have in that space (rows of data) often represent a small and non-
representative sample.
• This can dramatically impact the performance of machine learning
algorithms fit on data with many input features, generally referred to as the
“curse of dimensionality.”
• Therefore, it is often desirable to reduce the number of input features. This
reduces the number of dimensions of the feature space, hence the name
“dimensionality reduction.”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dimensionality Reduction
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dimensionality Reduction
• Dimensionality reduction refers to techniques for reducing the
number of input variables in training data.
• When dealing with high dimensional data, it is often useful to reduce
the dimensionality by projecting the data to a lower dimensional
subspace which captures the “essence” of the data. This is called
dimensionality reduction.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dimensionality Reduction
• Fewer input dimensions often mean correspondingly fewer
parameters or a simpler structure in the machine learning model,
referred to as degrees of freedom.
• A model with too many degrees of freedom is likely to overfit the
training dataset and therefore may not perform well on new data.
• It is desirable to have simple models that generalize well, and in turn,
input data with few input variables.
• This is particularly true for linear models where the number of inputs
and the degrees of freedom of the model are often closely related.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Techniques for Dimensionality Reduction
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Techniques for Dimensionality Reduction
• There are many techniques that can be used for dimensionality
reduction.
• Feature Selection Methods
• Matrix Factorization
• Manifold Learning
• Auto encoder Methods
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Feature Selection Methods
• Feature selection is also called variable selection or attribute
selection.
• It is the automatic selection of attributes in your data (such as
columns in tabular data) that are most relevant to the predictive
modeling problem you are working on.
• feature selection…
• is the process of selecting a subset of relevant features for use in model
construction
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Feature Selection Methods
• Feature selection is different from dimensionality reduction.
• Both methods seek to reduce the number of attributes in the dataset,
• but a dimensionality reduction method do so by creating new combinations of
attributes,
• where as feature selection methods include and exclude attributes present in
the data without changing them.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Feature Selection Methods
• Examples of dimensionality reduction methods include
• Principal Component Analysis,
• Singular Value Decomposition and
• Sammon’s Mapping.
• Feature selection is itself useful, but it mostly acts as a filter, muting
out features that aren’t useful in addition to your existing features.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Feature Selection Algorithms
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Filter Methods
• Filter feature selection methods apply a statistical measure to assign a
scoring to each feature.
• The features are ranked by the score and either selected to be kept or
removed from the dataset.
• The methods are often univariate and consider the feature
independently, or with regard to the dependent variable.
• Some examples of some filter methods include the Chi squared test,
information gain and correlation coefficient scores.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Wrapper Methods
• Wrapper methods consider the selection of a set of features as a search problem,
where different combinations are prepared, evaluated and compared to other
combinations.
• A predictive model us used to evaluate a combination of features and assign a
score based on model accuracy.
• The search process may be
• methodical such as a best-first search,
• it may stochastic such as a random hill-climbing algorithm, or
• it may use heuristics, like forward and backward passes to add and remove features.
• An example if a wrapper method is the recursive feature elimination algorithm.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Embedded Methods
• Embedded methods learn which features best contribute to the accuracy of the
model while the model is being created.
• The most common type of embedded feature selection methods are
• regularization methods.
• Regularization methods are also called penalization methods that introduce
• additional constraints into the optimization of a predictive algorithm (such as a regression
algorithm) that bias the model toward lower complexity (fewer coefficients).
• Examples of regularization algorithms are the
• LASSO, Elastic Net and Ridge Regression.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 9: EVALUATION METRICS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or Dimensionality
Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
EVALUATION METRICS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
EVALUATION METRICS
• Evaluating the Accuracy of a Classifier
• Basic Evaluation Measures for Classifier Performance.
• In Bioinformatics and machine learning in general,
• there is a large variation in the measures that are used to evaluate
prediction systems.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
A confusion matrix
• For simplicity, the assumption is that each instance can only be assigned
one of two classes:
• Positive or
• Negative
(e.g. a patient's tumor may be malignant or benign).
• Each instance (e.g. a patient) has a Known label, and a Predicted label.
• Some method is used (e.g. cross-validation) to make predictions on each
instance. Each instance then increments one cell in the confusion matrix.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
EVALUATION METRICS: -A confusion matrix
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Predicted Label
Positive Negative
Known Label
Positive
True Positive
False
Negative
(TP) (FN)
Negative
False
Positive
True
Negative
(FP) (TN)
EVALUATION METRICS
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
CONTINGENCY TABLE
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
CONTINGENCY TABLE
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 10: ACCURACY AND ERROR
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or Dimensionality
Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
ACCURACY AND ERROR
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
ACCURACY AND ERROR
• Accuracy
• Accuracy is the proportion of the time that the predicted class
equals the actual class, usually expressed as a percentage.
• It's meaning is straightforward, but may obscure important
differences in costs associated with different errors.
• The classic example of such costs is the medical diagnostic situation,
in which one can err be either:
• 1. keeping a healthy patient in the hospital (low cost), or
• 2. sending home a sick patient (very high cost).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
ACCURACY AND ERROR
• These classifiers need to be checked for both the accuracy of their
probabilities (Do cases predicted to have a 5% (30%, 80%, etc.)
probability really belong to the target class 5% (30%, 80%, etc.) of
the time?) and their ability to separate the classes in question.
Accuracy can be measured using many of the same metrics used to
evaluate numerical models (MSE, MAE, etc.).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Accuracy and Error Measures
• All models must be assessed somehow.
• Despite the existence of a bewildering array of performance
measures, much commercial modeling software provides a
surprisingly limited range of options.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Mean Squared Error (MSE)
• Mean Squared Error (MSE) is by far the most common measure of
numerical model performance.
• It is simply the average of the squares of the differences between
the predicted and actual values.
• It is a reasonably good measure of performance, though it could be
argued that it overemphasizes the importance of larger errors.
• Many modeling procedures directly minimize it.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Mean Absolute Error (MAE)
• Mean Absolute Error (MAE) is similar to the Mean Squared Error, but it
uses absolute values instead of squaring.
• This measure is not as popular as MSE, though its meaning is more
intuitive .
•
• Bias is the average of the differences between the predicted and actual
values.
• With this measure, positive errors cancel out negative ones.
• Bias is intended to assess how much higher or lower predictions are, on
average, than actual values.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Mean Absolute Percent Error (MAPE)
• Mean Absolute Percent Error (MAPE) is the average of the absolute
errors, as a percentage of the actual values.
• This is a relative measure of error, which is useful when larger errors
are more acceptable on larger actual values.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classification Accuracy: Estimating Error Rates
• Partition: Training-and-testing
o Use two independent data sets, e.g., training set (2/3), test set(1/3)
o Used for data set with large number of samples
• Cross-validation
o Divide the data set into k subsamples
o Use k-1 subsamples as training data and one sub-sample as test data—k-fold cross-validation
o For data set with moderate size
• Bootstrapping (leave-one-out)
o For small size data
• Confusion Matrix:
o This matrix shows not only how well the classifier predicts different classes
o It describes information about actual and detected classes:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classification Accuracy: Estimating Error Rates
Detected
Positive Negative
Actual Positive A: True positive B: False Negative
Negative C: False Positive D: True Negative
• The recall (or the true positive rate) and the precision (or the positive predictive rate) can
be derived from the confusion matrix as follows:
• o Recall = A / A+B
• o Precision = A / A+ C
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Classifier Accuracy Measures
• Accuracy of a classifier M, acc(M):
• percentage of test set tuples that are correctly classified by the model M
• Error rate (misclassification rate) of M = 1 – acc(M)
• Given m classes, CMi,j, an entry in a confusion matrix, indicates # of tuples in class i that are
labeled by the classifier as class j
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
classes buy_computer = yes buy_computer = no total recognition(%)
buy_computer = yes 6954 46 7000 99.34
buy_computer = no 412 2588 3000 86.27
total 7366 2634 10000 95.52
C1 C2
C1 True positive False negative
C2 False positive True negative
Classifier Accuracy Measures
• Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos /* true positive recognition rate */
specificity = t-neg/neg /* true negative recognition rate */
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)
• This model can also be used for cost-benefit analysis
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
classes buy_computer = yes buy_computer = no total recognition(%)
buy_computer = yes 6954 46 7000 99.34
buy_computer = no 412 2588 3000 86.27
total 7366 2634 10000 95.52
C1 C2
C1 True positive False negative
C2 False positive True negative
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 11: ORGANIZING THE CLASSES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or Dimensionality
Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
ORGANIZING THE CLASSES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
ORGANIZING THE CLASSES TAXONOMIES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
ORGANIZING THE CLASSES TAXONOMIES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
ORGANIZING THE CLASSES TAXONOMIES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
ORGANIZING THE CLASSES TAXONOMIES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
ORGANIZING THE CLASSES TAXONOMIES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 12: INDEXING AND SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or Dimensionality
Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INDEXING AND SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INDEXING AND SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INDEXING AND SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INDEXING AND SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 13: INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or Dimensionality
Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
FULL INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
FULL INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
FULL INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
FULL INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
FULL INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
FULL INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
FULL INVERTED INDEXES
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 14: SEQUENTIAL SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or Dimensionality
Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SEQUENTIAL SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SEQUENTIAL SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SEQUENTIAL SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SEQUENTIAL SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SEQUENTIAL SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SEQUENTIAL SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
SEQUENTIAL SEARCHING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – III: CLASSIFICATION
Topic 15: MULTI-DIMENSIONAL INDEXING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or Dimensionality
Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
MULTI-DIMENSIONAL INDEXING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
MULTI-DIMENSIONAL INDEXING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
MULTI-DIMENSIONAL INDEXING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
MULTI-DIMENSIONAL INDEXING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
MULTI-DIMENSIONAL SEARCH
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
MULTI-DIMENSIONAL SEARCH
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES

More Related Content

PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080 IRT UNIT I NOTES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
Chapter 1 Introduction to Information Storage and Retrieval.pdf
Habtamu100
 
PPTX
Introduction to Information Retrieval
Roi Blanco
 
PDF
Introduction to Information Retrieval & Models
Mounia Lalmas-Roelleke
 
PPTX
Information retrieval (introduction)
Primya Tamil
 
PPTX
The impact of web on ir
Primya Tamil
 
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
Chapter 1 Introduction to Information Storage and Retrieval.pdf
Habtamu100
 
Introduction to Information Retrieval
Roi Blanco
 
Introduction to Information Retrieval & Models
Mounia Lalmas-Roelleke
 
Information retrieval (introduction)
Primya Tamil
 
The impact of web on ir
Primya Tamil
 

What's hot (20)

PPTX
Term weighting
Primya Tamil
 
PPT
Inverted index
Krishna Gehlot
 
PDF
CS8080_IRT_UNIT - III T4 SUPERVISED ALGORITHMS.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
Anomaly detection
QuantUniversity
 
PPTX
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 
PPTX
Data mining: Classification and prediction
DataminingTools Inc
 
PPTX
Boolean,vector space retrieval Models
Primya Tamil
 
PDF
Anomaly detection
Hitesh Mohapatra
 
PPTX
Information retrieval introduction
nimmyjans4
 
PPTX
Lecture #01
Konpal Darakshan
 
PPTX
Classification in data mining
Sulman Ahmed
 
PPTX
Query processing
Ravinder Kamboj
 
PPT
2.3 bayesian classification
Krish_ver2
 
PPTX
Vector space model of information retrieval
Nanthini Dominique
 
PPT
Unit 1 chapter 1 Design and Analysis of Algorithms
P. Subathra Kishore, KAMARAJ College of Engineering and Technology, Madurai
 
PPTX
Data science.chapter-1,2,3
varshakumar21
 
PPTX
Association rule mining.pptx
maha797959
 
PPT
1.8 discretization
Krish_ver2
 
PDF
CS6010 Social Network Analysis Unit III
pkaviya
 
Term weighting
Primya Tamil
 
Inverted index
Krishna Gehlot
 
CS8080_IRT_UNIT - III T4 SUPERVISED ALGORITHMS.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
Anomaly detection
QuantUniversity
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 
Data mining: Classification and prediction
DataminingTools Inc
 
Boolean,vector space retrieval Models
Primya Tamil
 
Anomaly detection
Hitesh Mohapatra
 
Information retrieval introduction
nimmyjans4
 
Lecture #01
Konpal Darakshan
 
Classification in data mining
Sulman Ahmed
 
Query processing
Ravinder Kamboj
 
2.3 bayesian classification
Krish_ver2
 
Vector space model of information retrieval
Nanthini Dominique
 
Unit 1 chapter 1 Design and Analysis of Algorithms
P. Subathra Kishore, KAMARAJ College of Engineering and Technology, Madurai
 
Data science.chapter-1,2,3
varshakumar21
 
Association rule mining.pptx
maha797959
 
1.8 discretization
Krish_ver2
 
CS6010 Social Network Analysis Unit III
pkaviya
 
Ad

Similar to CS8080 information retrieval techniques unit iii ppt in pdf (20)

PDF
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080_IRT_UNIT - III T12 INDEXING AND SEARCHING.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080_IRT_UNIT - III T14 SEQUENTIAL SEARCHING.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080_IRT_UNIT - III T13 INVERTED INDEXES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...
cseij
 
PDF
CS8080_IRT_UNIT - III T9 EVALUATION METRICS.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PPT
Data Mining
shrapb
 
PPT
Lecture1
sumit621
 
PPTX
JM Information Retrieval Techniques Unit III
JeyamohanHAsstProfCS
 
PDF
Text Classification/Categorization
Oswal Abhishek
 
PPTX
Presentation data mining
Abdul Haseeb
 
PPTX
Ir 01
Mohammed Romi
 
PPTX
Text mining
Koshy Geoji
 
PPTX
Text Mining.pptx
vrundadevani
 
PDF
CS8080_IRT_UNIT - III T15 MULTI-DIMENSIONAL INDEXING.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
Text Classification, Sentiment Analysis, and Opinion Mining
Fabrizio Sebastiani
 
PDF
Machine learning in automated text categorization
unyil96
 
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080_IRT_UNIT - III T12 INDEXING AND SEARCHING.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080_IRT_UNIT - III T14 SEQUENTIAL SEARCHING.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080_IRT_UNIT - III T13 INVERTED INDEXES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...
cseij
 
CS8080_IRT_UNIT - III T9 EVALUATION METRICS.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
Data Mining
shrapb
 
Lecture1
sumit621
 
JM Information Retrieval Techniques Unit III
JeyamohanHAsstProfCS
 
Text Classification/Categorization
Oswal Abhishek
 
Presentation data mining
Abdul Haseeb
 
Text mining
Koshy Geoji
 
Text Mining.pptx
vrundadevani
 
CS8080_IRT_UNIT - III T15 MULTI-DIMENSIONAL INDEXING.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
Text Classification, Sentiment Analysis, and Opinion Mining
Fabrizio Sebastiani
 
Machine learning in automated text categorization
unyil96
 
Ad

More from AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING (19)

PPTX
JAVA PROGRAM CONSTRUCTS OR LANGUAGE BASICS.pptx
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PPTX
CS3391 OOP UT-I T4 JAVA BUZZWORDS.pptx
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PPTX
CS3391 OOP UT-I T1 OVERVIEW OF OOP
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PPTX
CS3391 OOP UT-I T3 FEATURES OF OBJECT ORIENTED PROGRAMMING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PPTX
CS3391 OOP UT-I T2 OBJECT ORIENTED PROGRAMMING PARADIGM.pptx
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS3391 -OOP -UNIT – V NOTES FINAL.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS3391 -OOP -UNIT – IV NOTES FINAL.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS3391 -OOP -UNIT – III NOTES FINAL.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS3391 -OOP -UNIT – II NOTES FINAL.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS3391 -OOP -UNIT – I NOTES FINAL.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080 IRT UNIT - III SLIDES IN PDF.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080_IRT_UNIT - III T10 ACCURACY AND ERROR.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080_IRT_UNIT - III T8 FEATURE SELECTION OR DIMENSIONALITY REDUCTION.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080_IRT_UNIT - III T7 SVM CLASSIFIER.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080_IRT_UNIT - III T5 DECISION TREES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PDF
CS8080_IRT_UNIT - III T2 UNSUPERVISED ALGORITHMS -CLUSTERING.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
JAVA PROGRAM CONSTRUCTS OR LANGUAGE BASICS.pptx
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS3391 OOP UT-I T4 JAVA BUZZWORDS.pptx
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS3391 OOP UT-I T1 OVERVIEW OF OOP
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS3391 OOP UT-I T3 FEATURES OF OBJECT ORIENTED PROGRAMMING
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS3391 OOP UT-I T2 OBJECT ORIENTED PROGRAMMING PARADIGM.pptx
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS3391 -OOP -UNIT – V NOTES FINAL.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS3391 -OOP -UNIT – IV NOTES FINAL.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS3391 -OOP -UNIT – III NOTES FINAL.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS3391 -OOP -UNIT – II NOTES FINAL.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS3391 -OOP -UNIT – I NOTES FINAL.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080 IRT UNIT - III SLIDES IN PDF.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080_IRT_UNIT - III T10 ACCURACY AND ERROR.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080_IRT_UNIT - III T8 FEATURE SELECTION OR DIMENSIONALITY REDUCTION.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080_IRT_UNIT - III T7 SVM CLASSIFIER.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080_IRT_UNIT - III T5 DECISION TREES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
CS8080_IRT_UNIT - III T2 UNSUPERVISED ALGORITHMS -CLUSTERING.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 

Recently uploaded (20)

PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PPTX
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
PDF
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
PDF
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PDF
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PPT
Understanding the Key Components and Parts of a Drone System.ppt
Siva Reddy
 
PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PDF
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
PPTX
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
Understanding the Key Components and Parts of a Drone System.ppt
Siva Reddy
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 

CS8080 information retrieval techniques unit iii ppt in pdf

  • 1. P1WU UNIT – III: CLASSIFICATION Topic 1: A CHARACTERIZATION OF TEXT CLASSIFICATION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 2. UNIT III 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 3. INTRODUCTION TO CLASSIFICATION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 4. INTRODUCTION TO CLASSIFICATION • Scientists became very serious about addressing the question: • “Can we build a model that learns from available data and automatically makes the right decisions and predictions?” • Answer can be found in numerous applications that are emerging from the fields of 1. pattern classification, 2. machine learning, and 3. artificial intelligence. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 5. INTRODUCTION TO CLASSIFICATION • Data from various sensoring devices combined with powerful learning algorithms and domain knowledge led to : • many great inventions that we now take for granted in our everyday life: • Internet queries via search engines like Google, • text recognition at the post office, • barcode scanners at the supermarket, the diagnosis of diseases, • speech recognition by Siri or • Google Now on our mobile phone, just to name a few. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 6. INTRODUCTION TO CLASSIFICATION • Classification is: • the data mining process of • finding a model (or function) that • describes and distinguishes data classes or concepts, • for the purpose of being able to use the model to predict the class of objects whose class label is unknown. • That is, predicts categorical class labels (discrete or nominal). • Classifies the data (constructs a model) based on the training set. • It predict group membership for data instances. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 7. INTRODUCTION TO CLASSIFICATION What is CLASSIFICATION? • Classification and prediction are : • two forms of data analysis that can used to extract models describing important data classes or to predict the future data trends. • C & P help us to provide a better understanding of large data. • Classification predicts categorical (discrete, unordered) labels. • Prediction models continuous valued functions. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 8. INTRODUCTION TO CLASSIFICATION • How can we classify? • The trick here is Machine Learning which requires us to make classifications based on past observations (the learning part). • We give the machine a set of data having texts with labels tagged to it and then we let the model to learn on all these data which will later give us some useful insight on the categories of text input we feed. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 9. Applications of Classification • Classification of (potential) customers for: • Credit approval, risk prediction, selective marketing • Performance prediction based on • selected indicators • Medical diagnosis based on symptoms or reactions to Therapy • Application areas: • Credit approval • Target marketing • Medical diagnosis • Treatment effectiveness analysis • Performance prediction AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 10. When is classification needed? • Scenarios: • In each of these examples, the data analysis task is classification, • where a model or classifier is constructed to predict categorical labels, such as • “safe” or “risky” for the loan application data; • “yes” or “no” for the marketing data; or • “treatment A,” “treatment B,” or “treatment C” for the medical data. • These categories can be represented by discrete values, where the ordering among values has no meaning. • For example, • the values 1, 2, and 3 may be used to represent treatments A, B, and C, • where there is no ordering implied among this group of treatment regimes. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 11. INTRODUCTION TO CLASSIFICATION Aim: predict categorical class labels for new tuples/samples Input: a training set of tuples/samples, each with a class label Output: a model (a classifier) based on the training set and the class labels AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 12. Why Classification? • A classical problem extensively studied by • statisticians and machine learning researchers • Predicts categorical class labels. • Produces a model (classifier). AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 13. Typical Applications of Classification • Example: • {credit history, salary} credit approval ( Yes/No) • {Temp, Humidity}  Rain (Yes/No) • A set of documents  sports, technology, etc. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES • Another Example: • If x >= 90 then grade =A. • If 80<=x<90 then grade =B. • If 70<=x<80 then grade =C. • If 60<=x<70 then grade =D. • If x<50 then grade =F.
  • 14. WHAT ARE TEXT CLASSIFICATION? • Text classification is a machine learning technique that assigns a set of predefined categories to open-ended text. • Text classifiers can be used to organize, structure, and categorize pretty much any kind of text – from documents, medical studies and files, and all over the web. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 15. What is meant by text classification? • Text classification or Text Categorization is the activity of labeling natural language texts with relevant categories from a predefined set. • In laymen terms, text classification is a process of extracting generic tags from unstructured text. • These generic tags come from a set of pre-defined categories. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 16. What is meant by text classification or Document classification ? • Document classification or document categorization is • a problem in library science, information science and computer science. • The task is to assign a document to one or more classes or categories. • This may be done "manually" or algorithmically. •Wikipedia AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 17. What is meant by text classification? • Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. • By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 18. Text Classification Examples • Text classification is becoming • an increasingly important part of businesses as it allows to easily get insights from data and automate business processes. • Some of the most common examples and use cases for automatic text classification include the following: a) Sentiment Analysis b) Topic Detection c) Language Detection AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 19. Text Classification Examples a) Sentiment Analysis: the process of understanding if a given text is talking positively or negatively about a given subject (e.g. for brand monitoring purposes). b) Topic Detection: the task of identifying the theme or topic of a piece of text (e.g. know if a product review is about Ease of Use, Customer Support, or Pricing when analyzing customer feedback). c) Language Detection: the procedure of detecting the language of a given text (e.g. know if an incoming support ticket is written in English or Spanish for automatically routing tickets to the appropriate team). AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 20. A Characterization of Text Classification • For example, • new articles can be organized by topics; • support tickets can be organized by urgency; • chat conversations can be organized by language; • brand mentions can be organized by sentiment; and so on. • Text classification is • one of the fundamental tasks in natural language processing with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection. • Here’s an example of how it works: • “The user interface is quite straightforward and easy to use.” • A text classifier can take this phrase as an input, analyze its content, and then automatically assign relevant tags, such as UI and Easy To Use. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 21. A Characterization of Text Classification • First tactic for categorizing documents is to assign a label to each document, • but this solve the problem only when the users know the labels of the documents they looking for. • This tactic does not solve more generic problem of finding documents on specific topic or subject. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 22. A Characterization of Text Classification • For that case, better solution is to • group documents by common generic topics and label each group with a meaningful name. • Each labeled group is called category or class. • Document classification is • the process of categorizing documents under a given cluster or category using fully supervised learning process. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 23. Why is Text Classification Important? • It’s estimated that around 80% of all information is unstructured, with text being one of the most common types of unstructured data. • Because of the messy nature of text, • analyzing, understanding, organizing, and sorting through text data is hard and time-consuming, so most companies fail to use it to its full potential. • This is where text classification with machine learning comes in. • Using text classifiers, companies can automatically structure all manner of relevant text, from • , legal documents, social media, chatbots, surveys, and more in a fast and cost-effective way. • This allows companies to • save time analyzing text data, automate business processes, and make data-driven business decisions. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 24. Reasons for: Text Classification Important a) Scalability • Manually analyzing and organizing is slow and much less accurate.. • Machine learning can automatically analyze millions of surveys, comments, emails, etc., at a fraction of the cost, often in just a few minutes. • Text classification tools are scalable to any business needs, large or small. b) Real-time analysis • There are critical situations that companies need to identify as soon as possible and take immediate action (e.g., PR crises on social media). • Machine learning text classification can follow your brand mentions constantly and in real time, so you'll identify critical information and be able to take action right away. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 25. Reasons for: Text Classification Important c) Consistent criteria • Human annotators make mistakes when classifying text data due to distractions, fatigue, and boredom, and human subjectivity creates inconsistent criteria. • Machine learning, on the other hand, applies the same lens and criteria to all data and results. • Once a text classification model is properly trained it performs with unsurpassed accuracy. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 26. A Characterization of Text Classification • Classification could be performed 1. manually by domain experts or 2. automatically using well- known and • widely used classification algorithms such as decision tree and Naïve Bayes. • Documents are classified according to • other attributes (e.g. author, document type, publishing year etc.) or according to their subjects. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 27. A Characterization of Text Classification • there are two main kind of subject classification of documents: 1. The content based approach and 2. the request based approach. • In Content based classification, • the weight that is given to subjects in a document decides the class to which the document is assigned. • For example, it is a rule in some library classification that at least 15% of the content of a book should be about the class to which the book is assigned. • In automatic classification, the number of times given words appears in a document determine the class. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 28. A Characterization of Text Classification • In Request oriented classification, the anticipated request from users is impacting how documents are being classified. • The classifier asks himself: • “Under which description should this entity be found?” and • “think of all the possible queries and decide for which ones the entity at hand is relevant”. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 29. Text Classification Applications • With the help of text classification, businesses can make sense of large amounts of data using techniques like • aspect-based sentiment analysis to understand what people are talking about and how they’re talking about each aspect. • Text classification can help support teams provide a stellar experience by • automating tasks that are better left to computers, saving precious time that can be spent on more important things. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 30. Text Classification Applications • models can help you analyze survey results to discover patterns and insights like: • What do people like about our product or service? • What should we improve? • What do we need to change? • By combining both quantitative results and qualitative analyses, • teams can make more informed decisions without having to spend hours manually analyzing every single open-ended response. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 31. Text Classification Applications • Text classification has thousands of use cases and is applied to a wide range of tasks. • In some cases, data classification tools work behind the scenes to enhance app features we interact with on a daily basis (like email spam filtering). • In some other cases, classifiers are used by marketers, product managers, engineers, and salespeople to automate business processes and save hundreds of hours of manual data processing. • Some of the top applications and use cases of text classification include: 1. Detecting urgent issues 2. Automating customer support processes 3. Listening to the Voice of customer (VoC) AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 32. A Characterization of Text Classification • Automatic document classification tasks can be divided into three types 1. Unsupervised document classification (document clustering): the classification must be done totally without reference to external information. 2. Semi-supervised document classification: parts of the documents are labeled by the external method. 3. Supervised document classification where some external method (such as human feedback) provides information on the correct classification for documents AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 33. Computational Supervised Learning • Computational Supervised Learning is also called classification aimed to: • Learn from past experience, and • use the learned knowledge to classify new data • Knowledge learned by intelligent algorithms • Examples: • Clinical diagnosis for patients • Cell type classification AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 34. Overall Picture of Supervised Learning AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Biomedical Financial Government Scientific Decision trees Emerging patterns SVM Neural networks Classifiers (M-Doctors)
  • 35. Unsupervised Learning • Unsupervised learning is a machine learning technique in which models are not supervised using training dataset. Instead, models itself find the hidden patterns and insights from the given data. It can be compared to learning which takes place in the human brain while learning new things. It can be defined as: • “Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without any supervision”. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 36. Unsupervised Learning Unsupervised learning cannot be directly applied to a regression or classification problem because unlike supervised learning, we have the input data but no corresponding output data. The goal of unsupervised learning is to find the underlying structure of dataset, group that data according to similarities, and represent that dataset in a compressed format. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 37. Unsupervised Learning Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of different types of cats and dogs. The algorithm is never trained upon the given dataset, which means it does not have any idea about the features of the dataset. The task of the unsupervised learning algorithm is to identify the image features on their own. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 38. Unsupervised Learning • . Unsupervised learning algorithm will • perform this task by clustering the image dataset into the groups according to similarities between images. • By Simply, • no training data is provided Examples: • neural network models • independent component analysis • clustering AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 39. Supervised vs. Unsupervised Learning classification Vs clustering • Supervised learning (classification) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning (clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 40. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 41. P1WU UNIT – III: CLASSIFICATION Topic 2: UNSUPERVIZED ALGORITHMS - CLUSTERING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 42. UNIT III 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 43. INTRODUCTION TO UNSUPERVIZED ALGORITHMS • Below is the list of some popular unsupervised learning algorithms: • K-means clustering • KNN (k-nearest neighbors) • Hierarchal clustering • Anomaly detection • Neural Networks • Principle Component Analysis • Independent Component Analysis • Apriori algorithm • Singular value decomposition AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 44. INTRODUCTION TO UNSUPERVIZED ALGORITHMS AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 45. WHAT ARE CLUSTERING? • Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. • It can be defined as "A way of grouping the data points into different clusters, consisting of similar data points. The objects with the possible similarities remain in a group that has less or no similarities with another group." AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 46. WHAT ARE CLUSTERING? • It does it by • finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc., and divides them as per the presence and absence of those similar patterns. • It is an unsupervised learning method, • hence no supervision is provided to the algorithm, and it deals with the unlabeled dataset. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 47. Difference between Supervised and Unsupervised Learning AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Supervised Learning Unsupervised Learning Supervised learning algorithms aretrained using labeled data. Unsupervised learning algorithmsare trained using unlabeled data. Supervised learning model takesdirect feedback to check if it is predicting correct output or not. Unsupervised learning model doesnot take any feedback. Supervised learning model predictsthe output. Unsupervised learning model findsthe hidden patterns in data. Supervised learning needs supervision to train the model. Unsupervised learning does not needany supervision to train the model. Supervised learning can becategorized in Classification and Regression problems. Unsupervised Learning can beclassified in Clustering and Associations problems. Supervised learning can be used for those cases where we know theinput as well as corresponding outputs. Unsupervised learning can be used for those cases where we have onlyinput data and no corresponding output data. Supervised learning model produces an accurate result. Unsupervised learning model may give less accurate result as compared to supervised learning. It includes various algorithms such It includes various algorithms such
  • 48. Advantages of Unsupervised Learning • Unsupervised learning is used for more complex tasks as compared to supervised learning because, • in unsupervised learning, we don't have labeled input data. • Unsupervised learning is preferable as • it is easy to get unlabeled data in comparison to labeled data. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 49. Disadvantages of Unsupervised Learning • Unsupervised learning is • intrinsically more difficult than supervised learning as it does not have corresponding output. • The result of the unsupervised learning algorithm might be • less accurate as input data is not labeled, and algorithms do not know the exact output in advance. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 50. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 51. P1WU UNIT – III: CLASSIFICATION Topic 3: NAÏVE TEXT CLASSIFICATION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 52. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 53. NAÏVE TEXT CLASSIFICATION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 54. INTRODUCTION TO NAÏVE TEXT CLASSIFICATION • Naive Bayes classifiers are a collection of classification algorithms based on Bayes Theorem. • It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. • Naive Bayes classifiers have been heavily used for text classification and text analysis machine learning problems. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 55. INTRODUCTION TO NAÏVE TEXT CLASSIFICATION • Text Analysis is a major application field for machine learning algorithms. • However the raw data, • a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 56. The Naive Bayes algorithm • Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. • It is not a single algorithm but a family of algorithms where all of them share a common principle, • i.e. every pair of features being classified is independent of each other. • The dataset is divided into two parts, namely, feature matrix and the response/target vector. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 57. The Naive Bayes algorithm • The Feature matrix (X) contains all the vectors(rows) of the dataset in which each vector consists of the value of dependent features. The number of features is d i.e. X = (x1,x2,x2, xd). • The Response/target vector (y) contains the value of class/group variable for each row of feature matrix. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 58. The Bayes’ Theorem Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as follows: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 59. The Bayes’ Theorem • where: • A and B are called events. • P(A | B) is the probability of event A, given the event B is true (has occured) • Event B is also termed as evidence. P(A) is the priori of A (the prior independent probability, i.e. probability of event before evidence is seen). • P(B | A) is the probability of B given event A, i.e. probability of event B after evidence A is seen. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 60. The Bayes’ Theorem • Summary AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 61. Dealing with text data • Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 62. Dealing with text data • In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely: • tokenizing strings and giving an integer id for each possible token, for instance by using w ite-spaces and punctuation as token separators. • counting the occurrences of tokens in each document. • In this scheme, features and samples are defined as follows: • each individual token occurrence frequency is treated as a feature. • the vector of all the token frequencies for a given document is considered a multivariate sample. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 63. Example 1 : Using the Naive Bayesian Classifier • We will consider the following training set. • The data samples are described by attributes age, income, student, and credit. • The class label attribute, buy, tells whether the person buys a computer, has two distinct values, yes (class C1) and no (class C2). AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 64. Example 1 : Using the Naive Bayesian Classifier AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES RID Age Income student Credit Ci: buy 1 Youth High no Fair C2: no 2 Youth High no Excellent C2: no 3 middle-aged High no Fair C1: yes 4 Senior medium no Fair C1: yes 5 Senior Low yes Fair C1: yes 6 Senior Low yes Excellent C2: no 7 middle-aged Low yes Excellent C1: yes 8 Youth medium no Fair C2: no 9 Youth Low yes Fair C1: yes 10 Senior medium yes Fair C1: yes 11 Youth medium yes Excellent C1: yes 12 middle-aged medium no Excellent C1: yes 13 middle-aged High yes Fair C1: yes 14 Senior medium no Excellent C2: no
  • 65. Example 1 : Using the Naive Bayesian Classifier • The sample we wish to classify is • X = (age = youth, income = medium, student = yes, credit = fair) • We need to maximize P (X|Ci)P (Ci), for i = 1, 2. P (Ci), the a priori probability of each class, can be estimated based on the training samples: • P(buy =yes ) = 9 /14 • P(buy =no ) = 5 /14 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 66. Example 1 : Using the Naive Bayesian Classifier • To compute P (X|Ci), for i = 1, 2, we compute the following conditional probabilities: • P(age=youth | buy =yes ) = 2/9 • P(income =medium | buy =yes ) = 4/9 • P(student =yes | buy =yes ) = 6/9 • P(credit =fair | buy =yes ) = 6/9 • P(age=youth | buy =no ) = 3/5 • P(income =medium | buy =no ) = 2/5 • P(student =yes | buy =no ) = 1/5 • P(credit =fair | buy =no ) = 2/5 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 67. Example 1 : Using the Naive Bayesian Classifier • Using the above probabilities, we obtain AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 68. Example 1 : Using the Naive Bayesian Classifier • Similarly AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES To find the class that maximizes P (X|Ci)P (Ci), we compute Thus the naive Bayesian classifier predicts buy = yes for sample X
  • 69. Example 2: Predicting a class label using naïve Bayesian classification • Predicting a class label using naïve Bayesian classification. • The training data set is given below: • The data tuples are described by the attributes Owns Home?, Married, Gender and Employed. • The class label attribute Risk Class has three distinct values. • Let C1 corresponds to the class A, and C2 corresponds to the class B and C3 corresponds to the class C. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 70. Example 1 : Using the Naive Bayesian Classifier • The tuple is to classify is, • X = (Owns Home = Yes, Married = No, Gender = Female, Employed = Yes) AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Owns Home Married Gender Employed Risk Class Yes Yes Male Yes B No No Female Yes A Yes Yes Female Yes C Yes No Male No B No Yes Female Yes C No No Female Yes A No No Male No B Yes No Female Yes A No Yes Female Yes C Yes Yes Female Yes C
  • 71. Example 2: Predicting a class label using naïve Bayesian classification • Solution • There are 10 samples and three classes. • Risk class A = 3 Risk class B = 3 Risk class C = 4 • • The prior probabilities are obtained by dividing these frequencies by the total number in the training data, • P(A) = 3/10 = 0.3 P(B) = 3/10 = 0.3 P(C) = 4/10 = 0.4 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 72. Example 2: Predicting a class label using naïve Bayesian classification • To compute P(X/Ci) =P {yes, no, female, yes}/Ci) for each of the classes, the conditional probabilities for each: • P(Owns Home = Yes/A) = 1/3 =0.33 • P(Married = No/A) = 3/3 =1 • P(Gender = Female/A) = 3/3 = 1 • P(Employed = Yes/A) = 3/3 = 1 • • P(Owns Home = Yes/B) = 2/3 =0.67 • P(Married = No/B) = 2/3 =0.67 • P(Gender = Female/B) = 0/3 = 0 • P(Employed = Yes/B) = 1/3 = 0.33 • • P(Owns Home = Yes/C) = 2/4 =0.5 • P(Married = No/C) = 0/4 =0 • P(Gender = Female/C) = 4/4 = 1 • P(Employed = Yes/C) = 4/4 = 1 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 73. Example 2: Predicting a class label using naïve Bayesian classification • Using the above probabilities, we obtain • P(X/A)= P(Owns Home = Yes/A) X • P(Married = No/A) x • P(Gender = Female/A) X • P(Employed = Yes/A) = 0.33 x 1 x 1 x 1 = 0.33 • Similarly, P(X/B)= 0 , P(X/C) =0 • • To find the class, G, that maximizes, P(X/Ci)P(Ci), we compute, • P(X/A) P(A) = 0.33 X 0.3 = 0.099 • P(X/B) P(B) =0 X 0.3 = 0 • P(X/C) P(C) = 0 X 0.4 = 0.0 • Therefore x is assigned to class A AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 74. Advantages and Disadvantages • Advantages: a) Have the minimum error rate in comparison to all other classifiers. b) Easy to implement c) Good results obtained in most of the cases. d) They provide theoretical justification for other classifiers that do not explicitly use • Disadvantages: a) Lack of available probability data. b) Inaccuracies in the assumption. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 75. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 76. P1WU UNIT – III: CLASSIFICATION Topic 4: SUPERVISED ALGORITHMS AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 77. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 78. SUPERVIZED LEARNING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 79. INTRODUCTION TO SUPERVIZED LEARNING • Supervised learning, also known as supervised machine learning, is a subcategory of machine learning and artificial intelligence. • It is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 80. What is supervised learning? • Supervised learning, also known as supervised machine learning, is • a subcategory of machine learning and artificial intelligence. • It is defined by • its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 81. Supervised Learning • It is defined by its use of labeled datasets to • train algorithms that to classify data or predict outcomes accurately. • As input data is fed into the model, it adjusts its weights until the model has been fitted appropriately, • which occurs as part of the cross validation process. • Supervised learning helps organizations solve for • a variety of real-world problems at scale, such as classifying spam in a separate folder from your inbox. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 82. What is the type of supervised learning? • There are two types of Supervised Learning techniques: 1. Regression and 2. Classification. • Classification separates the data, Regression fits the data. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 83. Example of Supervised Learning • A great example of supervised learning is text classification problems. • In this set of problems, the goal is to • predict the class label of a given piece of text. • One particularly popular topic in text classification is to • predict the sentiment of a piece of text, like a tweet or a product review. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 84. SUPERVIZED ALGORITHMS AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 85. INTRODUCTION TO SUPERVIZED ALGORITHMS • Which are supervised algorithm? • A supervised learning algorithm takes • a known set of input data (the learning set) and known responses to the data (the output), and forms a model to generate reasonable predictions for the response to the new input data. • Use supervised learning if you have existing data for the output you are trying to predict. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 86. SUPERVIZED ALGORITHMS EXAMPLE AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 87. INTRODUCTION TO SUPERVIZED ALGORITHMS • Various algorithms and computation techniques are used in supervised machine learning processes. • Most commonly used learning methods, typically calculated through use of programs like R or Python are: 1) Neural networks • Primarily leveraged for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 88. INTRODUCTION TO SUPERVIZED ALGORITHMS 2) Naive Bayes • Naive Bayes is classification approach that adopts the principle of class conditional independence from the Bayes Theorem. 3) Linear regression • Linear regression is used to identify the relationship between a dependent variable and one or more independent variables and is typically leveraged to make predictions about future outcomes. 4) Logistic regression • While linear regression is leveraged when dependent variables are continuous, logistical regression is selected when the dependent variable is categorical, meaning they have binary outputs, such as "true" and "false" or "yes" and "no." AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 89. INTRODUCTION TO SUPERVIZED ALGORITHMS 5) Support vector machine (SVM) • A support vector machine is a popular supervised learning model developed by Vladimir Vapnik, used for both data classification and regression. 6) K-nearest neighbor • K-nearest neighbor, also known as the KNN algorithm, is a non-parametric algorithm that classifies data points based on their proximity and association to other available data. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 90. INTRODUCTION TO SUPERVIZED ALGORITHMS 7) Random forest • Random forest is another flexible supervised machine learning algorithm used for both classification and regression purposes. • The "forest" references a collection of uncorrelated decision trees, which are then merged together to reduce variance and create more accurate data predictions. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 91. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 92. P1WU UNIT – III: CLASSIFICATION Topic 5: DECISION TREES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 93. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 94. DECISION TREES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 95. INTRODUCTION TO DECISION TREES • What is a decision tree? • A decision tree is a structure that includes a root node, branches, and leaf nodes. a) Each internal node denotes a test on an attribute, b) each branch denotes the outcome of a test, and c) each leaf node holds a class label. • The topmost node in the tree is the root node. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 96. INTRODUCTION TO DECISION TREES • ID3, C4.5, and CART adopt a greedy (i.e., non-backtracking) approach in which decision trees are constructed in a top- down recursive divide-and-conquer manner. • Most algorithms for decision tree induction also follow such a top-down approach, which starts with a training set of tuples and their associated class labels. • The training set is recursively partitioned into smaller subsets as the tree is being built. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 97. INTRODUCTION TO DECISION TREES • Decision tree induction is the learning of decision trees from class-labeled training tuples. • A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, • each branch represents an outcome of the test, and • each leaf node (or terminal node) holds a class label. • The top most node in a tree is the root node. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 98. INTRODUCTION TO DECISION TREES •A decision tree is a tree where • internal node = a test on an attribute • tree branch = an outcome of the test • leaf node = class label or class distribution AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 99. Benefits of Decision Trees •The benefits of having a decision tree are as follows − a) It does not require any domain knowledge. b) It is easy to comprehend. c) The learning and classification steps of a decision tree are simple and fast. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 100. Brief History of Decision Trees AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES CLS (Hunt etal. 1966)--- cost driven ID3 (Quinlan, 1986 MLJ) --- Information-driven C4.5 (Quinlan, 1993) --- Gain ratio + Pruning ideas CART (Breiman et al. 1984) --- Gini Index
  • 101. Elegance of Decision Trees AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 102. Structure of Decision Trees • If x1 > a1 & x2 > a2, then it’s A class • C4.5, CART, two of the most widely used • Easy interpretation, but accuracy generally unattractive AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Leaf nodes Internal nodes Root node A B B A A x1 x2 x4 x3 > a1 > a2
  • 103. Example of Decision Tree AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 104. Another Example of Decision Tree AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 105. Decision Tree classification Tasks AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 106. 5/16/2022 Data Mining: Concepts and Techniques 15 Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Assign Cheat to “No”
  • 107. 5/16/2022 Data Mining: Concepts and Techniques 16 Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 108. 5/16/2022 Data Mining: Concepts and Techniques 17 Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 109. 5/16/2022 Data Mining: Concepts and Techniques 18 Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 110. 5/16/2022 Data Mining: Concepts and Techniques 19 Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 111. 5/16/2022 Data Mining: Concepts and Techniques 20 Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Assign Cheat to “No”
  • 112. 5/16/2022 Data Mining: Concepts and Techniques 21 Decision Tree Classification Task Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Tree Induction algorithm Training Set Decision Tree
  • 113. Constructing a Decision Tree AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 114. Constructing a Decision Tree • Two phases of decision tree generation: 1. tree construction • at start, all the training examples at the root • partition examples based on selected attributes • test attributes are selected based on a heuristic or a statistical measure 2. tree pruning • identify and remove branches that reflect noise or outliers AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 115. Constructing a Decision Tree • Basic step: Determination of the root node of the tree and the root node of its sub-trees • Most Discriminatory Feature • Every feature can be used to partition the training data • If the partitions contain a pure class of training instances, then this feature is most discriminatory AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 116. Constructing a Decision Tree:- Example of Partitions • Categorical feature • Number of partitions of the training data is equal to the number of values of this feature • Numerical feature • Two partitions AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 117. Decision Tree Induction Algorithm • A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). • Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a greedy approach. • In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive divide-and-conquer manner. • Generating a decision tree form training tuples of data partition D AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 118. Algorithm : Generate_decision_tree • Input: • Data partition, D, which is a set of training tuples and their associated class labels. • attribute_list, the set of candidate attributes. • Attribute selection method, a procedure to determine the splitting criterion that best partitions that the data tuples into individual classes. This criterion includes a splitting_attribute and either a splitting point or splitting subset. Output: A Decision Tree AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 119. Algorithm : Generate_decision_tree • Method 1) create a node N; 2) if tuples in D are all of the same class, C then 3) return N as leaf node labeled with class C; 4) if attribute_list is empty then 5) return N as leaf node with labeled with majority class in D;//majority voting 6) apply attribute_selection_method(D, attribute_list) to find the best splitting_criterion; 7) label node N with splitting_criterion; 8) if splitting_attribute is discrete-valued and multiway splits allowed then // not restricted to binary trees 9) attribute_list = attribute_list - splitting attribute; // remove splitting attribute 10) for each outcome j of splitting criterion 11) // partition the tuples and grow subtrees for each partition 12) let Dj be the set of data tuples in D satisfying outcome j; // a partition 13) if Dj is empty then 14) attach a leaf labeled with the majority class in D to node N; else attach the node returned by Generate_decision tree(Dj, attribute list) to node N; end for 15) return N; AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 120. Constructing Decision Tree Example :- Weather Forecasting AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 121. Constructing Decision Tree :- A Simple Dataset AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES 9 Play samples 5 Don’t A total of 14.
  • 122. Constructing Decision Tree :- A Simple Dataset AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Outlook Temp Humidity Windy class Sunny 75 70 true Play Sunny 80 90 true Don’t Sunny 85 85 false Don’t Sunny 72 95 true Don’t Sunny 69 70 false Play Overcast 72 90 true Play Overcast 83 78 false Play Overcast 64 65 true Play Overcast 81 75 false Play Rain 71 80 true Don’t Rain 65 70 true Don’t Rain 75 80 false Play Rain 68 80 false Play Rain 70 96 false Play Instance # 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  • 123. Constructing Decision Tree :- A Simple Dataset AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES 2 outlook windy humidity Play Play Play Don’t Don’t sunny overcast rain <= 75 > 75 false true 2 4 3 3
  • 124. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Total 14 training instances 1,2,3,4,5 P,D,D,D,P 6,7,8,9 P,P,P,P 10,11,12,13,14 D, D, P, P, P Outlook = sunny Outlook = overcast Outlook = rain
  • 125. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Total 14 training instances 5,8,11,13,14 P,P, D, P, P 1,2,3,4,6,7,9,10,12 P,D,D,D,P,P,P,D,P Temperature <= 70 Temperature > 70
  • 126. Constructing Decision Tree :- A Simple Dataset AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES 2 outlook windy humidity Play Play Play Don’t Don’t sunny overcast rain <= 75 > 75 false true 2 4 3 3
  • 127. Constructing Decision Tree Example :- Decision on Buying a Computer / customer likely to purchase a computer AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 128. Constructing Decision Tree Example :- Decision on Buying a Computer AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES The following decision tree is for the concept buy_computer that indicates : Whether a customer at a company is likely to buy a computer or not? Each internal node represents a test on an attribute. Each leaf node represents a class.
  • 129. Constructing Decision Tree :- Training Dataset AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no This follows an example of Quinlan’s ID3 (Playing Tennis)
  • 130. Constructing Decision Tree :- Output: A Decision Tree for “buys_computer” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES age? overcast student? credit rating? <=30 >40 no yes yes yes 31..40 fair excellent yes no From the training dataset , calculate entropy value, which indicates that splitting attribute is: age
  • 131. A Decision Tree for “buys_computer” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 132. A Decision Tree for “buys_computer” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES From the training data set , age= youth has 2 classes based on student attribute
  • 133. A Decision Tree for “buys_computer” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES based on majority voting in student attribute , RID=3 is grouped under yes group.
  • 134. A Decision Tree for “buys_computer” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES From the training data set , age= senior has 2 classes based on credit rating.
  • 135. A Decision Tree for “buys_computer” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Final Decision Tree
  • 136. Classification by Decision Tree AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 137. Classification by Decision Tree • A typical decision tree that represents the concept buys computer, that is, it predicts whether a customer at AllElectronics is likely to purchase a computer. • Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals. • Some decision tree algorithms produce only binary trees (where each internal node branches to exactly two other nodes), whereas others can produce non binary trees. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 138. Classification by Decision Tree • “How are decision trees used for classification?” • Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple are tested against the decision tree. • A path is traced from the root to a leaf node, which holds the class prediction for that tuple. • Decision trees can easily be converted to classification rules. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 139. Classification by Decision Tree Why are decision tree classifiers so popular? • The construction of decision tree classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery. • Decision trees can handle high dimensional data. • Their representation of acquired knowledge in tree form is intuitive and generally easy to assimilate by humans. • The learning and classification steps of decision tree induction are simple and fast. • In general, decision tree classifiers have good accuracy. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 140. Classification by Decision Tree Extracting Classification Rules from Trees • Represent the knowledge in the form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules are easier for humans to understand • Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 141. Classification by Decision Tree Training Set and Its AVC Sets AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES student Buy_Computer yes no yes 6 1 no 3 4 Age Buy_Computer yes no <=30 3 2 31..40 4 0 >40 3 2 Credit rating Buy_Computer yes no fair 6 2 excellent 3 3 age income studentcredit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no AVC-set on income AVC-set on Age AVC-set on Student Training Examples income Buy_Computer yes no high 2 2 medium 4 2 low 3 1 AVC-set on credit_rating
  • 142. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 143. P1WU UNIT – III: CLASSIFICATION Topic 6: K-NN CLASSIFIER AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 144. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 145. K-NN CLASSIFIER AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 146. K-NN CLASSIFIER • K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. • K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. • K-NN algorithm stores all the available data and classifies a new data point based on the similarity. • This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 147. K-NN CLASSIFIER • supervised ML classification algorithm-KNN(K Nearest Neighbors) algorithm. • It is one of the simplest and widely used classification algorithms in which a new data point is classified based on similarity in the specific group of neighboring data points. • This gives a competitive result. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 148. K-NN CLASSIFIER EXAMPLE • Example: Suppose, we have an image of a creature that looks similar to cat and dog, • but we want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. • Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 149. K-NN CLASSIFIER EXAMPLE AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 150. K Nearest Neighbor Classification AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 151. INTRODUCTION TO K-NN CLASSIFIER • K nearest neighbors is a simple algorithm that stores • all available cases and classifies new cases based on a similarity measure (e.g., distance functions). • K represents number of nearest neighbors. • It classify an unknown example with the most common class among k closest examples. • KNN is based on • “tell me who your neighbors are, and I’ll tell you who you are” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 152. INTRODUCTION TO K-NN CLASSIFIER :- Example If K = 5, then in this case query instance xq will be classified as negative since three of its nearest neighbors are classified as negative. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 153. Different Schemes of KNN • 1-Nearest Neighbor • K-Nearest Neighbor using a majority voting scheme • K-NN using a weighted-sum voting Scheme AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 154. Different Schemes of KNN AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 155. Different Schemes of KNN AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 156. kNN: How to Choose k? • In theory, if infinite number of samples available, the larger is k, the better is classification • The limitation is that all k neighbors have to be close • Possible when infinite no of samples available • Impossible in practice since no of samples is finite k = 1 is often used for efficiency, but sensitive to “noise” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 157. kNN: How to Choose k? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 158. kNN: How to Choose k? • Larger k gives smoother boundaries, better for generalization But only if locality is preserved. Locality is not preserved if end up looking at samples too far away, not from the same class. • Interesting theoretical properties if k < sqrt(n), n is # of examples . AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Find a heuristically optimal number k of nearest neighbors, based on RMSE(root-mean-square error). This is done using cross validation. Cross-validation is another way to retrospectively determine a good K value by using an independent dataset to validate the K value. Historically, the optimal K for most datasets has been between 3-10. That produces much better results than 1NN.
  • 159. Distance Measure in KNN • There are three distance measures are valid for continuous variables. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 160. Distance Measure in KNN • It should also be noted that all In the instance of categorical variables the Hamming distance must be used. • It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and categorical variables in the dataset. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 161. Simple KNN - Algorithm: • For each training example , add the example to the list of training_examples. • Given a query instance xq to be classified, • Let x1 ,x2….xk denote the k instances from training_examples that are nearest to xq . • Return the class that represents the maximum of the k instances • Steps: 1. Determine parameter k= no of nearest neighbor 2. Calculate the distance between the query instance and all the training samples. 3. Sort the distance and determine nearest neighbor based on the k –th minimum distance 4. Gather the category of the nearest neighbors 5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 162. Simple KNN - Algorithm: • K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems. • K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data. • It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. • KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 163. Simple KNN – Algorithm Example • Example: • Consider the following data concerning credit default. Age and Loan are two numerical variables (predictors) and Default is the target. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 164. Simple KNN – Algorithm Example • Given Training Data set : AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 165. Simple KNN – Algorithm Example • Data to Classify: • to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. • • Step1: Determine parameter k • K=3 • • Step 2: Calculate the distance • D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 166. Simple KNN – Algorithm Example AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 167. Simple KNN – Algorithm Example • Step 3: Sort the distance ( refer above diagram) and mark upto kth rank i.e 1 to 3. • • Step 4: Gather the category of the nearest neighbors AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Age Loan Default Distance 33 $150000 Y 8000 35 $120000 N 22000 60 $100000 Y 42000 With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is Default=Y.
  • 168. Standardized Distance ( Feature Normalization) • One major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. • For example, if one variable is based on annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set as shown below. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 169. Standardized Distance ( Feature Normalization) • For ex loan , X =$ 40000 , • Xs = 40000- 20000 = 0.11 • 220000-20000 • Same way , calculate the standardized values for age and loan attributes, then apply the KNN algorithm. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 170. Simple KNN – Algorithm • Advantages • Can be applied to the data from any distribution • for example, data does not have to be separable with a linear boundary • Very simple and intuitive • Good classification if the number of samples is large enough • • Disadvantages • Choosing k may be tricky • Test stage is computationally expensive • No training stage, all the work is done during the test stage • This is actually the opposite of what we want. Usually we can afford training step to take a long time, but we want fast test step • Need large number of samples for accuracy AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 171. How does K-NN work? • he K-NN working can be explained on the basis of the below algorithm: • Step-1: Select the number K of the neighbors • Step-2: Calculate the Euclidean distance of K number of neighbors • Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. • Step-4: Among these k neighbors, count the number of the data points in each category. • Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. • Step-6: Our model is ready. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 172. How does K-NN work? • Suppose we have a new data point and we need to put it in the required category. Consider the below image: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 173. How does K-NN work? • Firstly, we will choose the number of neighbors, so we will choose the k=5. • Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points, which we have already studied in geometry. It can be calculated as: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 174. How does K-NN work? • By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two nearest neighbors in category B. Consider the below image: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 175. Why do we need a K-NN Algorithm? •Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point will lie in which of these categories. •To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset. : AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 176. Why do we need a K-NN Algorithm? •Consider the below diagram: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 177. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 178. P1WU UNIT – III: CLASSIFICATION Topic 7: SVM CLASSIFIER AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 179. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 180. SUPPORT VECTOR MACHINE (SVM) AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 181. INTRODUCTION TO SVM • A new classification method for both linear and nonlinear data • It uses a nonlinear mapping to transform the original training data into a higher dimension • With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”) • With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane • SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors) AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 182. INTRODUCTION TO SVM • A support vector machine (SVM) is a supervised machine learning model that uses classification algorithms. • It is more preferred for classification but is sometimes very useful for regression as well. • Basically, SVM finds a hyper-plane that creates a boundary between the types of data. • In 2- dimensional space, this hyper-plane is nothing but a line. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 183. SVM—History and Applications • Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s • Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization) • Used both for classification and prediction • Applications: • handwritten digit recognition, object recognition, speaker identification, benchmarking time- series prediction tests AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 184. SVM—General Philosophy AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 185. SVM—Margins and Support Vectors AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 186. INTRODUCTION TO SVM • In SVM, we plot each data item in the dataset in an N- dimensional space, where N is the number of features/attributes in the data. • Next, find the optimal hyperplane to separate the data. • So by this, you must have understood that inherently, SVM can only perform binary classification (i.e., choose between two classes). • However, there are various techniques to use for multi-class problems. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 187. Support Vector Machine for Multi- class Problems • To perform SVM on multi-class problems, we can create a binary classifier for each class of the data. • The two results of each classifier will be : • The data point belongs to that class OR • The data point does not belong to that class. • For example, in a class of fruits, to perform multi-class classification, we can create a binary classifier for each fruit. • For say, the ‘mango’ class, • there will be a binary classifier to predict if it IS a mango OR it is NOT a mango. • The classifier with the highest score is chosen as the output of the SVM. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 188. SVM—Linearly Separable • A separating hyperplane can be written as • W ● X + b = 0 • where W={w1, w2, …, wn} is a weight vector and b a scalar (bias) • For 2-D it can be written as • w0 + w1 x1 + w2 x2 = 0 • The hyperplane defining the sides of the margin: • H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and • H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1 • Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin) are support vectors • This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 189. SVM—When Data Is Linearly Separable • Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi • There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) • SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH) AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 190. SVM for complex (Non Linearly Separable) SVM for complex (Non Linearly Separable) SVM works very well without any modifications for linearly separable data. Linearly Separable Data is any data that can be plotted in a graph and can be separated into classes using a straight line. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES A: Linearly Separable Data B: Non-Linearly Separable Data
  • 191. SVM CLASSIFIER AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 192. SVM CLASSIFIER • A vector space method for binary classification problems documents represented in t-dimensional space • find a decision surface (hyperplane) that best separate documents of two classes new document classified by its position relative to hyperplane. • Simple 2D example: training documents linearly separable AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 193. SVM CLASSIFIER • Simple 2D example: training documents linearly separable AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 194. SVM CLASSIFIER • Line s—The Decision Hyperplane • maximizes distances to closest docs of each class • it is the best separating hyperplane • Delimiting Hyperplanes • parallel dashed lines that delimit region where to look for a solution AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 195. SVM CLASSIFIER • Lines that cross the delimiting hyperplanes. • candidates to be selected as the decision hyperplane • lines that are parallel to delimiting hyperplanes: best candidates AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 196. SVM CLASSIFIER • Support vectors: documents that belong to, and define, the delimiting hyperplanes Our example in a 2-dimensional system of coordinates AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 197. SVM CLASSIFIER AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 198. SVM CLASSIFIER AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 199. SVM vs. Neural Network AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES • SVM 1) Relatively new concept 2) Deterministic algorithm 3) Nice Generalization properties 4) Hard to learn – learned in batch mode using quadratic programming techniques 5) Using kernels can learn very complex functions • Neural Network 1) Relatively old 2) Nondeterministic algorithm 3) Generalizes well but doesn’t have strong mathematical foundation 4) Can easily be learned in incremental fashion 5) To learn complex functions— use multilayer perceptron (not that trivial)
  • 200. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 201. P1WU UNIT – III: CLASSIFICATION Topic 8 FEATURE SELECTION OR DIMENSIONALITY REDUCTION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 202. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 203. FEATURE SELECTION OR DIMENSIONALITY REDUCTION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 204. FEATURE SELECTION OR DIMENSIONALITY REDUCTION • Feature selection and dimensionality reduction allow us to • minimize the number of features in a dataset by only keeping features that are important. • In other words, we want to retain features that contain the most useful information that is needed by our model to • make accurate predictions while discarding redundant features that contain little to no information. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 205. FEATURE SELECTION OR DIMENSIONALITY REDUCTION • There are several benefits in performing feature selection and dimensionality reduction which include: • model interpretability, • minimizing overfitting as well as • reducing the size of the training set and consequently training time. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 206. Dimensionality Reduction • The number of input variables or features for a dataset is referred to as its dimensionality. • Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset. • More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 207. Dimensionality Reduction • High-dimensionality statistics and dimensionality reduction techniques are often used for data visualization. • Nevertheless these techniques can be used in applied machine learning to • simplify a classification or regression dataset in order to better fit a predictive model. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 208. Problem With Many Input Variables • If your data is represented using rows and columns, such as in a spreadsheet, then • the input variables are the columns that are fed as input to a model to predict the target variable. Input variables are also called features. • We can consider the columns of data representing dimensions on an n-dimensional feature space and the rows of data as points in that space. • This is a useful geometric interpretation of a dataset. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 209. Problem With Many Input Variables • Having a large number of dimensions in the feature space can mean that the volume of that space is very large, and in turn, • the points that we have in that space (rows of data) often represent a small and non- representative sample. • This can dramatically impact the performance of machine learning algorithms fit on data with many input features, generally referred to as the “curse of dimensionality.” • Therefore, it is often desirable to reduce the number of input features. This reduces the number of dimensions of the feature space, hence the name “dimensionality reduction.” AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 210. Dimensionality Reduction AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 211. Dimensionality Reduction • Dimensionality reduction refers to techniques for reducing the number of input variables in training data. • When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data. This is called dimensionality reduction. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 212. Dimensionality Reduction • Fewer input dimensions often mean correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom. • A model with too many degrees of freedom is likely to overfit the training dataset and therefore may not perform well on new data. • It is desirable to have simple models that generalize well, and in turn, input data with few input variables. • This is particularly true for linear models where the number of inputs and the degrees of freedom of the model are often closely related. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 213. Techniques for Dimensionality Reduction AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 214. Techniques for Dimensionality Reduction • There are many techniques that can be used for dimensionality reduction. • Feature Selection Methods • Matrix Factorization • Manifold Learning • Auto encoder Methods AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 215. Feature Selection Methods • Feature selection is also called variable selection or attribute selection. • It is the automatic selection of attributes in your data (such as columns in tabular data) that are most relevant to the predictive modeling problem you are working on. • feature selection… • is the process of selecting a subset of relevant features for use in model construction AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 216. Feature Selection Methods • Feature selection is different from dimensionality reduction. • Both methods seek to reduce the number of attributes in the dataset, • but a dimensionality reduction method do so by creating new combinations of attributes, • where as feature selection methods include and exclude attributes present in the data without changing them. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 217. Feature Selection Methods • Examples of dimensionality reduction methods include • Principal Component Analysis, • Singular Value Decomposition and • Sammon’s Mapping. • Feature selection is itself useful, but it mostly acts as a filter, muting out features that aren’t useful in addition to your existing features. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 218. Feature Selection Algorithms AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 219. Filter Methods • Filter feature selection methods apply a statistical measure to assign a scoring to each feature. • The features are ranked by the score and either selected to be kept or removed from the dataset. • The methods are often univariate and consider the feature independently, or with regard to the dependent variable. • Some examples of some filter methods include the Chi squared test, information gain and correlation coefficient scores. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 220. Wrapper Methods • Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. • A predictive model us used to evaluate a combination of features and assign a score based on model accuracy. • The search process may be • methodical such as a best-first search, • it may stochastic such as a random hill-climbing algorithm, or • it may use heuristics, like forward and backward passes to add and remove features. • An example if a wrapper method is the recursive feature elimination algorithm. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 221. Embedded Methods • Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. • The most common type of embedded feature selection methods are • regularization methods. • Regularization methods are also called penalization methods that introduce • additional constraints into the optimization of a predictive algorithm (such as a regression algorithm) that bias the model toward lower complexity (fewer coefficients). • Examples of regularization algorithms are the • LASSO, Elastic Net and Ridge Regression. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 222. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 223. P1WU UNIT – III: CLASSIFICATION Topic 9: EVALUATION METRICS AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 224. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 225. EVALUATION METRICS AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 226. EVALUATION METRICS • Evaluating the Accuracy of a Classifier • Basic Evaluation Measures for Classifier Performance. • In Bioinformatics and machine learning in general, • there is a large variation in the measures that are used to evaluate prediction systems. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 227. A confusion matrix • For simplicity, the assumption is that each instance can only be assigned one of two classes: • Positive or • Negative (e.g. a patient's tumor may be malignant or benign). • Each instance (e.g. a patient) has a Known label, and a Predicted label. • Some method is used (e.g. cross-validation) to make predictions on each instance. Each instance then increments one cell in the confusion matrix. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 228. EVALUATION METRICS: -A confusion matrix AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Predicted Label Positive Negative Known Label Positive True Positive False Negative (TP) (FN) Negative False Positive True Negative (FP) (TN)
  • 229. EVALUATION METRICS AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 230. CONTINGENCY TABLE AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 231. CONTINGENCY TABLE AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 232. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 233. P1WU UNIT – III: CLASSIFICATION Topic 10: ACCURACY AND ERROR AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 234. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 235. ACCURACY AND ERROR AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 236. ACCURACY AND ERROR • Accuracy • Accuracy is the proportion of the time that the predicted class equals the actual class, usually expressed as a percentage. • It's meaning is straightforward, but may obscure important differences in costs associated with different errors. • The classic example of such costs is the medical diagnostic situation, in which one can err be either: • 1. keeping a healthy patient in the hospital (low cost), or • 2. sending home a sick patient (very high cost). AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 237. ACCURACY AND ERROR • These classifiers need to be checked for both the accuracy of their probabilities (Do cases predicted to have a 5% (30%, 80%, etc.) probability really belong to the target class 5% (30%, 80%, etc.) of the time?) and their ability to separate the classes in question. Accuracy can be measured using many of the same metrics used to evaluate numerical models (MSE, MAE, etc.). AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 238. Accuracy and Error Measures • All models must be assessed somehow. • Despite the existence of a bewildering array of performance measures, much commercial modeling software provides a surprisingly limited range of options. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 239. Mean Squared Error (MSE) • Mean Squared Error (MSE) is by far the most common measure of numerical model performance. • It is simply the average of the squares of the differences between the predicted and actual values. • It is a reasonably good measure of performance, though it could be argued that it overemphasizes the importance of larger errors. • Many modeling procedures directly minimize it. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 240. Mean Absolute Error (MAE) • Mean Absolute Error (MAE) is similar to the Mean Squared Error, but it uses absolute values instead of squaring. • This measure is not as popular as MSE, though its meaning is more intuitive . • • Bias is the average of the differences between the predicted and actual values. • With this measure, positive errors cancel out negative ones. • Bias is intended to assess how much higher or lower predictions are, on average, than actual values. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 241. Mean Absolute Percent Error (MAPE) • Mean Absolute Percent Error (MAPE) is the average of the absolute errors, as a percentage of the actual values. • This is a relative measure of error, which is useful when larger errors are more acceptable on larger actual values. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 242. Classification Accuracy: Estimating Error Rates • Partition: Training-and-testing o Use two independent data sets, e.g., training set (2/3), test set(1/3) o Used for data set with large number of samples • Cross-validation o Divide the data set into k subsamples o Use k-1 subsamples as training data and one sub-sample as test data—k-fold cross-validation o For data set with moderate size • Bootstrapping (leave-one-out) o For small size data • Confusion Matrix: o This matrix shows not only how well the classifier predicts different classes o It describes information about actual and detected classes: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 243. Classification Accuracy: Estimating Error Rates Detected Positive Negative Actual Positive A: True positive B: False Negative Negative C: False Positive D: True Negative • The recall (or the true positive rate) and the precision (or the positive predictive rate) can be derived from the confusion matrix as follows: • o Recall = A / A+B • o Precision = A / A+ C AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 244. Classifier Accuracy Measures • Accuracy of a classifier M, acc(M): • percentage of test set tuples that are correctly classified by the model M • Error rate (misclassification rate) of M = 1 – acc(M) • Given m classes, CMi,j, an entry in a confusion matrix, indicates # of tuples in class i that are labeled by the classifier as class j AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES classes buy_computer = yes buy_computer = no total recognition(%) buy_computer = yes 6954 46 7000 99.34 buy_computer = no 412 2588 3000 86.27 total 7366 2634 10000 95.52 C1 C2 C1 True positive False negative C2 False positive True negative
  • 245. Classifier Accuracy Measures • Alternative accuracy measures (e.g., for cancer diagnosis) sensitivity = t-pos/pos /* true positive recognition rate */ specificity = t-neg/neg /* true negative recognition rate */ precision = t-pos/(t-pos + f-pos) accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg) • This model can also be used for cost-benefit analysis AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES classes buy_computer = yes buy_computer = no total recognition(%) buy_computer = yes 6954 46 7000 99.34 buy_computer = no 412 2588 3000 86.27 total 7366 2634 10000 95.52 C1 C2 C1 True positive False negative C2 False positive True negative
  • 246. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 247. P1WU UNIT – III: CLASSIFICATION Topic 11: ORGANIZING THE CLASSES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 248. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 249. ORGANIZING THE CLASSES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 250. ORGANIZING THE CLASSES TAXONOMIES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 251. ORGANIZING THE CLASSES TAXONOMIES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 252. ORGANIZING THE CLASSES TAXONOMIES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 253. ORGANIZING THE CLASSES TAXONOMIES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 254. ORGANIZING THE CLASSES TAXONOMIES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 255. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 256. P1WU UNIT – III: CLASSIFICATION Topic 12: INDEXING AND SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 257. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 258. INDEXING AND SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 259. INDEXING AND SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 260. INDEXING AND SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 261. INDEXING AND SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 262. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 263. P1WU UNIT – III: CLASSIFICATION Topic 13: INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 264. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 265. INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 266. INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 267. INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 268. INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 269. INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 270. FULL INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 271. FULL INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 272. FULL INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 273. FULL INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 274. FULL INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 275. FULL INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 276. FULL INVERTED INDEXES AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 277. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 278. P1WU UNIT – III: CLASSIFICATION Topic 14: SEQUENTIAL SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 279. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 280. SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 281. SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 282. SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 283. SEQUENTIAL SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 284. SEQUENTIAL SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 285. SEQUENTIAL SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 286. SEQUENTIAL SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 287. SEQUENTIAL SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 288. SEQUENTIAL SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 289. SEQUENTIAL SEARCHING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 290. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 291. P1WU UNIT – III: CLASSIFICATION Topic 15: MULTI-DIMENSIONAL INDEXING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 292. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 293. MULTI-DIMENSIONAL INDEXING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 294. MULTI-DIMENSIONAL INDEXING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 295. MULTI-DIMENSIONAL INDEXING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 296. MULTI-DIMENSIONAL INDEXING AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 297. MULTI-DIMENSIONAL SEARCH AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 298. MULTI-DIMENSIONAL SEARCH AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 299. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES