Copyright © 2014 EMC Corporation. All rights reserved.Copy

Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Advanced Analytics - Theory and Methods
1Module 4: Analytics Theory/Methods
Advanced Analytics – Theory and Methods
During this lesson the following topics are covered:
• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of
the classifier
Naïve Bayesian Classifiers

Module 4: Analytics Theory/Methods 2
The topics covered in this lesson are listed.
Classifiers
• Classification: assign labels to objects.
• Usually supervised: training set of pre-classified examples.
• Our examples:
Where in the catalog should I place this product listing?
Is this email spam?
Is this politician Democrat/Republican/Green?
The primary task performed by classifiers is to assign labels to
objects. Labels in classifiers are

pre-determined unlike in clustering where we discover the
structure and assign labels.
Classifier problems are supervised learning methods. We start
with a training set of pre-
classified examples and with the knowledge of probabilities we
assign class labels.
Some use case examples are shown in the slide. Based on the
voting pattern on issues we
could classify whether a politician has an affiliation to a party
or a principle. Retailers use
classifiers to assign proper catalog entry locations for their
products. Most importantly the
classification of emails as spam is another useful application of
classifier methods.
Logistic regression, discussed in the previous lesson, can be
viewed and used as a classifier. We
will discuss Naïve Bayesian Classifiers in this lesson and the
use of Decision Trees in the next
lesson.
Naïve Bayesian Classifier
• Determine the most probable class label for each object
t of each

other
may be classified (labeled) as a tennis ball
• Input variables are discrete
• Output:
– proportional to the true probability
– based on the highest probability score
The Naïve Bayesian Classifier is a probabilistic classifier ba sed
on Bayes' Law and naïve
conditional independence assumptions. In simple terms, a Naïve
Bayes Classifier assumes
that the presence (or absence) of a particular feature of a class
is unrelated to the presence (or
absence) of any other feature.
For example, an object can be classified into a particular
category based on its attributes such
as shape, color, and weight. A reasonable classification for an
object, that is spherical, yellow
and less than 60 grams in weight, may be a tennis ball. Even if
these features depend on each
other or upon the existence of the other features, a Naïve
Bayesian Classifier considers all of

these properties to independently contribute to the probability
that the object is a tennis ball.
The input variables are generally discrete (categorical) but there
are variations to the
algorithms that work with continuous variables as well. For
this lesson, we will consider only
discrete input variables. Although weight may be considered a
continuous variable, in the
tennis ball example, weight was grouped into intervals in order
to make weight a categorical
variable.
The output typically returns a probability score and class
membership. The output from most
implementations are log probability scores for the class (w e will
address this later in the lesson)
and we assign the class label that corresponds to the highest log
probability score.
Naïve Bayesian Classifier - Use Cases
• Preferred method for many text classification problems.
complicated
• Use cases

raud detection
Naïve Bayesian Classifiers are among the most successful
known algorithms for learning to
classify text documents. Spam filtering is the best known use of
Naïve Bayesian Text
Classification. Bayesian Spam Filtering has become a popular
mechanism to distinguish
illegitimate spam email from legitimate email. Many modern
mail clients implement Bayesian
Spam Filtering.
Naïve Bayesian Classifiers are used to detect fraud. For
example in auto insurance, based on a
training data set with attributes (such as driver’s rating, vehicle
age, vehicle price, is it a claim
by the policy holder, police report status, claim genuine ) we
can classify a new claim as
genuine or not.
References:
Spam filtering
(http://en.wikipedia.org/wiki/Bayesian_spam_filtering)
http://www.cisjournal.org/archive/vol2no4/vol2no4_1.pdf
Hybrid Recommender System Using Naive Bayes Classifier and
Collaborative Filtering
(http://eprints.ecs.soton.ac.uk/18483/
Online applications (http://www.convo.co.uk/x02/)

Building a Training Dataset to Predict Good or Bad Credit
• Predict the credit behavior of
a credit card applicant from
applicant's attributes:
• These are all categorical
variables and are better suited
to Naïve Bayesian Classifier
than to logistic regression.
Let us look into a specific use case example. We present here
the same example we worked
with in Lesson 2 of this module with the Apriori algorithm. The
training dataset consists of
attributes: personal status, job type, housing type and amount
of money in their savings
account. They are represented as categorical variables which are

well suited for Naïve Bayesian
Classifier.
With this training set we want to predict the credit behavior of a
new customer. This problem
could be solved with logistic regression as well. If there are
multiple levels for the outcome you
want to predict, then Naïve Bayesian Classifier is a better
solution.
Next, we will go through the technical basis for Naïve Bayesian
Classifiers and will revisit this
credit dataset later.
Technical Description - Bayes' Law
• C is the class label:
• A is the observed object attributes
• P(C | A) is the probability of C given A is observed
)(

)()|(
)(
)(
)|(
AP
CPCAP
AP
CAP
AC
Bayes' Law states: P(C | A)*P(A) = P(A | C)*P(C) = P(A ^ C).
That is, the conditional probability that C is true given that A is
true, denoted P(C|A), times the probability of A is
the same as the conditional probability that A is true given that
C is true, denoted P(A|C), times the probability of
C. Both of these terms are equal to P(A^C) that is the
probability A and C are simultaneously true. If we divide all
three terms by P(A), then we get the form shown on the slide.
The reason that Bayes’ Law is important is that we may not
know P(C|A) (and we want to), but we do know P(A|C)
and P(C) for each possible value of C from the training data.
As we will see later, it is not necessary to know P(A)
for the purposes of Naïve Bayes Classifiers.
An example using Bayes Law:

John flies frequently and likes to upgrade his seat to first class.
He has determined that, if he checks in for his
flight at least two hours early, the probability that he will get
the upgrade is .75; otherwise, the probability that he
will get the upgrade is .35. With his busy schedule, he checks in
at least two hours before his flight only 40% of the
time. Suppose John didn’t receive an upgrade on his most
recent attempt. What is the probability that he arrived
late?
C = John arrives late
A = John did not receive an upgrade
P(C) = Probability John arrives late = .6
P(A) = Probability John did not receive an upgrade = 1 – ( .4 x
.75 + .6 x .35) = 1 -.51 = .49
P(A|C) = Probability that John did not receive an upgrade given
that he arrived late = 1 - .35 = .65
P(C|A) = Probability that John arrived late given that he did not
receive his upgrade
= P(A|C)P(C)/P(A) = (.65 x .6)/.49 = .80 (approx)
In this simple example, C can take one of two possible values
{arriving early, arriving late) and there is only one
attribute which can take one of two possible values {received
upgrade, did not receive upgrade}. Next, we will
generalize Bayes’ Law to multiple attributes and apply the naïve
independence assumptions.

• For observed attributes A = (a1, a2, … am), we want to
compute
and assign the classifier, Ci , with the largest P(Ci|A)
• Two simplifications to the calculations
- each aj is conditionally
independent of
each other, then
ni
aaaP
CPCaaaP
ACP
m
iim
i
,...,2 ,1
),...,,(
)()|,...,,(
)|(
21
Apply the Naïve Assumption and Remove a Constant

m
j
ijimiiim
CaPCaPCaPCaPCaaaP
1
2121
The general approach is to assign the classifier label, Ci, to the
object with attributes A = (a1, a2,
… am) that corresponds to the largest value of P(Ci|A).
The probability that a set of attribute values A (comprised of m
variables a1 thru am) should be
labeled with a classification Ci will equal the probabi lity that of
the set of variables a1 thru am
given Ci is true, times the probability of Ci all divided by the
probability of the set of attribute
values a1 thru am .
The conditional independence assumption is that the probability
of observing the value of a
particular attribute given Ci is independent of the other
attributes. This naïve assumption
simplifies the calculation of P(a1, a2, …, am|Ci) as shown on

the slide.
Since P(a1, a2, …, am) appears in the denominator of P(Ci|A),
for all values of i, removing the
denominator will have no impact on the relative probability
scores and will simplify
calculations. Next, these two simplifications to the
calculations will be applied to build the
Naïve Bayesian Classifier.
Building a Naïve Bayesian Classifier
• Applying the two simplifications
• To build a Naïve Bayesian Classifier, collect the following
statistics from the training data:
niCPCaPaaaCP
i
m

j
ijmi
,...,2 ,1 )()|(),...,,|(
1
21
niCPCaP
i
m
j
ij

,...,2 ,1 )()|(
1
Applying the two simplifications, P(Ci|a1, a2, …, am) is
proportional to the product of the
various P(aj|Ci), for j=1,2,…m, times P(Ci). From a training
dataset, these probabilities can be
computed and stored for future classifier assignments. We now
return to the credit applicant
example.

Naïve Bayesian Classifiers for the Credit Example
• Class labels: {good, bad}
• Conditional Probabilities
) = 0.75
To build a Naïve Bayesian Classifier we need to collect the
following statistics:
1. Probability of all class labels – Probability of good credit and
probability of bad credit. From
the all data available in the training set we determine P(good) =
0.7 and P(bad) = 0.3
2. In the training set, there are several attributes:
personal_status, job, housing, and
saving_status. For each attribute and its possible values, we
need to compute the
conditional probabilities given bad or good credit. For
example, relative to the housing
attribute, we need to compute P(own|bad), P(own|good),

P(rent|bad), P(rent|good), etc.
Naïve Bayesian Classifier for a Particular Applicant
• Given applicant attributes of
A= {female single,
owns home,
self-employed,
savings > $1000}
• Since P(good|A) > (bad|A),
assign the applicant the label
"good" credit
aj Ci P(aj| Ci)
female single good 0.28
female single bad 0.36
own good 0.75
own bad 0.62
self emp good 0.14
self emp bad 0.17

savings>1K good 0.06
savings>1K bad 0.02
P(good|A) ~ (0.28*0.75*0.14*0.06)*0.7 = 0.0012
P(bad|A) ~ (0.36*0.62*0.17*0.02)*0.3 = 0.0002
Here we have an example of an applicant who is female, single,
owns a home, is self-employed
and has savings over $1000 in her savings account. How will we
classify this person? Will she
be scored as a person with good or bad credit?
Having built the classifier with the training set we find
P(good|A) which is equal to 0.0012 (see
the computation on the slide) and P(bad|A) is 0.0002. Since
P(good|A) is the maximum of the
two probability scores, we assign the label “good” credit.
The score is only proportional to the probability. It doesn't
equal the probability, because we
haven't included the denominator. However, both formulas have
the same denominator, so we
don't need to calculate it in order to know which quantity is
bigger.
Notice, though, how small in magnitude these scores are. When
we are looking at problems
with a large number of attributes, or attributes with a very high
number of levels, these values
can become very small in magnitude.

Naïve Bayesian Implementation Considerations
• Numerical underflow
• Zero probabilities due to unobserved attribute/classifier pair s
amount)
• Assign the classifier label, Ci, that maximizes the value of
)(log)|(log
1
i
m
j
ij

where i = 1,2,…,n and
P’ denotes the adjusted probabilities
Multiplying several probability values, each possibly close to
zero, invariably leads to the
problem of numerical underflow. So an important
implementation guideline is to compute the
logarithm of the product of the probabilities, which is
equivalent to the summation of the
logarithm of the probabilities. Although the risk of underflow
may increase as the number of
attributes increase, the use of logarithms should be applied
regardless of the number of
attribute dimensions.
Additionally, to address the possibility of probabilities equal to
zero, smoothing techniques can
be employed to adjust the probabilities to ensure non-zero

values. Applying a smoothing
technique assigns a small non-zero probability to rare events not
included in the training
dataset. Also, the smoothing addresses the possibility of taking
the logarithm of zero.
The R implementation of Naïve Bayes incorporates the
smoothing directly into the probability
tables. Essentially, the Laplace smoothing that R uses adds one
(or a small value) to every
count. For example, if we have 100 "good" customers, and 20
of them rent their housing, the
"raw" P(rent | good) = 20/100 = 0.2; with Laplace smoothing
add adding one to the counts,
the calculation would be P(rent | good) ~ (20 + 1)/(100+3) =
0.20388, where there are 3
possible values for housing (own, rent, for free).
Fortunately, the use of the logarithms and the smoothing
techniques are already implemented
in standard software packages for Naïve Bayes Classifiers.
However, if for performance
reasons, the Naïve Bayes Classifier algorithm needs to be coded
directly into an application,
these considerations should be implemented.
Diagnostics

• Hold-out data
• Cross-validation
• ROC curve/AUC
The diagnostics we used in regression can be used to validate
the effectiveness of the model
we built. The technique of using the hold-out data and
performing N-fold cross validations and
using the ROC/Area Under the Curve methods can be deployed
with Naïve Bayesian Classifier
as well.
Prediction
Actual
Class
good bad
good 671 29 700
bad 38 262 300

709 291 1000
Diagnostics: Confusion Matrix
Overall success rate (or accuracy):
(TP + TN) / (TP+TN+FP+FN) = (671+262)/1000 ≈ 0.93
TPR: TP / (TP + FN) = 671 / (671+29) = 671/700 ≈ 0.96
FPR: FP / (FP + TN) = 38 / (38 + 262) = 38/300 ≈ 0.13
FNR: FN / (TP + FN) = 29 / (671 + 29) = 29/700 ≈ 0.04
Precision: TP/ (TP + FP) = 671/709 ≈ 0.95
Recall (or TPR): TP / (TP + FN) ≈ 0.96
false negatives (FN)
false positives (FP)
true positives (TP)
true negatives (TN)
A confusion matrix is a specific table layout that allows
visualization of the performance of a model. In
the hypothetical example of confusion matrix shown:
Of 1000 credit score samples, the system predicted that there
were good and bad credit, and of the 700
good credits, the model predicted 29 as bad and similarly 38 of
the actual bad credits were predicted as

good. All correct guesses are located in the diagonal of the
table, so it's easy to visually inspect the table
for errors, as they will be represented by any non-zero values
outside the diagonal.
We define overall success rate (or accuracy) as a metric
defining – what we got right - which is the ratio
between the sum of the diagonal values (i.e., TP and TN) vs. the
sum of the table. In other words, the
confusion table of a good model has large numbers diagonally
and small (ideally zero) numbers off-
diagonally.
We saw a true positive rate (TPR) and a false positive rate
(FPR) when we discussed ROC curves:
• TPR – what percent of positive instances did we correctly
identify.
• FPR – what percent of negatives we marked positive.
Additionally we can measure the false negative rate (FNR):
• FNR– what percent of positives we marked negative
The computation of TPR, FPR and FNR are shown in the slide.
Precision and Recall are accuracy metrics used by the
information retrieval community; they are often
used to characterize classifiers as well. We will detail these
metrics in lesson 8 of this module.
Note:
• precision – what percent of things we marked positive really
are positive

• recall – what percent of positive instances did we correctly
identify. Recall is equivalent to TPR.
Reasons to Choose (+) Cautions (-)
Handles missing values quite well Numeric variables have to be
discrete
(categorized) Intervals
Robust to irrelevant variables Sensitive to correlated variables
"Double-counting"
Easy to implement Not good for estimating probabilities
Stick to class label or yes/no
Easy to score data
Resistant to over-fitting
Computationally efficient
Handles very high dimensional
problems

Handles categorical variables with a
lot of levels
Naïve Bayesian Classifier - Reasons to Choose (+)
and Cautions (-)
The Reasons to Choose (+) and Cautions (-) of the Naïve
Bayesian Classifier are listed. Unlike
Logistic regression, missing values are handled well by the
Naïve Bayesian Classifier. It is also
very robust to irrelevant variables (irrelevant variables are
distributed among all the classes and
their effects are not pronounced).
The model is easy to implement and we will see how easily a
basic version can be implemented
in the lab without using any packages. Scoring data (predicting)
is very simple and the model is
resistant to over fitting. (Over fitting refers to fitting training
data so well that we fit the
idiosyncrasies such as the data that are not relevant in
characterizing the data). It is
computationally efficient and handles high dimensional
problems efficiently. Unlike logistic
regression Naïve Bayesian Classifier handles categorical
variables with a lot of levels.
The Cautions (-) are that it is sensitive to correlated variables as
the algorithm double counts
the effect of the correlated variables. For example people with
low income tend to default and
people with low credit tend to default. It is also true that people

with low income tend to have
low credit. If we try to score “default” with both low income
and low credit as variables we will
see the double counting effect in our model output and in the
scoring.
Though the probabilities are provided as an output of the scored
data, Naïve Bayesian Classifier
is not very reliable for the probability estimation and should be
used for class label assignments
only. Naïve Bayesian Classifier in its simple form is used only
with categorical variables and any
continuous variables should be rendered discrete into intervals.
You will learn more about this
in the lab. However it is not necessary to have the continuous
variables as “discrete” and
several standard implementations can handle continuous
variables as well.
Check Your Knowledge
1. Consider the following Training Data Set:
• Apply the Naïve Bayesian Classifier to this
data set and compute the probability
score for P(y = 1|X) for X = (1,0,0)
Show your work

2. List some prominent use cases of the Naïve Bayesian
Classifier.
3. What gives the Naïve Bayesian Classifier the advantage of
being
computationally inexpensive?
4. Why should we use log-likelihoods rather than pure
probability
values in the Naïve Bayesian Classifier?
X1 X2 X3 Y
1 1 1 0
1 1 0 0
0 0 0 0
0 1 0 1
1 0 1 1
0 1 1 1
Training Data Set
Your Thoughts?
Record your answers here. More Check Your Knowledge
questions are on the next page.

Check Your Knowledge (Continued)
5. What is a confusion matrix and how it is used to evaluate the
effectiveness of the model?
6. Consider the following data set with two input features
temperature and season
• What is the Naïve Bayesian assumption?
• Is the Naïve Bayesian assumption satisfied for this problem?
Your Thoughts?
Temperature Season Electricty Usage
-10 to 50 F Winter High
50 to 70 F Winter Low
70 to 85 F Summer Low
85 to 110 F Summer High
Record your answers here.

Advanced Analytics – Theory and Methods
During this lesson the following topics were covered:
• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of
the classifier
Naïve Bayesian Classifiers - Summary
This lesson covered these topics. Please take a moment to
review them.

Copyright © 2014 EMC Corporation. All rights reserved.Copy

More Related Content

Similar to Copyright © 2014 EMC Corporation. All rights reserved.Copy (20)

More from shpopkinkz (20)

Recently uploaded (20)

Copyright © 2014 EMC Corporation. All rights reserved.Copy