SlideShare a Scribd company logo
Classification ŠJiawei Han and Micheline Kamber http://www-sal.cs.uiuc.edu/~hanj/bk2/ Chp 6 modified by Donghui Zhang Integrated with slides from Prof. Andrew W. Moore http:// www.cs.cmu.edu/~awm/tutorials
Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)
Classification:   models categorical class labels (discrete or nominal) e.g. given a new customer, does she belong to the “likely to buy a computer” class? Prediction:  models continuous-valued functions e.g. how many computers will a customer buy? Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis Classification vs. Prediction
Classification—A Two-Step Process   Model construction : describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the  class label attribute The set of tuples used for model construction is  training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage : for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’  Training Data Classifier (Model)
Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) Tenured? Classifier Testing Data Unseen Data
Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning   (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Evaluating Classification Methods Predictive accuracy Speed and scalability time to construct the model time to use the model Robustness handling noise and missing values Scalability efficiency in disk-resident databases  Interpretability:  understanding and insight provided by the model Goodness of rules decision tree size compactness of classification rules
Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)
Training Dataset This follows an  example from Quinlan’s ID3
Output: A Decision Tree for “ buys_computer” age? overcast student? credit rating? no yes fair excellent <=30 >40 no no yes yes yes 30..40
Extracting Classification Rules from Trees Represent the knowledge in the form of  IF-THEN  rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF  age  = “<=30” AND  student  = “ no ”  THEN  buys_computer  = “ no ” IF  age  = “<=30” AND  student  = “ yes ”  THEN  buys_computer  = “ yes ” IF  age  = “31…40”  THEN  buys_computer  = “ yes ” IF  age  = “>40”  AND  credit_rating  = “ excellent ”  THEN  buys_computer  = “ yes ” IF  age  = “<=30” AND  credit_rating  = “ fair ”  THEN  buys_computer  = “ no ”
Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a  top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,  information gain ) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning –  majority voting  is employed for classifying the leaf There are no samples left
Information gain slides adapted from Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email_address] 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials:  http://www.cs.cmu.edu/~awm/tutorials  . Comments and corrections gratefully received.
Bits You are watching a set of independent random samples of X You see that X has four possible values So you might see: BAACBADCDADDDA… You transmit data over a binary serial link. You can encode each reading with two bits (e.g. A = 00, B = 01, C = 10, D = 11) 0100001001001110110011111100… P(X=C) = 1/4 P(X=B) = 1/4 P(X=D) = 1/4 P(X=A) = 1/4
Fewer Bits Someone tells you that the probabilities are not equal It’s possible… … to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? P(X=C) = 1/8 P(X=B) = 1/4 P(X=D) = 1/8 P(X=A) = 1/2
Fewer Bits Someone tells you that the probabilities are not equal It’s possible… … to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? (This is just one of several ways) P(X=C) = 1/8 P(X=B) = 1/4 P(X=D) = 1/8 P(X=A) = 1/2 111 D 110 C 10 B 0 A
Fewer Bits Suppose there are three equally likely values… Here’s a naïve coding, costing 2 bits per symbol Can you think of a coding that would need only 1.6 bits per symbol on average? In theory, it can in fact be done with 1.58496 bits per symbol . P(X=C) = 1/3 P(X=B) = 1/3 P(X=D) = 1/3 10 C 01 B 00 A
Suppose X can have one of  m  values…  V 1,  V 2,  …  V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X “ High Entropy” means X is from a uniform (boring) distribution “ Low Entropy” means X is from varied (peaks and valleys) distribution General Case … . P(X=V 2 ) = p 2 P(X=V 1 ) = p 1 P(X=V m ) = p m
Suppose X can have one of  m  values…  V 1,  V 2,  …  V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X “ High Entropy” means X is from a uniform (boring) distribution “ Low Entropy” means X is from varied (peaks and valleys) distribution General Case A histogram of the frequency distribution of values of X would be flat A histogram of the frequency distribution of values of X would have many lows and one or two highs … . P(X=V 2 ) = p 2 P(X=V 1 ) = p 1 P(X=V m ) = p m
Suppose X can have one of  m  values…  V 1,  V 2,  …  V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X “ High Entropy” means X is from a uniform (boring) distribution “ Low Entropy” means X is from varied (peaks and valleys) distribution General Case A histogram of the frequency distribution of values of X would be flat A histogram of the frequency distribution of values of X would have many lows and one or two highs ..and so the values sampled from it would be all over the place ..and so the values sampled from it would be more predictable … . P(X=V 2 ) = p 2 P(X=V 1 ) = p 1 P(X=V m ) = p m
Entropy in a nut-shell Low Entropy High Entropy
Entropy in a nut-shell Low Entropy High Entropy ..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room ..the values (locations of soup) sampled entirely from within the soup bowl
Exercise: Suppose 100 customers have two classes:  “ Buy Computer” and “Not Buy Computer”. Uniform distribution: 50 buy. Entropy? Skewed distribution: 100 buy. Entropy?
Specific Conditional Entropy Suppose I’m trying to predict output Y and I have input X Let’s assume this reflects the true probabilities E.G. From this data we estimate P(LikeG = Yes) = 0.5 P(Major = Math & LikeG = No) = 0.25 P(Major = Math) = 0.5 P(LikeG = Yes  |  Major = History) = 0 Note: H(X) = 1.5 H(Y) = 1 X = College Major Y = Likes “Gladiator” Yes Math No History Yes CS No Math No Math Yes CS No History Yes Math Y X
Specific Conditional Entropy Definition of Specific Conditional Entropy: H(Y | X=v)  =  The entropy of  Y  among only those records in which  X  has value  v X = College Major Y = Likes “Gladiator” Yes Math No History Yes CS No Math No Math Yes CS No History Yes Math Y X
Specific Conditional Entropy Definition of Conditional Entropy: H(Y|X=v)  =  The entropy of  Y  among only those records in which  X  has value  v Example: H(Y|X=Math)  =  1 H(Y|X=History)  =  0 H(Y|X=CS)  =  0 X = College Major Y = Likes “Gladiator” Yes Math No History Yes CS No Math No Math Yes CS No History Yes Math Y X
Conditional Entropy Definition of Specific Conditional Entropy: H(Y | X)   =  The average conditional entropy of  Y = if you choose a record at random what will be the conditional entropy of  Y , conditioned on that row’s value of  X = Expected number of bits to transmit  Y  if both sides will know the value of  X =  Σ j Prob(X=v j ) H(Y  |  X = v j ) X = College Major Y = Likes “Gladiator” Yes Math No History Yes CS No Math No Math Yes CS No History Yes Math Y X
Conditional Entropy Definition of general Conditional Entropy: H(Y | X)   =  The average conditional entropy of  Y =  Σ j Prob(X=v j ) H(Y  |  X = v j ) X = College Major Y = Likes “Gladiator” Example: H(Y | X)  =  0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5 0 0.25 CS 0 0.25 History 1 0.5 Math H(Y  |  X = v j ) Prob(X=v j ) v j Yes Math No History Yes CS No Math No Math Yes CS No History Yes Math Y X
Information Gain Definition of Information Gain: IG(Y | X)  =  I must transmit  Y.  How many bits on average would it save me if both ends of the line knew  X ? IG(Y | X)  =  H(Y) - H(Y  |  X) X = College Major Y = Likes “Gladiator” Example: H(Y) = 1 H(Y|X) = 0.5 Thus IG(Y|X) = 1 – 0.5 = 0.5 Yes Math No History Yes CS No Math No Math Yes CS No History Yes Math Y X
What is Information Gain used for? Suppose you are trying to predict whether someone is going live past 80 years. From historical data you might find… IG(LongLife | HairColor) = 0.01 IG(LongLife | Smoker) = 0.2 IG(LongLife | Gender) = 0.25 IG(LongLife | LastDigitOfSSN) = 0.00001 IG tells you how interesting a 2-d contingency table is going to be.
Conditional entropy H(C|age) H(C|age<=30) = 2/5 * lg(5/2) + 3/5 * lg(5/3) = 0.971 H(C|age in 30..40) = 1 * lg 1 + 0 * lg 1/0 = 0 H(C|age>40) = 3/5 * lg(5/3) + 2/5 * lg(5/2) = 0.971
Select the attribute with lowest conditional entropy H(C|age) = 0.694 H(C|income) = 0.911 H(C|student) = 0.789 H(C|credit_rating) = 0.892 Select “age” to be the  tree root! yes age? <=30 >40 30..40 student? no yes no yes credit rating? fair excellent no yes
Goodness in Decision Tree Induction relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods
Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ  (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT  (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure  PUBLIC  (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest  (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label)
Visualization of a   Decision Tree   in SGI/MineSet 3.0
Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)
Bayesian Classification: Why? Probabilistic learning :  Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental : Each training example can incrementally increase/decrease the probability that a hypothesis is correct.  Prior knowledge can be combined with observed data. Probabilistic prediction :  Predict multiple hypotheses, weighted by their probabilities Standard : Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
Bayesian Classification X: a data sample whose class label is unknown, e.g.  X =(Income=medium, Credit_rating=Fair, Age=40). H i : a hypothesis that a record belongs to class C i , e.g. H i  = a record belongs to the “buy computer” class. P(H i ), P(X): probabilities. P(H i /X): a conditional probability: among all records with medium income and fair credit rating, what’s the probability to buy a computer?  This is what we need for classification! Given X, P(H i /X) tells us the possibility that it belongs to some class. What if we need to determine a single class for X?
Bayesian Theorem Another concept, P(X|H i ) : probability of observing the sample X, given that the hypothesis holds. E.g. among all people who buy computer, what percentage has the same value as X. We know P(X    H i ) = P(H i |X) P(X) = P(X|H i ) P(H i ), So We should assign X to the class C i  where P(H i |X) is maximized,  equivalent to maximize P(X|H i ) P(H i ).
Basic Idea  Read the training data, Compute P(H i ) for each class. Compute P(X k |H i ) for each distinct instance of X among records in class C i . To predict the class for a new data X,  for each class, compute P(X|H i ) P(H i ).  return the class which has the largest value. Any Problem? Too many combinations of X k . e.g. 50 ages, 50 credit ratings, 50 income levels    125,000 combinations of X k !
NaĂŻve Bayes Classifier  A simplified assumption: attributes are conditionally independent: The product of occurrence of say 2 elements x 1  and x 2 , given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y 1 ,y 2 ],C) = P(y 1 ,C) * P(y 2 ,C) No dependence relation between attributes  Greatly reduces the number of probabilities to maintain.
Sample quiz questions What data does naĂŻve Baysian net maintain? 2. Given  X =(age<=30, Income=medium, Student=yes Credit_rating=Fair) buy or not buy?
Naïve Bayesian Classifier:  Example Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”)  = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) :  P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) :  P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028   P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 X belongs to  class “buys_computer=yes” Pitfall: forget P(Ci)
NaĂŻve Bayesian Classifier: Comments Advantages :  Easy to implement  Good results obtained in most of the cases Disadvantages Assumption: class conditional independence , therefore loss of accuracy Practically, dependencies exist among variables  E.g.,  hospitals: patients: Profile: age, family history etc  Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc  Dependencies among these cannot be modeled by NaĂŻve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks
Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)
Baysian Networks slides adapted from Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email_address] 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials:  http://www.cs.cmu.edu/~awm/tutorials  . Comments and corrections gratefully received.
What we’ll discuss Recall the numerous and dramatic benefits of Joint Distributions for describing uncertain worlds Reel with terror at the problem with using Joint Distributions Discover how Bayes Net methodology allows us to build Joint Distributions in manageable chunks Discover there’s still a lurking problem… … Start to solve that problem
Why this matters In Andrew’s opinion, the most important technology in the Machine Learning / AI field to have emerged in the last 10 years. A clean, clear, manageable language and methodology for expressing what you’re certain and uncertain about Already, many practical applications in medicine, factories, helpdesks: P(this problem | these symptoms) anomalousness of this observation choosing next diagnostic test | these observations Anomaly Detection Inference Active Data Collection
Ways to deal with Uncertainty Three-valued logic: True / False / Maybe Fuzzy logic (truth values between 0 and 1) Non-monotonic reasoning (especially focused on Penguin informatics) Dempster-Shafer theory (and an extension known as quasi-Bayesian theory) Possibabilistic Logic Probability
Discrete Random Variables A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs. Examples A = The US president in 2023 will be male A = You wake up tomorrow with a headache A = You have Ebola
Probabilities We write P(A) as “the fraction of possible worlds in which A is true” We could at this point spend 2 hours on the philosophy of this. But we won’t.
Visualizing A Event space of all possible worlds Its area is 1 Worlds in which A is False Worlds in which A is true P(A) = Area of reddish oval
Interpreting the axioms 0 <= P(A)  <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) The area of A can’t get any smaller than 0 And a zero area would mean no world could ever have A true
Interpreting the axioms 0 <=  P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) The area of A can’t get any bigger than 1 And an area of 1 would mean all worlds will have A true
Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P( A  or  B ) = P( A ) + P( B ) - P( A  and  B ) A B
Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P( A  or  B ) = P( A ) + P( B ) - P( A  and  B ) P(A or B) B P(A and B) Simple addition and subtraction A B
These Axioms are Not to be Trifled With There have been attempts to do different methodologies for uncertainty Fuzzy Logic Three-valued logic Dempster-Shafer Non-monotonic reasoning But the axioms of probability are the only system with this property:  If you gamble using them you can’t be unfairly exploited by an opponent using some other system [di Finetti 1931]
Theorems from the Axioms 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P( A  or  B ) = P( A ) + P( B ) - P( A  and  B ) From these we can prove: P(not A) = P(~A) = 1-P(A) How?
Another important theorem 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P( A  or  B ) = P( A ) + P( B ) - P( A  and  B ) From these we can prove: P(A) = P(A ^ B) + P(A ^ ~B) How?
Conditional Probability P(A|B) = Fraction of worlds in which B is true that also have A true F H H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 “ Headaches are rare and flu is rarer, but if you’re coming down with ‘flu there’s a 50-50 chance you’ll have a headache.”
Conditional Probability H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 P(H|F) = Fraction of flu-inflicted worlds in which you have a headache = #worlds with flu and headache ------------------------------------ #worlds with flu = Area of “H and F” region ------------------------------ Area of “F” region = P(H ^ F) ----------- P(F)  F H
Definition of Conditional Probability P(A ^ B)  P(A|B)  =  ----------- P(B)  Corollary: The Chain Rule P(A ^ B) = P(A|B) P(B)
Bayes Rule P(A ^ B)  P(A|B) P(B) P(B|A) = ----------- = --------------- P(A)  P(A) This is Bayes Rule Bayes, Thomas (1763)  An essay towards solving a problem in the doctrine of chances.  Philosophical Transactions of the Royal Society of London,  53:370-418
Using Bayes Rule to Gamble The “Win” envelope has a dollar and four beads in it The “Lose” envelope has three beads and no money Trivial question: someone draws an envelope at random and offers to sell it to you. How much should you pay? R  R  B  B R  B  B $1.00
Using Bayes Rule to Gamble The “Win” envelope has a dollar and four beads in it The “Lose” envelope has three beads and no money Interesting question: before deciding, you are allowed to see one bead drawn from the envelope. Suppose it’s black: How much should you pay?  Suppose it’s red: How much should you pay? $1.00
Another Example You friend told you that she has two children (not twin). You are on your trip to visit them. Probability that both children are male? You knocked on the door. A male child answered the door. Probability that both children are male? He smiled and said “I’m the big child!” Probability that both children are male? Let B=big child is male, S=small child=male. P(X), P(X|BVS), P(X|B)
Multivalued Random Variables Suppose A can take on more than 2 values A is a  random variable with arity k  if it can take on exactly one value out of  {v 1 ,v 2 , .. v k } Thus…
An easy fact about Multivalued Random Variables: Using the axioms of probability… 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) And assuming that A obeys… It’s easy to prove that
An easy fact about Multivalued Random Variables: Using the axioms of probability… 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) And assuming that A obeys… It’s easy to prove that And thus we can prove
Another fact about Multivalued Random Variables: Using the axioms of probability… 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) And assuming that A obeys… It’s easy to prove that
Another fact about Multivalued Random Variables: Using the axioms of probability… 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) And assuming that A obeys… It’s easy to prove that And thus we can prove
More General Forms of Bayes Rule
More General Forms of Bayes Rule
Useful Easy-to-prove facts
From Probability to Bayesian Net Suppose there are some diseases, symptoms and related facts. E.g. flu, headache, fever, lung cancer, smoker. We are interested to know some (conditional or unconditional) probabilities such as P( flu ) P( lung cancer | smoker ) P( flu | headache ^ ~fever ) What shall we do?
The Joint Distribution Recipe for making a joint distribution of M variables: Example: Boolean variables A, B, C
The Joint Distribution Recipe for making a joint distribution of M variables: Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M  rows). Example: Boolean variables A, B, C 1 1 1 0 1 1 1 0 1 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 C B A
The Joint Distribution Recipe for making a joint distribution of M variables: Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M  rows). For each combination of values, say how probable it is. Example: Boolean variables A, B, C 0.10 1 1 1 0.25 0 1 1 0.10 1 0 1 0.05 0 0 1 0.05 1 1 0 0.10 0 1 0 0.05 1 0 0 0.30 0 0 0 Prob C B A
The Joint Distribution Recipe for making a joint distribution of M variables: Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M  rows). For each combination of values, say how probable it is. If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C A B C 0.05 0.25 0.10 0.05 0.05 0.10 0.10 0.30 0.10 1 1 1 0.25 0 1 1 0.10 1 0 1 0.05 0 0 1 0.05 1 1 0 0.10 0 1 0 0.05 1 0 0 0.30 0 0 0 Prob C B A
Using the Joint Once you have the JD you can ask for the probability of any logical expression involving your attribute
Using the Joint P(Poor Male) = 0.4654
Using the Joint P(Poor) = 0.7604
Inference with the Joint
Inference with the Joint P( Male  |  Poor ) = 0.4654 / 0.7604 = 0.612
Joint distributions Good news Once you have a joint distribution, you can ask important questions about stuff that involves a lot of uncertainty Bad news Impossible to create for more than about ten attributes because there are so many numbers needed when you build the damn thing.
Using fewer numbers Suppose there are two events: M: Manuela teaches the class (otherwise it’s Andrew) S: It is sunny The joint p.d.f. for these events contain four entries. If we want to build the joint p.d.f. we’ll have to invent those four numbers.  OR WILL WE?? We don’t have to specify with bottom level conjunctive events such as P(~M^S) IF… … instead it may sometimes be more convenient for us to specify things like: P(M), P(S). But just P(M) and  P(S) don’t derive the joint distribution.  So you can’t answer all questions.
Using fewer numbers Suppose there are two events: M: Manuela teaches the class (otherwise it’s Andrew) S: It is sunny The joint p.d.f. for these events contain four entries. If we want to build the joint p.d.f. we’ll have to invent those four numbers.  OR WILL WE?? We don’t have to specify with bottom level conjunctive events such as P(~M^S) IF… … instead it may sometimes be more convenient for us to specify things like: P(M), P(S). But just P(M) and  P(S) don’t derive the joint distribution.  So you can’t answer all questions. What extra assumption can you make?
Independence “ The sunshine levels do not depend on and do not influence who is teaching.” This can be specified very simply: P(S    M) = P(S) This is a powerful statement! It required extra domain knowledge. A different kind of knowledge than numerical probabilities.  It needed an understanding of causation.
Independence From  P(S    M) = P(S), the rules of probability imply:  ( can you   prove these? ) P(~S    M) = P(~S) P(M    S) = P(M) P(M ^ S) = P(M) P(S) P(~M ^ S) = P(~M) P(S), (PM^~S) = P(M)P(~S), P(~M^~S) = P(~M)P(~S)
Independence From  P(S    M) = P(S), the rules of probability imply:  ( can you   prove these? ) P(~S    M) = P(~S) P(M    S) = P(M) P(M ^ S) = P(M) P(S) P(~M ^ S) = P(~M) P(S), (PM^~S) = P(M)P(~S), P(~M^~S) = P(~M)P(~S) And in general: P(M=u ^ S=v) = P(M=u) P(S=v) for each of the four combinations of u=True/False v=True/False
Independence We’ve stated: P(M) = 0.6 P(S) = 0.3 P(S    M) = P(S) And since we now have the joint pdf, we can make any queries we like. From these statements, we can derive the full joint pdf. F F T F F T T T Prob S M
A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive later than Manuela.
A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. Let’s begin with writing down knowledge we’re happy about: P(S    M) = P(S),  P(S) = 0.3,  P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer.
A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. Let’s begin with writing down knowledge we’re happy about: P(S    M) = P(S),  P(S) = 0.3,  P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer.  We already know the Joint of S and M, so all we need now is P(L    S=u, M=v) in the 4 cases of u/v = True/False.
A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. P(S    M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L    M ^ S) = 0.05 P(L    M ^ ~S) = 0.1 P(L    ~M ^ S) = 0.1 P(L    ~M ^ ~S) = 0.2 Now we can derive a full joint p.d.f. with a “mere” six numbers instead of seven* *Savings are larger for larger numbers of variables.
A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. P(S    M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L    M ^ S) = 0.05 P(L    M ^ ~S) = 0.1 P(L    ~M ^ S) = 0.1 P(L    ~M ^ ~S) = 0.2 Question:  Express P(L=x ^ M=y ^ S=z) in terms that only need the above expressions, where  x,y  and  z  may each be True or False.
A bit of notation S M L P(s)=0.3 P(M)=0.6 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 P(S    M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L    M ^ S) = 0.05 P(L    M ^ ~S) = 0.1 P(L    ~M ^ S) = 0.1 P(L    ~M ^ ~S) = 0.2
A bit of notation S M L P(s)=0.3 P(M)=0.6 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Read the absence of an arrow between S and M to mean “it would not help me predict M if I knew the value of S” Read the two arrows into L to mean that if I want to know the value of L it may help me to know M and to know S. This kind of stuff will be thoroughly formalized later P(S    M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L    M ^ S) = 0.05 P(L    M ^ ~S) = 0.1 P(L    ~M ^ S) = 0.1 P(L    ~M ^ ~S) = 0.2
An even cuter trick Suppose we have these three events: M : Lecture taught by Manuela L : Lecturer arrives late R : Lecture concerns robots Suppose: Andrew has a higher chance of being late than Manuela. Andrew has a higher chance of giving robotics lectures. What kind of independence can we find? How about: P(L    M) = P(L) ? P(R    M) = P(R) ? P(L    R) = P(L) ?
Conditional independence Once you know who the lecturer is, then whether they arrive late doesn’t affect whether the lecture concerns robots. P(R    M,L) = P(R    M) and P(R    ~M,L) = P(R    ~M) We express this in the following way: “ R and L are conditionally independent given M” M L R Given knowledge of M, knowing anything else in the diagram won’t help us with L, etc. ..which is also notated by the following diagram.
Conditional Independence formalized R and L are conditionally independent given M  if for all x,y,z in {T,F}: P(R=x    M=y ^ L=z) = P(R=x    M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are  conditionally independent given S3  if for all assignments of values to the variables in the sets, P(S 1 ’s assignments    S 2 ’s assignments & S 3 ’s assignments)=  P(S1’s assignments    S3’s assignments)
Example: R and L are conditionally independent given M  if for all x,y,z in {T,F}: P(R=x    M=y ^ L=z) = P(R=x    M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are  conditionally independent given S3  if for all assignments of values to the variables in the sets, P(S 1 ’s assignments    S 2 ’s assignments & S 3 ’s assignments)=  P(S1’s assignments    S3’s assignments) “ Shoe-size is conditionally independent of Glove-size given height weight and age” means forall s,g,h,w,a P(ShoeSize=s|Height=h,Weight=w,Age=a) = P(ShoeSize=s|Height=h,Weight=w,Age=a,GloveSize=g)
Example: R and L are conditionally independent given M  if for all x,y,z in {T,F}: P(R=x    M=y ^ L=z) = P(R=x    M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are  conditionally independent given S3  if for all assignments of values to the variables in the sets, P(S 1 ’s assignments    S 2 ’s assignments & S 3 ’s assignments)=  P(S1’s assignments    S3’s assignments) “ Shoe-size is conditionally independent of Glove-size given height weight and age” does not mean forall s,g,h P(ShoeSize=s|Height=h) = P(ShoeSize=s|Height=h, GloveSize=g)
Conditional independence M L R We can write down P(M).  And then, since we know L is only directly influenced by M, we can write down the values of P(L  M) and P(L  ~M) and know we’ve fully specified L’s behavior.  Ditto for R. P(M) = 0.6 P(L      M) = 0.085 P(L    ~M) = 0.17 P(R      M) = 0.3 P(R      ~M) = 0.6 ‘ R and L conditionally independent given M’
Conditional independence M L R P(M) = 0.6 P(L      M) = 0.085 P(L    ~M) = 0.17 P(R      M) = 0.3 P(R      ~M) = 0.6 Conditional Independence: P(R  M,L) = P(R  M), P(R  ~M,L) = P(R  ~M) Again, we can obtain any member of the Joint prob dist that we desire: P(L=x ^ R=y ^ M=z) =
Assume five variables T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L) L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S) R only directly influenced by M (i.e. R is conditionally independent of L,S, given M) M and S are independent
Making a Bayes net T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny S M R L T Step One: add variables. Just choose the variables you’d like to be included in the net.
Making a Bayes net T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny S M R L T Step Two: add links. The link structure must be acyclic. If node X is given parents Q 1 ,Q 2 ,..Q n  you are promising that any variable that’s a non-descendent of X is conditionally independent of X given {Q 1 ,Q 2 ,..Q n }
Making a Bayes net T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step Three: add a probability table for each node. The table for node X must list P(X|Parent Values) for each possible combination of parent values
Making a Bayes net T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny Two unconnected variables may still be correlated Each node is conditionally independent of all non-descendants in the tree, given its parents. You can deduce many other conditional independence relations from a Bayes net. See the next lecture. S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair  V  ,  E  where: V is a set of vertices. E is a set of directed edges joining vertices.  No loops of any length are allowed. Each vertex in  V  contains the following information: The name of a random variable A probability distribution table indicating how the probability of this variable’s values depends on all possible combinations of parental values.
Building a Bayes Net Choose a set of relevant variables. Choose an ordering for them Assume they’re called  X 1  .. X m  (where  X 1  is the first in the ordering,  X 1  is the second, etc) For  i = 1 to m : Add the  X i  node to the network Set  Parents(X i  )  to be a minimal subset of { X 1 …X i-1 } such that we have conditional independence of  X i  and all other members of { X 1 …X i-1 } given  Parents(X i  ) Define the probability table of  P( X i   = k     Assignments of  Parents(X i  )  ).
Computing a Joint Entry How to compute an entry in a joint distribution? E.G: What is P(S ^ ~M ^ L ~R ^ T)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
Computing with Bayes Net P(T ^ ~R ^ L ^ ~M ^ S) = P(T    ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) =  P(T     L) *  P(~R ^ L ^ ~M ^ S) = P(T     L) *  P(~R    L ^ ~M ^ S) * P(L^~M^S) = P(T     L) *  P(~R    ~M) * P(L^~M^S) = P(T     L) *  P(~R    ~M) * P(L  ~M^S)*P(~M^S) = P(T     L) *  P(~R    ~M) * P(L  ~M^S)*P(~M | S)*P(S) = P(T     L) *  P(~R    ~M) * P(L  ~M^S)*P(~M)*P(S). S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
The general case P(X 1 = x 1  ^ X 2 =x 2  ^ ….X n-1 =x n-1  ^ X n =x n ) = P(X n =x n  ^ X n-1 =x n-1  ^ ….X 2 =x 2  ^ X 1 =x 1 ) = P(X n =x n    X n-1 =x n-1  ^ ….X 2 =x 2  ^ X 1 =x 1 ) * P(X n-1 =x n-1  ^…. X 2 =x 2  ^ X 1 =x 1 ) = P(X n =x n    X n-1 =x n-1  ^ ….X 2 =x 2  ^ X 1 =x 1 ) * P(X n-1 =x n-1   …. X 2 =x 2  ^ X 1 =x 1 ) * P(X n-2 =x n-2  ^…. X 2 =x 2  ^ X 1 =x 1 ) = : : = So any entry in joint pdf table can be computed. And so  any conditional probability  can be computed.
Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table.  Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables.  And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What   could we do to   compute P(R    T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table.  Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables.  And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What   could we do to   compute P(R    T,~S)? Step 1: Compute P( R ^   T ^ ~S ) Step 2: Compute P( T ^ ~S ) Step 3: Return P( R ^   T ^ ~S ) ------------------------------------- P( T ^ ~S ) S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table.  Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables.  And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What   could we do to   compute P(R    T,~S)? Step 1: Compute P( R ^   T ^ ~S ) Step 2: Compute P( T ^ ~S ) Step 3: Return P( R ^   T ^ ~S ) ------------------------------------- P( T ^ ~S ) Sum of all the rows in the Joint that match  R ^   T ^ ~S Sum of all the rows in the Joint that match  T ^ ~S S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table.  Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables.  And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What   could we do to   compute P(R    T,~S)? Step 1: Compute P( R ^   T ^ ~S ) Step 2: Compute P( T ^ ~S ) Step 3: Return P( R ^   T ^ ~S ) ------------------------------------- P( T ^ ~S ) Sum of all the rows in the Joint that match  R ^   T ^ ~S Sum of all the rows in the Joint that match  T ^ ~S Each of these obtained by the “computing a joint probability entry” method of the earlier slides 4 joint computes 8 joint computes S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
The good news We can do inference. We can compute any conditional probability: P( Some variable    Some other variable values )
The good news We can do inference. We can compute any conditional probability: P( Some variable    Some other variable values ) Suppose you have  m  binary-valued variables in your Bayes Net and expression  E 2  mentions  k  variables. How much work is the above computation?
The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables.
The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks too, right?
The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks too, right? Sadder and worse news: General querying of Bayes nets is NP-complete.
Bayes nets inference algorithms A poly-tree is a directed acyclic graph in which no two nodes have more than one path between them. A poly tree Not a poly tree (but still a legal Bayes net) S R L T L T M S M R X 1 X 2 X 4 X 3 X 5 X 1 X 2 X 3 X 5 X 4 If net is a poly-tree, there is a linear-time algorithm (see a later Andrew lecture). The best general-case algorithms convert a general net to a poly-tree (often at huge expense) and calls the poly-tree algorithm. Another popular, practical approach (doesn’t assume poly-tree): Stochastic Simulation.
Sampling from the Joint Distribution It’s pretty easy to generate a set of variable-assignments at random with the same probability as the underlying joint distribution. How? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
Sampling from the Joint Distribution 1.  Randomly choose S.  S = True with prob 0.3 2.  Randomly choose M.  M = True with prob 0.6 3.  Randomly choose L.  The probability that L is true depends on the assignments of S and M.  E.G. if steps 1 and 2 had produced S=True, M=False, then probability that L is true is 0.1 4.  Randomly choose R.  Probability depends on M. 5.  Randomly choose T.  Probability depends on L S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
A general sampling algorithm Let’s generalize the example on the previous slide to a general Bayes Net. As in Slides 16-17 , call the variables  X 1  .. X n , where  Parents(X i )  must be   a subset of { X 1  .. X i-1 }. For  i=1  to  n : Find parents, if any, of  X i .  Assume  n(i)  parents.  Call them  X p(i,1) ,  X p(i,2) , … X p(i,n(i)) . Recall the values that those parents were randomly given:  x p(i,1) ,  x p(i,2) , … x p(i,n(i)) . Look up in the lookup-table for:   P(X i =True    X p(i,1) = x p(i,1) ,X p(i,2) =x p(i,2) …X p(i,n(i)) =x p(i,n(i)) ) Randomly set  x i =True  according to this probability x 1 , x 2 ,…x n  are now a sample from the joint distribution of  X 1 , X 2 ,…X n .
Stochastic Simulation Example Someone wants to know P( R = True     T = True ^ S = False  ) We’ll do lots of random samplings and count the number of occurrences of the following: N c  : Num. samples in which  T=True and S=False. N s  : Num. samples in which  R=True ,  T=True and S=False . N  : Number of random samplings Now if N is big enough: N c  /N  is a good estimate of  P( T=True and S=False ). N s   /N  is a good estimate of  P( R=True  , T=True , S=False ) . P( R  T^~S ) = P( R^ T^~S )/P( T^~S ), so  N s  /  N c  can be a good estimate of P( R  T^~S ).
General Stochastic Simulation Someone wants to know P( E 1      E 2  ) We’ll do lots of random samplings and count the number of occurrences of the following: N c  : Num. samples in which  E 2 N s  : Num. samples in which  E 1  and  E 2 N  : Number of random samplings Now if N is big enough: N c  /N  is a good estimate of  P( E 2 ). N s   /N  is a good estimate of  P( E 1  ,  E 2 ) . P( E 1      E 2 ) = P( E 1 ^  E 2 )/P( E 2 ), so  N s  /  N c  can be a good estimate of P( E 1    E 2 ).
Likelihood weighting Problem with Stochastic Sampling: With lots of constraints in E, or unlikely events in E, then most of the simulations will be thrown away, (they’ll have no effect on Nc, or Ns).  Imagine we’re part way through our simulation. In E2 we have the constraint Xi = v We’re just about to generate a value for Xi at random.  Given the values assigned to the parents, we see that P(Xi = v    parents) = p . Now we know that with stochastic sampling: we’ll generate “Xi = v” proportion p of the time, and proceed. And we’ll generate a different value proportion 1-p of the time, and the simulation will be wasted. Instead, always generate Xi = v, but weight the answer by weight “p” to compensate.
Likelihood weighting Set  N c  :=0,  N s  :=0 Generate a random assignment of all variables that matches  E 2 .  This process returns a weight w. Define w to be the probability that this assignment would have been generated instead of an unmatching assignment during its generation in the original algorithm.Fact: w is a product of all likelihood factors involved in the generation. N c  := N c  + w If our sample matches  E 1  then  N s  :=  N s  + w Go to 1 Again,  N s  /  N c  estimates P( E 1      E 2  )
Case Study I Pathfinder system.  (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA). Diagnostic system for lymph-node diseases. 60 diseases and 100 symptoms and test-results. 14,000 probabilities Expert consulted to make net. 8 hours to determine variables. 35 hours for net topology. 40 hours for probability table values. Apparently, the experts found it quite easy to invent the causal links and probabilities. Pathfinder is now outperforming the world experts in diagnosis.  Being extended to several dozen other medical domains.
Questions What are the strengths of probabilistic networks compared with propositional logic? What are the weaknesses of probabilistic networks compared with propositional logic? What are the strengths of probabilistic networks compared with predicate logic? What are the weaknesses of probabilistic networks compared with predicate logic? (How) could predicate logic and probabilistic networks be combined?
What you should know The meanings and importance of independence and conditional independence. The definition of a Bayes net. Computing probabilities of assignments of variables (i.e. members of the joint p.d.f.) with a Bayes net. The slow (exponential) method for computing arbitrary, conditional probabilities. The stochastic simulation method and likelihood weighting.
Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)
Neural Networks Assume each record X is a vector of length n.  There are two classes which are valued 1 and -1. E.g. class 1 means “buy computer”, and class -1 means “no buy”. We can incrementally change weights to learn to produce these outputs using the  perceptron learning rule.
A  Neuron The  n -dimensional input vector  x  is mapped into  variable  y   by means of the scalar product and a nonlinear function mapping  k - f weighted  sum Input vector  x output  y Activation function weight vector  w  w 0 w 1 w n x 0 x 1 x n
A  Neuron  k - f weighted  sum Input vector  x output  y Activation function weight vector  w  w 0 w 1 w n x 0 x 1 x n
Multi-Layer Perceptron Output nodes Input nodes Hidden nodes Output vector Input vector:  x i w ij
Network Training The ultimate objective of training  obtain a set of weights that makes almost all the tuples in the training data classified correctly  Steps Initialize weights with random values  Feed the input tuples into the network one by one For each unit Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error Update the weights and the bias
Network Pruning and Rule Extraction Network pruning Fully connected network will be hard to articulate N  input nodes,  h  hidden nodes and  m  output nodes lead to  h(m+N)  weights Pruning: Remove some of the links without affecting classification accuracy of the network Extracting rules from a trained network Discretize activation values; replace individual activation value by the cluster average maintaining the network accuracy Enumerate the output from the discretized activation values to find rules between activation value and output Find the relationship between the input and activation value  Combine the above two to have rules relating the output to input
Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)
Linear Support Vector Machines Goal: find a plane that divides the two sets of points.  value( )= -1, e.g. does not buy computer value( )= 1, e.g. buy computer Also, need to maximize margin.  Margin
Linear Support Vector Machines Support Vectors Small Margin Large Margin
Linear Support Vector Machines The hyperplane is defined by vector  w  and value  b . For any point  x  where value( x )=-1,  w·x  + b  ≤ -1. For any point  x  where value( x )=1,  w·x  + b  ≥ 1. Here let  x  = ( x h  , x v ) w =( w h  , w v ), we have: w·x = w h  x h  +  w v  x v  w·x  + b  = -1 w·x  + b  = 1
Linear Support Vector Machines For instance, the hyperplane is defined by vector  w  = (1, 1/  ) and value  b  = -11. w¡x  + b  = -1 w¡x  + b  = 1 10 11 12 60° Some examples: x  = (10, 0)     w¡x  + b  = -1 x  = (0,  )     w¡x  + b  = 1 x  = (0, 0)     w¡x  + b  = -11 A little more detail: w  is perpendicular to minus plane,  margin M =    , e.g.  M
Linear Support Vector Machines Each record  x i  in training data has a vector and a class, There are two classes: 1 and -1, E.g. class 1 = “buy computer”, class -1 = “not buy”. E.g. a record (age, salary) = (45, 50k), class = 1. Read the training data, find the hyperplane defined by  w  and  b . Should maximize margin, minimize error. Can be done by Quadratic Programming Technique. Predict the class of a new data  x  by looking at the sign of  w·x  + b.  Positive: class 1. Negative: class -1.
SVM – Cont. What if the data is not linearly separable? Project the data to high dimensional space where it is linearly separable and then we can use linear SVM – (Using Kernels) -1 0 +1 + + - (1,0) (0,0) (0,1) + + -
Non-Linear SVM Classification using SVM ( w,b ) In non linear case we can see this as  Kernel – Can be thought of as doing dot product  in some high dimensional space
Example of Non-linear SVM
Results
SVM vs. Neural Network SVM Relatively new concept Nice Generalization properties Hard to learn – learned in batch mode using quadratic programming techniques Using kernels can learn very complex functions Neural Network Quite Old Generalizes well but doesn’t have strong mathematical foundation Can easily be learned in incremental fashion To learn complex functions – use multilayer perceptron (not that trivial)
SVM Related Links http://svm.dcs.rhbnc.ac.uk/ http://www.kernel-machines.org/ C. J. C. Burges.  A Tutorial on Support Vector Machines for Pattern Recognition .  Knowledge Discovery and Data Mining , 2(2), 1998.  SVM light  –  Software (in C)  http:// ais.gmd.de/~thorsten/svm_light BOOK:  An Introduction to Support Vector Machines N. Cristianini and J. Shawe-Taylor Cambridge University Press
Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM) Bagging and Boosting
Bagging and Boosting General idea  Training data  Altered Training data  Altered Training data …… .. Aggregation …. Classifier C Classification method (CM) CM Classifier C1 CM Classifier C2 Classifier C*
Bagging  Given a set S of s samples  Generate a bootstrap sample T from S. Cases in S may not appear in T or may appear more than once.  Repeat this sampling procedure, getting a sequence of k independent training sets A corresponding sequence of classifiers C1,C2,…,Ck is constructed for each of these training sets, by using the same classification algorithm  To classify an unknown sample X,let each classifier predict or vote  The Bagged Classifier C* counts the votes and assigns X to the class with the “most” votes
Boosting Technique — Algorithm Assign every example an equal weight  1/N For t = 1, 2, …, T Do  Obtain a hypothesis (classifier) h (t)  under w (t) Calculate the error of  h(t)  and re-weight the examples based on the error . Each classifier is dependent on the previous ones. Samples that are incorrectly predicted are weighted more heavily Normalize w (t+1)  to sum to 1 (weights assigned to different classifiers sum to 1) Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set
Summary Classification is an  extensively studied  problem (mainly in statistics, machine learning & neural networks) Classification is probably one of the most  widely used  data mining techniques with a lot of extensions Scalability  is still an important issue for database applications:  thus combining classification  with database techniques  should be a promising topic Research directions: classification of   non-relational data , e.g., text, spatial, multimedia, etc..

More Related Content

PPTX
Managing Data: storage, decisions and classification
Edward Blurock
 
PPT
Machine Learning: Decision Trees Chapter 18.1-18.3
butest
 
PPT
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
PPT
3_learning.ppt
butest
 
PPT
.ppt
butest
 
PDF
Machine learning Lecture 1
Srinivasan R
 
PPTX
Bayesian classification
Zul Kawsar
 
PDF
Machine Learning : why we should know and how it works
Kevin Lee
 
Managing Data: storage, decisions and classification
Edward Blurock
 
Machine Learning: Decision Trees Chapter 18.1-18.3
butest
 
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
3_learning.ppt
butest
 
.ppt
butest
 
Machine learning Lecture 1
Srinivasan R
 
Bayesian classification
Zul Kawsar
 
Machine Learning : why we should know and how it works
Kevin Lee
 

What's hot (20)

PPT
ppt
butest
 
PDF
Bayes 6
uddingias
 
PPTX
Pattern recognition binoy 05-naive bayes classifier
108kaushik
 
PPTX
Chap08 estimation additional topics
Judianto Nugroho
 
PPTX
Generalization abstraction
Edward Blurock
 
PPTX
Data Science and Machine Learning with Tensorflow
Shubham Sharma
 
PPT
Machine Learning: Foundations Course Number 0368403401
butest
 
PDF
CSA 3702 machine learning module 2
Nandhini S
 
PPT
Sfs4e ppt 06
Uconn Stamford
 
PPT
Machine Learning for NLP
butest
 
PPT
MachineLearning.ppt
butest
 
PPTX
Machine learning
Vatsal Gajera
 
PPTX
Chap04 discrete random variables and probability distribution
Judianto Nugroho
 
PPTX
BAS 250 Lecture 8
Wake Tech BAS
 
PPT
Machine learning
Digvijay Singh
 
PDF
Classification
CloudxLab
 
PPTX
Data mining classifiers.
ShwetaPatil174
 
PDF
Lecture 3b: Decision Trees (1 part)
Marina Santini
 
PPT
Download presentation source
butest
 
PPTX
Ensemble Learning and Random Forests
CloudxLab
 
ppt
butest
 
Bayes 6
uddingias
 
Pattern recognition binoy 05-naive bayes classifier
108kaushik
 
Chap08 estimation additional topics
Judianto Nugroho
 
Generalization abstraction
Edward Blurock
 
Data Science and Machine Learning with Tensorflow
Shubham Sharma
 
Machine Learning: Foundations Course Number 0368403401
butest
 
CSA 3702 machine learning module 2
Nandhini S
 
Sfs4e ppt 06
Uconn Stamford
 
Machine Learning for NLP
butest
 
MachineLearning.ppt
butest
 
Machine learning
Vatsal Gajera
 
Chap04 discrete random variables and probability distribution
Judianto Nugroho
 
BAS 250 Lecture 8
Wake Tech BAS
 
Machine learning
Digvijay Singh
 
Classification
CloudxLab
 
Data mining classifiers.
ShwetaPatil174
 
Lecture 3b: Decision Trees (1 part)
Marina Santini
 
Download presentation source
butest
 
Ensemble Learning and Random Forests
CloudxLab
 
Ad

Similar to My7class (20)

PPT
Computational Biology, Part 4 Protein Coding Regions
butest
 
PPT
20070702 Text Categorization
midi
 
PPT
lecture_mooney.ppt
butest
 
PDF
Classifiers
Ayurdata
 
PPT
Business Analytics using R.ppt
Rohit Raj
 
PPT
Machine Learning 3 - Decision Tree Learning
butest
 
PDF
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 
PPT
Decision tree
Ami_Surati
 
PPT
Decision tree
Soujanya V
 
PPT
Decision_Tree in machine learning with examples.ppt
amrita chaturvedi
 
PPTX
Naive Bayes Presentation
Md. Enamul Haque Chowdhury
 
PPT
. An introduction to machine learning and probabilistic ...
butest
 
PPT
original
butest
 
PDF
Data mining knowledge representation Notes
RevathiSundar4
 
PPT
Machine Learning
butest
 
PPT
ppt
butest
 
PDF
Naive bayes
Learnbay Datascience
 
PPTX
Presentation on Text Classification
Sai Srinivas Kotni
 
PPT
lecture13-nbbbbb. Bbnnndnjdjdjbayes.ppt
joyaluca2
 
PPT
week9_Machine_Learning.ppt
butest
 
Computational Biology, Part 4 Protein Coding Regions
butest
 
20070702 Text Categorization
midi
 
lecture_mooney.ppt
butest
 
Classifiers
Ayurdata
 
Business Analytics using R.ppt
Rohit Raj
 
Machine Learning 3 - Decision Tree Learning
butest
 
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 
Decision tree
Ami_Surati
 
Decision tree
Soujanya V
 
Decision_Tree in machine learning with examples.ppt
amrita chaturvedi
 
Naive Bayes Presentation
Md. Enamul Haque Chowdhury
 
. An introduction to machine learning and probabilistic ...
butest
 
original
butest
 
Data mining knowledge representation Notes
RevathiSundar4
 
Machine Learning
butest
 
ppt
butest
 
Naive bayes
Learnbay Datascience
 
Presentation on Text Classification
Sai Srinivas Kotni
 
lecture13-nbbbbb. Bbnnndnjdjdjbayes.ppt
joyaluca2
 
week9_Machine_Learning.ppt
butest
 
Ad

Recently uploaded (20)

PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Software Development Methodologies in 2025
KodekX
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Software Development Methodologies in 2025
KodekX
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 

My7class

  • 1. Classification ŠJiawei Han and Micheline Kamber http://www-sal.cs.uiuc.edu/~hanj/bk2/ Chp 6 modified by Donghui Zhang Integrated with slides from Prof. Andrew W. Moore http:// www.cs.cmu.edu/~awm/tutorials
  • 2. Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)
  • 3. Classification: models categorical class labels (discrete or nominal) e.g. given a new customer, does she belong to the “likely to buy a computer” class? Prediction: models continuous-valued functions e.g. how many computers will a customer buy? Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis Classification vs. Prediction
  • 4. Classification—A Two-Step Process Model construction : describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage : for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
  • 5. Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Training Data Classifier (Model)
  • 6. Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) Tenured? Classifier Testing Data Unseen Data
  • 7. Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
  • 8. Evaluating Classification Methods Predictive accuracy Speed and scalability time to construct the model time to use the model Robustness handling noise and missing values Scalability efficiency in disk-resident databases Interpretability: understanding and insight provided by the model Goodness of rules decision tree size compactness of classification rules
  • 9. Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)
  • 10. Training Dataset This follows an example from Quinlan’s ID3
  • 11. Output: A Decision Tree for “ buys_computer” age? overcast student? credit rating? no yes fair excellent <=30 >40 no no yes yes yes 30..40
  • 12. Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “ no ” THEN buys_computer = “ no ” IF age = “<=30” AND student = “ yes ” THEN buys_computer = “ yes ” IF age = “31…40” THEN buys_computer = “ yes ” IF age = “>40” AND credit_rating = “ excellent ” THEN buys_computer = “ yes ” IF age = “<=30” AND credit_rating = “ fair ” THEN buys_computer = “ no ”
  • 13. Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain ) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left
  • 14. Information gain slides adapted from Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email_address] 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
  • 15. Bits You are watching a set of independent random samples of X You see that X has four possible values So you might see: BAACBADCDADDDA… You transmit data over a binary serial link. You can encode each reading with two bits (e.g. A = 00, B = 01, C = 10, D = 11) 0100001001001110110011111100… P(X=C) = 1/4 P(X=B) = 1/4 P(X=D) = 1/4 P(X=A) = 1/4
  • 16. Fewer Bits Someone tells you that the probabilities are not equal It’s possible… … to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? P(X=C) = 1/8 P(X=B) = 1/4 P(X=D) = 1/8 P(X=A) = 1/2
  • 17. Fewer Bits Someone tells you that the probabilities are not equal It’s possible… … to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? (This is just one of several ways) P(X=C) = 1/8 P(X=B) = 1/4 P(X=D) = 1/8 P(X=A) = 1/2 111 D 110 C 10 B 0 A
  • 18. Fewer Bits Suppose there are three equally likely values… Here’s a naĂŻve coding, costing 2 bits per symbol Can you think of a coding that would need only 1.6 bits per symbol on average? In theory, it can in fact be done with 1.58496 bits per symbol . P(X=C) = 1/3 P(X=B) = 1/3 P(X=D) = 1/3 10 C 01 B 00 A
  • 19. Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X “ High Entropy” means X is from a uniform (boring) distribution “ Low Entropy” means X is from varied (peaks and valleys) distribution General Case … . P(X=V 2 ) = p 2 P(X=V 1 ) = p 1 P(X=V m ) = p m
  • 20. Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X “ High Entropy” means X is from a uniform (boring) distribution “ Low Entropy” means X is from varied (peaks and valleys) distribution General Case A histogram of the frequency distribution of values of X would be flat A histogram of the frequency distribution of values of X would have many lows and one or two highs … . P(X=V 2 ) = p 2 P(X=V 1 ) = p 1 P(X=V m ) = p m
  • 21. Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X “ High Entropy” means X is from a uniform (boring) distribution “ Low Entropy” means X is from varied (peaks and valleys) distribution General Case A histogram of the frequency distribution of values of X would be flat A histogram of the frequency distribution of values of X would have many lows and one or two highs ..and so the values sampled from it would be all over the place ..and so the values sampled from it would be more predictable … . P(X=V 2 ) = p 2 P(X=V 1 ) = p 1 P(X=V m ) = p m
  • 22. Entropy in a nut-shell Low Entropy High Entropy
  • 23. Entropy in a nut-shell Low Entropy High Entropy ..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room ..the values (locations of soup) sampled entirely from within the soup bowl
  • 24. Exercise: Suppose 100 customers have two classes: “ Buy Computer” and “Not Buy Computer”. Uniform distribution: 50 buy. Entropy? Skewed distribution: 100 buy. Entropy?
  • 25. Specific Conditional Entropy Suppose I’m trying to predict output Y and I have input X Let’s assume this reflects the true probabilities E.G. From this data we estimate P(LikeG = Yes) = 0.5 P(Major = Math & LikeG = No) = 0.25 P(Major = Math) = 0.5 P(LikeG = Yes | Major = History) = 0 Note: H(X) = 1.5 H(Y) = 1 X = College Major Y = Likes “Gladiator” Yes Math No History Yes CS No Math No Math Yes CS No History Yes Math Y X
  • 26. Specific Conditional Entropy Definition of Specific Conditional Entropy: H(Y | X=v) = The entropy of Y among only those records in which X has value v X = College Major Y = Likes “Gladiator” Yes Math No History Yes CS No Math No Math Yes CS No History Yes Math Y X
  • 27. Specific Conditional Entropy Definition of Conditional Entropy: H(Y|X=v) = The entropy of Y among only those records in which X has value v Example: H(Y|X=Math) = 1 H(Y|X=History) = 0 H(Y|X=CS) = 0 X = College Major Y = Likes “Gladiator” Yes Math No History Yes CS No Math No Math Yes CS No History Yes Math Y X
  • 28. Conditional Entropy Definition of Specific Conditional Entropy: H(Y | X) = The average conditional entropy of Y = if you choose a record at random what will be the conditional entropy of Y , conditioned on that row’s value of X = Expected number of bits to transmit Y if both sides will know the value of X = ÎŁ j Prob(X=v j ) H(Y | X = v j ) X = College Major Y = Likes “Gladiator” Yes Math No History Yes CS No Math No Math Yes CS No History Yes Math Y X
  • 29. Conditional Entropy Definition of general Conditional Entropy: H(Y | X) = The average conditional entropy of Y = ÎŁ j Prob(X=v j ) H(Y | X = v j ) X = College Major Y = Likes “Gladiator” Example: H(Y | X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5 0 0.25 CS 0 0.25 History 1 0.5 Math H(Y | X = v j ) Prob(X=v j ) v j Yes Math No History Yes CS No Math No Math Yes CS No History Yes Math Y X
  • 30. Information Gain Definition of Information Gain: IG(Y | X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X ? IG(Y | X) = H(Y) - H(Y | X) X = College Major Y = Likes “Gladiator” Example: H(Y) = 1 H(Y|X) = 0.5 Thus IG(Y|X) = 1 – 0.5 = 0.5 Yes Math No History Yes CS No Math No Math Yes CS No History Yes Math Y X
  • 31. What is Information Gain used for? Suppose you are trying to predict whether someone is going live past 80 years. From historical data you might find… IG(LongLife | HairColor) = 0.01 IG(LongLife | Smoker) = 0.2 IG(LongLife | Gender) = 0.25 IG(LongLife | LastDigitOfSSN) = 0.00001 IG tells you how interesting a 2-d contingency table is going to be.
  • 32. Conditional entropy H(C|age) H(C|age<=30) = 2/5 * lg(5/2) + 3/5 * lg(5/3) = 0.971 H(C|age in 30..40) = 1 * lg 1 + 0 * lg 1/0 = 0 H(C|age>40) = 3/5 * lg(5/3) + 2/5 * lg(5/2) = 0.971
  • 33. Select the attribute with lowest conditional entropy H(C|age) = 0.694 H(C|income) = 0.911 H(C|student) = 0.789 H(C|credit_rating) = 0.892 Select “age” to be the tree root! yes age? <=30 >40 30..40 student? no yes no yes credit rating? fair excellent no yes
  • 34. Goodness in Decision Tree Induction relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods
  • 35. Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label)
  • 36. Visualization of a Decision Tree in SGI/MineSet 3.0
  • 37. Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)
  • 38. Bayesian Classification: Why? Probabilistic learning : Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental : Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction : Predict multiple hypotheses, weighted by their probabilities Standard : Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
  • 39. Bayesian Classification X: a data sample whose class label is unknown, e.g. X =(Income=medium, Credit_rating=Fair, Age=40). H i : a hypothesis that a record belongs to class C i , e.g. H i = a record belongs to the “buy computer” class. P(H i ), P(X): probabilities. P(H i /X): a conditional probability: among all records with medium income and fair credit rating, what’s the probability to buy a computer? This is what we need for classification! Given X, P(H i /X) tells us the possibility that it belongs to some class. What if we need to determine a single class for X?
  • 40. Bayesian Theorem Another concept, P(X|H i ) : probability of observing the sample X, given that the hypothesis holds. E.g. among all people who buy computer, what percentage has the same value as X. We know P(X  H i ) = P(H i |X) P(X) = P(X|H i ) P(H i ), So We should assign X to the class C i where P(H i |X) is maximized, equivalent to maximize P(X|H i ) P(H i ).
  • 41. Basic Idea Read the training data, Compute P(H i ) for each class. Compute P(X k |H i ) for each distinct instance of X among records in class C i . To predict the class for a new data X, for each class, compute P(X|H i ) P(H i ). return the class which has the largest value. Any Problem? Too many combinations of X k . e.g. 50 ages, 50 credit ratings, 50 income levels  125,000 combinations of X k !
  • 42. NaĂŻve Bayes Classifier A simplified assumption: attributes are conditionally independent: The product of occurrence of say 2 elements x 1 and x 2 , given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y 1 ,y 2 ],C) = P(y 1 ,C) * P(y 2 ,C) No dependence relation between attributes Greatly reduces the number of probabilities to maintain.
  • 43. Sample quiz questions What data does naĂŻve Baysian net maintain? 2. Given X =(age<=30, Income=medium, Student=yes Credit_rating=Fair) buy or not buy?
  • 44. NaĂŻve Bayesian Classifier: Example Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 X belongs to class “buys_computer=yes” Pitfall: forget P(Ci)
  • 45. NaĂŻve Bayesian Classifier: Comments Advantages : Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence , therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by NaĂŻve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks
  • 46. Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)
  • 47. Baysian Networks slides adapted from Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email_address] 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
  • 48. What we’ll discuss Recall the numerous and dramatic benefits of Joint Distributions for describing uncertain worlds Reel with terror at the problem with using Joint Distributions Discover how Bayes Net methodology allows us to build Joint Distributions in manageable chunks Discover there’s still a lurking problem… … Start to solve that problem
  • 49. Why this matters In Andrew’s opinion, the most important technology in the Machine Learning / AI field to have emerged in the last 10 years. A clean, clear, manageable language and methodology for expressing what you’re certain and uncertain about Already, many practical applications in medicine, factories, helpdesks: P(this problem | these symptoms) anomalousness of this observation choosing next diagnostic test | these observations Anomaly Detection Inference Active Data Collection
  • 50. Ways to deal with Uncertainty Three-valued logic: True / False / Maybe Fuzzy logic (truth values between 0 and 1) Non-monotonic reasoning (especially focused on Penguin informatics) Dempster-Shafer theory (and an extension known as quasi-Bayesian theory) Possibabilistic Logic Probability
  • 51. Discrete Random Variables A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs. Examples A = The US president in 2023 will be male A = You wake up tomorrow with a headache A = You have Ebola
  • 52. Probabilities We write P(A) as “the fraction of possible worlds in which A is true” We could at this point spend 2 hours on the philosophy of this. But we won’t.
  • 53. Visualizing A Event space of all possible worlds Its area is 1 Worlds in which A is False Worlds in which A is true P(A) = Area of reddish oval
  • 54. Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) The area of A can’t get any smaller than 0 And a zero area would mean no world could ever have A true
  • 55. Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) The area of A can’t get any bigger than 1 And an area of 1 would mean all worlds will have A true
  • 56. Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P( A or B ) = P( A ) + P( B ) - P( A and B ) A B
  • 57. Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P( A or B ) = P( A ) + P( B ) - P( A and B ) P(A or B) B P(A and B) Simple addition and subtraction A B
  • 58. These Axioms are Not to be Trifled With There have been attempts to do different methodologies for uncertainty Fuzzy Logic Three-valued logic Dempster-Shafer Non-monotonic reasoning But the axioms of probability are the only system with this property: If you gamble using them you can’t be unfairly exploited by an opponent using some other system [di Finetti 1931]
  • 59. Theorems from the Axioms 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P( A or B ) = P( A ) + P( B ) - P( A and B ) From these we can prove: P(not A) = P(~A) = 1-P(A) How?
  • 60. Another important theorem 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P( A or B ) = P( A ) + P( B ) - P( A and B ) From these we can prove: P(A) = P(A ^ B) + P(A ^ ~B) How?
  • 61. Conditional Probability P(A|B) = Fraction of worlds in which B is true that also have A true F H H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 “ Headaches are rare and flu is rarer, but if you’re coming down with ‘flu there’s a 50-50 chance you’ll have a headache.”
  • 62. Conditional Probability H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 P(H|F) = Fraction of flu-inflicted worlds in which you have a headache = #worlds with flu and headache ------------------------------------ #worlds with flu = Area of “H and F” region ------------------------------ Area of “F” region = P(H ^ F) ----------- P(F) F H
  • 63. Definition of Conditional Probability P(A ^ B) P(A|B) = ----------- P(B) Corollary: The Chain Rule P(A ^ B) = P(A|B) P(B)
  • 64. Bayes Rule P(A ^ B) P(A|B) P(B) P(B|A) = ----------- = --------------- P(A) P(A) This is Bayes Rule Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418
  • 65. Using Bayes Rule to Gamble The “Win” envelope has a dollar and four beads in it The “Lose” envelope has three beads and no money Trivial question: someone draws an envelope at random and offers to sell it to you. How much should you pay? R R B B R B B $1.00
  • 66. Using Bayes Rule to Gamble The “Win” envelope has a dollar and four beads in it The “Lose” envelope has three beads and no money Interesting question: before deciding, you are allowed to see one bead drawn from the envelope. Suppose it’s black: How much should you pay? Suppose it’s red: How much should you pay? $1.00
  • 67. Another Example You friend told you that she has two children (not twin). You are on your trip to visit them. Probability that both children are male? You knocked on the door. A male child answered the door. Probability that both children are male? He smiled and said “I’m the big child!” Probability that both children are male? Let B=big child is male, S=small child=male. P(X), P(X|BVS), P(X|B)
  • 68. Multivalued Random Variables Suppose A can take on more than 2 values A is a random variable with arity k if it can take on exactly one value out of {v 1 ,v 2 , .. v k } Thus…
  • 69. An easy fact about Multivalued Random Variables: Using the axioms of probability… 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) And assuming that A obeys… It’s easy to prove that
  • 70. An easy fact about Multivalued Random Variables: Using the axioms of probability… 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) And assuming that A obeys… It’s easy to prove that And thus we can prove
  • 71. Another fact about Multivalued Random Variables: Using the axioms of probability… 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) And assuming that A obeys… It’s easy to prove that
  • 72. Another fact about Multivalued Random Variables: Using the axioms of probability… 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) And assuming that A obeys… It’s easy to prove that And thus we can prove
  • 73. More General Forms of Bayes Rule
  • 74. More General Forms of Bayes Rule
  • 76. From Probability to Bayesian Net Suppose there are some diseases, symptoms and related facts. E.g. flu, headache, fever, lung cancer, smoker. We are interested to know some (conditional or unconditional) probabilities such as P( flu ) P( lung cancer | smoker ) P( flu | headache ^ ~fever ) What shall we do?
  • 77. The Joint Distribution Recipe for making a joint distribution of M variables: Example: Boolean variables A, B, C
  • 78. The Joint Distribution Recipe for making a joint distribution of M variables: Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). Example: Boolean variables A, B, C 1 1 1 0 1 1 1 0 1 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 C B A
  • 79. The Joint Distribution Recipe for making a joint distribution of M variables: Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). For each combination of values, say how probable it is. Example: Boolean variables A, B, C 0.10 1 1 1 0.25 0 1 1 0.10 1 0 1 0.05 0 0 1 0.05 1 1 0 0.10 0 1 0 0.05 1 0 0 0.30 0 0 0 Prob C B A
  • 80. The Joint Distribution Recipe for making a joint distribution of M variables: Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). For each combination of values, say how probable it is. If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C A B C 0.05 0.25 0.10 0.05 0.05 0.10 0.10 0.30 0.10 1 1 1 0.25 0 1 1 0.10 1 0 1 0.05 0 0 1 0.05 1 1 0 0.10 0 1 0 0.05 1 0 0 0.30 0 0 0 Prob C B A
  • 81. Using the Joint Once you have the JD you can ask for the probability of any logical expression involving your attribute
  • 82. Using the Joint P(Poor Male) = 0.4654
  • 83. Using the Joint P(Poor) = 0.7604
  • 85. Inference with the Joint P( Male | Poor ) = 0.4654 / 0.7604 = 0.612
  • 86. Joint distributions Good news Once you have a joint distribution, you can ask important questions about stuff that involves a lot of uncertainty Bad news Impossible to create for more than about ten attributes because there are so many numbers needed when you build the damn thing.
  • 87. Using fewer numbers Suppose there are two events: M: Manuela teaches the class (otherwise it’s Andrew) S: It is sunny The joint p.d.f. for these events contain four entries. If we want to build the joint p.d.f. we’ll have to invent those four numbers. OR WILL WE?? We don’t have to specify with bottom level conjunctive events such as P(~M^S) IF… … instead it may sometimes be more convenient for us to specify things like: P(M), P(S). But just P(M) and P(S) don’t derive the joint distribution. So you can’t answer all questions.
  • 88. Using fewer numbers Suppose there are two events: M: Manuela teaches the class (otherwise it’s Andrew) S: It is sunny The joint p.d.f. for these events contain four entries. If we want to build the joint p.d.f. we’ll have to invent those four numbers. OR WILL WE?? We don’t have to specify with bottom level conjunctive events such as P(~M^S) IF… … instead it may sometimes be more convenient for us to specify things like: P(M), P(S). But just P(M) and P(S) don’t derive the joint distribution. So you can’t answer all questions. What extra assumption can you make?
  • 89. Independence “ The sunshine levels do not depend on and do not influence who is teaching.” This can be specified very simply: P(S  M) = P(S) This is a powerful statement! It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.
  • 90. Independence From P(S  M) = P(S), the rules of probability imply: ( can you prove these? ) P(~S  M) = P(~S) P(M  S) = P(M) P(M ^ S) = P(M) P(S) P(~M ^ S) = P(~M) P(S), (PM^~S) = P(M)P(~S), P(~M^~S) = P(~M)P(~S)
  • 91. Independence From P(S  M) = P(S), the rules of probability imply: ( can you prove these? ) P(~S  M) = P(~S) P(M  S) = P(M) P(M ^ S) = P(M) P(S) P(~M ^ S) = P(~M) P(S), (PM^~S) = P(M)P(~S), P(~M^~S) = P(~M)P(~S) And in general: P(M=u ^ S=v) = P(M=u) P(S=v) for each of the four combinations of u=True/False v=True/False
  • 92. Independence We’ve stated: P(M) = 0.6 P(S) = 0.3 P(S  M) = P(S) And since we now have the joint pdf, we can make any queries we like. From these statements, we can derive the full joint pdf. F F T F F T T T Prob S M
  • 93. A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive later than Manuela.
  • 94. A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. Let’s begin with writing down knowledge we’re happy about: P(S  M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer.
  • 95. A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. Let’s begin with writing down knowledge we’re happy about: P(S  M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer. We already know the Joint of S and M, so all we need now is P(L  S=u, M=v) in the 4 cases of u/v = True/False.
  • 96. A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 Now we can derive a full joint p.d.f. with a “mere” six numbers instead of seven* *Savings are larger for larger numbers of variables.
  • 97. A more interesting case M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2 Question: Express P(L=x ^ M=y ^ S=z) in terms that only need the above expressions, where x,y and z may each be True or False.
  • 98. A bit of notation S M L P(s)=0.3 P(M)=0.6 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2
  • 99. A bit of notation S M L P(s)=0.3 P(M)=0.6 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Read the absence of an arrow between S and M to mean “it would not help me predict M if I knew the value of S” Read the two arrows into L to mean that if I want to know the value of L it may help me to know M and to know S. This kind of stuff will be thoroughly formalized later P(S  M) = P(S) P(S) = 0.3 P(M) = 0.6 P(L  M ^ S) = 0.05 P(L  M ^ ~S) = 0.1 P(L  ~M ^ S) = 0.1 P(L  ~M ^ ~S) = 0.2
  • 100. An even cuter trick Suppose we have these three events: M : Lecture taught by Manuela L : Lecturer arrives late R : Lecture concerns robots Suppose: Andrew has a higher chance of being late than Manuela. Andrew has a higher chance of giving robotics lectures. What kind of independence can we find? How about: P(L  M) = P(L) ? P(R  M) = P(R) ? P(L  R) = P(L) ?
  • 101. Conditional independence Once you know who the lecturer is, then whether they arrive late doesn’t affect whether the lecture concerns robots. P(R  M,L) = P(R  M) and P(R  ~M,L) = P(R  ~M) We express this in the following way: “ R and L are conditionally independent given M” M L R Given knowledge of M, knowing anything else in the diagram won’t help us with L, etc. ..which is also notated by the following diagram.
  • 102. Conditional Independence formalized R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments)
  • 103. Example: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments) “ Shoe-size is conditionally independent of Glove-size given height weight and age” means forall s,g,h,w,a P(ShoeSize=s|Height=h,Weight=w,Age=a) = P(ShoeSize=s|Height=h,Weight=w,Age=a,GloveSize=g)
  • 104. Example: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments) “ Shoe-size is conditionally independent of Glove-size given height weight and age” does not mean forall s,g,h P(ShoeSize=s|Height=h) = P(ShoeSize=s|Height=h, GloveSize=g)
  • 105. Conditional independence M L R We can write down P(M). And then, since we know L is only directly influenced by M, we can write down the values of P(L  M) and P(L  ~M) and know we’ve fully specified L’s behavior. Ditto for R. P(M) = 0.6 P(L  M) = 0.085 P(L  ~M) = 0.17 P(R  M) = 0.3 P(R  ~M) = 0.6 ‘ R and L conditionally independent given M’
  • 106. Conditional independence M L R P(M) = 0.6 P(L  M) = 0.085 P(L  ~M) = 0.17 P(R  M) = 0.3 P(R  ~M) = 0.6 Conditional Independence: P(R  M,L) = P(R  M), P(R  ~M,L) = P(R  ~M) Again, we can obtain any member of the Joint prob dist that we desire: P(L=x ^ R=y ^ M=z) =
  • 107. Assume five variables T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L) L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S) R only directly influenced by M (i.e. R is conditionally independent of L,S, given M) M and S are independent
  • 108. Making a Bayes net T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny S M R L T Step One: add variables. Just choose the variables you’d like to be included in the net.
  • 109. Making a Bayes net T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny S M R L T Step Two: add links. The link structure must be acyclic. If node X is given parents Q 1 ,Q 2 ,..Q n you are promising that any variable that’s a non-descendent of X is conditionally independent of X given {Q 1 ,Q 2 ,..Q n }
  • 110. Making a Bayes net T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2 Step Three: add a probability table for each node. The table for node X must list P(X|Parent Values) for each possible combination of parent values
  • 111. Making a Bayes net T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny Two unconnected variables may still be correlated Each node is conditionally independent of all non-descendants in the tree, given its parents. You can deduce many other conditional independence relations from a Bayes net. See the next lecture. S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
  • 112. Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where: V is a set of vertices. E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: The name of a random variable A probability distribution table indicating how the probability of this variable’s values depends on all possible combinations of parental values.
  • 113. Building a Bayes Net Choose a set of relevant variables. Choose an ordering for them Assume they’re called X 1 .. X m (where X 1 is the first in the ordering, X 1 is the second, etc) For i = 1 to m : Add the X i node to the network Set Parents(X i ) to be a minimal subset of { X 1 …X i-1 } such that we have conditional independence of X i and all other members of { X 1 …X i-1 } given Parents(X i ) Define the probability table of P( X i = k  Assignments of Parents(X i ) ).
  • 114. Computing a Joint Entry How to compute an entry in a joint distribution? E.G: What is P(S ^ ~M ^ L ~R ^ T)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
  • 115. Computing with Bayes Net P(T ^ ~R ^ L ^ ~M ^ S) = P(T  ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) = P(T  L) * P(~R ^ L ^ ~M ^ S) = P(T  L) * P(~R  L ^ ~M ^ S) * P(L^~M^S) = P(T  L) * P(~R  ~M) * P(L^~M^S) = P(T  L) * P(~R  ~M) * P(L  ~M^S)*P(~M^S) = P(T  L) * P(~R  ~M) * P(L  ~M^S)*P(~M | S)*P(S) = P(T  L) * P(~R  ~M) * P(L  ~M^S)*P(~M)*P(S). S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
  • 116. The general case P(X 1 = x 1 ^ X 2 =x 2 ^ ….X n-1 =x n-1 ^ X n =x n ) = P(X n =x n ^ X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1 ^…. X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1  …. X 2 =x 2 ^ X 1 =x 1 ) * P(X n-2 =x n-2 ^…. X 2 =x 2 ^ X 1 =x 1 ) = : : = So any entry in joint pdf table can be computed. And so any conditional probability can be computed.
  • 117. Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
  • 118. Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? Step 1: Compute P( R ^ T ^ ~S ) Step 2: Compute P( T ^ ~S ) Step 3: Return P( R ^ T ^ ~S ) ------------------------------------- P( T ^ ~S ) S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
  • 119. Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? Step 1: Compute P( R ^ T ^ ~S ) Step 2: Compute P( T ^ ~S ) Step 3: Return P( R ^ T ^ ~S ) ------------------------------------- P( T ^ ~S ) Sum of all the rows in the Joint that match R ^ T ^ ~S Sum of all the rows in the Joint that match T ^ ~S S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
  • 120. Where are we now? We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. E.G. What could we do to compute P(R  T,~S)? Step 1: Compute P( R ^ T ^ ~S ) Step 2: Compute P( T ^ ~S ) Step 3: Return P( R ^ T ^ ~S ) ------------------------------------- P( T ^ ~S ) Sum of all the rows in the Joint that match R ^ T ^ ~S Sum of all the rows in the Joint that match T ^ ~S Each of these obtained by the “computing a joint probability entry” method of the earlier slides 4 joint computes 8 joint computes S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
  • 121. The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values )
  • 122. The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values ) Suppose you have m binary-valued variables in your Bayes Net and expression E 2 mentions k variables. How much work is the above computation?
  • 123. The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables.
  • 124. The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks too, right?
  • 125. The sad, bad news Conditional probabilities by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks too, right? Sadder and worse news: General querying of Bayes nets is NP-complete.
  • 126. Bayes nets inference algorithms A poly-tree is a directed acyclic graph in which no two nodes have more than one path between them. A poly tree Not a poly tree (but still a legal Bayes net) S R L T L T M S M R X 1 X 2 X 4 X 3 X 5 X 1 X 2 X 3 X 5 X 4 If net is a poly-tree, there is a linear-time algorithm (see a later Andrew lecture). The best general-case algorithms convert a general net to a poly-tree (often at huge expense) and calls the poly-tree algorithm. Another popular, practical approach (doesn’t assume poly-tree): Stochastic Simulation.
  • 127. Sampling from the Joint Distribution It’s pretty easy to generate a set of variable-assignments at random with the same probability as the underlying joint distribution. How? S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
  • 128. Sampling from the Joint Distribution 1. Randomly choose S. S = True with prob 0.3 2. Randomly choose M. M = True with prob 0.6 3. Randomly choose L. The probability that L is true depends on the assignments of S and M. E.G. if steps 1 and 2 had produced S=True, M=False, then probability that L is true is 0.1 4. Randomly choose R. Probability depends on M. 5. Randomly choose T. Probability depends on L S M R L T P(s)=0.3 P(M)=0.6 P(R  M)=0.3 P(R  ~M)=0.6 P(T  L)=0.3 P(T  ~L)=0.8 P(L  M^S)=0.05 P(L  M^~S)=0.1 P(L  ~M^S)=0.1 P(L  ~M^~S)=0.2
  • 129. A general sampling algorithm Let’s generalize the example on the previous slide to a general Bayes Net. As in Slides 16-17 , call the variables X 1 .. X n , where Parents(X i ) must be a subset of { X 1 .. X i-1 }. For i=1 to n : Find parents, if any, of X i . Assume n(i) parents. Call them X p(i,1) , X p(i,2) , … X p(i,n(i)) . Recall the values that those parents were randomly given: x p(i,1) , x p(i,2) , … x p(i,n(i)) . Look up in the lookup-table for: P(X i =True  X p(i,1) = x p(i,1) ,X p(i,2) =x p(i,2) …X p(i,n(i)) =x p(i,n(i)) ) Randomly set x i =True according to this probability x 1 , x 2 ,…x n are now a sample from the joint distribution of X 1 , X 2 ,…X n .
  • 130. Stochastic Simulation Example Someone wants to know P( R = True  T = True ^ S = False ) We’ll do lots of random samplings and count the number of occurrences of the following: N c : Num. samples in which T=True and S=False. N s : Num. samples in which R=True , T=True and S=False . N : Number of random samplings Now if N is big enough: N c /N is a good estimate of P( T=True and S=False ). N s /N is a good estimate of P( R=True , T=True , S=False ) . P( R  T^~S ) = P( R^ T^~S )/P( T^~S ), so N s / N c can be a good estimate of P( R  T^~S ).
  • 131. General Stochastic Simulation Someone wants to know P( E 1  E 2 ) We’ll do lots of random samplings and count the number of occurrences of the following: N c : Num. samples in which E 2 N s : Num. samples in which E 1 and E 2 N : Number of random samplings Now if N is big enough: N c /N is a good estimate of P( E 2 ). N s /N is a good estimate of P( E 1 , E 2 ) . P( E 1  E 2 ) = P( E 1 ^ E 2 )/P( E 2 ), so N s / N c can be a good estimate of P( E 1  E 2 ).
  • 132. Likelihood weighting Problem with Stochastic Sampling: With lots of constraints in E, or unlikely events in E, then most of the simulations will be thrown away, (they’ll have no effect on Nc, or Ns). Imagine we’re part way through our simulation. In E2 we have the constraint Xi = v We’re just about to generate a value for Xi at random. Given the values assigned to the parents, we see that P(Xi = v  parents) = p . Now we know that with stochastic sampling: we’ll generate “Xi = v” proportion p of the time, and proceed. And we’ll generate a different value proportion 1-p of the time, and the simulation will be wasted. Instead, always generate Xi = v, but weight the answer by weight “p” to compensate.
  • 133. Likelihood weighting Set N c :=0, N s :=0 Generate a random assignment of all variables that matches E 2 . This process returns a weight w. Define w to be the probability that this assignment would have been generated instead of an unmatching assignment during its generation in the original algorithm.Fact: w is a product of all likelihood factors involved in the generation. N c := N c + w If our sample matches E 1 then N s := N s + w Go to 1 Again, N s / N c estimates P( E 1  E 2 )
  • 134. Case Study I Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA). Diagnostic system for lymph-node diseases. 60 diseases and 100 symptoms and test-results. 14,000 probabilities Expert consulted to make net. 8 hours to determine variables. 35 hours for net topology. 40 hours for probability table values. Apparently, the experts found it quite easy to invent the causal links and probabilities. Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains.
  • 135. Questions What are the strengths of probabilistic networks compared with propositional logic? What are the weaknesses of probabilistic networks compared with propositional logic? What are the strengths of probabilistic networks compared with predicate logic? What are the weaknesses of probabilistic networks compared with predicate logic? (How) could predicate logic and probabilistic networks be combined?
  • 136. What you should know The meanings and importance of independence and conditional independence. The definition of a Bayes net. Computing probabilities of assignments of variables (i.e. members of the joint p.d.f.) with a Bayes net. The slow (exponential) method for computing arbitrary, conditional probabilities. The stochastic simulation method and likelihood weighting.
  • 137. Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)
  • 138. Neural Networks Assume each record X is a vector of length n. There are two classes which are valued 1 and -1. E.g. class 1 means “buy computer”, and class -1 means “no buy”. We can incrementally change weights to learn to produce these outputs using the perceptron learning rule.
  • 139. A Neuron The n -dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping  k - f weighted sum Input vector x output y Activation function weight vector w  w 0 w 1 w n x 0 x 1 x n
  • 140. A Neuron  k - f weighted sum Input vector x output y Activation function weight vector w  w 0 w 1 w n x 0 x 1 x n
  • 141. Multi-Layer Perceptron Output nodes Input nodes Hidden nodes Output vector Input vector: x i w ij
  • 142. Network Training The ultimate objective of training obtain a set of weights that makes almost all the tuples in the training data classified correctly Steps Initialize weights with random values Feed the input tuples into the network one by one For each unit Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error Update the weights and the bias
  • 143. Network Pruning and Rule Extraction Network pruning Fully connected network will be hard to articulate N input nodes, h hidden nodes and m output nodes lead to h(m+N) weights Pruning: Remove some of the links without affecting classification accuracy of the network Extracting rules from a trained network Discretize activation values; replace individual activation value by the cluster average maintaining the network accuracy Enumerate the output from the discretized activation values to find rules between activation value and output Find the relationship between the input and activation value Combine the above two to have rules relating the output to input
  • 144. Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)
  • 145. Linear Support Vector Machines Goal: find a plane that divides the two sets of points. value( )= -1, e.g. does not buy computer value( )= 1, e.g. buy computer Also, need to maximize margin. Margin
  • 146. Linear Support Vector Machines Support Vectors Small Margin Large Margin
  • 147. Linear Support Vector Machines The hyperplane is defined by vector w and value b . For any point x where value( x )=-1, w¡x + b ≤ -1. For any point x where value( x )=1, w¡x + b ≥ 1. Here let x = ( x h , x v ) w =( w h , w v ), we have: w¡x = w h x h + w v x v w¡x + b = -1 w¡x + b = 1
  • 148. Linear Support Vector Machines For instance, the hyperplane is defined by vector w = (1, 1/ ) and value b = -11. w¡x + b = -1 w¡x + b = 1 10 11 12 60° Some examples: x = (10, 0)  w¡x + b = -1 x = (0, )  w¡x + b = 1 x = (0, 0)  w¡x + b = -11 A little more detail: w is perpendicular to minus plane, margin M = , e.g. M
  • 149. Linear Support Vector Machines Each record x i in training data has a vector and a class, There are two classes: 1 and -1, E.g. class 1 = “buy computer”, class -1 = “not buy”. E.g. a record (age, salary) = (45, 50k), class = 1. Read the training data, find the hyperplane defined by w and b . Should maximize margin, minimize error. Can be done by Quadratic Programming Technique. Predict the class of a new data x by looking at the sign of w¡x + b. Positive: class 1. Negative: class -1.
  • 150. SVM – Cont. What if the data is not linearly separable? Project the data to high dimensional space where it is linearly separable and then we can use linear SVM – (Using Kernels) -1 0 +1 + + - (1,0) (0,0) (0,1) + + -
  • 151. Non-Linear SVM Classification using SVM ( w,b ) In non linear case we can see this as Kernel – Can be thought of as doing dot product in some high dimensional space
  • 154. SVM vs. Neural Network SVM Relatively new concept Nice Generalization properties Hard to learn – learned in batch mode using quadratic programming techniques Using kernels can learn very complex functions Neural Network Quite Old Generalizes well but doesn’t have strong mathematical foundation Can easily be learned in incremental fashion To learn complex functions – use multilayer perceptron (not that trivial)
  • 155. SVM Related Links http://svm.dcs.rhbnc.ac.uk/ http://www.kernel-machines.org/ C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition . Knowledge Discovery and Data Mining , 2(2), 1998. SVM light – Software (in C) http:// ais.gmd.de/~thorsten/svm_light BOOK: An Introduction to Support Vector Machines N. Cristianini and J. Shawe-Taylor Cambridge University Press
  • 156. Content What is classification? Decision tree NaĂŻve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM) Bagging and Boosting
  • 157. Bagging and Boosting General idea Training data Altered Training data Altered Training data …… .. Aggregation …. Classifier C Classification method (CM) CM Classifier C1 CM Classifier C2 Classifier C*
  • 158. Bagging Given a set S of s samples Generate a bootstrap sample T from S. Cases in S may not appear in T or may appear more than once. Repeat this sampling procedure, getting a sequence of k independent training sets A corresponding sequence of classifiers C1,C2,…,Ck is constructed for each of these training sets, by using the same classification algorithm To classify an unknown sample X,let each classifier predict or vote The Bagged Classifier C* counts the votes and assigns X to the class with the “most” votes
  • 159. Boosting Technique — Algorithm Assign every example an equal weight 1/N For t = 1, 2, …, T Do Obtain a hypothesis (classifier) h (t) under w (t) Calculate the error of h(t) and re-weight the examples based on the error . Each classifier is dependent on the previous ones. Samples that are incorrectly predicted are weighted more heavily Normalize w (t+1) to sum to 1 (weights assigned to different classifiers sum to 1) Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set
  • 160. Summary Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks) Classification is probably one of the most widely used data mining techniques with a lot of extensions Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic Research directions: classification of non-relational data , e.g., text, spatial, multimedia, etc..