SlideShare a Scribd company logo
CS 9633 Machine Learning Computational Learning Theory Adapted from notes by Tom Mitchell http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html
Theoretical Characterization of Learning Problems Under what conditions is successful learning possible and impossible? Under what conditions is a particular learning algorithm assured of learning successfully?
Two Frameworks PAC (Probably Approximately Correct) Learning Framework:  Identify classes of hypotheses that can and cannot be learned from a polynomial number of training examples Define a natural measure of complexity for hypothesis spaces that allows bounding the number of training examples needed Mistake Bound Framework
Theoretical Questions of Interest Is it possible to identify classes of learning problems that are inherently difficult or easy, independent of the learning algorithm? Can one characterize the number of training examples necessary or sufficient to assure successful learning? How is the number of examples affected  If observing a random sample of training data? if the learner is allowed to pose queries to the trainer? Can one characterize the number of mistakes that a learner will make before learning the target function? Can one characterize the inherent computational complexity of a class of learning algorithms?
Computational Learning Theory  Relatively recent field Area of intense research Partial answers to some questions on previous page is yes. Will generally focus on certain types of learning problems.
Inductive Learning of Target Function What we are given Hypothesis space Training examples What we want to know How many training examples are sufficient to successfully learn the target function? How many mistakes will the learner make before succeeding?
Questions for Broad Classes of Learning Algorithms Sample complexity How many training examples do we need to converge to a successful hypothesis with a high probability? Computational complexity How much computational effort is needed to converge to a successful hypothesis with a high probability? Mistake Bound How many training examples will the learner misclassify before converging to a successful hypothesis?
PAC Learning Probably Approximately Correct Learning Model Will restrict discussion to learning boolean-valued concepts in noise-free data.
Problem Setting: Instances and Concepts X  is set of all possible instances over which target function may be defined C  is set of target concepts learner is to learn Each target concept c in C is a subset of X Each target concept c in C is a boolean function  c: X  {0,1} c(x) = 1 if x is positive example of concept  c(x) = 0 otherwise
Problem Setting: Distribution Instances generated at random using some probability distribution  D  D  may be any distribution D  is generally not known to the learner D  is required to be stationary (does not change over time) Training examples  x  are drawn at random from  X  according to  D  and presented with target value  c(x)  to the learner.
Problem Setting:  Hypotheses Learner  L  considers set of hypotheses  H   After observing a sequence of training examples of the target concept  c ,  L  must output some hypothesis  h  from  H  which is its estimate of  c
Example Problem (Classifying Executables) Three Classes (Malicious, Boring, Funny) Features a 1  GUI present (yes/no) a 2  Deletes files (yes/no) a 3   Allocates memory (yes/no) a 4 Creates new thread (yes/no) Distribution? Hypotheses?
M No Yes No No 10 B Yes No No No 9 M Yes No Yes Yes 8 M No Yes Yes Yes 7 F No No No Yes 6 B Yes No No Yes 5 M Yes Yes No No 4 F No Yes Yes No 3 B No No No Yes 2 B Yes No No Yes 1 Class a 4 a 3 a 2 a 1 Instance
True Error Definition:  The  true error   (denoted  error D (h))  of hypothesis  h  with respect to target concept  c  and distribution  D   , is the probability that h will misclassify an instance drawn at random according to  D .
Error of h with respect to c Instance space  X + + + c h - - - -
Key Points True error defined over entire instance space, not just training data Error depends strongly on the unknown probability distribution  D The error of  h  with respect to  c  is not directly observable to the learner L—can only observe performance with respect to training data (training error) Question:  How probable is it that the observed training error for h gives a misleading estimate of the true error?
PAC Learnability Goal:  characterize classes of target concepts that can be reliably learned from a reasonable number of randomly drawn training examples and  using a reasonable amount of computation Unreasonable to expect perfect learning where error D (h) = 0 Would need to provide training examples corresponding to every possible instance With random sample of training examples, there is always a non-zero probability that the training examples will be misleading
Weaken Demand on Learner Hypothesis error  (Approximately) Will not require a zero error hypothesis Require that error is bounded by some constant   , that can be made arbitrarily small    is the error parameter Error on training data (Probably) Will not require that the learner succeed on every sequence of randomly drawn training examples Require that its probability of failure is bounded by a constant,   , that can be made arbitrarily small    is the confidence parameter
Definition of PAC-Learnability Definition:  Consider a concept class  C  defined over a set of instances  X  of length  n  and a learner  L  using hypothesis space  H .  C  is PAC-learnable by  L  using  H  if all  c    C , distributions  D  over  X ,      such that   0 <    < ½   , and    such that  0 <    < ½ , learner  L  will with probability at least  (1 -   )  output a hypothesis  h   H  such that  error D (h)      , in time that is polynomial in  1/  ,  1/  ,  n , and  size(c) .
Requirements of Definition L must with arbitrarily high probability (1-  ), out put a hypothesis having arbitrarily low error (  ). L’s learning must be efficient—grows polynomially in terms of  Strengths of output hypothesis (1/  , 1/  ) Inherent complexity of instance space (n) and concept class C (size(c)).
Block Diagram of PAC Learning Model Learning algorithm L Training sample Control Parameters  ,   Hypothesis h
Examples of second requirement Consider executables problem where instances are conjunctions of boolean features: a 1 =yes    a 2 =no    a 3 =yes    a 4 =no Concepts are conjunctions of a subset of the features a 1 =yes    a 3 =yes    a 4 =yes
Using the Concept of PAC Learning in Practice We often want to know how many training instances we need in order to achieve a certain level of accuracy with a specified probability. If L requires some minimum processing time per training example, then for C to be PAC-learnable by L, L must learn from a polynomial number of training examples.
Sample Complexity Sample complexity of a learning  problem  is the growth in the required training examples with problem size. Will determine the sample complexity for consistent learners. A learner is consistent if it outputs hypotheses which perfectly fit the training data whenever possible. All algorithms in Chapter 2 are consistent learners.
Recall definition of VS The  version space , denoted VS H,D , with respect to hypothesis space H and training examples D, is the subset of hypotheses from H consistent with the training examples in D
VS and PAC learning by consistent learners Every consistent learner outputs a hypothesis belonging to the version space, regardless of the instance space X, hypothesis space H, or training data D. To bound the number of examples needed by any consistent learner, we need only to bound the number of examples needed to assure that the version space contains no unacceptable hypotheses.
 -exhausted Definition:  Consider a hypothesis space H, target concept c, instance distribution  D , and set of training examples D of c.  The version space VS H,D  is said to be   -exhausted with respect to c and  D , if every hypothesis h in V H,D  has error less than    with respect to c and D.
Exhausting the version space VS H,D error = 0.1 r=0.2 error = 0.3 r=0.2 error = 0.2 r=0 error = 0.1 r=0 error = 0.3 r=0.4 error = 0.2 r=0.3 Hypothesis Space H
Exhausting the Version Space Only an observer who knows the identify of the target concept can determine with certainty whether the version space is   -exhausted. But, we can bound the probability that the version space will be   -exhausted after a given number of training examples  Without knowing the identity of the target concept Without knowing the distribution from which training examples were drawn
Theorem 7.1 Theorem 7.1   -exhausting the version space.  If the hypothesis space H is finite, D is a sequence of m    1 independent randomly drawn examples of some target concept c, then for any 0  1, the probability that the version space VS H,D  is not   -exhausted (with respect to c) is less than or equal to  |H|e -  m
Proof of theorem See text
Number of Training Examples  (Eq. 7.2)
Summary of Result Inequality on previous slide provides a general bound on the number of trianing examples sufficient for any consistent learner to successfully learn any target concept in H, for any desired values of    and   . This number m of training examples is sufficient to assure that any consistent hypothesis will be  probably (with probability 1-  )  approximately (within error   ) correct. The value of m grows  linearly with 1/    logarithmically with 1/    logarithmically with |H| The bound can be a substantial overestimate.
Problem Suppose we have the instance space described for the EnjoySports problem: Sky (Sunny, Cloudy, Rainy) AirTemp (Warm, Cold) Humidity (Normal, High) Wind (Strong, Weak) Water (Warm, Cold) Forecast (Same, Change) Hypotheses can be as before (?, Warm, Normal, ?, ?, Same)  (0, 0, 0, 0, 0, 0) How many training examples do we need to have an error rate of less than 10% with a probability of 95%?
Limits of Equation 7.2 Equation 7.2 tell us how many training examples suffice to ensure (with probability (1-  ) that every hypothesis having 0 training error, will have a true error of at most   . Problem:  there may be no hypothesis that is consistent with if the concept is not in H.  In this case, we want the minimum error hypothesis.
Agnostic Learning and Inconsistent Hypotheses An Agnostic Learner does not make the assumption that the concept is contained in the hypothesis space. We may want to consider the hypothesis with the minimum error Can derive a bound similar to the previous one:
Concepts that are PAC-Learnable Proofs that a type of concept is PAC-Learnable usually consist of two steps: Show that each target concept in C can be learned from a polynomial number of training examples Show that the processing time per training example is also polynomially bounded
PAC Learnability of Conjunctions of Boolean Literals Class C of target concepts described by conjunctions of boolean literals: GUI_Present      Opens_files Is C PAC learnable?  Yes. Will prove by  Showing that a polynomial # of training examples is needed to learn each concept Demonstrate an algorithm that uses polynomial time per training example
Examples Needed to Learn Each Concept Consider a consistent learner that uses hypothesis space H =C Compute number m of random training examples sufficient to ensure that L will, with probability (1 -   ), output a hypothesis with maximum error   . We will use m   (1/  )(ln|H|+ln(1/  )) What is the size of the hypothesis space?
Complexity Per Example We just need to show that for some algorithm, we can spend a polynomial amount of time per training example. One way to do this is to give an algorithm. In this case, we can use Find-S as the learning algorithm. Find-S incrementally computes the most specific hypothesis consistent with each training example. Old     Tired   +   Old     Happy  + Tired  + Old      Tired  - Rich    Happy  + What is a bound on the time per example?
Theorem 7.2 PAC-learnability of boolean conjunctions.  The class C of conjunctions of boolean literals is PAC-learnable by the FIND-S algorithm using H=C
Proof of Theorem 7.2 Equation 7.4 shows that the sample complexity for this concept class id polynomial in n, 1/  , and 1/  , and independent of size(c).  To incrematally process each training example, the FIND-S algorithm requires effort linear in n and independent of 1/  , 1/  , and size(c).  Therefore, this concept class is PAC-learnable by the FIND-S algorithm.
Interesting Results Unbiased learners are not PAC learnable because they require an exponential number of examples. K-term Disjunctive Normal Form is not PAC learnable K-term Conjunctive Normal Form is a superset of k-DNF, but it is PAC learnable
Sample Complexity with Infinite Hypothesis Spaces Two drawbacks to previous result It often does not give a very tight bound on the sample complexity It only applies to finite hypothesis spaces Vapnik-Chervonekis dimension of H (VC dimension)  Will give tighter bounds Applies to many infinite hypothesis spaces.
Shattering a Set of Instances Consider a subset of instances S from the instance space X. Every hypothesis imposes dichotomies on S {x  S | h(x) = 1} {x  S | h(x) = 0} Given some instance space S, there are 2 |S|  possible dichotomies. The ability of H to shatter a set of concepts is a measure of its capacity to represent target concepts defined over these instances.
Shattering a Hypothesis Space Definition:  A set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy.
Vapnik-Chervonenkis Dimension Ability to shatter a set of instances is closely related to the inductive bias of the hypothesis space. An unbiased hypothesis space is one that shatters the instance space X. Sometimes H cannot be shattered, but a large subset of it can.
Vapnik-Chervonenkis Dimension Definition:  The Vapnik-Chervonenkis dimension, VC(H) of hypothesis space H defined over instance space X, is the size of  the largest finite subset of X shattered by H .  If arbitrarily large finite sets of X can be shattered by H, then VC(H) =   .
Shattered Instance Space
Example 1 of VC Dimension Instance space X is the set of real numbers X =  R . H is the set of intervals on the real number line.  Form of H is: a < x < b  What is VC(H)?
Shattering the real number line -1.2 3.4 6.7 What is VC(H)? What is |H|? -1.2 3.4
Example 2 of VC Dimension Set X of instances corresponding to numbers on the x,y plane H is the set of all linear decision surfaces What is VC(H)?
Shattering the x-y plane 2 instances 3 instances VC(H) = ? |H| = ?
Proving limits on VC dimension If we find any set of instances of size d that can be shattered, then VC(H)    d. To show that VC(H) < d, we must show that no set of size d can be shattered.
General result for r dimensional space The VC dimension of linear decision surfaces in an r dimensional space is r+1.
Example 3 of VC dimension Set X of instances  are conjunctions of exactly three boolean literals young    happy    single H is the set of hypothesis described by a conjunction of up to 3 boolean literals. What is VC(H)?
Shattering conjunctions of literals Approach:  construct a set of instances of size 3 that can be shattered.  Let instance  i  have positive literal  l i  and all other literals negative.  Representation of instances that are conjunctions of literals  l 1 ,  l 2  and  l 3  as bit strings: Instance 1 :  100 Instance 2 :  010 Instance 3 :  001 Construction of dichotomy:  To exclude an instance, add appropriate   l i  to the hypothesis. Extend the argument to n literals. Can VC(H) be greater than n (number of literals)?
Sample Complexity and the VC dimension Can derive a new bound for the number of randomly drawn training examples that suffice to probably approximately learn a target concept (how many examples do we need to   -exhaust the version space with probability (1-  )?)
Comparing the Bounds
Lower Bound on Sample Complexity Theorem 7.3   Lower bound on sample complexity.   Consider any concept class C such that VC(C)    2, any learner L, and any 0 <    < 1/8, and 0 <    < 1/100.  Then there exists a distribution  D  and target concept in C such that if L observes fewer examples than  Then with probability at least   , L outputs a hypothesis h having error D (h) >   .

More Related Content

What's hot (20)

PDF
Vc dimension in Machine Learning
VARUN KUMAR
 
PPTX
Computational learning theory
swapnac12
 
PPTX
Concept learning
Musa Hawamdah
 
ODP
image compression ppt
Shivangi Saxena
 
PPT
Arithmetic coding
Vikas Goyal
 
PPTX
Concept learning and candidate elimination algorithm
swapnac12
 
PPSX
Edge Detection and Segmentation
Dr. A. B. Shinde
 
PDF
Bayesian learning
Vignesh Saravanan
 
PPTX
Chapter 9 morphological image processing
Ahmed Daoud
 
PDF
Lecture 15 DCT, Walsh and Hadamard Transform
VARUN KUMAR
 
PPTX
Instance based learning
swapnac12
 
PPT
Perceptron
Nagarajan
 
PPTX
Counter propagation Network
Akshay Dhole
 
PPTX
Color Image Processing
kiruthiammu
 
PDF
Machine learning Lecture 2
Srinivasan R
 
PPT
Huffman Coding
anithabalaprabhu
 
PPTX
IMAGE SEGMENTATION.
Tawose Olamide Timothy
 
PPT
2.5 backpropagation
Krish_ver2
 
PDF
PAC Learning
Sanghyuk Chun
 
Vc dimension in Machine Learning
VARUN KUMAR
 
Computational learning theory
swapnac12
 
Concept learning
Musa Hawamdah
 
image compression ppt
Shivangi Saxena
 
Arithmetic coding
Vikas Goyal
 
Concept learning and candidate elimination algorithm
swapnac12
 
Edge Detection and Segmentation
Dr. A. B. Shinde
 
Bayesian learning
Vignesh Saravanan
 
Chapter 9 morphological image processing
Ahmed Daoud
 
Lecture 15 DCT, Walsh and Hadamard Transform
VARUN KUMAR
 
Instance based learning
swapnac12
 
Perceptron
Nagarajan
 
Counter propagation Network
Akshay Dhole
 
Color Image Processing
kiruthiammu
 
Machine learning Lecture 2
Srinivasan R
 
Huffman Coding
anithabalaprabhu
 
IMAGE SEGMENTATION.
Tawose Olamide Timothy
 
2.5 backpropagation
Krish_ver2
 
PAC Learning
Sanghyuk Chun
 

Similar to Computational Learning Theory (20)

PDF
"PAC Learning - a discussion on the original paper by Valiant" presentation @...
Adrian Florea
 
PDF
Lecture5 xing
Tianlu Wang
 
PPT
AML_030607.ppt
butest
 
PPT
.ppt
butest
 
PPT
ppt
butest
 
PPT
ppt
butest
 
PPT
tutorial.ppt
GuioGonza2
 
PPT
Introduction to machine learning
butest
 
PPTX
Computational Learning Theory ppt.pptxhhhh
zoobiarana76
 
PPT
Machine Learning
butest
 
PPT
Machine learning
Digvijay Singh
 
PPTX
machine leraning : main principles and techniques
johngeorgakis99
 
PPTX
Machine Learning
GaytriDhingra1
 
PPTX
AI -learning and machine learning.pptx
GaytriDhingra1
 
PDF
Lecture 3 (Supervised learning)
VARUN KUMAR
 
PDF
Bayesian Learning- part of machine learning
kensaleste
 
PPT
Alpaydin - Chapter 2
butest
 
PPT
Machine Learning: Foundations Course Number 0368403401
butest
 
PPT
Alpaydin - Chapter 2
butest
 
PDF
A Theory of the Learnable; PAC Learning
dhruvgairola
 
"PAC Learning - a discussion on the original paper by Valiant" presentation @...
Adrian Florea
 
Lecture5 xing
Tianlu Wang
 
AML_030607.ppt
butest
 
.ppt
butest
 
ppt
butest
 
ppt
butest
 
tutorial.ppt
GuioGonza2
 
Introduction to machine learning
butest
 
Computational Learning Theory ppt.pptxhhhh
zoobiarana76
 
Machine Learning
butest
 
Machine learning
Digvijay Singh
 
machine leraning : main principles and techniques
johngeorgakis99
 
Machine Learning
GaytriDhingra1
 
AI -learning and machine learning.pptx
GaytriDhingra1
 
Lecture 3 (Supervised learning)
VARUN KUMAR
 
Bayesian Learning- part of machine learning
kensaleste
 
Alpaydin - Chapter 2
butest
 
Machine Learning: Foundations Course Number 0368403401
butest
 
Alpaydin - Chapter 2
butest
 
A Theory of the Learnable; PAC Learning
dhruvgairola
 
Ad

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
DOC
1. MPEG I.B.P frame之不同
butest
 
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
PPT
Timeline: The Life of Michael Jackson
butest
 
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
PPTX
Com 380, Summer II
butest
 
PPT
PPT
butest
 
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
DOC
MICHAEL JACKSON.doc
butest
 
PPTX
Social Networks: Twitter Facebook SL - Slide 1
butest
 
PPT
Facebook
butest
 
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
DOC
NEWS ANNOUNCEMENT
butest
 
DOC
C-2100 Ultra Zoom.doc
butest
 
DOC
MAC Printing on ITS Printers.doc.doc
butest
 
DOC
Mac OS X Guide.doc
butest
 
DOC
hier
butest
 
DOC
WEB DESIGN!
butest
 
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
butest
 
PPT
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
butest
 
hier
butest
 
WEB DESIGN!
butest
 
Ad

Computational Learning Theory

  • 1. CS 9633 Machine Learning Computational Learning Theory Adapted from notes by Tom Mitchell http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html
  • 2. Theoretical Characterization of Learning Problems Under what conditions is successful learning possible and impossible? Under what conditions is a particular learning algorithm assured of learning successfully?
  • 3. Two Frameworks PAC (Probably Approximately Correct) Learning Framework: Identify classes of hypotheses that can and cannot be learned from a polynomial number of training examples Define a natural measure of complexity for hypothesis spaces that allows bounding the number of training examples needed Mistake Bound Framework
  • 4. Theoretical Questions of Interest Is it possible to identify classes of learning problems that are inherently difficult or easy, independent of the learning algorithm? Can one characterize the number of training examples necessary or sufficient to assure successful learning? How is the number of examples affected If observing a random sample of training data? if the learner is allowed to pose queries to the trainer? Can one characterize the number of mistakes that a learner will make before learning the target function? Can one characterize the inherent computational complexity of a class of learning algorithms?
  • 5. Computational Learning Theory Relatively recent field Area of intense research Partial answers to some questions on previous page is yes. Will generally focus on certain types of learning problems.
  • 6. Inductive Learning of Target Function What we are given Hypothesis space Training examples What we want to know How many training examples are sufficient to successfully learn the target function? How many mistakes will the learner make before succeeding?
  • 7. Questions for Broad Classes of Learning Algorithms Sample complexity How many training examples do we need to converge to a successful hypothesis with a high probability? Computational complexity How much computational effort is needed to converge to a successful hypothesis with a high probability? Mistake Bound How many training examples will the learner misclassify before converging to a successful hypothesis?
  • 8. PAC Learning Probably Approximately Correct Learning Model Will restrict discussion to learning boolean-valued concepts in noise-free data.
  • 9. Problem Setting: Instances and Concepts X is set of all possible instances over which target function may be defined C is set of target concepts learner is to learn Each target concept c in C is a subset of X Each target concept c in C is a boolean function c: X  {0,1} c(x) = 1 if x is positive example of concept c(x) = 0 otherwise
  • 10. Problem Setting: Distribution Instances generated at random using some probability distribution D D may be any distribution D is generally not known to the learner D is required to be stationary (does not change over time) Training examples x are drawn at random from X according to D and presented with target value c(x) to the learner.
  • 11. Problem Setting: Hypotheses Learner L considers set of hypotheses H After observing a sequence of training examples of the target concept c , L must output some hypothesis h from H which is its estimate of c
  • 12. Example Problem (Classifying Executables) Three Classes (Malicious, Boring, Funny) Features a 1 GUI present (yes/no) a 2 Deletes files (yes/no) a 3 Allocates memory (yes/no) a 4 Creates new thread (yes/no) Distribution? Hypotheses?
  • 13. M No Yes No No 10 B Yes No No No 9 M Yes No Yes Yes 8 M No Yes Yes Yes 7 F No No No Yes 6 B Yes No No Yes 5 M Yes Yes No No 4 F No Yes Yes No 3 B No No No Yes 2 B Yes No No Yes 1 Class a 4 a 3 a 2 a 1 Instance
  • 14. True Error Definition: The true error (denoted error D (h)) of hypothesis h with respect to target concept c and distribution D , is the probability that h will misclassify an instance drawn at random according to D .
  • 15. Error of h with respect to c Instance space X + + + c h - - - -
  • 16. Key Points True error defined over entire instance space, not just training data Error depends strongly on the unknown probability distribution D The error of h with respect to c is not directly observable to the learner L—can only observe performance with respect to training data (training error) Question: How probable is it that the observed training error for h gives a misleading estimate of the true error?
  • 17. PAC Learnability Goal: characterize classes of target concepts that can be reliably learned from a reasonable number of randomly drawn training examples and using a reasonable amount of computation Unreasonable to expect perfect learning where error D (h) = 0 Would need to provide training examples corresponding to every possible instance With random sample of training examples, there is always a non-zero probability that the training examples will be misleading
  • 18. Weaken Demand on Learner Hypothesis error (Approximately) Will not require a zero error hypothesis Require that error is bounded by some constant  , that can be made arbitrarily small  is the error parameter Error on training data (Probably) Will not require that the learner succeed on every sequence of randomly drawn training examples Require that its probability of failure is bounded by a constant,  , that can be made arbitrarily small  is the confidence parameter
  • 19. Definition of PAC-Learnability Definition: Consider a concept class C defined over a set of instances X of length n and a learner L using hypothesis space H . C is PAC-learnable by L using H if all c  C , distributions D over X ,  such that 0 <  < ½ , and  such that 0 <  < ½ , learner L will with probability at least (1 -  ) output a hypothesis h  H such that error D (h)   , in time that is polynomial in 1/  , 1/  , n , and size(c) .
  • 20. Requirements of Definition L must with arbitrarily high probability (1-  ), out put a hypothesis having arbitrarily low error (  ). L’s learning must be efficient—grows polynomially in terms of Strengths of output hypothesis (1/  , 1/  ) Inherent complexity of instance space (n) and concept class C (size(c)).
  • 21. Block Diagram of PAC Learning Model Learning algorithm L Training sample Control Parameters  ,  Hypothesis h
  • 22. Examples of second requirement Consider executables problem where instances are conjunctions of boolean features: a 1 =yes  a 2 =no  a 3 =yes  a 4 =no Concepts are conjunctions of a subset of the features a 1 =yes  a 3 =yes  a 4 =yes
  • 23. Using the Concept of PAC Learning in Practice We often want to know how many training instances we need in order to achieve a certain level of accuracy with a specified probability. If L requires some minimum processing time per training example, then for C to be PAC-learnable by L, L must learn from a polynomial number of training examples.
  • 24. Sample Complexity Sample complexity of a learning problem is the growth in the required training examples with problem size. Will determine the sample complexity for consistent learners. A learner is consistent if it outputs hypotheses which perfectly fit the training data whenever possible. All algorithms in Chapter 2 are consistent learners.
  • 25. Recall definition of VS The version space , denoted VS H,D , with respect to hypothesis space H and training examples D, is the subset of hypotheses from H consistent with the training examples in D
  • 26. VS and PAC learning by consistent learners Every consistent learner outputs a hypothesis belonging to the version space, regardless of the instance space X, hypothesis space H, or training data D. To bound the number of examples needed by any consistent learner, we need only to bound the number of examples needed to assure that the version space contains no unacceptable hypotheses.
  • 27.  -exhausted Definition: Consider a hypothesis space H, target concept c, instance distribution D , and set of training examples D of c. The version space VS H,D is said to be  -exhausted with respect to c and D , if every hypothesis h in V H,D has error less than  with respect to c and D.
  • 28. Exhausting the version space VS H,D error = 0.1 r=0.2 error = 0.3 r=0.2 error = 0.2 r=0 error = 0.1 r=0 error = 0.3 r=0.4 error = 0.2 r=0.3 Hypothesis Space H
  • 29. Exhausting the Version Space Only an observer who knows the identify of the target concept can determine with certainty whether the version space is  -exhausted. But, we can bound the probability that the version space will be  -exhausted after a given number of training examples Without knowing the identity of the target concept Without knowing the distribution from which training examples were drawn
  • 30. Theorem 7.1 Theorem 7.1  -exhausting the version space. If the hypothesis space H is finite, D is a sequence of m  1 independent randomly drawn examples of some target concept c, then for any 0  1, the probability that the version space VS H,D is not  -exhausted (with respect to c) is less than or equal to |H|e -  m
  • 31. Proof of theorem See text
  • 32. Number of Training Examples (Eq. 7.2)
  • 33. Summary of Result Inequality on previous slide provides a general bound on the number of trianing examples sufficient for any consistent learner to successfully learn any target concept in H, for any desired values of  and  . This number m of training examples is sufficient to assure that any consistent hypothesis will be probably (with probability 1-  ) approximately (within error  ) correct. The value of m grows linearly with 1/  logarithmically with 1/  logarithmically with |H| The bound can be a substantial overestimate.
  • 34. Problem Suppose we have the instance space described for the EnjoySports problem: Sky (Sunny, Cloudy, Rainy) AirTemp (Warm, Cold) Humidity (Normal, High) Wind (Strong, Weak) Water (Warm, Cold) Forecast (Same, Change) Hypotheses can be as before (?, Warm, Normal, ?, ?, Same) (0, 0, 0, 0, 0, 0) How many training examples do we need to have an error rate of less than 10% with a probability of 95%?
  • 35. Limits of Equation 7.2 Equation 7.2 tell us how many training examples suffice to ensure (with probability (1-  ) that every hypothesis having 0 training error, will have a true error of at most  . Problem: there may be no hypothesis that is consistent with if the concept is not in H. In this case, we want the minimum error hypothesis.
  • 36. Agnostic Learning and Inconsistent Hypotheses An Agnostic Learner does not make the assumption that the concept is contained in the hypothesis space. We may want to consider the hypothesis with the minimum error Can derive a bound similar to the previous one:
  • 37. Concepts that are PAC-Learnable Proofs that a type of concept is PAC-Learnable usually consist of two steps: Show that each target concept in C can be learned from a polynomial number of training examples Show that the processing time per training example is also polynomially bounded
  • 38. PAC Learnability of Conjunctions of Boolean Literals Class C of target concepts described by conjunctions of boolean literals: GUI_Present   Opens_files Is C PAC learnable? Yes. Will prove by Showing that a polynomial # of training examples is needed to learn each concept Demonstrate an algorithm that uses polynomial time per training example
  • 39. Examples Needed to Learn Each Concept Consider a consistent learner that uses hypothesis space H =C Compute number m of random training examples sufficient to ensure that L will, with probability (1 -  ), output a hypothesis with maximum error  . We will use m  (1/  )(ln|H|+ln(1/  )) What is the size of the hypothesis space?
  • 40. Complexity Per Example We just need to show that for some algorithm, we can spend a polynomial amount of time per training example. One way to do this is to give an algorithm. In this case, we can use Find-S as the learning algorithm. Find-S incrementally computes the most specific hypothesis consistent with each training example. Old  Tired + Old  Happy + Tired + Old   Tired - Rich  Happy + What is a bound on the time per example?
  • 41. Theorem 7.2 PAC-learnability of boolean conjunctions. The class C of conjunctions of boolean literals is PAC-learnable by the FIND-S algorithm using H=C
  • 42. Proof of Theorem 7.2 Equation 7.4 shows that the sample complexity for this concept class id polynomial in n, 1/  , and 1/  , and independent of size(c). To incrematally process each training example, the FIND-S algorithm requires effort linear in n and independent of 1/  , 1/  , and size(c). Therefore, this concept class is PAC-learnable by the FIND-S algorithm.
  • 43. Interesting Results Unbiased learners are not PAC learnable because they require an exponential number of examples. K-term Disjunctive Normal Form is not PAC learnable K-term Conjunctive Normal Form is a superset of k-DNF, but it is PAC learnable
  • 44. Sample Complexity with Infinite Hypothesis Spaces Two drawbacks to previous result It often does not give a very tight bound on the sample complexity It only applies to finite hypothesis spaces Vapnik-Chervonekis dimension of H (VC dimension) Will give tighter bounds Applies to many infinite hypothesis spaces.
  • 45. Shattering a Set of Instances Consider a subset of instances S from the instance space X. Every hypothesis imposes dichotomies on S {x  S | h(x) = 1} {x  S | h(x) = 0} Given some instance space S, there are 2 |S| possible dichotomies. The ability of H to shatter a set of concepts is a measure of its capacity to represent target concepts defined over these instances.
  • 46. Shattering a Hypothesis Space Definition: A set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy.
  • 47. Vapnik-Chervonenkis Dimension Ability to shatter a set of instances is closely related to the inductive bias of the hypothesis space. An unbiased hypothesis space is one that shatters the instance space X. Sometimes H cannot be shattered, but a large subset of it can.
  • 48. Vapnik-Chervonenkis Dimension Definition: The Vapnik-Chervonenkis dimension, VC(H) of hypothesis space H defined over instance space X, is the size of the largest finite subset of X shattered by H . If arbitrarily large finite sets of X can be shattered by H, then VC(H) =  .
  • 50. Example 1 of VC Dimension Instance space X is the set of real numbers X = R . H is the set of intervals on the real number line. Form of H is: a < x < b What is VC(H)?
  • 51. Shattering the real number line -1.2 3.4 6.7 What is VC(H)? What is |H|? -1.2 3.4
  • 52. Example 2 of VC Dimension Set X of instances corresponding to numbers on the x,y plane H is the set of all linear decision surfaces What is VC(H)?
  • 53. Shattering the x-y plane 2 instances 3 instances VC(H) = ? |H| = ?
  • 54. Proving limits on VC dimension If we find any set of instances of size d that can be shattered, then VC(H)  d. To show that VC(H) < d, we must show that no set of size d can be shattered.
  • 55. General result for r dimensional space The VC dimension of linear decision surfaces in an r dimensional space is r+1.
  • 56. Example 3 of VC dimension Set X of instances are conjunctions of exactly three boolean literals young  happy  single H is the set of hypothesis described by a conjunction of up to 3 boolean literals. What is VC(H)?
  • 57. Shattering conjunctions of literals Approach: construct a set of instances of size 3 that can be shattered. Let instance i have positive literal l i and all other literals negative. Representation of instances that are conjunctions of literals l 1 , l 2 and l 3 as bit strings: Instance 1 : 100 Instance 2 : 010 Instance 3 : 001 Construction of dichotomy: To exclude an instance, add appropriate  l i to the hypothesis. Extend the argument to n literals. Can VC(H) be greater than n (number of literals)?
  • 58. Sample Complexity and the VC dimension Can derive a new bound for the number of randomly drawn training examples that suffice to probably approximately learn a target concept (how many examples do we need to  -exhaust the version space with probability (1-  )?)
  • 60. Lower Bound on Sample Complexity Theorem 7.3 Lower bound on sample complexity. Consider any concept class C such that VC(C)  2, any learner L, and any 0 <  < 1/8, and 0 <  < 1/100. Then there exists a distribution D and target concept in C such that if L observes fewer examples than Then with probability at least  , L outputs a hypothesis h having error D (h) >  .