SlideShare a Scribd company logo
Machine Learning : Foundations Course Number 0368403401 Prof. Nathan Intrator Teaching Assistants:  Daniel Gill, Guy Amit
Course structure There will be 4 homework exercises They will be theoretical as well as programming All programming will be done in Matlab Course info accessed from  www.cs.tau.ac.il/~nin Final has not been decided yet Office hours Wednesday 4-5 (Contact via email)
Class Notes Groups of 2-3 students will be responsible for a scribing class notes Submission of class notes by next Monday  (1 week) and then corrections and additions from Thursday to the following Monday 30% contribution to the grade
Class Notes (cont’d) Notes will be done in LaTeX to be compiled into PDF via miktex.  (Download from School site) Style file to be found on course web site Figures in GIF
Basic Machine Learning idea Receive a collection of observations associated with some action label Perform some kind of “Machine Learning”  to be able to: Receive a new observation “ Process” it and generate an action label that is based on previous observations  Main Requirement:  Good generalization
Learning Approaches Store observations in memory and retrieve Simple, little generalization (Distance measure?) Learn a set of rules and apply to new data Sometimes difficult to find a good model Good generalization Estimate a “flexible model” from the data Generalization issues, data size issues
Storage & Retrieval  Simple, computationally intensive little generalization  How can retrieval be performed? Requires a “distance measure” between stored observations and new observation Distance measure can be given or “learned” (Clustering)
Learning Set of Rules How to create “reliable” set of rules from the observed data Tree structures Graphical models Complexity of the set of rules vs. generalization
Estimation of a flexible model What is a “flexible” model Universal approximator Reliability and generalization, Data size issues
Applications Control Robot arm Driving and navigating a car Medical applications:  Diagnosis, monitoring, drug release, gene analysis Web retrieval based on user profile Customized ads:  Amazon Document retrieval: Google
Related Disciplines Machine Learning AI probability & statistics computational complexity theory control theory information theory philosophy psychology neurophysiology Data Mining decision theory game theory optimization biological evolution statistical mechanics
Example 1: Credit Risk Analysis Typical customer: bank. Database: Current clients data, including: basic profile (income, house ownership, delinquent account, etc.) Basic classification. Goal: predict/decide whether to grant credit.
Example 1: Credit Risk Analysis Rules learned from data: IF Other-Delinquent-Accounts > 2 and Number-Delinquent-Billing-Cycles >1 THEN  DENY CREDIT IF Other-Delinquent-Accounts = 0 and Income > $30k THEN  GRANT CREDIT
Example 2: Clustering news Data: Reuters news / Web data Goal: Basic category classification: Business, sports, politics, etc. classify to subcategories (unspecified) Methodology: consider “typical words” for each category. Classify using a “distance “ measure.
Example 3: Robot control Goal: Control a robot in an unknown environment. Needs both  to explore (new places and action) to use acquired knowledge to gain benefits. Learning task “control” what is observes!
Example 4: Medical Application Goal: Monitor multiple physiological parameters. Control a robot in an unknown environment. Needs both  to explore (new places and action) to use acquired knowledge to gain benefits. Learning task “control” what is observes!
 
History of Machine Learning 1960’s and 70’s:  Models of human learning High-level symbolic descriptions of knowledge, e.g., logical expressions or graphs/networks, e.g., (Karpinski & Michalski, 1966) (Simon & Lea, 1974). Winston’s (1975) structural learning system learned logic-based structural descriptions from examples. Minsky   Papert , 1969   1970’s:  Genetic algorithms Developed by Holland (1975) 1970’s  - present:  Knowledge-intensive learning A tabula rasa approach typically fares poorly.  “To acquire new knowledge a system must already possess a great deal of initial knowledge.”  Lenat’s CYC project is a good example.
History of Machine Learning (cont’d) 1970’s - present:  Alternative modes of learning  (besides examples) Learning from instruction, e.g.,  (Mostow, 1983) (Gordon & Subramanian, 1993) Learning by analogy, e.g., (Veloso, 1990) Learning from cases, e.g., (Aha, 1991) Discovery (Lenat, 1977) 1991: The first of a series of workshops on  Multistrategy Learning  (Michalski) 1970’s – present:  Meta-learning Heuristics for focusing attention, e.g., (Gordon & Subramanian, 1996) Active selection of examples for learning, e.g., (Angluin, 1987), (Gasarch & Smith, 1988), (Gordon, 1991) Learning how to learn, e.g., (Schmidhuber, 1996)
History of Machine Learning (cont’d) 1980 – The First Machine Learning Workshop was held at Carnegie-Mellon University in Pittsburgh. 1980 – Three consecutive issues of the  International Journal of Policy Analysis and Information Systems  were specially devoted to machine learning. 1981 - Hinton, Jordan, Sejnowski, Rumelhart, McLeland at UCSD  Back Propagation alg.  PDP Book 1986 – The establishment of the  Machine Learning  journal. 1987 – The beginning of annual international conferences on machine learning (ICML). Snowbird ML conference 1988 – The beginning of regular workshops on computational learning theory (COLT). 1990’s – Explosive growth in the field of data mining, which involves the application of machine learning techniques.
Bottom line from History 1960 – The Perceptron (Minsky Papert) 1960 – “Bellman Curse of Dimensionality” 1980 – Bounds on statistical estimators (C. Stone) 1990 – Beginning of high dimensional data (Hundreds variables) 2000 – High dimensional data (Thousands variables)
A Glimpse in to the future Today status: First-generation algorithms: Neural nets, decision trees, etc. Future: Smart remote controls, phones, cars  Data and communication networks, software
Type of models Supervised learning Given access to classified data Unsupervised learning Given access to data, but no classification Important for data reduction Control learning Selects actions and observes consequences. Maximizes long-term cumulative return.
Probability D 1  over  and probability D 2  for Equally likely. Computing the probability of “smiley” given a point (x,y). Use Bayes formula. Let p be the probability. Learning: Complete Information (x,y)
Task:  generate  class label to a point at location (x,y) Determine between S or H by comparing the probability of P(S|(x,y)) to P(H|(x,y)). Clearly, one needs to know all these probabilities
Predictions and Loss Model How do we determine the optimality of the prediction We define a loss for every prediction Try to minimize the loss Predict a Boolean value. each error we lose 1 (no error no loss.) Compare the probability p to 1/2. Predict deterministically with the higher value. Optimal prediction (for zero-one loss) Can not recover probabilities!
Bayes Estimator A Bayes estimator associated with a prior distribution p and a loss function L is an estimator d which minimizes  L (p,d). For every x, it is given by d(x), argument of min on estimators  d of p (p,d|x). The value  r (p) =  r (p,dap) is then called the  Bayes risk .
Other Loss Models Quadratic loss Predict a “real number” q for outcome 1. Loss (q-p) 2  for outcome 1 Loss ([1-q]-[1-p]) 2  for outcome 0 Expected loss: (p-q) 2 Minimized for p=q (Optimal prediction) Recovers the probabilities Needs to know p to compute loss!
The basic PAC Model A batch learning model, i.e., the algorithm is  trained over some fixed data set Assumption: Fixed (Unknown distribution D of x in a domain X) The error of a hypothesis h w.r.t. a target concept f is e(h)= Pr D [h(x)≠f(x)]   Goal: Given a collection of hypotheses  H , find  h in H  that minimizes  e(h).
The basic PAC Model As the distribution D is unknown, we are provided  with a training data set of m samples S on which we can estimate the error: e’(h)= 1/m |{ x ε S |  h(x)    f(x) }| Basic question:  How close is  e(h)   to   e’(h)
Bayesian Theory Prior distribution over  H Given a sample  S  compute a posterior distribution: Maximum Likelihood (ML)  Pr[S|h] Maximum A Posteriori (MAP)  Pr[h|S] Bayesian Predictor   h(x) Pr[h|S].
Some Issues in Machine Learning What algorithms can approximate functions well, and when?  How does number of training examples influence accuracy?  How does complexity of hypothesis representation impact it?  How does noisy data influence accuracy?
More Issues in Machine Learning What are the theoretical limits of learnability?  How can prior knowledge of learner help?  What clues can we get from biological learning  systems?  How can systems alter their own representations?
Complexity vs. Generalization Hypothesis complexity versus observed error. More complex hypothesis have lower observed  error on the training set,  Might have higher true error (on test set).
Criteria for Model Selection Differ in assumptions about a priori Likelihood of h AIC and BIC are two other theory-based  model selection methods Minimum Description Length (MDL)  ’ (h)  + |code length of h| Structural Risk Minimization:  ’ (h)  +  { log |H|  /  m } ½  m # of training samples
Weak Learning Small class of predicates  H Weak Learning: Assume that for  any  distribution  D , there is some  predicate  heH  that predicts better than  1/2+e. Multiple Weak Learning Strong Learning
Boosting Algorithms Functions: Weighted majority of the predicates. Methodology: Change the distribution to target “hard” examples. Weight of an example is exponential in the number of  incorrect classifications. Good experimental results and efficient algorithms.
Computational Methods How to find a hypothesis h from a collection H with low observed error.  Most cases computational tasks are provably hard. Some methods are only for a binary h and others  for both.
 
Nearest Neighbor Methods Classify using near examples. Assume a “structured space” and a “metric” + + + + - - - - ?
Separating Hyperplane Perceptron:   sign(    x i w i  ) Find  w 1  .... w n Limited representation x 1 x n w 1 w n  sign
Neural Networks Sigmoidal gates : a=    x i w i   and  output = 1/(1+ e -a ) Learning by “Back Propagation” of errors x 1 x n
Decision Trees x 1  > 5 x 6  > 2 +1 +1 -1
Decision Trees Top Down construction: Construct the tree greedy,  using a local index function. Ginni Index : G(x) = x(1-x), Entropy H(x) ... Bottom up model selection: Prune the decision Tree  while maintaining low observed error.
Decision Trees Limited Representation Highly interpretable Efficient training and retrieval algorithm Smart cost/complexity pruning Aim: Find a small decision tree with a low observed error.
Support Vector Machine n  dimensions m  dimensions
Support Vector Machine Use a hyperplane in the LARGE space. Choose a hyperplane with a large MARGIN. + + + + - - - Project data to a high dimensional space.
Reinforcement Learning Main idea: Learning with a Delayed Reward Uses dynamic programming and supervised learning Addresses problems that can not be addressed by  regular supervised methods E.g., Useful for Control Problems. Dynamic programming searches for optimal policies.
Genetic Programming A search Method. Local mutation operations  Cross-over operations Keeps the “best” candidates Change a node in a tree Replace a subtree by another tree Keep trees with low observed error Example: decision trees
Unsupervised learning: Clustering
Unsupervised learning: Clustering
Basic Concepts in Probability For a single hypothesis h: Given an observed error Bound the true error Markov Inequality
Basic Concepts in Probability Chebyshev  Inequality
Basic Concepts in Probability Chernoff Inequality i.i.d,  Convergence rate of empirical mean to the true mean
 
Basic Concepts in Probability Switching from h 1  to h 2 : Given the observed errors Predict if h 2  is better. Total error rate Cases where h 1 (x)    h 2 (x) More refine
Course structure Store observations in memory and retrieve Simple, little generalization (Distance measure?) Learn a set of rules and apply to new data Sometimes difficult to find a good model Good generalization Estimate a “flexible model” from the data Generalization issues, data size issues Some Issues in Machine Learning  ffl What algorithms can approximate functions well  (and when)?  ffl How does number of training examples influence  accuracy?  ffl How does complexity of hypothesis  representation impact it?  ffl How does noisy data influence accuracy?  ffl What are the theoretical limits of learnability?  ffl How can prior knowledge of learner help?  ffl What clues can we get from biological learning  systems?  ffl How can systems alter their own  representations?  21 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Fourier Transform f(x) =   z  z (x)   z (x) =  (-1) <x,z> Many Simple classes are well approximated using large coefficients.  Efficient algorithms for finding large coefficients.
General PAC Methodology Minimize the observed error. Search for a small size classifier Hand-tailored search method for specific classes.
Other Models Membership Queries x f(x)

More Related Content

PPT
MachineLearning.ppt
butest
 
PPT
Machine Learning presentation.
butest
 
PPT
Basics of Machine Learning
butest
 
PPT
Machine Learning Applications in NLP.ppt
butest
 
PPTX
Machine learning
Vatsal Gajera
 
PPTX
Introduction to Machine Learning
KmPooja4
 
PDF
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Marina Santini
 
PDF
Introduction to Machine Learning Classifiers
Functional Imperative
 
MachineLearning.ppt
butest
 
Machine Learning presentation.
butest
 
Basics of Machine Learning
butest
 
Machine Learning Applications in NLP.ppt
butest
 
Machine learning
Vatsal Gajera
 
Introduction to Machine Learning
KmPooja4
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Marina Santini
 
Introduction to Machine Learning Classifiers
Functional Imperative
 

What's hot (19)

PPTX
Machine Learning
Bhupender Sharma
 
PPTX
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
PDF
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Marina Santini
 
DOC
DagdelenSiriwardaneY..
butest
 
PPTX
Machine Learning and Real-World Applications
MachinePulse
 
PPT
Learning
Amit Pandey
 
PDF
An introduction to Machine Learning
butest
 
PPTX
Presentation on supervised learning
Tonmoy Bhagawati
 
PPTX
Machine Learning
Girish Khanzode
 
PPTX
Lecture 01: Machine Learning for Language Technology - Introduction
Marina Santini
 
PPTX
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
PPTX
Machine learning
Rohit Kumar
 
PPTX
Introduction to Machine Learning
Lior Rokach
 
PDF
Lecture 9: Machine Learning in Practice (2)
Marina Santini
 
PDF
Machine learning Lecture 1
Srinivasan R
 
PPT
Introduction to Machine Learning.
butest
 
PPTX
Mis End Term Exam Theory Concepts
Vidya sagar Sharma
 
PPTX
Introduction
butest
 
PPTX
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Parth Khare
 
Machine Learning
Bhupender Sharma
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Marina Santini
 
DagdelenSiriwardaneY..
butest
 
Machine Learning and Real-World Applications
MachinePulse
 
Learning
Amit Pandey
 
An introduction to Machine Learning
butest
 
Presentation on supervised learning
Tonmoy Bhagawati
 
Machine Learning
Girish Khanzode
 
Lecture 01: Machine Learning for Language Technology - Introduction
Marina Santini
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
Machine learning
Rohit Kumar
 
Introduction to Machine Learning
Lior Rokach
 
Lecture 9: Machine Learning in Practice (2)
Marina Santini
 
Machine learning Lecture 1
Srinivasan R
 
Introduction to Machine Learning.
butest
 
Mis End Term Exam Theory Concepts
Vidya sagar Sharma
 
Introduction
butest
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Parth Khare
 
Ad

Similar to Machine Learning: Foundations Course Number 0368403401 (20)

PPT
Machine Learning: Foundations Course Number 0368403401
butest
 
PPT
Machine Learning and Inductive Inference
butest
 
PPT
Machine Learning ICS 273A
butest
 
DOC
Lecture #1: Introduction to machine learning (ML)
butest
 
PPTX
Statistical foundations of ml
Vipul Kalamkar
 
PDF
Machine Learning Fundamentals: Definition and many more
anandsoni9179
 
PPTX
Rahul_Kirtoniya_11800121032_CSE_Machine_Learning.pptx
RahulKirtoniya
 
PPT
notes as .ppt
butest
 
PPT
Machine learning
Digvijay Singh
 
PDF
Machine Learning
butest
 
PPT
Lecture 1
Aun Akbar
 
PPT
lec1.ppt
SVasuKrishna1
 
PPT
AML_030607.ppt
butest
 
PPT
LECTURE8.PPT
butest
 
PPT
ML_Overview.ppt
ParveshKumar17303
 
PPTX
ML_Overview.pptx
ssuserb0b8ed1
 
PPT
ML overview
NoopurRathore1
 
PPT
ML_Overview.ppt
vijay251387
 
PPTX
ML slide share.pptx
GoodReads1
 
PPT
Induction and Decision Tree Learning (Part 1)
butest
 
Machine Learning: Foundations Course Number 0368403401
butest
 
Machine Learning and Inductive Inference
butest
 
Machine Learning ICS 273A
butest
 
Lecture #1: Introduction to machine learning (ML)
butest
 
Statistical foundations of ml
Vipul Kalamkar
 
Machine Learning Fundamentals: Definition and many more
anandsoni9179
 
Rahul_Kirtoniya_11800121032_CSE_Machine_Learning.pptx
RahulKirtoniya
 
notes as .ppt
butest
 
Machine learning
Digvijay Singh
 
Machine Learning
butest
 
Lecture 1
Aun Akbar
 
lec1.ppt
SVasuKrishna1
 
AML_030607.ppt
butest
 
LECTURE8.PPT
butest
 
ML_Overview.ppt
ParveshKumar17303
 
ML_Overview.pptx
ssuserb0b8ed1
 
ML overview
NoopurRathore1
 
ML_Overview.ppt
vijay251387
 
ML slide share.pptx
GoodReads1
 
Induction and Decision Tree Learning (Part 1)
butest
 
Ad

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
DOC
1. MPEG I.B.P frame之不同
butest
 
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
PPT
Timeline: The Life of Michael Jackson
butest
 
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
PPTX
Com 380, Summer II
butest
 
PPT
PPT
butest
 
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
DOC
MICHAEL JACKSON.doc
butest
 
PPTX
Social Networks: Twitter Facebook SL - Slide 1
butest
 
PPT
Facebook
butest
 
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
DOC
NEWS ANNOUNCEMENT
butest
 
DOC
C-2100 Ultra Zoom.doc
butest
 
DOC
MAC Printing on ITS Printers.doc.doc
butest
 
DOC
Mac OS X Guide.doc
butest
 
DOC
hier
butest
 
DOC
WEB DESIGN!
butest
 
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
butest
 
PPT
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
butest
 
hier
butest
 
WEB DESIGN!
butest
 

Machine Learning: Foundations Course Number 0368403401

  • 1. Machine Learning : Foundations Course Number 0368403401 Prof. Nathan Intrator Teaching Assistants: Daniel Gill, Guy Amit
  • 2. Course structure There will be 4 homework exercises They will be theoretical as well as programming All programming will be done in Matlab Course info accessed from www.cs.tau.ac.il/~nin Final has not been decided yet Office hours Wednesday 4-5 (Contact via email)
  • 3. Class Notes Groups of 2-3 students will be responsible for a scribing class notes Submission of class notes by next Monday (1 week) and then corrections and additions from Thursday to the following Monday 30% contribution to the grade
  • 4. Class Notes (cont’d) Notes will be done in LaTeX to be compiled into PDF via miktex. (Download from School site) Style file to be found on course web site Figures in GIF
  • 5. Basic Machine Learning idea Receive a collection of observations associated with some action label Perform some kind of “Machine Learning” to be able to: Receive a new observation “ Process” it and generate an action label that is based on previous observations Main Requirement: Good generalization
  • 6. Learning Approaches Store observations in memory and retrieve Simple, little generalization (Distance measure?) Learn a set of rules and apply to new data Sometimes difficult to find a good model Good generalization Estimate a “flexible model” from the data Generalization issues, data size issues
  • 7. Storage & Retrieval Simple, computationally intensive little generalization How can retrieval be performed? Requires a “distance measure” between stored observations and new observation Distance measure can be given or “learned” (Clustering)
  • 8. Learning Set of Rules How to create “reliable” set of rules from the observed data Tree structures Graphical models Complexity of the set of rules vs. generalization
  • 9. Estimation of a flexible model What is a “flexible” model Universal approximator Reliability and generalization, Data size issues
  • 10. Applications Control Robot arm Driving and navigating a car Medical applications: Diagnosis, monitoring, drug release, gene analysis Web retrieval based on user profile Customized ads: Amazon Document retrieval: Google
  • 11. Related Disciplines Machine Learning AI probability & statistics computational complexity theory control theory information theory philosophy psychology neurophysiology Data Mining decision theory game theory optimization biological evolution statistical mechanics
  • 12. Example 1: Credit Risk Analysis Typical customer: bank. Database: Current clients data, including: basic profile (income, house ownership, delinquent account, etc.) Basic classification. Goal: predict/decide whether to grant credit.
  • 13. Example 1: Credit Risk Analysis Rules learned from data: IF Other-Delinquent-Accounts > 2 and Number-Delinquent-Billing-Cycles >1 THEN DENY CREDIT IF Other-Delinquent-Accounts = 0 and Income > $30k THEN GRANT CREDIT
  • 14. Example 2: Clustering news Data: Reuters news / Web data Goal: Basic category classification: Business, sports, politics, etc. classify to subcategories (unspecified) Methodology: consider “typical words” for each category. Classify using a “distance “ measure.
  • 15. Example 3: Robot control Goal: Control a robot in an unknown environment. Needs both to explore (new places and action) to use acquired knowledge to gain benefits. Learning task “control” what is observes!
  • 16. Example 4: Medical Application Goal: Monitor multiple physiological parameters. Control a robot in an unknown environment. Needs both to explore (new places and action) to use acquired knowledge to gain benefits. Learning task “control” what is observes!
  • 17.  
  • 18. History of Machine Learning 1960’s and 70’s: Models of human learning High-level symbolic descriptions of knowledge, e.g., logical expressions or graphs/networks, e.g., (Karpinski & Michalski, 1966) (Simon & Lea, 1974). Winston’s (1975) structural learning system learned logic-based structural descriptions from examples. Minsky Papert , 1969 1970’s: Genetic algorithms Developed by Holland (1975) 1970’s - present: Knowledge-intensive learning A tabula rasa approach typically fares poorly. “To acquire new knowledge a system must already possess a great deal of initial knowledge.” Lenat’s CYC project is a good example.
  • 19. History of Machine Learning (cont’d) 1970’s - present: Alternative modes of learning (besides examples) Learning from instruction, e.g., (Mostow, 1983) (Gordon & Subramanian, 1993) Learning by analogy, e.g., (Veloso, 1990) Learning from cases, e.g., (Aha, 1991) Discovery (Lenat, 1977) 1991: The first of a series of workshops on Multistrategy Learning (Michalski) 1970’s – present: Meta-learning Heuristics for focusing attention, e.g., (Gordon & Subramanian, 1996) Active selection of examples for learning, e.g., (Angluin, 1987), (Gasarch & Smith, 1988), (Gordon, 1991) Learning how to learn, e.g., (Schmidhuber, 1996)
  • 20. History of Machine Learning (cont’d) 1980 – The First Machine Learning Workshop was held at Carnegie-Mellon University in Pittsburgh. 1980 – Three consecutive issues of the International Journal of Policy Analysis and Information Systems were specially devoted to machine learning. 1981 - Hinton, Jordan, Sejnowski, Rumelhart, McLeland at UCSD Back Propagation alg. PDP Book 1986 – The establishment of the Machine Learning journal. 1987 – The beginning of annual international conferences on machine learning (ICML). Snowbird ML conference 1988 – The beginning of regular workshops on computational learning theory (COLT). 1990’s – Explosive growth in the field of data mining, which involves the application of machine learning techniques.
  • 21. Bottom line from History 1960 – The Perceptron (Minsky Papert) 1960 – “Bellman Curse of Dimensionality” 1980 – Bounds on statistical estimators (C. Stone) 1990 – Beginning of high dimensional data (Hundreds variables) 2000 – High dimensional data (Thousands variables)
  • 22. A Glimpse in to the future Today status: First-generation algorithms: Neural nets, decision trees, etc. Future: Smart remote controls, phones, cars Data and communication networks, software
  • 23. Type of models Supervised learning Given access to classified data Unsupervised learning Given access to data, but no classification Important for data reduction Control learning Selects actions and observes consequences. Maximizes long-term cumulative return.
  • 24. Probability D 1 over and probability D 2 for Equally likely. Computing the probability of “smiley” given a point (x,y). Use Bayes formula. Let p be the probability. Learning: Complete Information (x,y)
  • 25. Task: generate class label to a point at location (x,y) Determine between S or H by comparing the probability of P(S|(x,y)) to P(H|(x,y)). Clearly, one needs to know all these probabilities
  • 26. Predictions and Loss Model How do we determine the optimality of the prediction We define a loss for every prediction Try to minimize the loss Predict a Boolean value. each error we lose 1 (no error no loss.) Compare the probability p to 1/2. Predict deterministically with the higher value. Optimal prediction (for zero-one loss) Can not recover probabilities!
  • 27. Bayes Estimator A Bayes estimator associated with a prior distribution p and a loss function L is an estimator d which minimizes L (p,d). For every x, it is given by d(x), argument of min on estimators d of p (p,d|x). The value r (p) = r (p,dap) is then called the Bayes risk .
  • 28. Other Loss Models Quadratic loss Predict a “real number” q for outcome 1. Loss (q-p) 2 for outcome 1 Loss ([1-q]-[1-p]) 2 for outcome 0 Expected loss: (p-q) 2 Minimized for p=q (Optimal prediction) Recovers the probabilities Needs to know p to compute loss!
  • 29. The basic PAC Model A batch learning model, i.e., the algorithm is trained over some fixed data set Assumption: Fixed (Unknown distribution D of x in a domain X) The error of a hypothesis h w.r.t. a target concept f is e(h)= Pr D [h(x)≠f(x)] Goal: Given a collection of hypotheses H , find h in H that minimizes e(h).
  • 30. The basic PAC Model As the distribution D is unknown, we are provided with a training data set of m samples S on which we can estimate the error: e’(h)= 1/m |{ x ε S | h(x)  f(x) }| Basic question: How close is e(h) to e’(h)
  • 31. Bayesian Theory Prior distribution over H Given a sample S compute a posterior distribution: Maximum Likelihood (ML) Pr[S|h] Maximum A Posteriori (MAP) Pr[h|S] Bayesian Predictor  h(x) Pr[h|S].
  • 32. Some Issues in Machine Learning What algorithms can approximate functions well, and when? How does number of training examples influence accuracy? How does complexity of hypothesis representation impact it? How does noisy data influence accuracy?
  • 33. More Issues in Machine Learning What are the theoretical limits of learnability? How can prior knowledge of learner help? What clues can we get from biological learning systems? How can systems alter their own representations?
  • 34. Complexity vs. Generalization Hypothesis complexity versus observed error. More complex hypothesis have lower observed error on the training set, Might have higher true error (on test set).
  • 35. Criteria for Model Selection Differ in assumptions about a priori Likelihood of h AIC and BIC are two other theory-based model selection methods Minimum Description Length (MDL)  ’ (h) + |code length of h| Structural Risk Minimization:  ’ (h) + { log |H| / m } ½ m # of training samples
  • 36. Weak Learning Small class of predicates H Weak Learning: Assume that for any distribution D , there is some predicate heH that predicts better than 1/2+e. Multiple Weak Learning Strong Learning
  • 37. Boosting Algorithms Functions: Weighted majority of the predicates. Methodology: Change the distribution to target “hard” examples. Weight of an example is exponential in the number of incorrect classifications. Good experimental results and efficient algorithms.
  • 38. Computational Methods How to find a hypothesis h from a collection H with low observed error. Most cases computational tasks are provably hard. Some methods are only for a binary h and others for both.
  • 39.  
  • 40. Nearest Neighbor Methods Classify using near examples. Assume a “structured space” and a “metric” + + + + - - - - ?
  • 41. Separating Hyperplane Perceptron: sign(  x i w i ) Find w 1 .... w n Limited representation x 1 x n w 1 w n  sign
  • 42. Neural Networks Sigmoidal gates : a=  x i w i and output = 1/(1+ e -a ) Learning by “Back Propagation” of errors x 1 x n
  • 43. Decision Trees x 1 > 5 x 6 > 2 +1 +1 -1
  • 44. Decision Trees Top Down construction: Construct the tree greedy, using a local index function. Ginni Index : G(x) = x(1-x), Entropy H(x) ... Bottom up model selection: Prune the decision Tree while maintaining low observed error.
  • 45. Decision Trees Limited Representation Highly interpretable Efficient training and retrieval algorithm Smart cost/complexity pruning Aim: Find a small decision tree with a low observed error.
  • 46. Support Vector Machine n dimensions m dimensions
  • 47. Support Vector Machine Use a hyperplane in the LARGE space. Choose a hyperplane with a large MARGIN. + + + + - - - Project data to a high dimensional space.
  • 48. Reinforcement Learning Main idea: Learning with a Delayed Reward Uses dynamic programming and supervised learning Addresses problems that can not be addressed by regular supervised methods E.g., Useful for Control Problems. Dynamic programming searches for optimal policies.
  • 49. Genetic Programming A search Method. Local mutation operations Cross-over operations Keeps the “best” candidates Change a node in a tree Replace a subtree by another tree Keep trees with low observed error Example: decision trees
  • 52. Basic Concepts in Probability For a single hypothesis h: Given an observed error Bound the true error Markov Inequality
  • 53. Basic Concepts in Probability Chebyshev Inequality
  • 54. Basic Concepts in Probability Chernoff Inequality i.i.d, Convergence rate of empirical mean to the true mean
  • 55.  
  • 56. Basic Concepts in Probability Switching from h 1 to h 2 : Given the observed errors Predict if h 2 is better. Total error rate Cases where h 1 (x)  h 2 (x) More refine
  • 57. Course structure Store observations in memory and retrieve Simple, little generalization (Distance measure?) Learn a set of rules and apply to new data Sometimes difficult to find a good model Good generalization Estimate a “flexible model” from the data Generalization issues, data size issues Some Issues in Machine Learning ffl What algorithms can approximate functions well (and when)? ffl How does number of training examples influence accuracy? ffl How does complexity of hypothesis representation impact it? ffl How does noisy data influence accuracy? ffl What are the theoretical limits of learnability? ffl How can prior knowledge of learner help? ffl What clues can we get from biological learning systems? ffl How can systems alter their own representations? 21 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
  • 58. Fourier Transform f(x) =  z  z (x)  z (x) = (-1) <x,z> Many Simple classes are well approximated using large coefficients. Efficient algorithms for finding large coefficients.
  • 59. General PAC Methodology Minimize the observed error. Search for a small size classifier Hand-tailored search method for specific classes.
  • 60. Other Models Membership Queries x f(x)

Editor's Notes

  • #6: How retrieval can be done
  • #7: How retrieval can be done
  • #8: How retrieval can be done
  • #9: How retrieval can be done