Machine Learning 101 - AWS Machine Learning Web Day

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Michael Brückner
Manager Machine Learning
25/02/2016
Machine Learning 101

Agenda
• What is Machine Learning and why do we need it?
• Model Building
• Model Evaluation & Tuning

What is Machine Learning?
Methods and Systems that
…
Adapt
based on
recorded
data
Predict
new data
based on
recorded
data
Optimize
an action
given a
utility
function
Extract
hidden
structure
from the
data
Summarize
data into
concise
descriptions

What is Machine Learning NOT?
Methods and Systems that
…
can yield
Garbage-In
Knowledge-
Out
perform well
without
data modeling
& feature
engineering
avoid the
curse-of-
dimensionality
are a
replacement
for business
rules

Infer-Predict-Decide Cycle
Inference
Build & evaluate
Predictor
Prediction
Apply the learned
Predictor
Decision Making
Adjust Business loss
and get new/more data

What for?
Automate tasks, which typically require humans in order to
• scale
• improve over humans (non-experts)
• preserve privacy
or solve tasks that are impossible for humans

Examples: Personalized Recommandation
• Input:

Examples: Personalized Recommandation
• Output:

Examples: Face Detection & Recognition
Face detection
• Input: image
• Output: face position
Face recognition
• Input: face (image & face position)
• Output: person’s name

Examples: Full-Text Translation
• Input: text in one language
• Output: text of another language

Examples: Spam Filtering
• Input: email (text, images, …)
• Output: spam/non-spam flag
• Challenges:
• extremely high precision for
legitimate emails
• spam changes constantly
• noisy ground truth

Supervised Machine Learning
1. Model problem in terms of input data and output data
2. Collect sample of input-output pairs
3. Learn a mapping that produces the output given the
input
4. Apply this function on new inputs to make predictions

A Programer’s Perspective
Traditional Programming (Predicting)
Supervised Machine Learning
Computer
Input Data
Mapping
Output Data
Computer
Input Data
Output Data
Mapping

Advantages
• Use data instead of intuition to derive the mapping
• Can solve very complex tasks
• Can adapt to new situations (collect more data)
• Does not require much expert knowledge

Input Data
Description Type Cost Actual Cost Diff In Catalogue
Movies Entertainment $50 $28 $22 Yes
Music (CDs, MP3s, etc.) $500 $30 $470 No
Sporting Events Entertainment $0 $40 ($40) No
Dining Out Food $1,000 $1,200 ($200) Yes
Groceries $100 $0 $100 Yes
Charity 1 Gifts and Charity $200 $200 $0 No
Charity 2 $500 $500 $0 No
Cable/Satellite Housing $100 $100 $0 Yes
Electric Housing $45 $40 $5 Yes
Mortgage or Rent $700 $700 $0 Yes
Health Insurance $400 $400 $0 Yes
Home Insurance $400 $400 $0 No
Credit Card 1 $0 Yes
Dataset
Categorical Data
Missing Data
Binary Data
Numerical Data
Attribute Name
Attribute Value
Attribute
Text Data

Description Type Cost Actual Cost Diff In Catalogue
Movies Entertainment $50 $28 $22 Yes
Music (CDs, MP3s, etc.) ? $500 $30 $470 No
Sporting Events Entertainment $0 $40 ($40) No
Dining Out Food $1,000 $1,200 ($200) Yes
Groceries ? $100 $0 $100 Yes
Charity 1 Gifts and Charity $200 $200 $0 No
Charity 2 ? $500 $500 $0 No
Cable/Satellite Housing $100 $100 $0 Yes
Electric Housing $45 $40 $5 Yes
Mortgage or Rent ? $700 $700 $0 Yes
Health Insurance $400 $400 $0 Yes
Home Insurance $400 $400 $0 No
Credit Card 1 ? $0 Yes
Output Data
Target Attribute Values
Target Attribute

Problem Setting
• Input: vector of observable attributes, x
• Output: target attribute value, y
• Training data: pairs of input and corresponding output,
D = (x1,y1),…,(xN,yN)
• Application data: inputs only
• Goal: learn mapping fw:x ↦ y
Predictor

Challenges in Model Building
• Which function class for Predictor (data modeling)?
• How to pre-process the data (feature engineering)?
• How to learn this Predictor from our training data?
• How to generalize to new data?

Which function class for Predictor?
Types of prediction tasks (output type):
• Binary Classification ⇒ binary target y  {–1, +1}
• Multinomial Classification ⇒ categorical target y  {1… K}
• Regression ⇒ numeric target y  [l,u]  R

Which function class for Binary Classification?
• Decision Tree
+
+-
-
-
x2 > 7?
no yes
+
+
+
+
+
x1 < 3?
no yes
x2 < 5?
no yes
x1 < 1?
no yes
+
+
-
-
x2
x1
1 3
5
7

• Decision Tree
+-
x2 > 7?
no yes
+
x1 < 3?
no yes
x2 < 5?
no yes
x1 < 1?
no yes
+ -
x2
x1
+
-
-

• Linear function
• binary target attribute
values y  {–1, +1}
x2
x1
Hw +
-
y(x) = sign( fw
(x))
Hw
={x| fw
(x) = xT
w+w0
= 0}
^

• Generalized linear function
(Kernel methods)
• Layered Generalized linear
function (Neural Networks)
• Ensemble of functions
• …
x2
x1
+
- +
-

How to pre-process the data?
• Predictor’s function class defined for limited input domain
⇒ transform/extract attributes first (pre-processing)
• Number to (normalized) Number:
• z-standardization, min-max normalization
• Number to Category:
• Binning (quantile, equidistant)
• Category to (numeric) Vector:
• One-hot encoding

How to pre-process the data?
• Predictor’s function class defined for limited input domain
⇒ transform/extract attributes first (pre-processing)
• Text to (numeric) Vector:
• Normalization, tokenization, stemming
• Bag-of-Words, Bag-of-NGrams, TI-IDF ⇒ sparse vector
• Latent word embedding (LSI, word2vec, LDA) ⇒ dense vector
• Image to (numeric) Vector:
• HoG, DAISY, color histogram

How to learn a Predictor?
• Loss of Predictor fw:x ↦ y for a given input-output pair:
Loss function PredictionGround Truth
L(y, fw
(x))

Loss functions for binary classification (target ):y Î{-1,+1}

Function Class Loss Function Learning Algorithm
Decision Trees 0/1 loss ID3
Decision Trees Quadratic loss CART
Linear function Quadratic loss Least-squares regression
Linear function Logistic loss Logistic regression
Linear function Hinge loss Support Vector Machines
Layered Generalized
Linear function
Logistic loss Neural Networks
(Binary Classification)
Layered Generalized
Linear function
Quadratic loss Neural Networks
(Regression)

• Theoretical Risk:
• Empirical Risk:
Average over all possible data
Average over training data

• Prediction depends on Predictor with model
parameters w
• Minimize Risk w.r.t. those model parameters w
⇒ mathematical Optimisation Problem
• Gradient-based first or second-order methods
• Coordinate-descent methods
• (Greedy) Search
y(x)^ fw

How to generalize to new data?
Error
Model Complexity

How to generalize to new data?
• Empirical Risk:
• Structural Risk: Regularizer

Performance for Binary Classification
Total number of
data points (N)
True Target
positive negative
Predicted
Target
positive
True
Positive
False
Positive
negative
False
Negative
True
Negative

• Accuracy:
• Recall (true positive rate):
• Precision:
• Fall-out (false positive rate):
TP+TN
N
TP
TP+ FN
TP
TP+ FP
FP
TN + FP

Decision function
AUC
(Area Under roc Curve)
y(x) = sign( fw
(x)+b)^
Predictor Decision threshold

Training vs. Test Performance
How do we know that a Predictor works well on new data?
Small error on training
data ≠ small error on
new data (test data)!

Hold-out Evaluation
• Put some data aside before training = test data
• Use this hold-out data for evaluation
• Disadvantages:
• What if we were (un)lucky when choosing the hold-out data?
• We do NOT use all the data for model training!

K-Fold Cross Validation-based Evaluation
• Split data into K partitions (folds)
• Take all but one partition to train a Predictor
• Evaluate Predictor on the left-out partition
• Repeat this for all partitions
• Average performance for all K evaluations
• Finally train a Predictor on all data

Model Tuning
Learning methods and Predictors have hyper-parameters
• Amount of regularization
• Choice of loss function
• Decision threshold score
• Learning rate
• …

Example: Decision threshold
Decision threshold

How to choose hyper-parameters?
Grid Search:
• Evaluate Predictor for all grid points (hyper-parameter
combinations)
• Take best grid point
Very expensive!


2
10 0
10 2
10
1
2
0
2
1
2

How to choose hyper-parameters?
Bayesian Optimisation:
• Learn model to predict evaluation outcomes
• Evaluate Predictor only for promising grid points
• Take best grid point
after fixed number of
evaluations


2
10 0
10 2
10
1
2
0
2
1
2

Common Pitfalls
• Model tuning is part of training
⇒ Do NOT use test data or test CV partitions!
• Use proper grid resolution and axis scaling
• Use same metric for tuning as for evaluation

Thank you!

Machine Learning 101 - AWS Machine Learning Web Day

More Related Content

What's hot (15)

Similar to Machine Learning 101 - AWS Machine Learning Web Day (20)

More from AWS Germany (20)

Recently uploaded (20)

Machine Learning 101 - AWS Machine Learning Web Day