SlideShare a Scribd company logo
Machine Learning
Data science for beginners, session 6
Machine Learning: your 5-7 things
Defining machine learning
The Scikit-Learn library
Machine learning algorithms
Choosing an algorithm
Measuring algorithm performance
Defining Machine Learning
Machine Learning = learning models from data
Which advert is the user most likely to click on?
Who’s most likely to win this election?
Which wells are most likely to fail in the next 6 months?
Machine Learning as Predictive Analytics...
Machine Learning Process
● Get data
● Select a model
● Select hyperparameters for that model
● Fit model to data
● Validate model (and change model, if necessary)
● Use the model to predict values for new data
Today’s library: Scikit-Learn (sklearn)
Scikit-Learn’s example datasets
● Iris
● Digits
● Diabetes
● Boston
Select a Model
Algorithm Types
Supervised learning
Regression: learning numbers
Classification: learning classes
Unsupervised learning
Clustering: finding groups
Dimensionality reduction: finding efficient representations
Linear Regression: fit a line to (numerical) data
Linear Regression: First, get your data
import numpy as np
import pandas as pd
gen = np.random.RandomState(42)
num_samples = 40
x = 10 * gen.rand(num_samples)
y = 3 * x + 7+ gen.randn(num_samples)
X = pd.DataFrame(x)
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(x,y)
Linear Regression: Fit model to data
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
print('Slope: {}, Intercept: {}'.format(model.coef_, model.intercept_))
Linear Regression: Check your model
Xtest = pd.DataFrame(np.linspace(-1, 11))
predicted = model.predict(Xtest)
plt.scatter(x, y)
plt.plot(Xtest, predicted)
Reality can be a little more like this…
Classification: Predict classes
● Well pump: [working, broken]
● CV: [accept, reject]
● Gender: [male, female, others]
● Iris variety: [iris setosa, iris virginica, iris versicolor]
Classification: The Iris Dataset Petal
Sepal
Classification: first get your data
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target
Classification: Split your data
ntest=10
np.random.seed(0)
indices = np.random.permutation(len(X))
iris_X_train = X[indices[:-ntest]]
iris_Y_train = Y[indices[:-ntest]]
iris_X_test = X[indices[-ntest:]]
iris_Y_test = Y[indices[-ntest:]]
Classifier: Fit Model to Data
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski')
knn.fit(iris_X_train, iris_Y_train)
Classifier: Check your model
predicted_classes = knn.predict(iris_X_test)
print('kNN predicted classes: {}'.format(predicted_classes))
print('Real classes: {}'.format(iris_Y_test))
Clustering: Find groups in your data
Clustering: get your data
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target
print("Xs: {}".format(X))
Clustering: Fit model to data
from sklearn import cluster
k_means = cluster.KMeans(3)
k_means.fit(iris.data)
Clustering: Check your model
print("Generated labels: n{}".format(k_means.labels_))
print("Real labels: n{}".format(Y))
Dimensionality Reduction
Dimensionality reduction: Get your data
Dimensionality reduction: Fit model to data
Recap: Choosing an Algorithm
Have: data and expected outputs
Want numbers? Try regression algorithms
Want classes? Try classification algorithms
Have: just data
Want to find structure? Try clustering algorithms
Want to look at it? Try dimensionality reduction
Model Validation
How well does the model fit new data?
“Holdout sets”:
split your data into training and test sets
learn your model with the training set
get a validation score for your test set
Models are rarely perfect… you might have to change parameters or model
● underfitting: model not complex enough to fit the training data
● overfitting: model too complex: fits the training data well, does badly on test
Overfitting and underfitting
The Confusion Matrix
True positive
False positive
False negative
True negative
Test Metrics
Precision:
of all the “true” results, how many were actually “true”?
Precision = tp / (tp + fp)
Recall:
how many of the things that were really “true” were marked as “true” by the
classifier?
Recall = tp / (tp + fn)
F1 score:
harmonic mean of precision and recall
F1_score = 2 * precision * recall / (precision + recall)
Iris classification: metrics
from sklearn import metrics
print(metrics.classification_report(iris_Y_test, predicted_classes))
Exercises
Explore some algorithms
Notebooks 6.x contain examples of machine learning algorithms. Run them,
play with the numbers in them, break them, think about why they might have
broken.

More Related Content

PDF
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
PDF
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
PPTX
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
PDF
Data exploration validation and sanitization
Venkata Reddy Konasani
 
PDF
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Sebastian Raschka
 
PDF
Le Machine Learning de A à Z
Alexia Audevart
 
PPTX
What is Machine Learning
Bhaskara Reddy Sannapureddy
 
PPTX
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
Data exploration validation and sanitization
Venkata Reddy Konasani
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Sebastian Raschka
 
Le Machine Learning de A à Z
Alexia Audevart
 
What is Machine Learning
Bhaskara Reddy Sannapureddy
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 

What's hot (20)

PDF
Building Random Forest at Scale
Sri Ambati
 
PPTX
Intro to Machine Learning for non-Data Scientists
Parinaz Ameri
 
PDF
Random forest using apache mahout
Gaurav Kasliwal
 
PDF
DataRobot R Package
DataRobot
 
PPTX
R- Introduction
Venkata Reddy Konasani
 
PPTX
Classification with Naive Bayes
Josh Patterson
 
PPTX
Ppt shuai
Xiang Zhang
 
PPTX
Machine Learning - Dummy Variable Conversion
Andrew Ferlitsch
 
PPTX
Make Sense Out of Data with Feature Engineering
DataRobot
 
PDF
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak
 
PPTX
Gradient Boosted trees
Nihar Ranjan
 
PPTX
Step By Step Guide to Learn R
Venkata Reddy Konasani
 
PPTX
Introduction to Machine Learning with Python and scikit-learn
Matt Hagy
 
PPTX
Primer to Machine Learning
Jeff Tanner
 
PDF
Introduction to Machine Learning in Python using Scikit-Learn
Amol Agrawal
 
PDF
L11. The Future of Machine Learning
Machine Learning Valencia
 
PDF
Analysis using r
Priya Mohan
 
PDF
Overview of tree algorithms from decision tree to xgboost
Takami Sato
 
PPTX
Machine Learning - Dataset Preparation
Andrew Ferlitsch
 
PDF
VSSML16 L3. Clusters and Anomaly Detection
BigML, Inc
 
Building Random Forest at Scale
Sri Ambati
 
Intro to Machine Learning for non-Data Scientists
Parinaz Ameri
 
Random forest using apache mahout
Gaurav Kasliwal
 
DataRobot R Package
DataRobot
 
R- Introduction
Venkata Reddy Konasani
 
Classification with Naive Bayes
Josh Patterson
 
Ppt shuai
Xiang Zhang
 
Machine Learning - Dummy Variable Conversion
Andrew Ferlitsch
 
Make Sense Out of Data with Feature Engineering
DataRobot
 
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak
 
Gradient Boosted trees
Nihar Ranjan
 
Step By Step Guide to Learn R
Venkata Reddy Konasani
 
Introduction to Machine Learning with Python and scikit-learn
Matt Hagy
 
Primer to Machine Learning
Jeff Tanner
 
Introduction to Machine Learning in Python using Scikit-Learn
Amol Agrawal
 
L11. The Future of Machine Learning
Machine Learning Valencia
 
Analysis using r
Priya Mohan
 
Overview of tree algorithms from decision tree to xgboost
Takami Sato
 
Machine Learning - Dataset Preparation
Andrew Ferlitsch
 
VSSML16 L3. Clusters and Anomaly Detection
BigML, Inc
 
Ad

Similar to Session 06 machine learning.pptx (20)

PDF
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
Dataconomy Media
 
PDF
Cheat Sheet for Machine Learning in Python: Scikit-learn
Karlijn Willems
 
PDF
Scikit learn cheat_sheet_python
Zahid Hasan
 
PDF
Scikit-learn Cheatsheet-Python
Dr. Volkan OBAN
 
PPTX
random_forest_ppt.pptxhgvghvhjghjghjghjghjghjjh
RahinTamboli
 
PDF
G. Barcaroli, The use of machine learning in official statistics
Istituto nazionale di statistica
 
PDF
Hands-on - Machine Learning using scikitLearn
avrtraining021
 
PPTX
Instance Based Learning in machine learning
tanishqgujari
 
PPTX
Machine Learning in R
SujaAldrin
 
ODP
Quick Machine learning projects steps in 5 mins
Naveen Davis
 
PDF
20MEMECH Part 3- Classification.pdf
MariaKhan905189
 
PPTX
Learning Predictive Modeling with TSA and Kaggle
Yvonne K. Matos
 
PPTX
Lecture-6-7.pptx
JohnMichaelPadernill
 
PDF
Scalable machine learning
Tien-Yang (Aiden) Wu
 
PDF
Machine Learning : why we should know and how it works
Kevin Lee
 
PDF
OpenML 2019
Joaquin Vanschoren
 
PPTX
Dian Vitiana Ningrum ()6211540000020)
dian vit
 
PPTX
ML .pptx
ssuser8324dd
 
PPTX
AlgorithmsModelsNov13.pptx
PerumalPitchandi
 
PDF
Workshop: Your first machine learning project
Alex Austin
 
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
Dataconomy Media
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Karlijn Willems
 
Scikit learn cheat_sheet_python
Zahid Hasan
 
Scikit-learn Cheatsheet-Python
Dr. Volkan OBAN
 
random_forest_ppt.pptxhgvghvhjghjghjghjghjghjjh
RahinTamboli
 
G. Barcaroli, The use of machine learning in official statistics
Istituto nazionale di statistica
 
Hands-on - Machine Learning using scikitLearn
avrtraining021
 
Instance Based Learning in machine learning
tanishqgujari
 
Machine Learning in R
SujaAldrin
 
Quick Machine learning projects steps in 5 mins
Naveen Davis
 
20MEMECH Part 3- Classification.pdf
MariaKhan905189
 
Learning Predictive Modeling with TSA and Kaggle
Yvonne K. Matos
 
Lecture-6-7.pptx
JohnMichaelPadernill
 
Scalable machine learning
Tien-Yang (Aiden) Wu
 
Machine Learning : why we should know and how it works
Kevin Lee
 
OpenML 2019
Joaquin Vanschoren
 
Dian Vitiana Ningrum ()6211540000020)
dian vit
 
ML .pptx
ssuser8324dd
 
AlgorithmsModelsNov13.pptx
PerumalPitchandi
 
Workshop: Your first machine learning project
Alex Austin
 
Ad

More from bodaceacat (20)

PPTX
CansecWest2019: Infosec Frameworks for Misinformation
bodaceacat
 
PDF
2019 11 terp_breuer_disclosure_master
bodaceacat
 
PPTX
Terp breuer misinfosecframeworks_cansecwest2019
bodaceacat
 
PPTX
Misinfosec frameworks Cansecwest 2019
bodaceacat
 
PPTX
Sjterp ds_of_misinfo_feb_2019
bodaceacat
 
PPTX
Practical Influence Operations, presentation at Sofwerx Dec 2018
bodaceacat
 
PPTX
Session 10 handling bigger data
bodaceacat
 
PPTX
Session 09 learning relationships.pptx
bodaceacat
 
PPTX
Session 08 geospatial data
bodaceacat
 
PPTX
Session 07 text data.pptx
bodaceacat
 
PPTX
Session 05 cleaning and exploring
bodaceacat
 
PPTX
Session 04 communicating results
bodaceacat
 
PPTX
Session 03 acquiring data
bodaceacat
 
PPTX
Session 02 python basics
bodaceacat
 
PPTX
Session 01 designing and scoping a data science project
bodaceacat
 
ODP
Gp technologybuilds july2011
bodaceacat
 
ODP
Gp technologybuilds july2011
bodaceacat
 
ODP
Ardrone represent
bodaceacat
 
PPTX
Global pulse app connection manager
bodaceacat
 
PPT
Un Pulse Camp - Humanitarian Innovation
bodaceacat
 
CansecWest2019: Infosec Frameworks for Misinformation
bodaceacat
 
2019 11 terp_breuer_disclosure_master
bodaceacat
 
Terp breuer misinfosecframeworks_cansecwest2019
bodaceacat
 
Misinfosec frameworks Cansecwest 2019
bodaceacat
 
Sjterp ds_of_misinfo_feb_2019
bodaceacat
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
bodaceacat
 
Session 10 handling bigger data
bodaceacat
 
Session 09 learning relationships.pptx
bodaceacat
 
Session 08 geospatial data
bodaceacat
 
Session 07 text data.pptx
bodaceacat
 
Session 05 cleaning and exploring
bodaceacat
 
Session 04 communicating results
bodaceacat
 
Session 03 acquiring data
bodaceacat
 
Session 02 python basics
bodaceacat
 
Session 01 designing and scoping a data science project
bodaceacat
 
Gp technologybuilds july2011
bodaceacat
 
Gp technologybuilds july2011
bodaceacat
 
Ardrone represent
bodaceacat
 
Global pulse app connection manager
bodaceacat
 
Un Pulse Camp - Humanitarian Innovation
bodaceacat
 

Recently uploaded (20)

PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
short term internship project on Data visualization
JMJCollegeComputerde
 

Session 06 machine learning.pptx

  • 1. Machine Learning Data science for beginners, session 6
  • 2. Machine Learning: your 5-7 things Defining machine learning The Scikit-Learn library Machine learning algorithms Choosing an algorithm Measuring algorithm performance
  • 4. Machine Learning = learning models from data Which advert is the user most likely to click on? Who’s most likely to win this election? Which wells are most likely to fail in the next 6 months?
  • 5. Machine Learning as Predictive Analytics...
  • 6. Machine Learning Process ● Get data ● Select a model ● Select hyperparameters for that model ● Fit model to data ● Validate model (and change model, if necessary) ● Use the model to predict values for new data
  • 8. Scikit-Learn’s example datasets ● Iris ● Digits ● Diabetes ● Boston
  • 10. Algorithm Types Supervised learning Regression: learning numbers Classification: learning classes Unsupervised learning Clustering: finding groups Dimensionality reduction: finding efficient representations
  • 11. Linear Regression: fit a line to (numerical) data
  • 12. Linear Regression: First, get your data import numpy as np import pandas as pd gen = np.random.RandomState(42) num_samples = 40 x = 10 * gen.rand(num_samples) y = 3 * x + 7+ gen.randn(num_samples) X = pd.DataFrame(x) %matplotlib inline import matplotlib.pyplot as plt plt.scatter(x,y)
  • 13. Linear Regression: Fit model to data from sklearn.linear_model import LinearRegression model = LinearRegression(fit_intercept=True) model.fit(X, y) print('Slope: {}, Intercept: {}'.format(model.coef_, model.intercept_))
  • 14. Linear Regression: Check your model Xtest = pd.DataFrame(np.linspace(-1, 11)) predicted = model.predict(Xtest) plt.scatter(x, y) plt.plot(Xtest, predicted)
  • 15. Reality can be a little more like this…
  • 16. Classification: Predict classes ● Well pump: [working, broken] ● CV: [accept, reject] ● Gender: [male, female, others] ● Iris variety: [iris setosa, iris virginica, iris versicolor]
  • 17. Classification: The Iris Dataset Petal Sepal
  • 18. Classification: first get your data import numpy as np from sklearn import datasets iris = datasets.load_iris() X = iris.data Y = iris.target
  • 19. Classification: Split your data ntest=10 np.random.seed(0) indices = np.random.permutation(len(X)) iris_X_train = X[indices[:-ntest]] iris_Y_train = Y[indices[:-ntest]] iris_X_test = X[indices[-ntest:]] iris_Y_test = Y[indices[-ntest:]]
  • 20. Classifier: Fit Model to Data from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski') knn.fit(iris_X_train, iris_Y_train)
  • 21. Classifier: Check your model predicted_classes = knn.predict(iris_X_test) print('kNN predicted classes: {}'.format(predicted_classes)) print('Real classes: {}'.format(iris_Y_test))
  • 22. Clustering: Find groups in your data
  • 23. Clustering: get your data from sklearn import datasets iris = datasets.load_iris() X = iris.data Y = iris.target print("Xs: {}".format(X))
  • 24. Clustering: Fit model to data from sklearn import cluster k_means = cluster.KMeans(3) k_means.fit(iris.data)
  • 25. Clustering: Check your model print("Generated labels: n{}".format(k_means.labels_)) print("Real labels: n{}".format(Y))
  • 29. Recap: Choosing an Algorithm Have: data and expected outputs Want numbers? Try regression algorithms Want classes? Try classification algorithms Have: just data Want to find structure? Try clustering algorithms Want to look at it? Try dimensionality reduction
  • 31. How well does the model fit new data? “Holdout sets”: split your data into training and test sets learn your model with the training set get a validation score for your test set Models are rarely perfect… you might have to change parameters or model ● underfitting: model not complex enough to fit the training data ● overfitting: model too complex: fits the training data well, does badly on test
  • 33. The Confusion Matrix True positive False positive False negative True negative
  • 34. Test Metrics Precision: of all the “true” results, how many were actually “true”? Precision = tp / (tp + fp) Recall: how many of the things that were really “true” were marked as “true” by the classifier? Recall = tp / (tp + fn) F1 score: harmonic mean of precision and recall F1_score = 2 * precision * recall / (precision + recall)
  • 35. Iris classification: metrics from sklearn import metrics print(metrics.classification_report(iris_Y_test, predicted_classes))
  • 37. Explore some algorithms Notebooks 6.x contain examples of machine learning algorithms. Run them, play with the numbers in them, break them, think about why they might have broken.

Editor's Notes

  • #5: What you’re learning isn’t the data, but a model that will help you understand (and possibly also explain) it.
  • #6: We bother making models because we want to start asking questions, and (hopefully) making changes in our world. Image from http://www.rosebt.com/blog/descriptive-diagnostic-predictive-prescriptive-analytics
  • #7: AKA import-instantiate-fit-predict Hyperparameter: things like “how many clusters of data do I think there are in this dataset?”
  • #8: Lots of great tutorials on http://scikit-learn.org/stable/ You import from this library, which is called “sklearn” in python code.
  • #9: Iris image from Nociveglia https://www.flickr.com/photos/40385177@N07/.
  • #11: Supervised versus unsupervised learning: supervised = give the algorithm both input data and the answers for that data (kinda like teaching), and it learns the connection between data and answers; unsupervised = give the algorithm just the data, and it finds the structure in that data Semi-supervised learning (where you only have a few answers) does exist, but isn’t talked about much. There’s also reinforcement learning, where you know if a result is better or worse, but not how much it’s better or worse.
  • #12: Fit a line to a set of datapoints. Use that line to predict new values
  • #13: This will give you 40 random samples around the line y = 3x + 7. Random.rand selects from a uniform distribution; random.randn selects from a standard normal distribution.
  • #14: Note the hyperparameter (fit_intercept). This says that your model doesn’t start at (0,0).
  • #15: predicted_slope = model.coef_ predicted_intercept = model.intercept_
  • #16: 1-feature linear regression on the Diabetes dataset. This is where you need to change your model. In this case, you’d start by trying more features, then adapting the model hyperparameters (e.g. it might not be a straight line that you need to fit) or the model that you use (e.g. linear regression might not be the best model type to use on this dataset).
  • #17: When there are just two classifications, it’s called binary classification.
  • #18: Classification: finding the link between data and classes. This is the Iris dataset. It’s one of Scikit-learn’s example datasets.
  • #19: print("Targets: {}".format(iris['target_names'])) print("Target data: {}".format(iris_Y)) print("Features: {}".format(iris['feature_names'])) print("Feature data: {}".format(iris_X))
  • #20: Why do we split into training and test sets? This is called a “holdout” set… we save some of our data, so we can use it to check how well our classifier does on data it hasn’t seen before. print(‘{} training points, {} test points’.format(len(iris_X_train), len(iris_X_test)))
  • #21: This is the k nearest neighbours algorithm. For every new datapoint, it looks at the N nearest datapoints it has classifications for, and assigns the new datapoint the class that’s most common amongst them. Here, we’re using 5 neighbours. We’re also using the Minkowski distance (https://machinelearning1.wordpress.com/2013/03/25/three-famous-metrics-manhattan-euclidean-minkowski/) : this tells the algorithm how to compute the distance between two points, so we can define which points are ‘closest’. Common distance metrics you’ll see in machine learning include: Manhattan, or “city block” distance: add the distance along the x axis to the distance along the y axis (“city block” because that’s how you navigate in Manhattan”) Euclidian distance: calculate the straight-line distance between the two points (e.g. sqrt(x^2 + y^2)) Minkowski distance: a variant of Euclidian distance, for large numbers of features
  • #27: This is the digits example dataset.
  • #28: This is all in notebook 6.5
  • #30: There’s no “best” algorithm for every problem. This is also known as the “no free lunch” theory. If you have data and estimate of better/worse: reinforcement learning There are lots of variants on these algorithms: the Scikit-learn cheat sheet will help you choose between them: http://scikit-learn.org/stable/tutorial/machine_learning_map/
  • #32: Overfitting: matches the training data well, performs badly on new data… has high variance Underfitting: doesn’t match the training data well, might perform well on new data… has high bias Bias/ Variance tradeoff: adjust your hyperparameters until the model performs well on the test data. See e.g. http://scott.fortmann-roe.com/docs/BiasVariance.html
  • #33: This is all about your parameters e.g. the difference between fitting a straight line, a quadratic curve or a n-dimensional curve. Figures from Jake Van Der Plas’ Python for Data Science book. We’ll talk about the bias-variance tradeoff later.
  • #34: False positive is also known as a “type 1 error”; false negative is also known as a “type 2 error”.
  • #35: These numbers are always between 0 and 1. If you want to play with F1, try it in Python, e.g.: import numpy as np p = np.array([.25, .25, .125, .5, .75]) r = np.array([.001, .10, .7, .9, .3]) 2*p*r / (p + r)
  • #36: Support: how many things that are actually this class did we use to calculate these metrics? Precision: of all the “true” results, how many were actually “true”? Recall: how many of the things that were really “true” were marked as “true” by the classifier? F1: combination of precision and recall