SlideShare a Scribd company logo
Python-Machine Learning
using Scikit-Learn package
Dr. Sarwan Singh
Agenda
• Introduction (SciKit-Learn Toolkit)
• History, contributors
• Data representation in Machine Learning
• Supervised learning example
• Classification model
• Machine Learning Project
using Iris dataset
Artificial Intelligence
Machine Learning
Deep Learning
Machine learning is a
branch in computer
science that studies the
design of algorithms
that can learn.
sarwan@NIELIT 2
History
• Scikit-learn was original authored by an data scientist
David Courapeau in 2007
• Google Summer of Code Project
• This project was started in 2007 as a Google Summer of Code project by David
Cournapeau. Later that year, Matthieu Brucher started work on this project as
part of his thesis.
• In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent
Michel of INRIA took leadership of the project and made the first public
release, February the 1st 2010.
• Since then, several releases have appeared following a ~3 month cycle, and a
thriving international community has been leading the development.
• Of the various scikits, scikit-learn as well as scikit-image were described as
"well-maintained and popular" in November 2012
sarwan@NIELIT 3
Introduction
• Machine learning library written
in Python
• Simple and efficient, for both
experts and non-experts
• Classical, well-established
machine learning algorithms
• BSD 3 license
• characterized by a clean, uniform,
and streamlined API
• Community driven development
• 20~ core developers (mostly
researchers)
• 500+ occasional contributors
• All working publicly
together on GitHub
• Emphasis on keeping the project
maintainable
• Style consistency
• Unit-test coverage
• Documentation and examples
• Code review
sarwan@NIELIT 4
Pandas NumPy Scikit-Learn workflow
• Start with CSV
• Convert to Pandas DataFrame
• Slice and dice in Pandas
• Convert to NumPy array to feed to Scikit-Learn
Additional web resource :
• UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online
repository of machine learning datasets (at the time of writing, they are listing 233 datasets).
The repository is available online: http://archive.ics.uci.edu/ml/
• https://github.com/rasbt/pattern_classification/blob/master/resources/machine_learning_ebooks.md
sarwan@NIELIT 5
Data Representation in Scikit-Learn
• Machine learning is about creating models from data
• The best way to think about data within Scikit-Learn is in terms of
tables of data.
• Data as table : A basic table is a two-dimensional grid of data, in
which the rows represent individual elements of the dataset, and the
columns represent quantities related to each of these elements.
• E.g. the Iris dataset, famously analyzed by Ronald Fisher in 1936.
• This can be downloaded in dataset in the form of a Pandas DataFrame
using the Seaborn library
sarwan@NIELIT 6
Layman’s view of Machine Learning
• Loading the dataset.
• Summarizing the dataset.
• Visualizing the dataset.
• Evaluating some algorithms.
• Making some predictions.
Making
some
predictions
Evaluating
some
algorithms.
Visualizing
the
dataset.
Summariz-
ing the
dataset.
Loading
the
dataset.
sarwan@NIELIT 7
Basics of the Scikit-Learn estimator API
1. Choose a class of model by importing the appropriate estimator class
from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired
values.
3. Arrange data into a features matrix and target vector
4. Fit the model to your data by calling the fit() method
of the model instance.
5. Apply the model to new data:
• For supervised learning, often we predict labels for unknown data using the
predict() method.
• For unsupervised learning, we often transform or infer properties of the data
using the transform() or predict() method
sarwan@NIELIT 8
Basics of the Scikit-Learn estimator API
Choose a class of model
Choose model hyperparameters
Arrange data into a features
matrix and target vector
Fit the model to your data
Apply model to new data
sarwan@NIELIT 9
Supervised learning example: Simple linear regression
• Lets Learn with an example : common
case of fitting a line to x, y data.
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y) ;
sarwan@NIELIT 10
1. Choose a class of model. - In Scikit-Learn, every class of model is
represented by a Python class.
from sklearn.linear_model import LinearRegression
• once the model class is selected, hyperparameters are selected .
Supervised learning example:
Simple linear regression
sarwan@NIELIT 11
2. Choose model hyperparameters. An important point is that a class of
model is not the same as an instance of a model.
• hyperparameters are parameters that must be set before the model
is fit to data
• In Scikit-Learn, hyperparameters are chosen by passing values at
model instantiation.
model = LinearRegression( fit_intercept=True )
Finally the model will become :
LinearRegression( copy_X=True, fit_intercept=True,
n_jobs=1, normalize=False)
• the model is not yet applied to any data: the Scikit-Learn API makes
very clear the distinction between choice of model and application of
model to data.
Supervised learning example: Simple linear regression
sarwan@NIELIT 12
3. Arrange data into a features matrix and target vector.
• Make two-dimensional features matrix (X) and
a one-dimensional target array (Y)
• target variable y is already in the correct form (a length-n_samples
array)
• Make the data x into a matrix of size [n_samples, n_features].
X = x[:, np.newaxis]
X.shape –output- (50,1)
Supervised learning example: Simple linear regression
Earlier state :
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
sarwan@NIELIT 13
4. Fit the model to your data.
• apply model to data using fit() method
model.fit( X , y )
Final: LinearRegression( copy_X=True,
fit_intercept=True, n_jobs=1, normalize=False)
• fit() command causes a number of model-dependent internal
computations to take place, and the results of these computations
are stored in model specific attributes
• In Scikit-Learn, by convention all model parameters that were
learned during the fit() process have trailing underscores
Supervised learning example: Simple linear regression
sarwan@NIELIT 14
4. Fit the model to your data.(contd..)
• The two parameters represent the slope and intercept of the simple
linear fit to the data. In our data definition, its very close to the
input slope of 2 and intercept of –1
• In general, Scikit-Learn does not provide tools to draw conclusions
from internal model parameters themselves: interpreting model
parameters is much more a statistical modeling question than a
machine learning question.
• Machine learning rather focuses on what the model predicts.
Supervised learning example: Simple linear regression
sarwan@NIELIT 15
5. Predict labels for unknown data.
• Once the model is trained, the main task of supervised machine
learning is to evaluate it based on what it says about new data that
was not part of the training set.
• In Scikit-Learn, the predict() method is used.
xfit = np.linspace(-1, 11)
#coerce x values into a [n_samples, n_features] features matrix
Xfit = xfit [ : , np.newaxis ]
yfit = model.predict (Xfit)
#visualize the result
plt.scatter(x, y)
plt.plot(xfit, yfit);
Supervised learning example: Simple linear regression
sarwan@NIELIT 16
What makes up a classification model?
• The structure of the model: In this, we use a threshold on a single
feature.
• The search procedure: In this, we try every possible combination of
feature and threshold.
• The loss function: Using the loss function, we decide which of the
possibilities is less bad (because we can rarely talk about the perfect
solution). We can use the training error or just define this point the
other way around and say that we want the best accuracy.
• Traditionally, people want the loss function to be minimum.
sarwan@NIELIT 17
• Alternatively, we might have different loss functions. It might be that
one type of error is much more costly than another. In a medical
setting, false negatives and false positives are not equivalent.
• A false negative (when the result of a test comes back negative, but
that is false) might lead to the patient not receiving treatment for a
serious disease.
• A false positive (when the test comes back positive even though the
patient does not actually have that disease) might lead to additional
tests for confirmation purposes or unnecessary treatment (which can
still have costs, including side effects from the treatment).
• With spam filtering, we may face the same problem; incorrectly
deleting a non-spam e-mail can be very dangerous for the user, while
letting a spam e-mail through is just a minor annoyance.
sarwan@NIELIT 18
• What the cost function should be is always dependent on the exact
problem you are working on.
• When we present a general-purpose algorithm, we often focus on
minimizing the number of mistakes (achieving the highest accuracy).
• However, if some mistakes are more costly than others, it might be
better to accept a lower overall accuracy to minimize overall costs.
sarwan@NIELIT 19
• This is a general area normally termed feature engineering; it is
sometimes seen as less glamorous than algorithms, but it may matter
more for performance (a simple algorithm on well-chosen features
will perform better than a fancy algorithm on not-so-good features).
• Features and feature engineering
• Feature selection.
sarwan@NIELIT 20
First Machine Learning
Project using Iris dataset
Hello world program of machine learning
“classification of iris flowers”
Iris virginica
Iris setosa
Iris versicolor
sarwan@NIELIT 21
Question
• After looking at new flower in the field,
could we make a good prediction about
its species from its measurements?
Iris virginica
Iris setosa
Iris versicolor
sarwan@NIELIT 22
Iris dataset
• The Iris dataset is a classic dataset from the 1930s; it is
one of the first modern examples of statistical
classification.
• The setting is that of Iris flowers, of which there
are multiple species that can be identified by their
morphology.
• Today, the species would be defined by their genomic
signatures, but in the 1930s, DNA had not even been
identified as the carrier of genetic information.
• The following four attributes of each plant were
measured:
• Sepal length , Sepal width, Petal length, Petal width
sarwan@NIELIT 23
Iris dataset
• Generally, any measurement from our data as features.
• This is the supervised learning or classification problem; given labeled
examples, we can design a rule that will eventually be applied to
other examples.
• Other modern application examples of Pattern classification : Optical
Character Recognition (OCR) in the post office, spam filtering in our
email clients(spam messages vs “ham” {= not-spam} messages),
barcode scanners in the supermarket, etc
sarwan@NIELIT 24
Hello World of Machine Learning with Iris
• The best small project to start with on a new tool is the classification of
iris flowers. why iris dataset
• Attributes are numeric so you have to figure out how to load and
handle data.
• It is a classification problem, allowing to practice with perhaps an
easier type of supervised learning algorithm.
• It is a multi-class classification problem (multi-nominal) that may
require some specialized handling.
• It only has 4 attributes and 150 rows, meaning it is small and easily
fits into memory (and a screen or A4 page).
• All of the numeric attributes are in the same units and the same
scale, not requiring any special scaling or transforms to get started.
sarwan@NIELIT 25
Iris Dataset
• Iris dataset contains 150
observations of iris
flowers.
• Has four columns of
measurements of the
flowers in centimeters.
• The fifth column is the
species of the flower
observed.
• All observed flowers
belong to one of three
species
Inputs from : machinelearningmastery, google, kaggle,etc
sarwan@NIELIT 26
Summarize dataset
• Take statistical summary using
describe().
• Grouping the rows/records based
on class of flower, using
irisDataframe.groupby('class').size()
sarwan@NIELIT 27
Data Visualization
Two types of plots:
• Univariate plots to better understand
each attribute.
• Multivariate plots to better understand
the relationships between attributes.
sarwan@NIELIT 28
Multivariate plots
• scatterplots of all pairs of
attributes.
• It is helpful to spot structured
relationships between input
variables
• The diagonal grouping of some
pairs of attributes, suggests a high
correlation and a predictable
relationship
sarwan@NIELIT 29
Create a Validation Dataset
Split the loaded dataset into two:
• 80% of which we will use to train our models and
• 20% that we will hold back as a validation dataset.
training data in the
• X_train and Y_train for preparing models and
• X_validation and Y_validation sets
sarwan@NIELIT 30
Arranging data into a features matrix and target vector
sarwan@NIELIT 31
K-fold cross validation
• Cross-validation, sometimes called rotation
estimation or out-of-sample testing is any
of various similar model validation
techniques for assessing how the results of
a statistical analysis will generalize to an
independent data set.
Source:wikipedia.org
• Mainly used in settings where the goal is prediction, and one wants to
estimate how accurately a predictive model will perform in practice.
• In a prediction problem, a model is usually given a dataset of known data on
which training is run (training dataset), and a dataset of unknown data (or
first seen data) against which the model is tested (called the validation
dataset or testing set).
• The goal of cross-validation is to test the model’s ability to predict new data
that were not used in estimating it, in order to flag problems like overfitting
sarwan@NIELIT 32
Test Harness
• use 10-fold cross validation to estimate accuracy.
• This will split the dataset into 10 parts, train on 9 and test on 1 and
repeat for all combinations of train-test splits.
• use ‘accuracy’ metric to evaluate models.
• This is a ratio of the number of correctly predicted instances in divided
by the total number of instances in the dataset multiplied by 100 to give
a percentage (e.g. 95% accurate).
sarwan@NIELIT 33
Evaluate 6 different algorithms:
• Logistic Regression (LR)
• Linear Discriminant Analysis (LDA)
• K-Nearest Neighbours (KNN).
• Classification and Regression Trees
(CART).
• Gaussian Naive Bayes (NB).
• Support Vector Machines (SVM).
Its good mix of simple linear
(LR and LDA), nonlinear
(KNN, CART, NB and SVM) algorithms.
To ensures the results are directly
comparable, reset the random number
seed before each run to ensure that the
evaluation of each algorithm is performed
using exactly the same data splits. sarwan@NIELIT 34
Compare algorithms
sarwan@NIELIT 35
Fit the model to your data
sarwan@NIELIT 36

More Related Content

Similar to Hands-on - Machine Learning using scikitLearn (20)

PPTX
background.pptx
KabileshCm
 
PDF
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
PDF
Pycon 2012 Scikit-Learn
Anoop Thomas Mathew
 
PPTX
Machine learning and types
Padma Metta
 
PDF
Machine Learning with Python- Machine Learning Algorithms.pdf
KalighatOkira
 
PPTX
Lecture-6-7.pptx
JohnMichaelPadernill
 
PPTX
Session 06 machine learning.pptx
bodaceacat
 
PPTX
Session 06 machine learning.pptx
Sara-Jayne Terp
 
PDF
20MEMECH Part 3- Classification.pdf
MariaKhan905189
 
PPTX
Data Science Using Scikit-Learn
Ducat India
 
PDF
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
Athens Big Data
 
PPTX
UNIT_5_Data Wrangling.pptx
BhagyasriPatel2
 
PDF
Scikit-learn for easy machine learning: the vision, the tool, and the project
Gael Varoquaux
 
PPTX
Machine learning and decision trees
Padma Metta
 
PDF
Introduction to Machine Learning in Python using Scikit-Learn
Amol Agrawal
 
PPT
VTU technical seminar 8Th Sem on Scikit-learn
AmarnathKambale
 
PDF
Hands-on Tutorial of Machine Learning in Python
Chun-Ming Chang
 
PDF
Machine Learning Crash Course by Sebastian Raschka
PawanJayarathna1
 
PPTX
Intro To Machine Learning in Python
Russel Mahmud
 
PDF
Python Machine Learning Sebastian Raschka Vahid Mirjalili
alhbebtroll
 
background.pptx
KabileshCm
 
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Pycon 2012 Scikit-Learn
Anoop Thomas Mathew
 
Machine learning and types
Padma Metta
 
Machine Learning with Python- Machine Learning Algorithms.pdf
KalighatOkira
 
Lecture-6-7.pptx
JohnMichaelPadernill
 
Session 06 machine learning.pptx
bodaceacat
 
Session 06 machine learning.pptx
Sara-Jayne Terp
 
20MEMECH Part 3- Classification.pdf
MariaKhan905189
 
Data Science Using Scikit-Learn
Ducat India
 
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
Athens Big Data
 
UNIT_5_Data Wrangling.pptx
BhagyasriPatel2
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Gael Varoquaux
 
Machine learning and decision trees
Padma Metta
 
Introduction to Machine Learning in Python using Scikit-Learn
Amol Agrawal
 
VTU technical seminar 8Th Sem on Scikit-learn
AmarnathKambale
 
Hands-on Tutorial of Machine Learning in Python
Chun-Ming Chang
 
Machine Learning Crash Course by Sebastian Raschka
PawanJayarathna1
 
Intro To Machine Learning in Python
Russel Mahmud
 
Python Machine Learning Sebastian Raschka Vahid Mirjalili
alhbebtroll
 

Recently uploaded (20)

PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
Rational Functions, Equations, and Inequalities (1).pptx
mdregaspi24
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Rational Functions, Equations, and Inequalities (1).pptx
mdregaspi24
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Climate Action.pptx action plan for climate
justfortalabat
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
Data base management system Transactions.ppt
gandhamcharan2006
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
AI/ML Applications in Financial domain projects
Rituparna De
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Ad

Hands-on - Machine Learning using scikitLearn

  • 2. Agenda • Introduction (SciKit-Learn Toolkit) • History, contributors • Data representation in Machine Learning • Supervised learning example • Classification model • Machine Learning Project using Iris dataset Artificial Intelligence Machine Learning Deep Learning Machine learning is a branch in computer science that studies the design of algorithms that can learn. sarwan@NIELIT 2
  • 3. History • Scikit-learn was original authored by an data scientist David Courapeau in 2007 • Google Summer of Code Project • This project was started in 2007 as a Google Summer of Code project by David Cournapeau. Later that year, Matthieu Brucher started work on this project as part of his thesis. • In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel of INRIA took leadership of the project and made the first public release, February the 1st 2010. • Since then, several releases have appeared following a ~3 month cycle, and a thriving international community has been leading the development. • Of the various scikits, scikit-learn as well as scikit-image were described as "well-maintained and popular" in November 2012 sarwan@NIELIT 3
  • 4. Introduction • Machine learning library written in Python • Simple and efficient, for both experts and non-experts • Classical, well-established machine learning algorithms • BSD 3 license • characterized by a clean, uniform, and streamlined API • Community driven development • 20~ core developers (mostly researchers) • 500+ occasional contributors • All working publicly together on GitHub • Emphasis on keeping the project maintainable • Style consistency • Unit-test coverage • Documentation and examples • Code review sarwan@NIELIT 4
  • 5. Pandas NumPy Scikit-Learn workflow • Start with CSV • Convert to Pandas DataFrame • Slice and dice in Pandas • Convert to NumPy array to feed to Scikit-Learn Additional web resource : • UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online repository of machine learning datasets (at the time of writing, they are listing 233 datasets). The repository is available online: http://archive.ics.uci.edu/ml/ • https://github.com/rasbt/pattern_classification/blob/master/resources/machine_learning_ebooks.md sarwan@NIELIT 5
  • 6. Data Representation in Scikit-Learn • Machine learning is about creating models from data • The best way to think about data within Scikit-Learn is in terms of tables of data. • Data as table : A basic table is a two-dimensional grid of data, in which the rows represent individual elements of the dataset, and the columns represent quantities related to each of these elements. • E.g. the Iris dataset, famously analyzed by Ronald Fisher in 1936. • This can be downloaded in dataset in the form of a Pandas DataFrame using the Seaborn library sarwan@NIELIT 6
  • 7. Layman’s view of Machine Learning • Loading the dataset. • Summarizing the dataset. • Visualizing the dataset. • Evaluating some algorithms. • Making some predictions. Making some predictions Evaluating some algorithms. Visualizing the dataset. Summariz- ing the dataset. Loading the dataset. sarwan@NIELIT 7
  • 8. Basics of the Scikit-Learn estimator API 1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn. 2. Choose model hyperparameters by instantiating this class with desired values. 3. Arrange data into a features matrix and target vector 4. Fit the model to your data by calling the fit() method of the model instance. 5. Apply the model to new data: • For supervised learning, often we predict labels for unknown data using the predict() method. • For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method sarwan@NIELIT 8
  • 9. Basics of the Scikit-Learn estimator API Choose a class of model Choose model hyperparameters Arrange data into a features matrix and target vector Fit the model to your data Apply model to new data sarwan@NIELIT 9
  • 10. Supervised learning example: Simple linear regression • Lets Learn with an example : common case of fitting a line to x, y data. import matplotlib.pyplot as plt import numpy as np rng = np.random.RandomState(42) x = 10 * rng.rand(50) y = 2 * x - 1 + rng.randn(50) plt.scatter(x, y) ; sarwan@NIELIT 10
  • 11. 1. Choose a class of model. - In Scikit-Learn, every class of model is represented by a Python class. from sklearn.linear_model import LinearRegression • once the model class is selected, hyperparameters are selected . Supervised learning example: Simple linear regression sarwan@NIELIT 11
  • 12. 2. Choose model hyperparameters. An important point is that a class of model is not the same as an instance of a model. • hyperparameters are parameters that must be set before the model is fit to data • In Scikit-Learn, hyperparameters are chosen by passing values at model instantiation. model = LinearRegression( fit_intercept=True ) Finally the model will become : LinearRegression( copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) • the model is not yet applied to any data: the Scikit-Learn API makes very clear the distinction between choice of model and application of model to data. Supervised learning example: Simple linear regression sarwan@NIELIT 12
  • 13. 3. Arrange data into a features matrix and target vector. • Make two-dimensional features matrix (X) and a one-dimensional target array (Y) • target variable y is already in the correct form (a length-n_samples array) • Make the data x into a matrix of size [n_samples, n_features]. X = x[:, np.newaxis] X.shape –output- (50,1) Supervised learning example: Simple linear regression Earlier state : x = 10 * rng.rand(50) y = 2 * x - 1 + rng.randn(50) sarwan@NIELIT 13
  • 14. 4. Fit the model to your data. • apply model to data using fit() method model.fit( X , y ) Final: LinearRegression( copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) • fit() command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model specific attributes • In Scikit-Learn, by convention all model parameters that were learned during the fit() process have trailing underscores Supervised learning example: Simple linear regression sarwan@NIELIT 14
  • 15. 4. Fit the model to your data.(contd..) • The two parameters represent the slope and intercept of the simple linear fit to the data. In our data definition, its very close to the input slope of 2 and intercept of –1 • In general, Scikit-Learn does not provide tools to draw conclusions from internal model parameters themselves: interpreting model parameters is much more a statistical modeling question than a machine learning question. • Machine learning rather focuses on what the model predicts. Supervised learning example: Simple linear regression sarwan@NIELIT 15
  • 16. 5. Predict labels for unknown data. • Once the model is trained, the main task of supervised machine learning is to evaluate it based on what it says about new data that was not part of the training set. • In Scikit-Learn, the predict() method is used. xfit = np.linspace(-1, 11) #coerce x values into a [n_samples, n_features] features matrix Xfit = xfit [ : , np.newaxis ] yfit = model.predict (Xfit) #visualize the result plt.scatter(x, y) plt.plot(xfit, yfit); Supervised learning example: Simple linear regression sarwan@NIELIT 16
  • 17. What makes up a classification model? • The structure of the model: In this, we use a threshold on a single feature. • The search procedure: In this, we try every possible combination of feature and threshold. • The loss function: Using the loss function, we decide which of the possibilities is less bad (because we can rarely talk about the perfect solution). We can use the training error or just define this point the other way around and say that we want the best accuracy. • Traditionally, people want the loss function to be minimum. sarwan@NIELIT 17
  • 18. • Alternatively, we might have different loss functions. It might be that one type of error is much more costly than another. In a medical setting, false negatives and false positives are not equivalent. • A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. • A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests for confirmation purposes or unnecessary treatment (which can still have costs, including side effects from the treatment). • With spam filtering, we may face the same problem; incorrectly deleting a non-spam e-mail can be very dangerous for the user, while letting a spam e-mail through is just a minor annoyance. sarwan@NIELIT 18
  • 19. • What the cost function should be is always dependent on the exact problem you are working on. • When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes (achieving the highest accuracy). • However, if some mistakes are more costly than others, it might be better to accept a lower overall accuracy to minimize overall costs. sarwan@NIELIT 19
  • 20. • This is a general area normally termed feature engineering; it is sometimes seen as less glamorous than algorithms, but it may matter more for performance (a simple algorithm on well-chosen features will perform better than a fancy algorithm on not-so-good features). • Features and feature engineering • Feature selection. sarwan@NIELIT 20
  • 21. First Machine Learning Project using Iris dataset Hello world program of machine learning “classification of iris flowers” Iris virginica Iris setosa Iris versicolor sarwan@NIELIT 21
  • 22. Question • After looking at new flower in the field, could we make a good prediction about its species from its measurements? Iris virginica Iris setosa Iris versicolor sarwan@NIELIT 22
  • 23. Iris dataset • The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification. • The setting is that of Iris flowers, of which there are multiple species that can be identified by their morphology. • Today, the species would be defined by their genomic signatures, but in the 1930s, DNA had not even been identified as the carrier of genetic information. • The following four attributes of each plant were measured: • Sepal length , Sepal width, Petal length, Petal width sarwan@NIELIT 23
  • 24. Iris dataset • Generally, any measurement from our data as features. • This is the supervised learning or classification problem; given labeled examples, we can design a rule that will eventually be applied to other examples. • Other modern application examples of Pattern classification : Optical Character Recognition (OCR) in the post office, spam filtering in our email clients(spam messages vs “ham” {= not-spam} messages), barcode scanners in the supermarket, etc sarwan@NIELIT 24
  • 25. Hello World of Machine Learning with Iris • The best small project to start with on a new tool is the classification of iris flowers. why iris dataset • Attributes are numeric so you have to figure out how to load and handle data. • It is a classification problem, allowing to practice with perhaps an easier type of supervised learning algorithm. • It is a multi-class classification problem (multi-nominal) that may require some specialized handling. • It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page). • All of the numeric attributes are in the same units and the same scale, not requiring any special scaling or transforms to get started. sarwan@NIELIT 25
  • 26. Iris Dataset • Iris dataset contains 150 observations of iris flowers. • Has four columns of measurements of the flowers in centimeters. • The fifth column is the species of the flower observed. • All observed flowers belong to one of three species Inputs from : machinelearningmastery, google, kaggle,etc sarwan@NIELIT 26
  • 27. Summarize dataset • Take statistical summary using describe(). • Grouping the rows/records based on class of flower, using irisDataframe.groupby('class').size() sarwan@NIELIT 27
  • 28. Data Visualization Two types of plots: • Univariate plots to better understand each attribute. • Multivariate plots to better understand the relationships between attributes. sarwan@NIELIT 28
  • 29. Multivariate plots • scatterplots of all pairs of attributes. • It is helpful to spot structured relationships between input variables • The diagonal grouping of some pairs of attributes, suggests a high correlation and a predictable relationship sarwan@NIELIT 29
  • 30. Create a Validation Dataset Split the loaded dataset into two: • 80% of which we will use to train our models and • 20% that we will hold back as a validation dataset. training data in the • X_train and Y_train for preparing models and • X_validation and Y_validation sets sarwan@NIELIT 30
  • 31. Arranging data into a features matrix and target vector sarwan@NIELIT 31
  • 32. K-fold cross validation • Cross-validation, sometimes called rotation estimation or out-of-sample testing is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Source:wikipedia.org • Mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. • In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set). • The goal of cross-validation is to test the model’s ability to predict new data that were not used in estimating it, in order to flag problems like overfitting sarwan@NIELIT 32
  • 33. Test Harness • use 10-fold cross validation to estimate accuracy. • This will split the dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits. • use ‘accuracy’ metric to evaluate models. • This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). sarwan@NIELIT 33
  • 34. Evaluate 6 different algorithms: • Logistic Regression (LR) • Linear Discriminant Analysis (LDA) • K-Nearest Neighbours (KNN). • Classification and Regression Trees (CART). • Gaussian Naive Bayes (NB). • Support Vector Machines (SVM). Its good mix of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. To ensures the results are directly comparable, reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. sarwan@NIELIT 34
  • 36. Fit the model to your data sarwan@NIELIT 36