SlideShare a Scribd company logo
Dr. Parinaz Ameri
Intro to Machine Learning
for non-Data Scientists
Agenda
● 1.5 hours: Introduction to ML algorithms
● 1.5 hours: Implementing algorithms for different use-cases
● 1 hour: Working on a recommendation mini-project
Machine Learning in Daily Life
Source:
[xkcd_1838]
Machine Learning Definition
Arthur Samuel (1959):
“Field of study that gives computers the ability to learn without being explicitly
programmed.” [ML_Awad]
Source: [fortune]
Email Spam Filter
A Machine Learning Model
Machine Learning Definition
Tom Mitchell (1998):
“A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure P if its performance at tasks in T, as measured by P, improves
with experience E.” [ML_Mitchell]
E, T and P in a Spam Filter Example
● Task T:
○ Classify emails as Spam or Ham.
● Experience E:
○ Monitor you labeling emails as Spam or Not spam.
● Performance measure P:
○ The Number (or fraction) of emails that are correctly classified as Spam or Ham.
Machine Learning Definition
Peter Flach (2012):
“Machine learning is the systematic study of algorithms and systems that improve their
knowledge or performance with experience.” [ML_Flach]
Source:
[towardsdatascience]
Machine Learning Main Ingredients
1. Tasks:
○ An abstract representation of a problem we want to solve regarding the domain objects
2. Models:
○ Representation of many tasks as a model from data points to outputs.
○ Produces as the output of a machine learning algorithm applied to training data.
3. Features:
○ A language definition in which we describe the relevant objects in our domain.
Source: [ML_Flach]
Machine Learning Main Ingredients
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
Source: [Medium]
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Tasks & Learning Algorithms
● Supervised Learning
○ Regression
○ Classification
● Unsupervised Learning
○ Clustering
● Reinforcement Learning
● Recommendation systems
Supervised Learning Algorithms
Data is Labeled = Right Answers are Given
Housing Price Prediction
750
Regression : Predict
a continuous valued
output
Breast Cancer (Malignant, Benign)
Classification :
Predict discrete
valued output (0,1)
Features in Classification
Other Features:
- Clump thickness
- Uniformity of cell
size
- Uniformity of cell
shape
- ...
Exercise 1
Should you treat the following problems with regression or classification?
Problem 1: You want to develop a learning algorithm to examine individual customer accounts
and determine if each account has been hacked.
Problem 2: You have a huge list of identical items and want to predict which how many of
them will be sold over next 3 months.
Unsupervised Learning Algorithms
Data is Not Labeled
Supervised Learning
X1
X2
Unsupervised Learning
X1
X2
Clustering
Clustering in Biology
Source: [researchgate]
More Clustering Applications
Social Network Analysis
Organizing Computing Clusters
Market Segmentation
Exercise 2
Which of the following problems would you address with Unsupervised Learning
algorithms?
1. Given a dataset of patients diagnosed as either having diabetes or not, learn
to classify new patients as having diabetes or not.
2. Given a database of customer data, automatically discover market segments
and group customers into different market segments.
3. Given a dataset of news articles found on the web, group them into set of
articles about the same story.
4. Given email labeled as spam/ham, learn spam filter.
Example of Supervised learning
Source:[radimrehurek]
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Building
a model
Models
Predictive model Descriptive model
Supervised learning Classification, Regression Subgrouping
Unsupervised learning Predictive clustering Clustering, Association Rule
discovery
Model Types
● Geometric
● Probabilistic
● Logical
Building a Linear Regression Model
Mean Squared Error
(MSE):
Measures the
average of the
squares of the errors
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Building
a model
Model
Evaluation
Model Validation
● Goodness of fit (fit error)
● Goodness of prediction (prediction error): generalization error
Overfitting:
unnecessary increase of model complexity
Underfitting:
too simple model will not fit data properly
k-Fold Cross Validation
k=4 Cross Validation
Source: [wiki]
Mean Squared
Prediction Error:
computed on q
data points that
were not used in
estimating the
model
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Building
a model
Model
Evaluation
New
Data
Prediction
Result
Get your hands dirty
Source: [karlstratos]
Installing docker with Anaconda image
1. Install docker with :
> sudo apt install docker.io
2. Add your current user to the docker group with the following command:
> sudo usermod -a -G docker $USER
3. Restart your computer
4. Register and proceed at https://hub.docker.com/_/anaconda
5. Download the docker of anaconda with the following command:
> docker pull continuumio/anaconda
6. Run docker:
> docker run -i -t continuumio/anaconda /bin/bash
7. Test your conda environment:
(base) root@9b9e483ba80e:/opt/conda# conda info
Running Jupyter Notebook
Run the following command in one line from host machine:
> docker run -i -t -p 8888:8888 continuumio/miniconda /bin/bash -c
"/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks &&
/opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip=0.0.0.0 --
port=8888 --no-browser --allow-root"
- Open your Notebook in the browser
- Open a terminal and install: numpy pandas matplotlib scipy and sklearn
Local Download server
172.90.0.161
Python Libraries for Machine Learning
● NumPy (http://www.numpy.org/ ):
○ Introduce objects for multidimensional arrays and matrices
○ Provides vectorization of mathematical operations on arrays and matrices
● SciPy(https://www.scipy.org/scipylib/ ):
○ Collection of algorithms for linear algebra, statistics, optimization and etc.
○ Build on NumPy
● Pandas(http://pandas.pydata.org/ ):
○ Provide tools for data manipulation and handling missing data
● SciKit-Learn(https://scikit-learn.org/stable/ ):
○ Provide machine learning algorithms: classification, regression, clustering, model validation
etc.
● Matplotlib(https://matplotlib.org/ ):
○ Python 2D plotting library
Pandas DataFrame Data Types
Pandas type Python native type Description
obj string The most general dtype.
Will be assigned to your
column if it contains mixed
types (numbers and
strings).
int64 int Numeric characters. 64 refers to
the memory allocated to hold
this character.
float64 float Numeric characters with
decimals. If a column contains
numbers and NaNs(see below),
pandas will default to float64, in
case your missing value has a
decimal.
datetime64, timedelta[ns] N/A (but see thedatetimemodule
in Python’s standard library)
Values meant to hold time data.
Look into these for time series
experiments.
DataFrame Attributes
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labelsand column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values Numpy representation of the data
Exercise with DataFrame Attributes
1. How many records this data frame has?
2. How many elements are there?
3. What are the column names?
4. What types of columns we have in this data frame?
DataFrame Methods
df.method() description
head( [n] ), tail( [n] ) first/lastn rows
describe() generate descriptive statistics (for numeric
columns only)
max(), min() return max/min values for all numeric
columns
mean(), median() return mean/median values for all numeric
columns
std() standard deviation
sample([n]) returns a random sample of the data frame
dropna() drop all the records with missing values
Exercise with DataFrame Methods
1. Give the summary for the numeric columns in the dataset
2. Calculate standard deviation for all numeric columns
3. What are the mean values of the first 50 records in the dataset?
Hint: use head() method to subset the first 50 records and then calculate the mean
Handling Missing Values
● ‘NaN - NoT a Number’ shows missing values
● Often replaced by arbitrary chosen values like -1 in feature with positive numbers or 0 or
medium (most common)
● But should be aware that something has been changed
● Could also ignore the sample or feature with missing values
Missing Values in Pandas
● Missing values in GroupBy method are excluded
● Many descriptive statistics methods have ‘skipna’ option to control if missing data should
be excluded . This value is set to True by default.
Dealing with Missing Values in DF
df.method() description
dropna() Drop missing observations
dropna(how='all') Drop observations where all cells is NA
dropna(axis=1, how='all') Drop column if all the values aremissing
dropna(thresh = 5) Drop rows that contain less than 5 non-
missing values
fillna(0) Replace missing values with zeros
isnull() returns True if the value is missing
notnull() Returns True for non-missing values
Source: [Print_Lego]
Building a Linear Regression Model
Mean Squared Error
(MSE):
Measures the
average of the
squares of the errors
R-Squared
Where and
Here, yi^ is the fitted value for observation i and y¯ is the mean of Y.
k-Nearest Neighbors
Distance Measurements
KNN Algorithm
Accuracy
K-Means Clustering
K-Means Clustering Algorithm
Future Plans?
Further Learning
● Kaggle: is the place to do data science projects
● Seeing Theory : a visual introduction to probability and statistics.
● Kdnuggets: Machine Learning, Data Science, Data Mining, Big Data, Analytics, AI.
Software
Reading Recommendations
● Machine learning : The art and science of algorithms that make sense of data by Peter
Flach
● Python for Data Analysis by We McKinney
● https://www.kdnuggets.com/2018/12/feature-engineering-explained.html
References
[ML_Awad] Awad M., Khanna R. (2015) Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA
[xkcd_1838] https://xkcd.com/1838/
[fortune] http://fortune.com/2018/06/25/ai-business-breakthrough/
[ML_Flach] Flach, P. (2012). Machine Learning: The art and science of algorithms that make sense of data. Cambridge University Press.
[ML_Mitchell] Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill. ISBN: 978-0-07-042807-2
[Medium_Sharma] https://medium.com/datadriveninvestor/how-to-built-a-recommender-system-rs-616c988d64b2
[karlstratos] http://karlstratos.com/drawings/drawings.html
[Print_Lego] https://www.pinterest.com/pin/422071796300372061/
[Medium] https://medium.com/@mehulved1503/feature-selection-and-feature-extraction-in-machine-learning-an-overview-
57891c595e96
[researchgate] https://www.researchgate.net/figure/Hierarchical-clustering-of-the-181-genes-corresponding-to-zinc-biology-related-
functional_fig6_26688269
References (2)
[redimrehurek] https://radimrehurek.com/data_science_python/
[wiki] https://en.wikipedia.org/wiki/Cross-validation_(statistics)
Icon References
● Icons made by: Freepik from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Pixel perfect from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Vectors Market from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Smashicons from www.flaticon.com is licensed by CC 3.0 BY
We organize IT24.04.2019
Your Contact
Dr. Hamzeh Alavira
Founder, oranIT GmbH
alavirad@oranit.de
0049-176-8080-7585
Dr. Parinaz Ameri
Co-Founder, oranIT GmbH
ameri@oranit.de
0049-176-3497-0683

More Related Content

What's hot (20)

PDF
LR1. Summary Day 1
Machine Learning Valencia
 
PDF
L13. Cluster Analysis
Machine Learning Valencia
 
PPT
Learning On The Border:Active Learning in Imbalanced classification Data
萍華 楊
 
PDF
L4. Ensembles of Decision Trees
Machine Learning Valencia
 
PPTX
Session 06 machine learning.pptx
bodaceacat
 
PPTX
Ppt shuai
Xiang Zhang
 
PDF
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Marina Santini
 
PPTX
sentiment analysis using support vector machine
Shital Andhale
 
PDF
L3. Decision Trees
Machine Learning Valencia
 
PPTX
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Sri Ambati
 
PPTX
Supervised Machine Learning in R
Babu Priyavrat
 
PDF
Machine Learning Lecture 2 Basics
ananth
 
PDF
Lecture 3b: Decision Trees (1 part)
Marina Santini
 
PDF
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
Sri Ambati
 
PDF
Machine learning Lecture 1
Srinivasan R
 
PDF
Winning Kaggle 101: Introduction to Stacking
Ted Xiao
 
PDF
VSSML16 L3. Clusters and Anomaly Detection
BigML, Inc
 
PDF
Artificial Intelligence Course: Linear models
ananth
 
PPTX
WEKA: Credibility Evaluating Whats Been Learned
DataminingTools Inc
 
PPTX
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
wajrcs
 
LR1. Summary Day 1
Machine Learning Valencia
 
L13. Cluster Analysis
Machine Learning Valencia
 
Learning On The Border:Active Learning in Imbalanced classification Data
萍華 楊
 
L4. Ensembles of Decision Trees
Machine Learning Valencia
 
Session 06 machine learning.pptx
bodaceacat
 
Ppt shuai
Xiang Zhang
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Marina Santini
 
sentiment analysis using support vector machine
Shital Andhale
 
L3. Decision Trees
Machine Learning Valencia
 
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Sri Ambati
 
Supervised Machine Learning in R
Babu Priyavrat
 
Machine Learning Lecture 2 Basics
ananth
 
Lecture 3b: Decision Trees (1 part)
Marina Santini
 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
Sri Ambati
 
Machine learning Lecture 1
Srinivasan R
 
Winning Kaggle 101: Introduction to Stacking
Ted Xiao
 
VSSML16 L3. Clusters and Anomaly Detection
BigML, Inc
 
Artificial Intelligence Course: Linear models
ananth
 
WEKA: Credibility Evaluating Whats Been Learned
DataminingTools Inc
 
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
wajrcs
 

Similar to Intro to Machine Learning for non-Data Scientists (20)

PPTX
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
PPTX
More on Pandas.pptx
VirajPathania1
 
PPTX
Build_Machine_Learning_System for Machine Learning Course
ssuserfece35
 
PDF
business analytic meeting 1 tunghai university.pdf
Anggi Andriyadi
 
PPTX
Building and deploying analytics
Collin Bennett
 
PPTX
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
PPTX
python for data anal gh i o fytysis creation.pptx
Vinod Deenathayalan
 
PPTX
Ml programming with python
Kumud Arora
 
PPTX
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
NETFest
 
PPT
Feature-selection-techniques to be used in machine learning algorithms
ssuser363702
 
PDF
The ABC of Implementing Supervised Machine Learning with Python.pptx
Ruby Shrestha
 
PDF
1. Demystifying ML.pdf
Jyoti Yadav
 
PPTX
Python for data analysis
Savitribai Phule Pune University
 
PPTX
Primer to Machine Learning
Jeff Tanner
 
PPTX
Lec1 intoduction.pptx
Oussama Haj Salem
 
PDF
Machine Learning - Lecture1.pptx.pdf
NsitTech
 
PPTX
KabirDataPreprocessingPyMMMMMMMMMMMMMMMMMMMMthon.pptx
ratnapatil14
 
PPTX
Application of Machine Learning in Agriculture
Aman Vasisht
 
PPTX
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 
PDF
LR2. Summary Day 2
Machine Learning Valencia
 
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
More on Pandas.pptx
VirajPathania1
 
Build_Machine_Learning_System for Machine Learning Course
ssuserfece35
 
business analytic meeting 1 tunghai university.pdf
Anggi Andriyadi
 
Building and deploying analytics
Collin Bennett
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
python for data anal gh i o fytysis creation.pptx
Vinod Deenathayalan
 
Ml programming with python
Kumud Arora
 
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
NETFest
 
Feature-selection-techniques to be used in machine learning algorithms
ssuser363702
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
Ruby Shrestha
 
1. Demystifying ML.pdf
Jyoti Yadav
 
Python for data analysis
Savitribai Phule Pune University
 
Primer to Machine Learning
Jeff Tanner
 
Lec1 intoduction.pptx
Oussama Haj Salem
 
Machine Learning - Lecture1.pptx.pdf
NsitTech
 
KabirDataPreprocessingPyMMMMMMMMMMMMMMMMMMMMthon.pptx
ratnapatil14
 
Application of Machine Learning in Agriculture
Aman Vasisht
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 
LR2. Summary Day 2
Machine Learning Valencia
 
Ad

Recently uploaded (20)

PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PPT
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
short term internship project on Data visualization
JMJCollegeComputerde
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Ad

Intro to Machine Learning for non-Data Scientists

  • 1. Dr. Parinaz Ameri Intro to Machine Learning for non-Data Scientists
  • 2. Agenda ● 1.5 hours: Introduction to ML algorithms ● 1.5 hours: Implementing algorithms for different use-cases ● 1 hour: Working on a recommendation mini-project
  • 3. Machine Learning in Daily Life
  • 5. Machine Learning Definition Arthur Samuel (1959): “Field of study that gives computers the ability to learn without being explicitly programmed.” [ML_Awad] Source: [fortune]
  • 6. Email Spam Filter A Machine Learning Model
  • 7. Machine Learning Definition Tom Mitchell (1998): “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” [ML_Mitchell]
  • 8. E, T and P in a Spam Filter Example ● Task T: ○ Classify emails as Spam or Ham. ● Experience E: ○ Monitor you labeling emails as Spam or Not spam. ● Performance measure P: ○ The Number (or fraction) of emails that are correctly classified as Spam or Ham.
  • 9. Machine Learning Definition Peter Flach (2012): “Machine learning is the systematic study of algorithms and systems that improve their knowledge or performance with experience.” [ML_Flach]
  • 11. Machine Learning Main Ingredients 1. Tasks: ○ An abstract representation of a problem we want to solve regarding the domain objects 2. Models: ○ Representation of many tasks as a model from data points to outputs. ○ Produces as the output of a machine learning algorithm applied to training data. 3. Features: ○ A language definition in which we describe the relevant objects in our domain.
  • 16. Tasks & Learning Algorithms ● Supervised Learning ○ Regression ○ Classification ● Unsupervised Learning ○ Clustering ● Reinforcement Learning ● Recommendation systems
  • 17. Supervised Learning Algorithms Data is Labeled = Right Answers are Given
  • 18. Housing Price Prediction 750 Regression : Predict a continuous valued output
  • 19. Breast Cancer (Malignant, Benign) Classification : Predict discrete valued output (0,1)
  • 20. Features in Classification Other Features: - Clump thickness - Uniformity of cell size - Uniformity of cell shape - ...
  • 21. Exercise 1 Should you treat the following problems with regression or classification? Problem 1: You want to develop a learning algorithm to examine individual customer accounts and determine if each account has been hacked. Problem 2: You have a huge list of identical items and want to predict which how many of them will be sold over next 3 months.
  • 26. More Clustering Applications Social Network Analysis Organizing Computing Clusters Market Segmentation
  • 27. Exercise 2 Which of the following problems would you address with Unsupervised Learning algorithms? 1. Given a dataset of patients diagnosed as either having diabetes or not, learn to classify new patients as having diabetes or not. 2. Given a database of customer data, automatically discover market segments and group customers into different market segments. 3. Given a dataset of news articles found on the web, group them into set of articles about the same story. 4. Given email labeled as spam/ham, learn spam filter.
  • 28. Example of Supervised learning Source:[radimrehurek]
  • 29. Machine Learning Pipeline Data Preparation Training Data Test Data Feature Selection ML Algorithm Selection Building a model
  • 30. Models Predictive model Descriptive model Supervised learning Classification, Regression Subgrouping Unsupervised learning Predictive clustering Clustering, Association Rule discovery
  • 31. Model Types ● Geometric ● Probabilistic ● Logical
  • 32. Building a Linear Regression Model Mean Squared Error (MSE): Measures the average of the squares of the errors
  • 33. Machine Learning Pipeline Data Preparation Training Data Test Data Feature Selection ML Algorithm Selection Building a model Model Evaluation
  • 34. Model Validation ● Goodness of fit (fit error) ● Goodness of prediction (prediction error): generalization error
  • 36. Underfitting: too simple model will not fit data properly
  • 38. k=4 Cross Validation Source: [wiki] Mean Squared Prediction Error: computed on q data points that were not used in estimating the model
  • 39. Machine Learning Pipeline Data Preparation Training Data Test Data Feature Selection ML Algorithm Selection Building a model Model Evaluation New Data Prediction Result
  • 40. Get your hands dirty Source: [karlstratos]
  • 41. Installing docker with Anaconda image 1. Install docker with : > sudo apt install docker.io 2. Add your current user to the docker group with the following command: > sudo usermod -a -G docker $USER 3. Restart your computer 4. Register and proceed at https://hub.docker.com/_/anaconda 5. Download the docker of anaconda with the following command: > docker pull continuumio/anaconda 6. Run docker: > docker run -i -t continuumio/anaconda /bin/bash 7. Test your conda environment: (base) root@9b9e483ba80e:/opt/conda# conda info
  • 42. Running Jupyter Notebook Run the following command in one line from host machine: > docker run -i -t -p 8888:8888 continuumio/miniconda /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks && /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip=0.0.0.0 -- port=8888 --no-browser --allow-root" - Open your Notebook in the browser - Open a terminal and install: numpy pandas matplotlib scipy and sklearn
  • 44. Python Libraries for Machine Learning ● NumPy (http://www.numpy.org/ ): ○ Introduce objects for multidimensional arrays and matrices ○ Provides vectorization of mathematical operations on arrays and matrices ● SciPy(https://www.scipy.org/scipylib/ ): ○ Collection of algorithms for linear algebra, statistics, optimization and etc. ○ Build on NumPy ● Pandas(http://pandas.pydata.org/ ): ○ Provide tools for data manipulation and handling missing data ● SciKit-Learn(https://scikit-learn.org/stable/ ): ○ Provide machine learning algorithms: classification, regression, clustering, model validation etc. ● Matplotlib(https://matplotlib.org/ ): ○ Python 2D plotting library
  • 45. Pandas DataFrame Data Types Pandas type Python native type Description obj string The most general dtype. Will be assigned to your column if it contains mixed types (numbers and strings). int64 int Numeric characters. 64 refers to the memory allocated to hold this character. float64 float Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64, in case your missing value has a decimal. datetime64, timedelta[ns] N/A (but see thedatetimemodule in Python’s standard library) Values meant to hold time data. Look into these for time series experiments.
  • 46. DataFrame Attributes df.attribute description dtypes list the types of the columns columns list the column names axes list the row labelsand column names ndim number of dimensions size number of elements shape return a tuple representing the dimensionality values Numpy representation of the data
  • 47. Exercise with DataFrame Attributes 1. How many records this data frame has? 2. How many elements are there? 3. What are the column names? 4. What types of columns we have in this data frame?
  • 48. DataFrame Methods df.method() description head( [n] ), tail( [n] ) first/lastn rows describe() generate descriptive statistics (for numeric columns only) max(), min() return max/min values for all numeric columns mean(), median() return mean/median values for all numeric columns std() standard deviation sample([n]) returns a random sample of the data frame dropna() drop all the records with missing values
  • 49. Exercise with DataFrame Methods 1. Give the summary for the numeric columns in the dataset 2. Calculate standard deviation for all numeric columns 3. What are the mean values of the first 50 records in the dataset? Hint: use head() method to subset the first 50 records and then calculate the mean
  • 50. Handling Missing Values ● ‘NaN - NoT a Number’ shows missing values ● Often replaced by arbitrary chosen values like -1 in feature with positive numbers or 0 or medium (most common) ● But should be aware that something has been changed ● Could also ignore the sample or feature with missing values
  • 51. Missing Values in Pandas ● Missing values in GroupBy method are excluded ● Many descriptive statistics methods have ‘skipna’ option to control if missing data should be excluded . This value is set to True by default.
  • 52. Dealing with Missing Values in DF df.method() description dropna() Drop missing observations dropna(how='all') Drop observations where all cells is NA dropna(axis=1, how='all') Drop column if all the values aremissing dropna(thresh = 5) Drop rows that contain less than 5 non- missing values fillna(0) Replace missing values with zeros isnull() returns True if the value is missing notnull() Returns True for non-missing values
  • 54. Building a Linear Regression Model Mean Squared Error (MSE): Measures the average of the squares of the errors
  • 55. R-Squared Where and Here, yi^ is the fitted value for observation i and y¯ is the mean of Y.
  • 63. Further Learning ● Kaggle: is the place to do data science projects ● Seeing Theory : a visual introduction to probability and statistics. ● Kdnuggets: Machine Learning, Data Science, Data Mining, Big Data, Analytics, AI. Software
  • 64. Reading Recommendations ● Machine learning : The art and science of algorithms that make sense of data by Peter Flach ● Python for Data Analysis by We McKinney ● https://www.kdnuggets.com/2018/12/feature-engineering-explained.html
  • 65. References [ML_Awad] Awad M., Khanna R. (2015) Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA [xkcd_1838] https://xkcd.com/1838/ [fortune] http://fortune.com/2018/06/25/ai-business-breakthrough/ [ML_Flach] Flach, P. (2012). Machine Learning: The art and science of algorithms that make sense of data. Cambridge University Press. [ML_Mitchell] Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill. ISBN: 978-0-07-042807-2 [Medium_Sharma] https://medium.com/datadriveninvestor/how-to-built-a-recommender-system-rs-616c988d64b2 [karlstratos] http://karlstratos.com/drawings/drawings.html [Print_Lego] https://www.pinterest.com/pin/422071796300372061/ [Medium] https://medium.com/@mehulved1503/feature-selection-and-feature-extraction-in-machine-learning-an-overview- 57891c595e96 [researchgate] https://www.researchgate.net/figure/Hierarchical-clustering-of-the-181-genes-corresponding-to-zinc-biology-related- functional_fig6_26688269
  • 66. References (2) [redimrehurek] https://radimrehurek.com/data_science_python/ [wiki] https://en.wikipedia.org/wiki/Cross-validation_(statistics)
  • 67. Icon References ● Icons made by: Freepik from www.flaticon.com is licensed by CC 3.0 BY ● Icons made by: Pixel perfect from www.flaticon.com is licensed by CC 3.0 BY ● Icons made by: Vectors Market from www.flaticon.com is licensed by CC 3.0 BY ● Icons made by: Smashicons from www.flaticon.com is licensed by CC 3.0 BY
  • 68. We organize IT24.04.2019 Your Contact Dr. Hamzeh Alavira Founder, oranIT GmbH [email protected] 0049-176-8080-7585 Dr. Parinaz Ameri Co-Founder, oranIT GmbH [email protected] 0049-176-3497-0683