Intro to Machine Learning for non-Data Scientists

Dr. Parinaz Ameri
Intro to Machine Learning
for non-Data Scientists

Agenda
● 1.5 hours: Introduction to ML algorithms
● 1.5 hours: Implementing algorithms for different use-cases
● 1 hour: Working on a recommendation mini-project

Machine Learning in Daily Life

Machine Learning Definition
Arthur Samuel (1959):
“Field of study that gives computers the ability to learn without being explicitly
programmed.” [ML_Awad]
Source: [fortune]

Email Spam Filter
A Machine Learning Model

Tom Mitchell (1998):
“A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure P if its performance at tasks in T, as measured by P, improves
with experience E.” [ML_Mitchell]

E, T and P in a Spam Filter Example
● Task T:
○ Classify emails as Spam or Ham.
● Experience E:
○ Monitor you labeling emails as Spam or Not spam.
● Performance measure P:
○ The Number (or fraction) of emails that are correctly classified as Spam or Ham.

Peter Flach (2012):
“Machine learning is the systematic study of algorithms and systems that improve their
knowledge or performance with experience.” [ML_Flach]

Machine Learning Main Ingredients
1. Tasks:
○ An abstract representation of a problem we want to solve regarding the domain objects
2. Models:
○ Representation of many tasks as a model from data points to outputs.
○ Produces as the output of a machine learning algorithm applied to training data.
3. Features:
○ A language definition in which we describe the relevant objects in our domain.

Source: [ML_Flach]
Machine Learning Main Ingredients

Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection

Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection

Tasks & Learning Algorithms
● Supervised Learning
○ Regression
○ Classification
● Unsupervised Learning
○ Clustering
● Reinforcement Learning
● Recommendation systems

Supervised Learning Algorithms
Data is Labeled = Right Answers are Given

Housing Price Prediction
750
Regression : Predict
a continuous valued
output

Breast Cancer (Malignant, Benign)
Classification :
Predict discrete
valued output (0,1)

Features in Classification
Other Features:
- Clump thickness
- Uniformity of cell
size
- Uniformity of cell
shape
- ...

Exercise 1
Should you treat the following problems with regression or classification?
Problem 1: You want to develop a learning algorithm to examine individual customer accounts
and determine if each account has been hacked.
Problem 2: You have a huge list of identical items and want to predict which how many of
them will be sold over next 3 months.

Unsupervised Learning Algorithms
Data is Not Labeled

Unsupervised Learning
X1
X2
Clustering

Clustering in Biology
Source: [researchgate]

More Clustering Applications
Social Network Analysis
Organizing Computing Clusters
Market Segmentation

Exercise 2
Which of the following problems would you address with Unsupervised Learning
algorithms?
1. Given a dataset of patients diagnosed as either having diabetes or not, learn
to classify new patients as having diabetes or not.
2. Given a database of customer data, automatically discover market segments
and group customers into different market segments.
3. Given a dataset of news articles found on the web, group them into set of
articles about the same story.
4. Given email labeled as spam/ham, learn spam filter.

Example of Supervised learning
Source:[radimrehurek]

Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Building
a model

Models
Predictive model Descriptive model
Supervised learning Classification, Regression Subgrouping
Unsupervised learning Predictive clustering Clustering, Association Rule
discovery

Model Types
● Geometric
● Probabilistic
● Logical

Building a Linear Regression Model
Mean Squared Error
(MSE):
Measures the
average of the
squares of the errors

Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Building
a model
Model
Evaluation

Model Validation
● Goodness of fit (fit error)
● Goodness of prediction (prediction error): generalization error

Overfitting:
unnecessary increase of model complexity

Underfitting:
too simple model will not fit data properly

k=4 Cross Validation
Source: [wiki]
Mean Squared
Prediction Error:
computed on q
data points that
were not used in
estimating the
model

Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Building
a model
Model
Evaluation
New
Data
Prediction
Result

Get your hands dirty
Source: [karlstratos]

Installing docker with Anaconda image
1. Install docker with :
> sudo apt install docker.io
2. Add your current user to the docker group with the following command:
> sudo usermod -a -G docker $USER
3. Restart your computer
4. Register and proceed at https://hub.docker.com/_/anaconda
5. Download the docker of anaconda with the following command:
> docker pull continuumio/anaconda
6. Run docker:
> docker run -i -t continuumio/anaconda /bin/bash
7. Test your conda environment:
(base) root@9b9e483ba80e:/opt/conda# conda info

Running Jupyter Notebook
Run the following command in one line from host machine:
> docker run -i -t -p 8888:8888 continuumio/miniconda /bin/bash -c
"/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks &&
/opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip=0.0.0.0 --
port=8888 --no-browser --allow-root"
- Open your Notebook in the browser
- Open a terminal and install: numpy pandas matplotlib scipy and sklearn

Local Download server
172.90.0.161

Python Libraries for Machine Learning
● NumPy (http://www.numpy.org/ ):
○ Introduce objects for multidimensional arrays and matrices
○ Provides vectorization of mathematical operations on arrays and matrices
● SciPy(https://www.scipy.org/scipylib/ ):
○ Collection of algorithms for linear algebra, statistics, optimization and etc.
○ Build on NumPy
● Pandas(http://pandas.pydata.org/ ):
○ Provide tools for data manipulation and handling missing data
● SciKit-Learn(https://scikit-learn.org/stable/ ):
○ Provide machine learning algorithms: classification, regression, clustering, model validation
etc.
● Matplotlib(https://matplotlib.org/ ):
○ Python 2D plotting library

Pandas DataFrame Data Types
Pandas type Python native type Description
obj string The most general dtype.
Will be assigned to your
column if it contains mixed
types (numbers and
strings).
int64 int Numeric characters. 64 refers to
the memory allocated to hold
this character.
float64 float Numeric characters with
decimals. If a column contains
numbers and NaNs(see below),
pandas will default to float64, in
case your missing value has a
decimal.
datetime64, timedelta[ns] N/A (but see thedatetimemodule
in Python’s standard library)
Values meant to hold time data.
Look into these for time series
experiments.

DataFrame Attributes
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labelsand column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values Numpy representation of the data

Exercise with DataFrame Attributes
1. How many records this data frame has?
2. How many elements are there?
3. What are the column names?
4. What types of columns we have in this data frame?

DataFrame Methods
df.method() description
head( [n] ), tail( [n] ) first/lastn rows
describe() generate descriptive statistics (for numeric
columns only)
max(), min() return max/min values for all numeric
columns
mean(), median() return mean/median values for all numeric
columns
std() standard deviation
sample([n]) returns a random sample of the data frame
dropna() drop all the records with missing values

Exercise with DataFrame Methods
1. Give the summary for the numeric columns in the dataset
2. Calculate standard deviation for all numeric columns
3. What are the mean values of the first 50 records in the dataset?
Hint: use head() method to subset the first 50 records and then calculate the mean

Handling Missing Values
● ‘NaN - NoT a Number’ shows missing values
● Often replaced by arbitrary chosen values like -1 in feature with positive numbers or 0 or
medium (most common)
● But should be aware that something has been changed
● Could also ignore the sample or feature with missing values

Missing Values in Pandas
● Missing values in GroupBy method are excluded
● Many descriptive statistics methods have ‘skipna’ option to control if missing data should
be excluded . This value is set to True by default.

Dealing with Missing Values in DF
df.method() description
dropna() Drop missing observations
dropna(how='all') Drop observations where all cells is NA
dropna(axis=1, how='all') Drop column if all the values aremissing
dropna(thresh = 5) Drop rows that contain less than 5 non-
missing values
fillna(0) Replace missing values with zeros
isnull() returns True if the value is missing
notnull() Returns True for non-missing values

R-Squared
Where and
Here, yi^ is the fitted value for observation i and y¯ is the mean of Y.

Further Learning
● Kaggle: is the place to do data science projects
● Seeing Theory : a visual introduction to probability and statistics.
● Kdnuggets: Machine Learning, Data Science, Data Mining, Big Data, Analytics, AI.
Software

Reading Recommendations
● Machine learning : The art and science of algorithms that make sense of data by Peter
Flach
● Python for Data Analysis by We McKinney
● https://www.kdnuggets.com/2018/12/feature-engineering-explained.html

References
[ML_Awad] Awad M., Khanna R. (2015) Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA
[xkcd_1838] https://xkcd.com/1838/
[fortune] http://fortune.com/2018/06/25/ai-business-breakthrough/
[ML_Flach] Flach, P. (2012). Machine Learning: The art and science of algorithms that make sense of data. Cambridge University Press.
[ML_Mitchell] Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill. ISBN: 978-0-07-042807-2
[Medium_Sharma] https://medium.com/datadriveninvestor/how-to-built-a-recommender-system-rs-616c988d64b2
[karlstratos] http://karlstratos.com/drawings/drawings.html
[Print_Lego] https://www.pinterest.com/pin/422071796300372061/
[Medium] https://medium.com/@mehulved1503/feature-selection-and-feature-extraction-in-machine-learning-an-overview-
57891c595e96
[researchgate] https://www.researchgate.net/figure/Hierarchical-clustering-of-the-181-genes-corresponding-to-zinc-biology-related-
functional_fig6_26688269

References (2)
[redimrehurek] https://radimrehurek.com/data_science_python/
[wiki] https://en.wikipedia.org/wiki/Cross-validation_(statistics)

Icon References
● Icons made by: Freepik from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Pixel perfect from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Vectors Market from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Smashicons from www.flaticon.com is licensed by CC 3.0 BY

We organize IT24.04.2019
Your Contact
Dr. Hamzeh Alavira
Founder, oranIT GmbH
alavirad@oranit.de
0049-176-8080-7585
Dr. Parinaz Ameri
Co-Founder, oranIT GmbH
ameri@oranit.de
0049-176-3497-0683

Intro to Machine Learning for non-Data Scientists

More Related Content

What's hot (20)

Similar to Intro to Machine Learning for non-Data Scientists (20)

Recently uploaded (20)

Intro to Machine Learning for non-Data Scientists