ML_Lec2 introduction to data processing.pdf

Agenda
 Revision
 Data Preprocessing.
 Data cleaning
 Dealing with missing data.
 Dealing with categorical data.
 Feature scaling
 ▪ Rescaling
 ▪ Mean normalization
 ▪ Standardization
 ▪ Scaling to unit length
 Dimensionality reduction
 ▪ Feature selection
 ▪ Feature extraction
 Partitioning a dataset in training and testing sets

Revision
Learning
 In machine learning, pattern recognition and learning there is a
relationship/nature between the observations (input features) and
response (target), we need to understand this relationship .

❑ Learning
▪ The learning process tends to understand and predict new values
based on finding a mathematical model and a relationship between
the given observations ( i.e. inputs and output).

❑ Examples of learning problems:
❑ Predict a wind speed and a solar radiation.
❑ Detection of cancers; Breast, Leukemia and Spinal cancers.
❑ Estimate the amount of glucose in the blood.

Data Preprocessing
Introduction
 The quality of the data and the amount of useful information that it
contains are key factors that determine how well a machine learning
algorithm can learn.
 Therefore, it is absolutely critical that we make sure to examine and
preprocess a dataset before we feed it to a learning algorithm.
 we will discuss the essential data preprocessing techniques that will
help us to build good machine learning models.

 Data Preprocessing technique is used to manipulate and transform the raw
dataset into a clean and scaled dataset.
 In other words, whenever the dataset is gathered from different sources it is
collected in raw format which is not feasible for the analysis.
 Therefore, certain steps are executed to convert the dataset into a
small clean dataset.
 The data preprocessing technique is performed before the uses of the
collected dataset.
Data Preprocessing
Introduction

 Data Cleaning
 Feature Scaling
 Dimensionality Reduction
Data Preprocessing
Introduction

Data Preprocessing
Why we need to the data preprocessing
process?
 The main objectives of data preprocessing are to manipulate and
transform raw data into cleaned and scaled format.
 In addition it is important to compress the data onto a smaller
dimensional subspace while retaining most of the relevant
information.

Data Preprocessing
Types of data preprocessing

Data Preprocessing
Data cleaning: missing data
 In real-world applications are familiar that the collected data samples contain one
or more missing values for various reasons.
 These reasons include:
 There could have been an error in the data collection process
 certain measurements are not applicable and
 particular fields could have been simply left blank in a survey, for
instance.
 We typically see missing values as the blank spaces in our data
table or as placeholder strings such as NaN (Not A Number).

Data Preprocessing
Data cleaning: missing data

Data Preprocessing
Data cleaning: categorical data

Data Preprocessing
Feature scaling

Data Preprocessing
Feature scaling: rescaling

Data Preprocessing
Feature scaling: mean normalization

Data Preprocessing
Feature scaling: standardization

Data Preprocessing
Feature scaling: scaling to unit length

Data Preprocessing
Dimensionality Reduction

Data Preprocessing
Partitioning a dataset in training and testing
sets

ML_Lec2 introduction to data processing.pdf

ML_Lec2 introduction to data processing.pdf

More Related Content

Similar to ML_Lec2 introduction to data processing.pdf (20)

More from BeshoyArnest (8)

Recently uploaded (20)

ML_Lec2 introduction to data processing.pdf