SlideShare a Scribd company logo
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Preprocessing
Manash Kumar Mondal
Department of Computer Science and Engineering
University of Kalyani
November, 2024
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 1 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 2 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 3 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
What is Data?
Data refers to raw facts, figures, or observations that can be
processed and analyzed to extract meaningful information.
• It can be numbers, text, images, or sound.
• Data can be structured (in databases) or unstructured (text,
images, etc.).
Example:
• A list of temperatures over a week: 25, 30, 28, 31, 29.
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 4 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 5 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
What is Data Preprocessing?
Data preprocessing is the process of preparing raw data for analysis
and model training by cleaning, organizing, and transforming it
into a more suitable format:
• Identifying and correcting errors: Detecting and removing
inaccurate, incomplete, or irrelevant data
• Addressing issues: Addressing issues like missing values,
noise, inconsistencies, and outliers
• Extracting features: Extracting specific features from images
• Establishing standards: Establishing standards and best
practices for preparing data
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 6 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Splitting
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 7 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Preprocessing Steps
Figure 1: Data Preprocessing Steps
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 8 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Why Preprocess the Data?
Data pre-processing involves transforming raw data into a format
suitable for analysis.
• Why?
• Improve accuracy of models.
• Handle missing or inconsistent data.
• Make the data easier to work with.
• What does it involve?
• Data cleaning
• Data integration
• Data transformation
Example:
• A survey where some respondents skipped questions.
Preprocessing will handle missing values.
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 9 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
Data Cleaning
Outliers
Missing data
Erroneous data
Data Integration
Data Transformation
Data Reduction
Data Splitting
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 10 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Cleaning
Data cleaning involves removing errors, inconsistencies, and
irrelevant data.
• Handle missing values
• Correct inconsistencies
• Remove duplicates
Example:
• Replace missing values in a dataset with the mean or median.
data.fillna(data.mean())
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 11 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Cleaning
Data cleaning helps in getting rid of commonly found errors and
mistakes in a data set. These are the 3 commonly found errors in
data.
• Outliers: Data points existing out of the range.
• Missing data: Data points missing at certain places.
• Erroneous data: Incorrect data points.
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 12 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Outliers
• An outlier is a data point in
a dataset that is distant
from all other observations.
• An outlier is something that
behaves differently from the
combination/collection of
the data.
Figure 2: Outlier
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 13 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Missing data
What do these N/A values indicate?
They are the missing values in the data set. We can handle them
in several ways:
• By eliminating the rows of
missing values.
• Ignore the tuple
• Fill in the missing value
manually
• Use a global constant
• Use attribute mean
• Use the most probable value
Figure 3: Missing data
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 14 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Erroneous data
Erroneous data is data that is inconsistent, illogical, contradictory,
or out of range. It can also be data that a program cannot process
or should not accept.
• Incorrect
• Outside boundary tolerance
• Making use of incorrect data
type
• Making use of invalid
characters
Figure 4: Erroneous data generally
rejected
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 15 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Few Important Terms
• Discrepancy Detection (Human Error, Data Decay, Deliberate
Errors)
• Metadata
• Unique rule
• Null rule
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 16 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
What is the purpose of data cleaning?
1 To improve the speed of data processing
2 To get rid of commonly found errors and mistakes in a dataset
3 To collect new data points
4 To generate more data
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 17 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
What is "erroneous data"?
1 Data points that are missing from the dataset
2 Data points that are incorrect or invalid
3 Data points that are repetitive
4 Data points that are too large
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 18 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
Which of the following is considered an outlier?
1 A data point that is repeated multiple times
2 A data point existing out of the range
3 A data point that is entirely missing
4 A data point that contains invalid characters
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 19 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
Which of these errors are commonly addressed during data
cleaning?
1 Outliers
2 Missing data
3 Erroneous data
4 All of the aboves
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 20 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Splitting
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 21 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Integration
Data integration combines data from different sources into a
unified dataset.
• Merge tables from multiple
databases.
• Resolve conflicts between
data sources.
Example:
• Merging customer
information from two
different systems.
merged_data =
pd.merge(data1, data2,
on=’customer_id’)
Figure 5: Data Integration
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 22 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Integration issues
• Entity identification problem
• Redundancy
• Tuple Duplication
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 23 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Handling Redundant Data in Data Integration
Redundant data occur often when the integration of multiple
databases
• The same attribute may have different names in different
databases
• One attribute may be a derived attribute in another table,
e.g., annual revenue
• Redundant data may be able to be detected by correlation
analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 24 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Splitting
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 25 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Transformation
Data transformation involves converting data into a format that is
suitable for analysis.
• Normalize or standardize
data
• Apply aggregation
Example:
• Convert a salary column
from USD to EUR.
data[’salary_eur’] =
data[’salary_usd’] *
exchange_rate
Figure 6: Data Transformation
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 26 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 27 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Splitting
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 28 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Reduction
Data reduction reduces the volume of data while preserving its
important characteristics.
• Dimensionality reduction (e.g., PCA)
• Sampling
Example:
• Reducing the number of features in a dataset using PCA.
from sklearn.decomposition import PCA
Data Reduction - Strategies
• Data cube aggregation
• Dimension Reduction
• Data Compression
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 29 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Splitting
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 30 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Splitting
Data splitting divides the dataset into training, validation, and test
sets.
• Training set: for training the model
• Test set: for evaluating model
performance
• Validation set: for tuning
hyper-parameters
Example:
• Splitting a dataset into 80%
training and 20% testing.
train_data, test_data =
train_test_split(data,
test_size=0.2) Figure 7: Data Splitting
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 31 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
Which of the following is NOT a data reduction strategy?
1 Data cube aggregation
2 Dimension reduction
3 Data visualization
4 Data compression
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 32 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
What is data splitting?
1 Combining multiple datasets into one
2 Dividing a dataset into subsets for different purposes
3 Compressing a dataset to reduce size
4 Organizing a dataset into alphabetical order
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 33 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
What is the purpose of data reduction techniques?
1 To increase the volume of data
2 To represent data with a reduced size while maintaining
integrity
3 To eliminate unnecessary datasets
4 To enhance the speed of data collection
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 34 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 35 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Visualization
Data visualization involves creating graphs and charts to represent
data.
• Helps to understand patterns, trends, and insights.
• Types of visualization: bar charts, line charts, histograms, etc.
Example:
• Visualizing the distribution of ages in a dataset.
import matplotlib.pyplot as plt
plt.hist(data[’age’])
Applications: Presenting statistics, mapping, to show change over
time, to compare values, to show connections
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 36 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Visualization
By using visual elements like charts, graphs, and maps, data
visualization tools provide an accessible way to see and understand
trends and patterns in data.
Example:
• Histograms
• Bar graphs
• Pie charts
• Donut Charts
• Gantt charts
• Line graphs
• Map etc.
Figure 8: Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 37 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Conclusion
Data processing is an essential step in data science and machine
learning.
• Proper preprocessing leads to better model accuracy.
• The steps include data cleaning, integration, transformation,
and more.
Remember: Good data is the foundation of good insights.
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 38 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Thank You
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 39 / 39

More Related Content

Similar to Introduction to Data Preprocessing for Machine Learning (20)

PPTX
DATA preprocessing.pptx
Chandra Meena
 
PDF
3-DataPreprocessing a complete guide.pdf
shobyscms
 
PDF
Chapter 3.pdf
DrGnaneswariG
 
PPTX
preprocessing
ITM universe , vadodara
 
PDF
Data Preprocessing -Data Quality Noisy Data
ShivarkarSandip
 
PPTX
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
subhashchandra197
 
PPT
preproccessing level 3 for students.ppt
AhmedAlrashdy
 
PDF
4 preprocess
anita desiani
 
PDF
Module 1.2 data preparation
Sara Hooker
 
PPT
Chapter 2 Cond (1).ppt
kannaradhas
 
PPTX
Anwar kamal .pdf.pptx
Luminous8
 
PDF
Data Preparation and Preprocessing , Data Cleaning
ShivarkarSandip
 
PPTX
Data mining
Saraswathi Murugan
 
PPTX
Data Preparation.pptx
DrAbhishekKumarSingh3
 
PDF
Data Cleaning and Preprocessing: Ensuring Data Quality
priyanka rajput
 
PDF
KNOLX_Data_preprocessing
Knoldus Inc.
 
PPTX
Statistical Measures and Data Analysis - 8th Grade by Slidesgo.pptx
ssuser6c0488
 
PPTX
Security Test Repor Your score increases as you pick a categort.pptx
ssuser6c0488
 
PPTX
Data preprocessing
Gajanand Sharma
 
PPTX
Data Preprocessing techniques in Data Science
Muazzam25
 
DATA preprocessing.pptx
Chandra Meena
 
3-DataPreprocessing a complete guide.pdf
shobyscms
 
Chapter 3.pdf
DrGnaneswariG
 
Data Preprocessing -Data Quality Noisy Data
ShivarkarSandip
 
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
subhashchandra197
 
preproccessing level 3 for students.ppt
AhmedAlrashdy
 
4 preprocess
anita desiani
 
Module 1.2 data preparation
Sara Hooker
 
Chapter 2 Cond (1).ppt
kannaradhas
 
Anwar kamal .pdf.pptx
Luminous8
 
Data Preparation and Preprocessing , Data Cleaning
ShivarkarSandip
 
Data mining
Saraswathi Murugan
 
Data Preparation.pptx
DrAbhishekKumarSingh3
 
Data Cleaning and Preprocessing: Ensuring Data Quality
priyanka rajput
 
KNOLX_Data_preprocessing
Knoldus Inc.
 
Statistical Measures and Data Analysis - 8th Grade by Slidesgo.pptx
ssuser6c0488
 
Security Test Repor Your score increases as you pick a categort.pptx
ssuser6c0488
 
Data preprocessing
Gajanand Sharma
 
Data Preprocessing techniques in Data Science
Muazzam25
 

More from Manash Kumar Mondal (20)

PDF
The Basics of Networking . Connecting Devices in a Digital World
Manash Kumar Mondal
 
PDF
Cloud Computing : Fundamental Concepts and Models
Manash Kumar Mondal
 
PDF
Cloud Computing: Cloud Enabling Technologies
Manash Kumar Mondal
 
PDF
An Introduction to Cloud Computing and its Applications
Manash Kumar Mondal
 
PDF
Introduction to Machine_Learning for Absolute Beginner
Manash Kumar Mondal
 
PDF
Complexity Class of Algorithm for Beginner
Manash Kumar Mondal
 
PDF
Introduction to Artificial Intelligence (AI)
Manash Kumar Mondal
 
PDF
A brief Introduction to Linux Operating System.
Manash Kumar Mondal
 
PDF
Systematic Literature Review on academic research
Manash Kumar Mondal
 
PDF
Statistical Inference & Hypothesis Testing.pdf
Manash Kumar Mondal
 
PDF
Role of NDLI in Higher Education _ Research, KU.pdf
Manash Kumar Mondal
 
PPTX
Various security issues and its solutions in the
Manash Kumar Mondal
 
PPTX
Omicron - A Covid 19 variant
Manash Kumar Mondal
 
PPTX
Computer network
Manash Kumar Mondal
 
PPTX
Boolean alebra
Manash Kumar Mondal
 
PPTX
Introduction to Algorithm
Manash Kumar Mondal
 
PPTX
File in C language
Manash Kumar Mondal
 
PDF
Pegasus, A spyware
Manash Kumar Mondal
 
PPTX
A comparative study between cloud computing and fog
Manash Kumar Mondal
 
The Basics of Networking . Connecting Devices in a Digital World
Manash Kumar Mondal
 
Cloud Computing : Fundamental Concepts and Models
Manash Kumar Mondal
 
Cloud Computing: Cloud Enabling Technologies
Manash Kumar Mondal
 
An Introduction to Cloud Computing and its Applications
Manash Kumar Mondal
 
Introduction to Machine_Learning for Absolute Beginner
Manash Kumar Mondal
 
Complexity Class of Algorithm for Beginner
Manash Kumar Mondal
 
Introduction to Artificial Intelligence (AI)
Manash Kumar Mondal
 
A brief Introduction to Linux Operating System.
Manash Kumar Mondal
 
Systematic Literature Review on academic research
Manash Kumar Mondal
 
Statistical Inference & Hypothesis Testing.pdf
Manash Kumar Mondal
 
Role of NDLI in Higher Education _ Research, KU.pdf
Manash Kumar Mondal
 
Various security issues and its solutions in the
Manash Kumar Mondal
 
Omicron - A Covid 19 variant
Manash Kumar Mondal
 
Computer network
Manash Kumar Mondal
 
Boolean alebra
Manash Kumar Mondal
 
Introduction to Algorithm
Manash Kumar Mondal
 
File in C language
Manash Kumar Mondal
 
Pegasus, A spyware
Manash Kumar Mondal
 
A comparative study between cloud computing and fog
Manash Kumar Mondal
 
Ad

Recently uploaded (20)

PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PPTX
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PDF
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PPTX
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
Hashing Introduction , hash functions and techniques
sailajam21
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
Ad

Introduction to Data Preprocessing for Machine Learning

  • 1. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Data Preprocessing Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani November, 2024 Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 1 / 39
  • 2. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 2 / 39
  • 3. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 3 / 39
  • 4. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization What is Data? Data refers to raw facts, figures, or observations that can be processed and analyzed to extract meaningful information. • It can be numbers, text, images, or sound. • Data can be structured (in databases) or unstructured (text, images, etc.). Example: • A list of temperatures over a week: 25, 30, 28, 31, 29. Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 4 / 39
  • 5. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 5 / 39
  • 6. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization What is Data Preprocessing? Data preprocessing is the process of preparing raw data for analysis and model training by cleaning, organizing, and transforming it into a more suitable format: • Identifying and correcting errors: Detecting and removing inaccurate, incomplete, or irrelevant data • Addressing issues: Addressing issues like missing values, noise, inconsistencies, and outliers • Extracting features: Extracting specific features from images • Establishing standards: Establishing standards and best practices for preparing data Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 6 / 39
  • 7. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps Data Cleaning Data Integration Data Transformation Data Reduction Data Splitting 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 7 / 39
  • 8. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Data Preprocessing Steps Figure 1: Data Preprocessing Steps Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 8 / 39
  • 9. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Why Preprocess the Data? Data pre-processing involves transforming raw data into a format suitable for analysis. • Why? • Improve accuracy of models. • Handle missing or inconsistent data. • Make the data easier to work with. • What does it involve? • Data cleaning • Data integration • Data transformation Example: • A survey where some respondents skipped questions. Preprocessing will handle missing values. Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 9 / 39
  • 10. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps Data Cleaning Outliers Missing data Erroneous data Data Integration Data Transformation Data Reduction Data Splitting 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 10 / 39
  • 11. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Data Cleaning Data cleaning involves removing errors, inconsistencies, and irrelevant data. • Handle missing values • Correct inconsistencies • Remove duplicates Example: • Replace missing values in a dataset with the mean or median. data.fillna(data.mean()) Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 11 / 39
  • 12. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Data Cleaning Data cleaning helps in getting rid of commonly found errors and mistakes in a data set. These are the 3 commonly found errors in data. • Outliers: Data points existing out of the range. • Missing data: Data points missing at certain places. • Erroneous data: Incorrect data points. Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 12 / 39
  • 13. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Outliers • An outlier is a data point in a dataset that is distant from all other observations. • An outlier is something that behaves differently from the combination/collection of the data. Figure 2: Outlier Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 13 / 39
  • 14. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Missing data What do these N/A values indicate? They are the missing values in the data set. We can handle them in several ways: • By eliminating the rows of missing values. • Ignore the tuple • Fill in the missing value manually • Use a global constant • Use attribute mean • Use the most probable value Figure 3: Missing data Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 14 / 39
  • 15. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Erroneous data Erroneous data is data that is inconsistent, illogical, contradictory, or out of range. It can also be data that a program cannot process or should not accept. • Incorrect • Outside boundary tolerance • Making use of incorrect data type • Making use of invalid characters Figure 4: Erroneous data generally rejected Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 15 / 39
  • 16. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Few Important Terms • Discrepancy Detection (Human Error, Data Decay, Deliberate Errors) • Metadata • Unique rule • Null rule Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 16 / 39
  • 17. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Quiz What is the purpose of data cleaning? 1 To improve the speed of data processing 2 To get rid of commonly found errors and mistakes in a dataset 3 To collect new data points 4 To generate more data Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 17 / 39
  • 18. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Quiz What is "erroneous data"? 1 Data points that are missing from the dataset 2 Data points that are incorrect or invalid 3 Data points that are repetitive 4 Data points that are too large Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 18 / 39
  • 19. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Quiz Which of the following is considered an outlier? 1 A data point that is repeated multiple times 2 A data point existing out of the range 3 A data point that is entirely missing 4 A data point that contains invalid characters Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 19 / 39
  • 20. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Quiz Which of these errors are commonly addressed during data cleaning? 1 Outliers 2 Missing data 3 Erroneous data 4 All of the aboves Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 20 / 39
  • 21. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps Data Cleaning Data Integration Data Transformation Data Reduction Data Splitting 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 21 / 39
  • 22. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Data Integration Data integration combines data from different sources into a unified dataset. • Merge tables from multiple databases. • Resolve conflicts between data sources. Example: • Merging customer information from two different systems. merged_data = pd.merge(data1, data2, on=’customer_id’) Figure 5: Data Integration Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 22 / 39
  • 23. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Data Integration issues • Entity identification problem • Redundancy • Tuple Duplication Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 23 / 39
  • 24. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Handling Redundant Data in Data Integration Redundant data occur often when the integration of multiple databases • The same attribute may have different names in different databases • One attribute may be a derived attribute in another table, e.g., annual revenue • Redundant data may be able to be detected by correlation analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 24 / 39
  • 25. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps Data Cleaning Data Integration Data Transformation Data Reduction Data Splitting 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 25 / 39
  • 26. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Data Transformation Data transformation involves converting data into a format that is suitable for analysis. • Normalize or standardize data • Apply aggregation Example: • Convert a salary column from USD to EUR. data[’salary_eur’] = data[’salary_usd’] * exchange_rate Figure 6: Data Transformation Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 26 / 39
  • 27. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 27 / 39
  • 28. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps Data Cleaning Data Integration Data Transformation Data Reduction Data Splitting 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 28 / 39
  • 29. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Data Reduction Data reduction reduces the volume of data while preserving its important characteristics. • Dimensionality reduction (e.g., PCA) • Sampling Example: • Reducing the number of features in a dataset using PCA. from sklearn.decomposition import PCA Data Reduction - Strategies • Data cube aggregation • Dimension Reduction • Data Compression Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 29 / 39
  • 30. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps Data Cleaning Data Integration Data Transformation Data Reduction Data Splitting 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 30 / 39
  • 31. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Data Splitting Data splitting divides the dataset into training, validation, and test sets. • Training set: for training the model • Test set: for evaluating model performance • Validation set: for tuning hyper-parameters Example: • Splitting a dataset into 80% training and 20% testing. train_data, test_data = train_test_split(data, test_size=0.2) Figure 7: Data Splitting Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 31 / 39
  • 32. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Quiz Which of the following is NOT a data reduction strategy? 1 Data cube aggregation 2 Dimension reduction 3 Data visualization 4 Data compression Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 32 / 39
  • 33. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Quiz What is data splitting? 1 Combining multiple datasets into one 2 Dividing a dataset into subsets for different purposes 3 Compressing a dataset to reduce size 4 Organizing a dataset into alphabetical order Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 33 / 39
  • 34. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Quiz What is the purpose of data reduction techniques? 1 To increase the volume of data 2 To represent data with a reduced size while maintaining integrity 3 To eliminate unnecessary datasets 4 To enhance the speed of data collection Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 34 / 39
  • 35. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 35 / 39
  • 36. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Data Visualization Data visualization involves creating graphs and charts to represent data. • Helps to understand patterns, trends, and insights. • Types of visualization: bar charts, line charts, histograms, etc. Example: • Visualizing the distribution of ages in a dataset. import matplotlib.pyplot as plt plt.hist(data[’age’]) Applications: Presenting statistics, mapping, to show change over time, to compare values, to show connections Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 36 / 39
  • 37. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Data Visualization By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends and patterns in data. Example: • Histograms • Bar graphs • Pie charts • Donut Charts • Gantt charts • Line graphs • Map etc. Figure 8: Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 37 / 39
  • 38. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Conclusion Data processing is an essential step in data science and machine learning. • Proper preprocessing leads to better model accuracy. • The steps include data cleaning, integration, transformation, and more. Remember: Good data is the foundation of good insights. Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 38 / 39
  • 39. What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization Thank You Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 39 / 39