SlideShare a Scribd company logo
Data Preprocessing:
• Data Cleaning: Handling missing values, noise reduction, outlier
detection.
• Data Integration: Merging data from multiple sources.
• Data Reduction: Dimensionality reduction, aggregation.
• Data Transformation: Normalization, scaling, encoding categorical
variables.
• Discretization: Converting continuous data into discrete intervals.
A series of steps to be made suitable for mining. This transformation
phase is known as data preprocessing,
an essential and often time-consuming stage in the data mining
pipeline.
• Data preprocessing is the method of cleaning and transforming
raw data into a structured and usable format, ready for
subsequent analysis.
• The real-world data we gather is riddled with imperfections. There
may be missing values, redundant information, or inconsistencies
that can adversely impact the outcome of data analysis. The
methodologies employed to turn raw data into a rich, structured,
and actionable asset.
DATA PREPROCESSING
DRK_Introduction to Data mining and Knowledge discovery
Data Cleaning: An Overview
Data cleaning, sometimes referred to as data cleansing,
involves detecting and correcting (or removing) errors
and inconsistencies in data to improve its quality.
The objective is to ensure data integrity and enhance
the accuracy of subsequent data analysis
Common Issues Addressed in Data Cleaning:
Missing Values: Data can often have gaps. For instance, a dataset of
patient records might lack age details for some individuals. Such
missing data can skew analysis and lead to incomplete results.
Noisy Data: This refers to random error or variance in a dataset. An
example would be a faulty sensor logging erratic temperature
readings amidst accurate ones
Contd…
Outliers: Data points that deviate significantly from other
observations can distort results. For example, in a dataset of
house prices, an unusually high price due to an erroneous entry
can skew the average.
Duplicate Entries: Redundancies can creep in, especially when
data is collated from various sources. Duplicate rows or records
need to be identified and removed
Inconsistent Data: This could be due to various reasons like
different data entry personnel or multiple sources. A date might
be entered as "January 15, 2020" in one record and "15/01/2020"
in another
Methods and Techniques for Data Cleaning:
1. Imputation: Filling missing data based on statistical methods. For
example, missing numerical values could be replaced by the mean or
median of the entire column.
2. Noise Filtering: Applying techniques to smooth out noisy data. Time-
series data, for example, can be smoothed using moving averages.
3. Outlier Detection: Utilizing statistical methods or visualization tools to
identify and manage outliers. The IQR (Interquartile Range) method is a
popular technique.
4. De-duplication: Algorithms are used to detect and remove duplicate
records. This often involves matching and purging data.
5. Data Validation: Setting up rules to ensure consistency. For instance, a
rule could be that age cannot be more than 150 or less than 0
Data Integration:
• Merging data from multiple sources
• Data integration is the process of combining data from
various sources into a unified format that can be used for
analytical, operational, and decision-making purposes.
There are several ways to integrate data
• Data virtualization
• Presents data from multiple sources in a single data set in real-time without
replicating, transforming, or loading the data. Instead, it creates a virtual view
that integrates all the data sources and populates a dashboard with data from
multiple sources after receiving a query.
• Extract, load, transform (ELT)
• A modern twist on ETL that loads data into a flexible repository, like a data lake,
before transformation. This allows for greater flexibility and handling of
unstructured data.
• Application integration
• Allows separate applications to work together by moving and syncing data
between them. This can support operational needs, such as ensuring that an HR
system has the same data as a finance system.
• Here are some examples of data integration:
• Facebook Ads and Google Ads to acquire new users
• Google Analytics to track events on a website and in a mobile app
• MySQL database to store user information and image metadata
• Marketo to send marketing email and nurture leads
DATA INTEGRATION
Data integration is the process of combining data from
multiple sources into a cohesive and consistent view.
This process involves identifying and accessing the different
data sources, mapping the data to a common format, and
reconciling any inconsistencies or discrepancies between the
sources.
The goal of data integration is to make it easier to access and
analyze data that is spread across multiple systems or
platforms, in order to gain a more complete and accurate
understanding of the data.
Contd..
Data integration can be challenging due to the variety of
data formats, structures, and semantics used by different
data sources. Different data sources may use different data
types, naming conventions, and schemas, making it difficult
to combine the data into a single view.
Data integration typically involves a combination of manual
and automated processes, including data profiling, data
mapping, data transformation, and data reconciliation.
Contd..
There are mainly 2 major approaches for data integration –
one is the “tight coupling approach” and another is the
“loose coupling approach”.
Tight Coupling:
This approach involves creating a centralized repository or
data warehouse to store the integrated data. The data is
extracted from various sources, transformed and loaded into
a data warehouse. Data is integrated in a tightly coupled
manner, meaning that the data is integrated at a high level,
such as at the level of the entire dataset or schema.
Contd..
This approach is also known as data warehousing, and it
enables data consistency and integrity, but it can be
inflexible and difficult to change or update.
• Here, a data warehouse is treated as an information
retrieval component.
• In this coupling, data is combined from different sources
into a single physical location through the process of ETL –
Extraction, Transformation, and Loading.
Contd..
Loose Coupling:
This approach involves integrating data at the lowest level,
such as at the level of individual data elements or records.
Data is integrated in a loosely coupled manner, meaning that
the data is integrated at a low level, and it allows data to be
integrated without having to create a central repository or
data warehouse.
This approach is also known as data federation, and it
enables data flexibility and easy updates, but it can be
difficult to maintain consistency and integrity across multiple
data sources.
Contd..
Data Reduction
• Data Reduction refers to the process of reducing the volume
of data while maintaining its informational quality.
• Data reduction is the process in which an organization sets
out to limit the amount of data it's storing.
• Data reduction techniques seek to lessen the redundancy
found in the original data set so that large amounts of
originally sourced data can be more efficiently stored as
reduced data.
DRK_Introduction to Data mining and Knowledge discovery
Data Transformation:
While data cleaning focuses on rectifying errors,
data transformation is about converting data into
a suitable format or structure for analysis. It’s
about making the data compatible and ready for
the next steps in the data mining process.
Common Data Transformation Techniques:
1. Normalization: Scaling numeric data to fall within a small, specified
range. For example, adjusting variables so they range between 0 and 1
2. Standardization: Shifting data to have a mean of zero and a standard
deviation of one. This is often done so different variables can be
compared on common grounds.
3. Binning: Transforming continuous variables into discrete 'bins'. For
instance, age can be categorized into bins like 0-18, 19-35, and so on.
4. One-hot encoding: Converting categorical data into a binary (0 or 1)
format. For example, the color variable with values 'Red', 'Green', 'Blue'
can be transformed into three binary columns—one for each color.
5. Log Transformation: Applied to handle skewed data or when dealing
with exponential patterns
Benefits of Data Cleaning and Transformation:
Enhanced Analysis Accuracy: With cleaner data, algorithms work
more effectively, leading to more accurate insights.
Reduced Complexity: Removing redundant and irrelevant data
reduces dataset size and complexity, making subsequent analysis
faster.
Improved Decision Making: Accurate data leads to better
insights, which in turn facilitates informed decision-making.
Enhanced Data Integrity: Consistency in data ensures integrity,
which is crucial for analytics and reporting.
Common Data Transformation Techniques:
1. Normalization: Scaling numeric data to fall within a small, specified
range. For example, adjusting variables so they range between 0 and 1
2. Standardization: Shifting data to have a mean of zero and a standard
deviation of one. This is often done so different variables can be
compared on common grounds.
3. Binning: Transforming continuous variables into discrete 'bins'. For
instance, age can be categorized into bins like 0-18, 19-35, and so on.
4. One-hot encoding: Converting categorical data into a binary (0 or 1)
format. For example, the color variable with values 'Red', 'Green', 'Blue'
can be transformed into three binary columns—one for each color.
5. Log Transformation: Applied to handle skewed data or when dealing
with exponential patterns
Data Normalization and Standardization
1. Data Normalization:Normalization scales all numeric variables in the
range between 0 and 1. The goal is to change the values of numeric
columns in the dataset to a common scale, without distorting differences
in the range of values
2. Benefits of Normalization:
1. Predictability: Ensures that gradient descent (used in many modeling
techniques) converges more quickly.
2. Uniformity: Brings data to a uniform scale, making it easier to
compare different features.
Normalization has its drawbacks. It can be influenced heavily by outliers.
Data Normalization and Standardization
Data Standardization
While normalization adjusts features to a specific range, standardization
adjusts them to have a mean of 0 and a standard deviation of 1. It's also
commonly known as the z-score normalization
Benefits of Standardization:
Centering the Data: It centers the data around 0, which can be useful in
algorithms that assume zero centric data, like Principal Component
Analysis (PCA).
Handling Outliers: Standardization is less sensitive to outliers compared
to normalization.
Common Scale: Like normalization, it brings features to a common scale
Discretization:
• In statistics and machine learning, discretization refers to the
process of converting continuous features or variables to
discretized or nominal features.
• Discretization in data mining refers to converting a range of
continuous values into discrete categories.
• For example:
• Suppose we have an attribute of Age with the given values
• Discretized Data
DRK_Introduction to Data mining and Knowledge discovery
DRK_Introduction to Data mining and Knowledge discovery
Data Visualization
• Data visualization is the representation of data through use of
common graphics, such as charts, plots, infographics and even
animations.
• These visual displays of information communicate complex data
relationships and data-driven insights in a way that is easy to
understand.
• Data visualization is the graphical representation of information and
data. By using visual elements like charts, graphs, maps, and
dashboards, data visualization tools provide an accessible way to see
and understand trends, outliers, and patterns in data.
Key Concepts in Data Visualization:
• Types of Visualizations:
• Bar Charts: Used to compare categories or show changes over time.
• Line Charts: Ideal for showing trends over time.
• Pie Charts: Good for showing proportions.
• Scatter Plots: Useful for observing relationships between variables.
• Heatmaps: Display data in matrix form, with values represented by colors.
• Histograms: Show the distribution of a dataset.
• Box Plots: Provide a summary of data through quartiles and outliers.
• Best Practices:
• Clarity: Ensure that your visualizations are easy to understand.
Avoid clutter.
• Accuracy: Represent data truthfully. Avoid misleading scales or
distorted visuals.
• Context: Include labels, legends, and titles to make your visuals self-
explanatory.
• Consistency: Use consistent colors, fonts, and styles across your
visuals.
• Focus: Highlight the key message you want to convey through your
visualization.
• Tools for Data Visualization:
• Tableau: A powerful tool for creating interactive visualizations and
dashboards.
• Microsoft Power BI: Offers data visualization and business
intelligence capabilities.
• Google Data Studio: Useful for creating reports and dashboards
from various data sources.
• Matplotlib and Seaborn (Python libraries): Widely used for creating
static, animated, and interactive plots in Python.
• D3.js: A JavaScript library for producing dynamic, interactive data
visualizations in web browsers.
• Excel: A basic tool for creating charts and graphs, suitable for simple
data visualization tasks.
Data Similarity and Dissimilarity Measures
• Euclidean Distance: A measure of similarity between two data points
in space.
• Cosine Similarity: Measures the cosine of the angle between two
vectors.
• Jaccard Similarity: A measure of similarity between two sets.
• Pearson Correlation: A measure of the linear correlation between
two variables.
•Cosine Similarity
• Definition: Measures the cosine of the angle between two
vectors in a multi-dimensional space. It is often used in text
mining and information retrieval.
• Formula:
Cosine Similarity=
•Jaccard Similarity
• Definition: Measures the similarity between two sets by
comparing the size of their intersection to the size of their
union.
•Euclidean Distance
• Definition: Measures the straight-line distance between
two points in a multi-dimensional space.
•Manhattan Distance (L1 Norm)
• Definition: Measures the sum of the absolute differences
between coordinates of two points.
•Minkowski Distance
• Definition: A generalization of Euclidean and Manhattan
distances, parameterized by an exponent ppp.
•Hamming Distance
• Definition: Measures the number of positions at which two
strings of equal length differ. It is often used for binary or
categorical data.

More Related Content

Similar to DRK_Introduction to Data mining and Knowledge discovery (20)

PDF
data processing.pdf
DimpyJindal4
 
PPTX
Data preprocessing
Gajanand Sharma
 
PPTX
Anwar kamal .pdf.pptx
Luminous8
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
DOC
Data Mining: Data Preprocessing
Lakshmi Sarvani Videla
 
PPT
When an image is under tampr, resamplink
rapellisrikanth
 
PPTX
Data integration
kalavathisugan
 
PPTX
Data Preparation.pptx
YashikaSengar2
 
PPTX
INTRODUCTION TO DATA ANALYTICS -MODULE 1.pptx
paathuu04
 
PPT
Data pre processing
pommurajopt
 
PPT
Preprocessing data mining hhxdzsdsasaasa
Suvedha8
 
PPTX
Data Preparation.pptx
DrAbhishekKumarSingh3
 
PDF
KNOLX_Data_preprocessing
Knoldus Inc.
 
PPT
data Preprocessing different techniques summarized
shalinipriya1692
 
PDF
Cs501 data preprocessingdw
Kamal Singh Lodhi
 
PPT
Unit 3-2.ppt
Ankit506645
 
PPTX
1Chapter_ Two_ 2 Data Preparation lecture note.pptx
fikadumeuedu
 
PPT
Data preprocessing
kayathri02
 
PPT
Chapter 2 Cond (1).ppt
kannaradhas
 
PDF
Data Preprocessing in Data Mining Lecture Slide
Nehal668249
 
data processing.pdf
DimpyJindal4
 
Data preprocessing
Gajanand Sharma
 
Anwar kamal .pdf.pptx
Luminous8
 
What Is Data Integration and Transformation?
subhashenia
 
Data Mining: Data Preprocessing
Lakshmi Sarvani Videla
 
When an image is under tampr, resamplink
rapellisrikanth
 
Data integration
kalavathisugan
 
Data Preparation.pptx
YashikaSengar2
 
INTRODUCTION TO DATA ANALYTICS -MODULE 1.pptx
paathuu04
 
Data pre processing
pommurajopt
 
Preprocessing data mining hhxdzsdsasaasa
Suvedha8
 
Data Preparation.pptx
DrAbhishekKumarSingh3
 
KNOLX_Data_preprocessing
Knoldus Inc.
 
data Preprocessing different techniques summarized
shalinipriya1692
 
Cs501 data preprocessingdw
Kamal Singh Lodhi
 
Unit 3-2.ppt
Ankit506645
 
1Chapter_ Two_ 2 Data Preparation lecture note.pptx
fikadumeuedu
 
Data preprocessing
kayathri02
 
Chapter 2 Cond (1).ppt
kannaradhas
 
Data Preprocessing in Data Mining Lecture Slide
Nehal668249
 

More from coolscools1231 (8)

PPTX
Lecturer3 by RamaKrishna SRU waranagal telanga
coolscools1231
 
PPTX
SRU_RK_Lecturer1 about datamining cocepts
coolscools1231
 
PPT
R1234_SRU data knowledge informations regarding
coolscools1231
 
PPT
SR_R_Datamining.ppt detaled explanation re
coolscools1231
 
PPT
ERK_SRU_ch08-2019-03-27.ppt discussion in class room
coolscools1231
 
PPTX
WEKA Tutorial and Introduction Data mining
coolscools1231
 
PPT
Dynamic Programming and Applications.ppt
coolscools1231
 
PPTX
ch17_Transaction management in Database Management System
coolscools1231
 
Lecturer3 by RamaKrishna SRU waranagal telanga
coolscools1231
 
SRU_RK_Lecturer1 about datamining cocepts
coolscools1231
 
R1234_SRU data knowledge informations regarding
coolscools1231
 
SR_R_Datamining.ppt detaled explanation re
coolscools1231
 
ERK_SRU_ch08-2019-03-27.ppt discussion in class room
coolscools1231
 
WEKA Tutorial and Introduction Data mining
coolscools1231
 
Dynamic Programming and Applications.ppt
coolscools1231
 
ch17_Transaction management in Database Management System
coolscools1231
 
Ad

Recently uploaded (20)

PPTX
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PDF
Performance Report Sample (Draft7).pdf
AmgadMaher5
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
Climate Action.pptx action plan for climate
justfortalabat
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
Performance Report Sample (Draft7).pdf
AmgadMaher5
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Ad

DRK_Introduction to Data mining and Knowledge discovery

  • 1. Data Preprocessing: • Data Cleaning: Handling missing values, noise reduction, outlier detection. • Data Integration: Merging data from multiple sources. • Data Reduction: Dimensionality reduction, aggregation. • Data Transformation: Normalization, scaling, encoding categorical variables. • Discretization: Converting continuous data into discrete intervals.
  • 2. A series of steps to be made suitable for mining. This transformation phase is known as data preprocessing, an essential and often time-consuming stage in the data mining pipeline. • Data preprocessing is the method of cleaning and transforming raw data into a structured and usable format, ready for subsequent analysis. • The real-world data we gather is riddled with imperfections. There may be missing values, redundant information, or inconsistencies that can adversely impact the outcome of data analysis. The methodologies employed to turn raw data into a rich, structured, and actionable asset. DATA PREPROCESSING
  • 4. Data Cleaning: An Overview Data cleaning, sometimes referred to as data cleansing, involves detecting and correcting (or removing) errors and inconsistencies in data to improve its quality. The objective is to ensure data integrity and enhance the accuracy of subsequent data analysis
  • 5. Common Issues Addressed in Data Cleaning: Missing Values: Data can often have gaps. For instance, a dataset of patient records might lack age details for some individuals. Such missing data can skew analysis and lead to incomplete results. Noisy Data: This refers to random error or variance in a dataset. An example would be a faulty sensor logging erratic temperature readings amidst accurate ones
  • 6. Contd… Outliers: Data points that deviate significantly from other observations can distort results. For example, in a dataset of house prices, an unusually high price due to an erroneous entry can skew the average. Duplicate Entries: Redundancies can creep in, especially when data is collated from various sources. Duplicate rows or records need to be identified and removed Inconsistent Data: This could be due to various reasons like different data entry personnel or multiple sources. A date might be entered as "January 15, 2020" in one record and "15/01/2020" in another
  • 7. Methods and Techniques for Data Cleaning: 1. Imputation: Filling missing data based on statistical methods. For example, missing numerical values could be replaced by the mean or median of the entire column. 2. Noise Filtering: Applying techniques to smooth out noisy data. Time- series data, for example, can be smoothed using moving averages. 3. Outlier Detection: Utilizing statistical methods or visualization tools to identify and manage outliers. The IQR (Interquartile Range) method is a popular technique. 4. De-duplication: Algorithms are used to detect and remove duplicate records. This often involves matching and purging data. 5. Data Validation: Setting up rules to ensure consistency. For instance, a rule could be that age cannot be more than 150 or less than 0
  • 8. Data Integration: • Merging data from multiple sources • Data integration is the process of combining data from various sources into a unified format that can be used for analytical, operational, and decision-making purposes.
  • 9. There are several ways to integrate data • Data virtualization • Presents data from multiple sources in a single data set in real-time without replicating, transforming, or loading the data. Instead, it creates a virtual view that integrates all the data sources and populates a dashboard with data from multiple sources after receiving a query. • Extract, load, transform (ELT) • A modern twist on ETL that loads data into a flexible repository, like a data lake, before transformation. This allows for greater flexibility and handling of unstructured data. • Application integration • Allows separate applications to work together by moving and syncing data between them. This can support operational needs, such as ensuring that an HR system has the same data as a finance system.
  • 10. • Here are some examples of data integration: • Facebook Ads and Google Ads to acquire new users • Google Analytics to track events on a website and in a mobile app • MySQL database to store user information and image metadata • Marketo to send marketing email and nurture leads
  • 12. Data integration is the process of combining data from multiple sources into a cohesive and consistent view. This process involves identifying and accessing the different data sources, mapping the data to a common format, and reconciling any inconsistencies or discrepancies between the sources. The goal of data integration is to make it easier to access and analyze data that is spread across multiple systems or platforms, in order to gain a more complete and accurate understanding of the data. Contd..
  • 13. Data integration can be challenging due to the variety of data formats, structures, and semantics used by different data sources. Different data sources may use different data types, naming conventions, and schemas, making it difficult to combine the data into a single view. Data integration typically involves a combination of manual and automated processes, including data profiling, data mapping, data transformation, and data reconciliation. Contd..
  • 14. There are mainly 2 major approaches for data integration – one is the “tight coupling approach” and another is the “loose coupling approach”. Tight Coupling: This approach involves creating a centralized repository or data warehouse to store the integrated data. The data is extracted from various sources, transformed and loaded into a data warehouse. Data is integrated in a tightly coupled manner, meaning that the data is integrated at a high level, such as at the level of the entire dataset or schema. Contd..
  • 15. This approach is also known as data warehousing, and it enables data consistency and integrity, but it can be inflexible and difficult to change or update. • Here, a data warehouse is treated as an information retrieval component. • In this coupling, data is combined from different sources into a single physical location through the process of ETL – Extraction, Transformation, and Loading. Contd..
  • 16. Loose Coupling: This approach involves integrating data at the lowest level, such as at the level of individual data elements or records. Data is integrated in a loosely coupled manner, meaning that the data is integrated at a low level, and it allows data to be integrated without having to create a central repository or data warehouse. This approach is also known as data federation, and it enables data flexibility and easy updates, but it can be difficult to maintain consistency and integrity across multiple data sources. Contd..
  • 17. Data Reduction • Data Reduction refers to the process of reducing the volume of data while maintaining its informational quality. • Data reduction is the process in which an organization sets out to limit the amount of data it's storing. • Data reduction techniques seek to lessen the redundancy found in the original data set so that large amounts of originally sourced data can be more efficiently stored as reduced data.
  • 19. Data Transformation: While data cleaning focuses on rectifying errors, data transformation is about converting data into a suitable format or structure for analysis. It’s about making the data compatible and ready for the next steps in the data mining process.
  • 20. Common Data Transformation Techniques: 1. Normalization: Scaling numeric data to fall within a small, specified range. For example, adjusting variables so they range between 0 and 1 2. Standardization: Shifting data to have a mean of zero and a standard deviation of one. This is often done so different variables can be compared on common grounds. 3. Binning: Transforming continuous variables into discrete 'bins'. For instance, age can be categorized into bins like 0-18, 19-35, and so on. 4. One-hot encoding: Converting categorical data into a binary (0 or 1) format. For example, the color variable with values 'Red', 'Green', 'Blue' can be transformed into three binary columns—one for each color. 5. Log Transformation: Applied to handle skewed data or when dealing with exponential patterns
  • 21. Benefits of Data Cleaning and Transformation: Enhanced Analysis Accuracy: With cleaner data, algorithms work more effectively, leading to more accurate insights. Reduced Complexity: Removing redundant and irrelevant data reduces dataset size and complexity, making subsequent analysis faster. Improved Decision Making: Accurate data leads to better insights, which in turn facilitates informed decision-making. Enhanced Data Integrity: Consistency in data ensures integrity, which is crucial for analytics and reporting.
  • 22. Common Data Transformation Techniques: 1. Normalization: Scaling numeric data to fall within a small, specified range. For example, adjusting variables so they range between 0 and 1 2. Standardization: Shifting data to have a mean of zero and a standard deviation of one. This is often done so different variables can be compared on common grounds. 3. Binning: Transforming continuous variables into discrete 'bins'. For instance, age can be categorized into bins like 0-18, 19-35, and so on. 4. One-hot encoding: Converting categorical data into a binary (0 or 1) format. For example, the color variable with values 'Red', 'Green', 'Blue' can be transformed into three binary columns—one for each color. 5. Log Transformation: Applied to handle skewed data or when dealing with exponential patterns
  • 23. Data Normalization and Standardization 1. Data Normalization:Normalization scales all numeric variables in the range between 0 and 1. The goal is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the range of values 2. Benefits of Normalization: 1. Predictability: Ensures that gradient descent (used in many modeling techniques) converges more quickly. 2. Uniformity: Brings data to a uniform scale, making it easier to compare different features. Normalization has its drawbacks. It can be influenced heavily by outliers.
  • 24. Data Normalization and Standardization Data Standardization While normalization adjusts features to a specific range, standardization adjusts them to have a mean of 0 and a standard deviation of 1. It's also commonly known as the z-score normalization Benefits of Standardization: Centering the Data: It centers the data around 0, which can be useful in algorithms that assume zero centric data, like Principal Component Analysis (PCA). Handling Outliers: Standardization is less sensitive to outliers compared to normalization. Common Scale: Like normalization, it brings features to a common scale
  • 25. Discretization: • In statistics and machine learning, discretization refers to the process of converting continuous features or variables to discretized or nominal features.
  • 26. • Discretization in data mining refers to converting a range of continuous values into discrete categories. • For example: • Suppose we have an attribute of Age with the given values • Discretized Data
  • 29. Data Visualization • Data visualization is the representation of data through use of common graphics, such as charts, plots, infographics and even animations. • These visual displays of information communicate complex data relationships and data-driven insights in a way that is easy to understand. • Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, maps, and dashboards, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
  • 30. Key Concepts in Data Visualization: • Types of Visualizations: • Bar Charts: Used to compare categories or show changes over time. • Line Charts: Ideal for showing trends over time. • Pie Charts: Good for showing proportions. • Scatter Plots: Useful for observing relationships between variables. • Heatmaps: Display data in matrix form, with values represented by colors. • Histograms: Show the distribution of a dataset. • Box Plots: Provide a summary of data through quartiles and outliers.
  • 31. • Best Practices: • Clarity: Ensure that your visualizations are easy to understand. Avoid clutter. • Accuracy: Represent data truthfully. Avoid misleading scales or distorted visuals. • Context: Include labels, legends, and titles to make your visuals self- explanatory. • Consistency: Use consistent colors, fonts, and styles across your visuals. • Focus: Highlight the key message you want to convey through your visualization.
  • 32. • Tools for Data Visualization: • Tableau: A powerful tool for creating interactive visualizations and dashboards. • Microsoft Power BI: Offers data visualization and business intelligence capabilities. • Google Data Studio: Useful for creating reports and dashboards from various data sources. • Matplotlib and Seaborn (Python libraries): Widely used for creating static, animated, and interactive plots in Python. • D3.js: A JavaScript library for producing dynamic, interactive data visualizations in web browsers. • Excel: A basic tool for creating charts and graphs, suitable for simple data visualization tasks.
  • 33. Data Similarity and Dissimilarity Measures • Euclidean Distance: A measure of similarity between two data points in space. • Cosine Similarity: Measures the cosine of the angle between two vectors. • Jaccard Similarity: A measure of similarity between two sets. • Pearson Correlation: A measure of the linear correlation between two variables.
  • 34. •Cosine Similarity • Definition: Measures the cosine of the angle between two vectors in a multi-dimensional space. It is often used in text mining and information retrieval. • Formula: Cosine Similarity= •Jaccard Similarity • Definition: Measures the similarity between two sets by comparing the size of their intersection to the size of their union.
  • 35. •Euclidean Distance • Definition: Measures the straight-line distance between two points in a multi-dimensional space. •Manhattan Distance (L1 Norm) • Definition: Measures the sum of the absolute differences between coordinates of two points.
  • 36. •Minkowski Distance • Definition: A generalization of Euclidean and Manhattan distances, parameterized by an exponent ppp. •Hamming Distance • Definition: Measures the number of positions at which two strings of equal length differ. It is often used for binary or categorical data.