SlideShare a Scribd company logo
DATA SCIENCE
Dr. Suneel Pappala
Data science
 Data science is an interdisciplinary field that combines statistical
techniques, programming, and domain expertise to extract insights
and knowledge from structured and unstructured data.
 It involves a combination of methods from mathematics, statistics,
computer science, and machine learning to analyze and interpret
complex data.
 The goal of data science is to uncover patterns, make predictions,
and inform decision-making based on data-driven insights.
Key components of data science include
1. Data Collection: Gathering relevant data from various sources.
2. Data Cleaning: Preparing and cleaning the data to ensure accuracy and
consistency.
3. Data Analysis: Applying statistical models and algorithms to understand the
data and find trends or patterns.
4. Machine Learning: Using algorithms that allow computers to learn from data
and make predictions or decisions without explicit programming.
5. Data Visualization: Presenting the results in an understandable and actionable
format, often through charts, graphs, or dashboards.
6. Decision-Making: Using insights from the analysis to drive business strategy,
optimize processes, or solve specific problems.
Data Collection
 Definition: Gathering raw data from various sources, which could include databases, web
scraping, IoT devices, APIs, and more.
 Example: Collecting customer transaction data from an e-commerce platform or gathering social
media data via APIs.
 Customer Transaction Data (E-commerce Platform):
o Example: Suppose you're working with an online retail store like Amazon or Flipkart. The
platform collects data about every customer transaction, which includes:
 Product Details: Item purchased, price, quantity, and category.
 Customer Information: User ID, age, location, and browsing history.
 Transaction Metadata: Purchase date, time, payment method, and delivery option.
o This transactional data can be used for understanding customer behavior, predicting future
purchases, or designing personalized marketing campaigns.
Data Cleaning (Preprocessing)
 Definition: Removing or fixing errors, handling missing data, and preparing data for analysis.
 Example: Converting data types, handling missing values, filtering out outliers, and removing
duplicates in a dataset.
Converting Data Types
 Problem: Sometimes data is imported in the wrong format. For example, a numeric column (like
price) might be stored as text.
 Example: Suppose you have a column Price that is in string format but should be numeric.
Handling Missing Values
 Problem: Datasets often have missing values that need to be addressed before analysis.
 Example: You can either fill missing values (imputation) or drop rows/columns with missing
data.
Data Cleaning (Preprocessing)
Filtering Out Outliers
 Problem: Outliers can distort your analysis or model performance, so they
need to be handled.
 Example: Suppose you have a column Price where an outlier exists, like a
product mistakenly priced at $1,000,000.
Removing Duplicates
 Problem: Duplicate entries in the data can lead to inaccurate results,
especially in aggregations or model training.
 Example: If you have a dataset where some rows are repeated, you can
remove them.
Data Cleaning (Preprocessing)
Key Points Addressed:
 Price column converted to numeric.
 Missing values in the Purchase Date and Quantity
columns handled.
 Outliers in the Price column removed.
 Duplicate rows removed.
Data Exploration (Exploratory Data Analysis - EDA)
. Definition: Investigating the dataset to uncover initial patterns, trends, and insights using statistical
methods and visualization.
 Example: Plotting histograms, scatter plots, and correlation matrices to understand relationships
between variables.
Plotting a Histogram
 Purpose: A histogram shows the distribution of a single variable. It's useful for understanding the
frequency of values and the shape of the data distribution (e.g., normal, skewed).
 Example: Plotting the distribution of product prices in an e-commerce dataset.
Scatter Plot
 Purpose: A scatter plot is used to visualize the relationship between two variables. It shows how one
variable changes in relation to another.
 Example: Plotting the relationship between product price and quantity sold.
Correlation Matrix
 Purpose: A correlation matrix shows the correlation coefficients between multiple variables. It helps
identify linear relationships between variables, ranging from -1 (negative correlation) to 1 (positive
correlation).
 Example: Finding the correlation between price, quantity, and customer ratings.
Summary of Visualization Techniques
1. Histogram: Shows the distribution of a single variable (e.g., prices, quantities).
o Insight: Reveals the shape and frequency of data (e.g., normal or skewed
distribution).
2. Scatter Plot: Displays the relationship between two continuous variables (e.g., price
vs. quantity sold).
o Insight: Shows patterns, trends, or relationships (e.g., whether a higher price
leads to fewer items sold).
3. Correlation Matrix: Quantifies relationships between multiple variables using
correlation coefficients.
o Insight: Identifies strong or weak correlations between variables (e.g., positive
correlation between price and rating).
Data Transformation/Feature Engineering
 Definition: Transforming raw data into useful formats by creating new
features or adjusting existing ones to make data more suitable for
modeling.
 Example: Normalizing data, converting categorical variables into
numerical values (encoding), or creating new features like 'age group'
from raw 'date of birth.'
 Here are some common feature engineering techniques such as
normalizing data, encoding categorical variables, and creating new
features like age groups from raw data. These techniques are crucial for
preparing your data for machine learning models.
Data Transformation/Feature Engineering
1. Normalizing Data
 Purpose: Normalization scales the values of continuous variables to a similar range,
typically between 0 and 1. This helps machine learning models converge faster and
ensures that no one feature dominates due to its scale.
 Example: Normalizing the Price column to bring all prices between 0 and 1.
2. Converting Categorical Variables into Numerical Values (Encoding)
 Purpose: Machine learning models require numerical inputs, so categorical features
(e.g., product categories or user locations) need to be converted to numeric form.
 Example: Encoding a Category column (e.g., Electronics, Apparel,
Books) using one-hot encoding or label encoding.
Data Transformation/Feature Engineering
3. Creating New Features (Feature Engineering)
 Purpose: You can create new features that make the data more
useful for analysis or predictive modeling. For example, converting
raw Date of Birth data into age groups can make the model
understand the user demographics better.
 Example: Create an Age Group from a Date of Birth
column.
Data Transformation/Feature Engineering
Summary of Techniques:
1. Normalization: Rescales numeric values to a similar range (e.g., 0 to 1) to
prevent larger numbers from dominating smaller ones in models.
2. Encoding Categorical Variables:
o One-Hot Encoding: Creates binary columns for each category (useful for
unordered categories).
o Label Encoding: Converts categories into numeric labels (useful for
ordinal data).
3. Creating New Features: Converts raw data into more meaningful categories
(e.g., Age Group from Date of Birth).

More Related Content

PDF
KNOLX_Data_preprocessing
Knoldus Inc.
 
PPTX
Introduction to data mining
Ujjawal
 
PPTX
data science, prior knowledge ,modeling, scatter plot
SteffinAlex
 
PPTX
Data analytics
Bhanu Pratap
 
PPTX
Data Preprocessing
T Kavitha
 
PPT
ML-ChapterTwo-Data Preprocessing.ppt
belay41
 
PPT
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
MuhweziArthur1
 
PPT
1.6.data preprocessing
Krish_ver2
 
KNOLX_Data_preprocessing
Knoldus Inc.
 
Introduction to data mining
Ujjawal
 
data science, prior knowledge ,modeling, scatter plot
SteffinAlex
 
Data analytics
Bhanu Pratap
 
Data Preprocessing
T Kavitha
 
ML-ChapterTwo-Data Preprocessing.ppt
belay41
 
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
MuhweziArthur1
 
1.6.data preprocessing
Krish_ver2
 

Similar to Introduction of Data science .pptx (20)

PPT
data Preprocessing different techniques summarized
shalinipriya1692
 
PPTX
DRK_Introduction to Data mining and Knowledge discovery
coolscools1231
 
PDF
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
PDF
Data Science Introduction and Process in Data Science
Pyingkodi Maran
 
PPT
Chapter 3. Data Preprocessing.ppt
Subrata Kumer Paul
 
PDF
Data preprocessing.pdf
sankirtishiravale
 
PPT
Data Preprocessing in Pharmaceutical.ppt
Lexesford
 
PDF
Data preprocessing using Machine Learning
Gopal Sakarkar
 
PDF
02Data updated.pdf
saman Iftikhar
 
PPT
03 preprocessing
purnimatm
 
PDF
Cs501 data preprocessingdw
Kamal Singh Lodhi
 
PDF
Data Warehousing and Suitable for BCA, BSC, MCA
Guru Jhambheswar University of science and technology,Hisar-125033
 
PPTX
Data Preprocessing techniques for applications
RamaKrishnaErroju
 
PPTX
Introduction-to-Data-Science_Abiot_.pptx
AbiotBezabeh1
 
PPTX
UNIT I- Introduction- data science key components, features
EzhilmathiManinathan
 
PPTX
Introduction-FODS-fundamantals of data science
arivukarasi
 
PPTX
FCT UNIT 6 teach Foundation of computing technologies.pptx
nandinikhalane
 
PDF
Data science guide
gokulprasath06
 
PPT
Preprocessing.ppt
Revathy V R
 
data Preprocessing different techniques summarized
shalinipriya1692
 
DRK_Introduction to Data mining and Knowledge discovery
coolscools1231
 
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
Data Science Introduction and Process in Data Science
Pyingkodi Maran
 
Chapter 3. Data Preprocessing.ppt
Subrata Kumer Paul
 
Data preprocessing.pdf
sankirtishiravale
 
Data Preprocessing in Pharmaceutical.ppt
Lexesford
 
Data preprocessing using Machine Learning
Gopal Sakarkar
 
02Data updated.pdf
saman Iftikhar
 
03 preprocessing
purnimatm
 
Cs501 data preprocessingdw
Kamal Singh Lodhi
 
Data Warehousing and Suitable for BCA, BSC, MCA
Guru Jhambheswar University of science and technology,Hisar-125033
 
Data Preprocessing techniques for applications
RamaKrishnaErroju
 
Introduction-to-Data-Science_Abiot_.pptx
AbiotBezabeh1
 
UNIT I- Introduction- data science key components, features
EzhilmathiManinathan
 
Introduction-FODS-fundamantals of data science
arivukarasi
 
FCT UNIT 6 teach Foundation of computing technologies.pptx
nandinikhalane
 
Data science guide
gokulprasath06
 
Preprocessing.ppt
Revathy V R
 
Ad

More from SUNEEL37 (9)

PPTX
Datafication .pptx
SUNEEL37
 
PPTX
Getting past the hype .pptx
SUNEEL37
 
PPTX
Big Data and Data Science hype .pptx
SUNEEL37
 
DOCX
Normal Forms for Context Free Grammers.docx
SUNEEL37
 
DOCX
Types of Turing Machine .docx
SUNEEL37
 
DOCX
Regular Expression .docx
SUNEEL37
 
DOCX
Context Free Grammer .docx
SUNEEL37
 
DOCX
Introduction to Finite Automata .docx
SUNEEL37
 
PPTX
INTRODUTION Formal Language and Automatic Theory.pptx
SUNEEL37
 
Datafication .pptx
SUNEEL37
 
Getting past the hype .pptx
SUNEEL37
 
Big Data and Data Science hype .pptx
SUNEEL37
 
Normal Forms for Context Free Grammers.docx
SUNEEL37
 
Types of Turing Machine .docx
SUNEEL37
 
Regular Expression .docx
SUNEEL37
 
Context Free Grammer .docx
SUNEEL37
 
Introduction to Finite Automata .docx
SUNEEL37
 
INTRODUTION Formal Language and Automatic Theory.pptx
SUNEEL37
 
Ad

Recently uploaded (20)

PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
PPTX
quantum computing transition from classical mechanics.pptx
gvlbcy
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
PPTX
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
PDF
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PPTX
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PDF
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
PPTX
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
PDF
All chapters of Strength of materials.ppt
girmabiniyam1234
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
quantum computing transition from classical mechanics.pptx
gvlbcy
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
Inventory management chapter in automation and robotics.
atisht0104
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
All chapters of Strength of materials.ppt
girmabiniyam1234
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 

Introduction of Data science .pptx

  • 2. Data science  Data science is an interdisciplinary field that combines statistical techniques, programming, and domain expertise to extract insights and knowledge from structured and unstructured data.  It involves a combination of methods from mathematics, statistics, computer science, and machine learning to analyze and interpret complex data.  The goal of data science is to uncover patterns, make predictions, and inform decision-making based on data-driven insights.
  • 3. Key components of data science include 1. Data Collection: Gathering relevant data from various sources. 2. Data Cleaning: Preparing and cleaning the data to ensure accuracy and consistency. 3. Data Analysis: Applying statistical models and algorithms to understand the data and find trends or patterns. 4. Machine Learning: Using algorithms that allow computers to learn from data and make predictions or decisions without explicit programming. 5. Data Visualization: Presenting the results in an understandable and actionable format, often through charts, graphs, or dashboards. 6. Decision-Making: Using insights from the analysis to drive business strategy, optimize processes, or solve specific problems.
  • 4. Data Collection  Definition: Gathering raw data from various sources, which could include databases, web scraping, IoT devices, APIs, and more.  Example: Collecting customer transaction data from an e-commerce platform or gathering social media data via APIs.  Customer Transaction Data (E-commerce Platform): o Example: Suppose you're working with an online retail store like Amazon or Flipkart. The platform collects data about every customer transaction, which includes:  Product Details: Item purchased, price, quantity, and category.  Customer Information: User ID, age, location, and browsing history.  Transaction Metadata: Purchase date, time, payment method, and delivery option. o This transactional data can be used for understanding customer behavior, predicting future purchases, or designing personalized marketing campaigns.
  • 5. Data Cleaning (Preprocessing)  Definition: Removing or fixing errors, handling missing data, and preparing data for analysis.  Example: Converting data types, handling missing values, filtering out outliers, and removing duplicates in a dataset. Converting Data Types  Problem: Sometimes data is imported in the wrong format. For example, a numeric column (like price) might be stored as text.  Example: Suppose you have a column Price that is in string format but should be numeric. Handling Missing Values  Problem: Datasets often have missing values that need to be addressed before analysis.  Example: You can either fill missing values (imputation) or drop rows/columns with missing data.
  • 6. Data Cleaning (Preprocessing) Filtering Out Outliers  Problem: Outliers can distort your analysis or model performance, so they need to be handled.  Example: Suppose you have a column Price where an outlier exists, like a product mistakenly priced at $1,000,000. Removing Duplicates  Problem: Duplicate entries in the data can lead to inaccurate results, especially in aggregations or model training.  Example: If you have a dataset where some rows are repeated, you can remove them.
  • 7. Data Cleaning (Preprocessing) Key Points Addressed:  Price column converted to numeric.  Missing values in the Purchase Date and Quantity columns handled.  Outliers in the Price column removed.  Duplicate rows removed.
  • 8. Data Exploration (Exploratory Data Analysis - EDA) . Definition: Investigating the dataset to uncover initial patterns, trends, and insights using statistical methods and visualization.  Example: Plotting histograms, scatter plots, and correlation matrices to understand relationships between variables. Plotting a Histogram  Purpose: A histogram shows the distribution of a single variable. It's useful for understanding the frequency of values and the shape of the data distribution (e.g., normal, skewed).  Example: Plotting the distribution of product prices in an e-commerce dataset. Scatter Plot  Purpose: A scatter plot is used to visualize the relationship between two variables. It shows how one variable changes in relation to another.  Example: Plotting the relationship between product price and quantity sold.
  • 9. Correlation Matrix  Purpose: A correlation matrix shows the correlation coefficients between multiple variables. It helps identify linear relationships between variables, ranging from -1 (negative correlation) to 1 (positive correlation).  Example: Finding the correlation between price, quantity, and customer ratings.
  • 10. Summary of Visualization Techniques 1. Histogram: Shows the distribution of a single variable (e.g., prices, quantities). o Insight: Reveals the shape and frequency of data (e.g., normal or skewed distribution). 2. Scatter Plot: Displays the relationship between two continuous variables (e.g., price vs. quantity sold). o Insight: Shows patterns, trends, or relationships (e.g., whether a higher price leads to fewer items sold). 3. Correlation Matrix: Quantifies relationships between multiple variables using correlation coefficients. o Insight: Identifies strong or weak correlations between variables (e.g., positive correlation between price and rating).
  • 11. Data Transformation/Feature Engineering  Definition: Transforming raw data into useful formats by creating new features or adjusting existing ones to make data more suitable for modeling.  Example: Normalizing data, converting categorical variables into numerical values (encoding), or creating new features like 'age group' from raw 'date of birth.'  Here are some common feature engineering techniques such as normalizing data, encoding categorical variables, and creating new features like age groups from raw data. These techniques are crucial for preparing your data for machine learning models.
  • 12. Data Transformation/Feature Engineering 1. Normalizing Data  Purpose: Normalization scales the values of continuous variables to a similar range, typically between 0 and 1. This helps machine learning models converge faster and ensures that no one feature dominates due to its scale.  Example: Normalizing the Price column to bring all prices between 0 and 1. 2. Converting Categorical Variables into Numerical Values (Encoding)  Purpose: Machine learning models require numerical inputs, so categorical features (e.g., product categories or user locations) need to be converted to numeric form.  Example: Encoding a Category column (e.g., Electronics, Apparel, Books) using one-hot encoding or label encoding.
  • 13. Data Transformation/Feature Engineering 3. Creating New Features (Feature Engineering)  Purpose: You can create new features that make the data more useful for analysis or predictive modeling. For example, converting raw Date of Birth data into age groups can make the model understand the user demographics better.  Example: Create an Age Group from a Date of Birth column.
  • 14. Data Transformation/Feature Engineering Summary of Techniques: 1. Normalization: Rescales numeric values to a similar range (e.g., 0 to 1) to prevent larger numbers from dominating smaller ones in models. 2. Encoding Categorical Variables: o One-Hot Encoding: Creates binary columns for each category (useful for unordered categories). o Label Encoding: Converts categories into numeric labels (useful for ordinal data). 3. Creating New Features: Converts raw data into more meaningful categories (e.g., Age Group from Date of Birth).

Editor's Notes

  • #1: NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image.