2. Data science
Data science is an interdisciplinary field that combines statistical
techniques, programming, and domain expertise to extract insights
and knowledge from structured and unstructured data.
It involves a combination of methods from mathematics, statistics,
computer science, and machine learning to analyze and interpret
complex data.
The goal of data science is to uncover patterns, make predictions,
and inform decision-making based on data-driven insights.
3. Key components of data science include
1. Data Collection: Gathering relevant data from various sources.
2. Data Cleaning: Preparing and cleaning the data to ensure accuracy and
consistency.
3. Data Analysis: Applying statistical models and algorithms to understand the
data and find trends or patterns.
4. Machine Learning: Using algorithms that allow computers to learn from data
and make predictions or decisions without explicit programming.
5. Data Visualization: Presenting the results in an understandable and actionable
format, often through charts, graphs, or dashboards.
6. Decision-Making: Using insights from the analysis to drive business strategy,
optimize processes, or solve specific problems.
4. Data Collection
Definition: Gathering raw data from various sources, which could include databases, web
scraping, IoT devices, APIs, and more.
Example: Collecting customer transaction data from an e-commerce platform or gathering social
media data via APIs.
Customer Transaction Data (E-commerce Platform):
o Example: Suppose you're working with an online retail store like Amazon or Flipkart. The
platform collects data about every customer transaction, which includes:
Product Details: Item purchased, price, quantity, and category.
Customer Information: User ID, age, location, and browsing history.
Transaction Metadata: Purchase date, time, payment method, and delivery option.
o This transactional data can be used for understanding customer behavior, predicting future
purchases, or designing personalized marketing campaigns.
5. Data Cleaning (Preprocessing)
Definition: Removing or fixing errors, handling missing data, and preparing data for analysis.
Example: Converting data types, handling missing values, filtering out outliers, and removing
duplicates in a dataset.
Converting Data Types
Problem: Sometimes data is imported in the wrong format. For example, a numeric column (like
price) might be stored as text.
Example: Suppose you have a column Price that is in string format but should be numeric.
Handling Missing Values
Problem: Datasets often have missing values that need to be addressed before analysis.
Example: You can either fill missing values (imputation) or drop rows/columns with missing
data.
6. Data Cleaning (Preprocessing)
Filtering Out Outliers
Problem: Outliers can distort your analysis or model performance, so they
need to be handled.
Example: Suppose you have a column Price where an outlier exists, like a
product mistakenly priced at $1,000,000.
Removing Duplicates
Problem: Duplicate entries in the data can lead to inaccurate results,
especially in aggregations or model training.
Example: If you have a dataset where some rows are repeated, you can
remove them.
7. Data Cleaning (Preprocessing)
Key Points Addressed:
Price column converted to numeric.
Missing values in the Purchase Date and Quantity
columns handled.
Outliers in the Price column removed.
Duplicate rows removed.
8. Data Exploration (Exploratory Data Analysis - EDA)
. Definition: Investigating the dataset to uncover initial patterns, trends, and insights using statistical
methods and visualization.
Example: Plotting histograms, scatter plots, and correlation matrices to understand relationships
between variables.
Plotting a Histogram
Purpose: A histogram shows the distribution of a single variable. It's useful for understanding the
frequency of values and the shape of the data distribution (e.g., normal, skewed).
Example: Plotting the distribution of product prices in an e-commerce dataset.
Scatter Plot
Purpose: A scatter plot is used to visualize the relationship between two variables. It shows how one
variable changes in relation to another.
Example: Plotting the relationship between product price and quantity sold.
9. Correlation Matrix
Purpose: A correlation matrix shows the correlation coefficients between multiple variables. It helps
identify linear relationships between variables, ranging from -1 (negative correlation) to 1 (positive
correlation).
Example: Finding the correlation between price, quantity, and customer ratings.
10. Summary of Visualization Techniques
1. Histogram: Shows the distribution of a single variable (e.g., prices, quantities).
o Insight: Reveals the shape and frequency of data (e.g., normal or skewed
distribution).
2. Scatter Plot: Displays the relationship between two continuous variables (e.g., price
vs. quantity sold).
o Insight: Shows patterns, trends, or relationships (e.g., whether a higher price
leads to fewer items sold).
3. Correlation Matrix: Quantifies relationships between multiple variables using
correlation coefficients.
o Insight: Identifies strong or weak correlations between variables (e.g., positive
correlation between price and rating).
11. Data Transformation/Feature Engineering
Definition: Transforming raw data into useful formats by creating new
features or adjusting existing ones to make data more suitable for
modeling.
Example: Normalizing data, converting categorical variables into
numerical values (encoding), or creating new features like 'age group'
from raw 'date of birth.'
Here are some common feature engineering techniques such as
normalizing data, encoding categorical variables, and creating new
features like age groups from raw data. These techniques are crucial for
preparing your data for machine learning models.
12. Data Transformation/Feature Engineering
1. Normalizing Data
Purpose: Normalization scales the values of continuous variables to a similar range,
typically between 0 and 1. This helps machine learning models converge faster and
ensures that no one feature dominates due to its scale.
Example: Normalizing the Price column to bring all prices between 0 and 1.
2. Converting Categorical Variables into Numerical Values (Encoding)
Purpose: Machine learning models require numerical inputs, so categorical features
(e.g., product categories or user locations) need to be converted to numeric form.
Example: Encoding a Category column (e.g., Electronics, Apparel,
Books) using one-hot encoding or label encoding.
13. Data Transformation/Feature Engineering
3. Creating New Features (Feature Engineering)
Purpose: You can create new features that make the data more
useful for analysis or predictive modeling. For example, converting
raw Date of Birth data into age groups can make the model
understand the user demographics better.
Example: Create an Age Group from a Date of Birth
column.
14. Data Transformation/Feature Engineering
Summary of Techniques:
1. Normalization: Rescales numeric values to a similar range (e.g., 0 to 1) to
prevent larger numbers from dominating smaller ones in models.
2. Encoding Categorical Variables:
o One-Hot Encoding: Creates binary columns for each category (useful for
unordered categories).
o Label Encoding: Converts categories into numeric labels (useful for
ordinal data).
3. Creating New Features: Converts raw data into more meaningful categories
(e.g., Age Group from Date of Birth).
Editor's Notes
#1:NOTE:
To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image.