SlideShare a Scribd company logo
7
Most read
10
Most read
16
Most read
MODY UNIVERSITY OF SCIENCE
ANDTECHNOLOGY
Colloquium Presentation
CS 14.371
Submitted to-
Dr. Pervesh Kumar
Bishnoi
Ms. Sonal Shukla
Submitted by-
Akshita Kanther
B.Tech. IIIrd yr C2
Er.No.-180161
Presentation on data preparation with pandas
CONTENTS
Introduction
Why Should We Prepare Our Data
Python
Python Libraries
Pandas
Features of Pandas
Core Components Of Pandas
Pandas Operations
Typical Pipeline For Data Preparation
Common Tasks Involved In Data Preparation
Applications Of Pandas
Companies using Pandas
Summary
INTRODUCTION
Data preparation is the first step after you get your
hands on any kind of dataset. This is the step when you
pre-process raw data into a form that can be easily and
accurately analyzed. Proper data preparation allows for
efficient analysis - it can eliminate errors and
inaccuracies that could have occurred during the data
gathering process and can thus help in removing some
bias resulting from poor data quality. Therefore a lot of
an analyst's time is spent on this vital step.
WHY SHOULD
WE PREPARE
OUR DATA
Garbage in, garbage out
Reduce errors
Remove duplicate records
Fix missing values
Correct range values
Fix formatting (i.e. date, text, number)
PYTHON
 Object-oriented, high-level
programming language
 Used as a scripting language to
connect existing components
together
 Simple, easy to learn syntax
emphasizes readability
 Supports modules and
packages
PYTHON LIBRARIES
Many popular Python
toolboxes/libraries:-
• NumPy
• SciPy
• Pandas
• SciKit-Learn
Visualization libraries:-
• matplotlib
• Seaborn
Presentation on data preparation with pandas
PANDAS
• Pandas is a software library written for Python
• Pandas has so many uses that it might make sense to list
the things it can't do instead of what it can do
• This tool is essentially your data’s home. Through pandas,
you get acquainted with your data by cleaning,
transforming, and analyzing it
• Pandas is well suited for different kinds of data, such as:
 Tabular data with heterogeneously-typed columns
 Ordered and unordered time series data
 Arbitrary matrix data with row & column labels
 Unlabelled data
 Any other form of observational or statistical data sets
To use the pandas library, you need to first import it. Just type
this in your python console:
Presentation on data preparation with pandas
CORE COMPONENTS OF PANDAS
The primary two components of Pandas are:-
Dataframe
 Series
A Series is essentially column and a Dataframe is a
multidimensional Table made up of a collection of Series.
PANDAS OPERATIONS
Using Python pandas, you can perform a lot of operations with series, data frames, missing data,
group by etc. Some of the common operations for data manipulation are listed below:
TYPICAL PIPELINE FOR DATA
PREPARATION
• The first step of a data preparation pipeline is to gather data from various
sources and locations
• Before any processing is done, we wish to discover what the data is about.
At this stage, we understand the data within the context of business goals
and Visualization of the data is also helpful here
• The next stage is to cleanse the data of missing values and invalid values.
We also reformat data to standard forms
• Next we transform the data for a specific outcome or audience
• We can enrich data by merging different datasets to enable richer insights
• Finally, we store the data or directly send it out for analytics
COMMON TASKS INVOLVED IN DATA
PREPARATION
Tasks involved in
Data Preparation
Aggregation
Augmentation
Decomposing
Deletion
Blending
Anonymization
Data preparation involves one or more of the following tasks:
•Aggregation: Multiple columns are reduced to fewer columns.
Records are summarized
•Anonymization: Sensitive values are removed for the sake of
privacy
•Augmentation: Expand the dataset size without collecting more
data. For example, image data is augmented via cropping or
rotating
•Blending: Combine and link related data from various sources. For
example, combine an employee's HR data with payroll data
•Decomposing: Decompose a data column that has sub-fields. For
example, "6 ounces butter" is decomposed into three columns
representing value, unit and ingredient
•Deletion: Duplicates and outliers are removed. Exploratory Data
Analysis (EDA) may be used to identify outliers
APPLICATIONS OF
PANDAS
COMPANIES USING PANDAS
SUMMARY
Raw data is usually not suitable for direct analysis.
This is because the data might come from different
sources in different formats. Moreover, real-world
data is not clean. Some data points might be
missing. Some others might be out of range.
There could be duplicates. Data preparation is
therefore an essential task that transforms or
prepares data into a form that's suitable for
analysis.
Data preparation assumes that data has already
been collected. However, others may consider
data collection and data ingestion as part of data
preparation. Within data preparation, it's common
to identify sub-stages that might include data pre-
processing, data wrangling, and data
transformation.
Presentation on data preparation with pandas
Presentation on data preparation with pandas

More Related Content

What's hot (20)

PPTX
Introduction to pandas
Piyush rai
Ā 
PPTX
Python Scipy Numpy
Girish Khanzode
Ā 
PPTX
Pandas
Jyoti shukla
Ā 
PDF
Introduction to NumPy
Huy Nguyen
Ā 
PDF
Introduction to Python Pandas for Data Analytics
Phoenix
Ā 
PPTX
Data Structures in Python
Devashish Kumar
Ā 
PPTX
Python Seaborn Data Visualization
Sourabh Sahu
Ā 
PDF
pandas: Powerful data analysis tools for Python
Wes McKinney
Ā 
PDF
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Edureka!
Ā 
PPTX
DataFrame in Python Pandas
Sangita Panchal
Ā 
PPTX
Python pandas Library
Md. Sohag Miah
Ā 
PDF
Python NumPy Tutorial | NumPy Array | Edureka
Edureka!
Ā 
PPTX
Python
Aashish Jain
Ā 
PDF
Python GUI
LusciousLarryDas
Ā 
PPTX
Data preprocessing in Machine learning
pyingkodi maran
Ā 
PPTX
Introduction to matplotlib
Piyush rai
Ā 
PPTX
Data Analysis in Python-NumPy
Devashish Kumar
Ā 
PDF
Data Visualization in Python
Jagriti Goswami
Ā 
PPT
Python Pandas
Sunil OS
Ā 
PPTX
Python
Sangita Panchal
Ā 
Introduction to pandas
Piyush rai
Ā 
Python Scipy Numpy
Girish Khanzode
Ā 
Pandas
Jyoti shukla
Ā 
Introduction to NumPy
Huy Nguyen
Ā 
Introduction to Python Pandas for Data Analytics
Phoenix
Ā 
Data Structures in Python
Devashish Kumar
Ā 
Python Seaborn Data Visualization
Sourabh Sahu
Ā 
pandas: Powerful data analysis tools for Python
Wes McKinney
Ā 
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Edureka!
Ā 
DataFrame in Python Pandas
Sangita Panchal
Ā 
Python pandas Library
Md. Sohag Miah
Ā 
Python NumPy Tutorial | NumPy Array | Edureka
Edureka!
Ā 
Python
Aashish Jain
Ā 
Python GUI
LusciousLarryDas
Ā 
Data preprocessing in Machine learning
pyingkodi maran
Ā 
Introduction to matplotlib
Piyush rai
Ā 
Data Analysis in Python-NumPy
Devashish Kumar
Ā 
Data Visualization in Python
Jagriti Goswami
Ā 
Python Pandas
Sunil OS
Ā 
Python
Sangita Panchal
Ā 

Similar to Presentation on data preparation with pandas (20)

PPTX
python-pandas-For-Data-Analysis-Manipulate.pptx
PLOKESH8
Ā 
PPTX
Complete Introduction To Pandas Python.pptx
ARUN R S
Ā 
PPTX
Pandas in Programming (python) presentation
AhmadAbdullah244742
Ā 
PPTX
Pandas in Programming (Python) Presentation
AhmadAbdullah244742
Ā 
PPT
Pandas-and-NumPy-Powerful-Tools-for-Data-Analysis (1).ppt
sagarrathore52204
Ā 
PPTX
Pandas Data Cleaning and Preprocessing PPT.pptx
bajajrishabh96tech
Ā 
DOCX
Detailed Report on Basics Of Pandas of Python
anushaashraf20
Ā 
PPTX
Working with Graphs _python.pptx
MrPrathapG
Ā 
PPTX
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
HarshitChauhan88
Ā 
PPTX
Lecture 1 Pandas Basics.pptx machine learning
my6305874
Ā 
PPTX
Unit 4_Working with Graphs _python (2).pptx
prakashvs7
Ā 
PDF
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
Ā 
PDF
Panda data structures and its importance in Python.pdf
sumitt6_25730773
Ā 
PPTX
Detailed explanation about python pandas library
snehajain3062023
Ā 
PPTX
2. Data Preprocessing with Numpy and Pandas.pptx
PeangSereysothirich
Ā 
PDF
Python pandas I .pdf gugugigg88iggigigih
rajveerpersonal21
Ā 
PDF
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
arianmutchpp
Ā 
PPTX
To understand the importance of Python libraries in data analysis.
GurpinderSingh98
Ā 
PPTX
Meetup Junio Data Analysis with python 2018
DataLab Community
Ā 
PPTX
Python for Data Analytics and ML examples
omaramssi06
Ā 
python-pandas-For-Data-Analysis-Manipulate.pptx
PLOKESH8
Ā 
Complete Introduction To Pandas Python.pptx
ARUN R S
Ā 
Pandas in Programming (python) presentation
AhmadAbdullah244742
Ā 
Pandas in Programming (Python) Presentation
AhmadAbdullah244742
Ā 
Pandas-and-NumPy-Powerful-Tools-for-Data-Analysis (1).ppt
sagarrathore52204
Ā 
Pandas Data Cleaning and Preprocessing PPT.pptx
bajajrishabh96tech
Ā 
Detailed Report on Basics Of Pandas of Python
anushaashraf20
Ā 
Working with Graphs _python.pptx
MrPrathapG
Ā 
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
HarshitChauhan88
Ā 
Lecture 1 Pandas Basics.pptx machine learning
my6305874
Ā 
Unit 4_Working with Graphs _python (2).pptx
prakashvs7
Ā 
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
Ā 
Panda data structures and its importance in Python.pdf
sumitt6_25730773
Ā 
Detailed explanation about python pandas library
snehajain3062023
Ā 
2. Data Preprocessing with Numpy and Pandas.pptx
PeangSereysothirich
Ā 
Python pandas I .pdf gugugigg88iggigigih
rajveerpersonal21
Ā 
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
arianmutchpp
Ā 
To understand the importance of Python libraries in data analysis.
GurpinderSingh98
Ā 
Meetup Junio Data Analysis with python 2018
DataLab Community
Ā 
Python for Data Analytics and ML examples
omaramssi06
Ā 
Ad

Recently uploaded (20)

PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
Ā 
PPTX
What Is Data Integration and Transformation?
subhashenia
Ā 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
Ā 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
Ā 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
Ā 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
Ā 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
Ā 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
Ā 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
Ā 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
Ā 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
Ā 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
Ā 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
Ā 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
Ā 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
Ā 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
Ā 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
Ā 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
Ā 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
Ā 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
Ā 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
Ā 
What Is Data Integration and Transformation?
subhashenia
Ā 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
Ā 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
Ā 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
Ā 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
Ā 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
Ā 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
Ā 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
Ā 
How to Add Columns and Rows in an R Data Frame
subhashenia
Ā 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
Ā 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
Ā 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
Ā 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
Ā 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
Ā 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
Ā 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
Ā 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
Ā 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
Ā 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
Ā 
Ad

Presentation on data preparation with pandas

  • 1. MODY UNIVERSITY OF SCIENCE ANDTECHNOLOGY Colloquium Presentation CS 14.371 Submitted to- Dr. Pervesh Kumar Bishnoi Ms. Sonal Shukla Submitted by- Akshita Kanther B.Tech. IIIrd yr C2 Er.No.-180161
  • 3. CONTENTS Introduction Why Should We Prepare Our Data Python Python Libraries Pandas Features of Pandas Core Components Of Pandas Pandas Operations Typical Pipeline For Data Preparation Common Tasks Involved In Data Preparation Applications Of Pandas Companies using Pandas Summary
  • 4. INTRODUCTION Data preparation is the first step after you get your hands on any kind of dataset. This is the step when you pre-process raw data into a form that can be easily and accurately analyzed. Proper data preparation allows for efficient analysis - it can eliminate errors and inaccuracies that could have occurred during the data gathering process and can thus help in removing some bias resulting from poor data quality. Therefore a lot of an analyst's time is spent on this vital step.
  • 6. Garbage in, garbage out Reduce errors Remove duplicate records Fix missing values Correct range values Fix formatting (i.e. date, text, number)
  • 7. PYTHON  Object-oriented, high-level programming language  Used as a scripting language to connect existing components together  Simple, easy to learn syntax emphasizes readability  Supports modules and packages
  • 8. PYTHON LIBRARIES Many popular Python toolboxes/libraries:- • NumPy • SciPy • Pandas • SciKit-Learn Visualization libraries:- • matplotlib • Seaborn
  • 10. PANDAS • Pandas is a software library written for Python • Pandas has so many uses that it might make sense to list the things it can't do instead of what it can do • This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it • Pandas is well suited for different kinds of data, such as:  Tabular data with heterogeneously-typed columns  Ordered and unordered time series data  Arbitrary matrix data with row & column labels  Unlabelled data  Any other form of observational or statistical data sets To use the pandas library, you need to first import it. Just type this in your python console:
  • 12. CORE COMPONENTS OF PANDAS The primary two components of Pandas are:- Dataframe  Series A Series is essentially column and a Dataframe is a multidimensional Table made up of a collection of Series.
  • 13. PANDAS OPERATIONS Using Python pandas, you can perform a lot of operations with series, data frames, missing data, group by etc. Some of the common operations for data manipulation are listed below:
  • 14. TYPICAL PIPELINE FOR DATA PREPARATION
  • 15. • The first step of a data preparation pipeline is to gather data from various sources and locations • Before any processing is done, we wish to discover what the data is about. At this stage, we understand the data within the context of business goals and Visualization of the data is also helpful here • The next stage is to cleanse the data of missing values and invalid values. We also reformat data to standard forms • Next we transform the data for a specific outcome or audience • We can enrich data by merging different datasets to enable richer insights • Finally, we store the data or directly send it out for analytics
  • 16. COMMON TASKS INVOLVED IN DATA PREPARATION Tasks involved in Data Preparation Aggregation Augmentation Decomposing Deletion Blending Anonymization Data preparation involves one or more of the following tasks: •Aggregation: Multiple columns are reduced to fewer columns. Records are summarized •Anonymization: Sensitive values are removed for the sake of privacy •Augmentation: Expand the dataset size without collecting more data. For example, image data is augmented via cropping or rotating •Blending: Combine and link related data from various sources. For example, combine an employee's HR data with payroll data •Decomposing: Decompose a data column that has sub-fields. For example, "6 ounces butter" is decomposed into three columns representing value, unit and ingredient •Deletion: Duplicates and outliers are removed. Exploratory Data Analysis (EDA) may be used to identify outliers
  • 19. SUMMARY Raw data is usually not suitable for direct analysis. This is because the data might come from different sources in different formats. Moreover, real-world data is not clean. Some data points might be missing. Some others might be out of range. There could be duplicates. Data preparation is therefore an essential task that transforms or prepares data into a form that's suitable for analysis. Data preparation assumes that data has already been collected. However, others may consider data collection and data ingestion as part of data preparation. Within data preparation, it's common to identify sub-stages that might include data pre- processing, data wrangling, and data transformation.