SlideShare a Scribd company logo
DISCOVER . LEARN . EMPOWER
Lecture – 1
Pandas Basics
APEX INSTITUTE OF TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
MACHINE LEARNING (22CSH-286)
Faculty: Prof. (Dr.) Madan Lal Saini(E13485)
1
Machine Learning: Course Objectives
2
COURSE OBJECTIVES
The Course aims to:
1. Understand and apply various data handling and visualization techniques.
2. Understand about some basic learning algorithms and techniques and their
applications, as well as general questions related to analysing and handling large data
sets.
3. To develop skills of supervised and unsupervised learning techniques and
implementation of these to solve real life problems.
4. To develop basic knowledge on the machine techniques to build an intellectual
machine for making decisions behalf of humans.
5. To develop skills for selecting an algorithm and model parameters and apply them for
designing optimized machine learning applications.
COURSE OUTCOMES
3
On completion of this course, the students shall be able to:-
CO1 Describe and apply various data pre-processing and visualization techniques on dataset.
CO2
Understand about some basic learning on algorithms and analysing their applications, as
well as general questions related to analysing and handling large data sets.
CO3
Describe machine learning techniques to build an intellectual machine for making
decisions on behalf of humans.
CO4
Develop supervised and unsupervised learning techniques and implementation of these to
solve real life problems.
CO5
Analyse the performance of machine learning model and apply optimization techniques to
improve the performance of the model.
Unit-1 Syllabus
4
Unit-1 Data Pre-processing Techniques
Data Pre-
Processing
Data Frame Basics, CSV File, Libraries for Pre-processing, Handling
Missing data, Encoding Categorical data, Feature Scaling, Handling Time
Series data.
Feature Extraction Dimensionality Reduction: Feature Selection Techniques, Feature
Extraction Techniques; Data Transformation, Data Normalization.
Data Visualization Different types of plots, Plotting fundamentals using Matplotlib, Plotting
fundamentals using Seaborn.
SUGGESTIVE READINGS
TEXT BOOKS:
• T1: Tom.M.Mitchell, “Machine Learning”, McGraw Hill, International Edition, 2018
• T2: Ethern Alpaydin, “Introduction to Machine Learning”. Eastern Economy Edition, Prentice Hall of
India, 2015.
• T3: Andreas C. Miller, Sarah Guido, “Introduction to Machine Learning with Python”, O’REILLY
(2018).
REFERENCE BOOKS:
• R1 Sebastian Raschka, Vahid Mirjalili, “Python Machine Learning”, Packt Publisher (2019)
• R2 Aurélien Géron, “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, Wiley,
2nd Edition, 2022
• R3 Christopher Bishop, “Pattern Recognition and Machine Learning”, Illustrated Edition, Springer,
2016.
5
Table of Contents
 Introduction to Pandas
 Data frame
 Series
 Operation
 Plots
6
Data Structures
• Series: It is a one-dimensional labeled array capable of holding data
of any type (integer, string, float, python objects, etc.). Pandas Series
is nothing but a column in an excel sheet.
• Import pandas as pd
• data=np.array ([‘d’,’e’,’e’,’k’,’s’,’h’,’a’])
• ser=pd.series(data)
• Data Frame: it is two-dimensional size-mutable, potentially heterogeneous
tabular data structure with labeled axes (row and columns).
• d=pd.DataRange(20200301,period=10)
• pd.DataFrame(np.random.randn(10,4),index=d,columns=[‘A’,’B’,’C’,’D’])
…continued
df.head()
df.columns
df.index
df.describe()
df.sort_values(by=‘C’)
df[0:3]
df.loc[‘2020301’:’20200306’,[‘D’:,’C’]]
df.iloc[3:5,0:2]
df[df[‘A’]>0]
Handle Missing Values
Missing data or null values in a data can create lot of ruckus in other
stages of data science life cycle.
It is very important to deal with the missing data in an effective
manner
• Ex.
• df.isnull().count()
• df.isnull().sum()
• df.dropna()
• df.fillna(value=2)
Series
data = np.array(['a','b','c','d’])
s = pd.Series(data,index=[100,101,102,103])
print s
Create Data Frame
List
Dict
Series
Numpy ndarrays
Another Data Frame
Data Frame Examples
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
data = [['Alex',10],['Bob',12],['Clarke',13]]
df =
pd.DataFrame(data,columns=['Name','Age’])
print df
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b',
'c', 'd'])}
df = pd.DataFrame(d)
print df ['one']
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4],
index=['a', 'b', 'c', 'd']), 'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print df # using del function
print ("Deleting the first column using DEL function:")
del df['one’]
print df # using pop function
print ("Deleting another column using POP function:")
df.pop('two’)
print df
Data Frame Functionality
Sr.No. Attribute or Method & Description
1 T
Transposes rows and columns.
2 axes
Returns a list with the row axis labels and column axis labels as the only members.
3 dtypes
Returns the dtypes in this object.
4 empty
True if NDFrame is entirely empty [no items]; if any of the axes are of length 0.
5 ndim
Number of axes / array dimensions.
6 shape
Returns a tuple representing the dimensionality of the DataFrame.
7 size
Number of elements in the NDFrame.
8 values
Numpy representation of NDFrame.
9 head()
Returns the first n rows.
10 tail()
Returns last n rows.
Continued..
• rename:The rename() method allows you to relabel an axis based
on some mapping (a dict or Series) or an arbitrary function.
• getdummies(): Returns the DataFrame with One-Hot Encoded
values.
• loc: Pandas provide various methods to have purely label based
indexing. When slicing, the start bound is also included.
• iloc: Pandas provide various methods in order to get purely
integer based indexing. Like python and numpy, these are 0-
based indexing.
df = pd.DataFrame(np.random.randn(8, 4), index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D’])
# Select few rows for multiple columns, say list[]
print df.loc[['a','b','f','h'],['A','C’]]
df.loc[‘a’:’h’]
print df.iloc[:4]
print df.iloc[1:5, 2:4]
More Functions..
Sr.No. Function Description
1 count() Number of non-null observations
2 sum() Sum of values
3 mean() Mean of Values
4 median() Median of Values
5 mode() Mode of values
6 std() Standard Deviation of the Values
7 min() Minimum Value
8 max() Maximum Value
9 abs() Absolute Value
10 prod() Product of Values
11 cumsum() Cumulative Sum
12 cumprod() Cumulative Product
Data Frame: filtering
16
To subset the data we can apply Boolean indexing. This indexing is commonly
known as a filter. For example if we want to subset the rows in which the salary
value is greater than $120K:
In [ ]: #Calculate mean salary for each professor rank:
df_sub = df[ df['salary'] > 120000 ]
In [ ]: #Select only those rows that contain female professors:
df_f = df[ df['sex'] == 'Female' ]
Any Boolean operator can be used to subset the data:
> greater; >= greater or equal;
< less; <= less or equal;
== equal; != not equal;
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
Data Frames groupby method
19
Using "group by" method we can:
• Split the data into groups based on some criteria
• Calculate statistics (or apply a function) to each group
• Similar to dplyr() function in R
In [ ]: #Group data using rank
df_rank = df.groupby(['rank'])
In [ ]: #Calculate mean value for each numeric column per each group
df_rank.mean()
Data Frames groupby method
20
Once groupby object is create we can calculate various statistics for each group:
In [ ]: #Calculate mean salary for each professor rank:
df.groupby('rank')[['salary']].mean()
Note: If single brackets are used to specify the column (e.g. salary), then the output is Pandas Series object.
When double brackets are used the output is a Data Frame
Data Frames groupby method
21
groupby performance notes:
- no grouping/splitting occurs until it's needed. Creating the groupby object
only verifies that you have passed a valid mapping
- by default the group keys are sorted during the groupby operation. You may
want to pass sort=False for potential speedup:
In [ ]: #Calculate mean salary for each professor rank:
df.groupby(['rank'], sort=False)[['salary']].mean()
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
Graphics to explore the data
28
To show graphs within Python notebook include inline directive:
In [ ]: %matplotlib inline
Seaborn package is built on matplotlib but provides high level
interface for drawing attractive statistical graphics, similar to ggplot2
library in R. It specifically targets statistical data visualization
Graphics
29
description
distplot histogram
barplot estimate of central tendency for a numeric variable
violinplot similar to boxplot, also shows the probability density of the
data
jointplot Scatterplot
regplot Regression plot
pairplot Pairplot
boxplot boxplot
swarmplot categorical scatterplot
factorplot General categorical plot
Key Features
• Fast and efficient DataFrame object with default and customized
indexing.
• Tools for loading data into in-memory data objects from different file
formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and subsetting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
Questions?
• How Do You Handle Missing or Corrupted Data in a Dataset?
• How Can You Choose a Classifier Based on a Training Set Data Size?
• What Are the Three Stages of Building a Model in Machine Learning?
• What Are the Different Types of Machine Learning?
• What is ‘training Set’ and ‘test Set’ in a Machine Learning Model? How
Much Data Will You Allocate for Your Training, Validation, and Test Sets?
31
References
Book:
• Ethern Alpaydin, “Introduction to Machine Learning”. Eastern Economy Edition, Prentice
Hall of India, 2015.
• Andreas C. Miller, Sarah Guido, “Introduction to Machine Learning with Python”, O’REILLY
(2018).
Research Paper:
• Bi, Qifang, et al. "What is machine learning? A primer for the epidemiologist." American
journal of epidemiology 188.12 (2019): 2222-2239.
• Jordan, Michael I., and Tom M. Mitchell. "Machine learning: Trends, perspectives, and
prospects." Science 349.6245 (2015): 255-260.
Websites:
• https://www.geeksforgeeks.org/machine-learning/
• https://www.javatpoint.com/machine-learning
Videos:
• https://www.youtube.com/playlist?list=PLIg1dOXc_acbdJo-AE5RXpIM_rvwrerwR
32
THANK YOU
For queries
Email: madan.e13485@cumail.in

More Related Content

Similar to Lecture 1 Pandas Basics.pptx machine learning (20)

PPTX
Unit 3_Numpy_Vsp.pptx
prakashvs7
 
PPTX
python for data anal gh i o fytysis creation.pptx
Vinod Deenathayalan
 
PPTX
Python for data analysis
Savitribai Phule Pune University
 
PPTX
Python-for-Data-Analysis.pptx
tangadhurai
 
PPTX
Python-for-Data-Analysis.pptx
Sandeep Singh
 
PDF
Python for Data Analysis.pdf
JulioRecaldeLara1
 
PDF
Python-for-Data-Analysis.pdf
ssuser598883
 
PDF
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
arianmutchpp
 
PPTX
Python-for-Data-Analysis.pptx
ParveenShaik21
 
PPTX
Unit 3_Numpy_VP.pptx
vishnupriyapm4
 
PPTX
Lecture 9.pptx
MathewJohnSinoCruz
 
PDF
Download full ebook of Mastering Pandas Femi Anthony instant download pdf
siefphor
 
PDF
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
KrishnaJyotish1
 
PPTX
introduction to data structures in pandas
vidhyapm2
 
PDF
pandas dataframe notes.pdf
AjeshSurejan2
 
PDF
pandas-221217084954-937bb582.pdf
scorsam1
 
PPTX
Pandas.pptx
Govardhan Bhavani
 
PPTX
Unit 3_Numpy_VP.pptx
vishnupriyapm4
 
PPT
Python Panda Library for python programming.ppt
tejaskumbhani111
 
PDF
lecture14DATASCIENCE AND MACHINE LER.pdf
smartashammari
 
Unit 3_Numpy_Vsp.pptx
prakashvs7
 
python for data anal gh i o fytysis creation.pptx
Vinod Deenathayalan
 
Python for data analysis
Savitribai Phule Pune University
 
Python-for-Data-Analysis.pptx
tangadhurai
 
Python-for-Data-Analysis.pptx
Sandeep Singh
 
Python for Data Analysis.pdf
JulioRecaldeLara1
 
Python-for-Data-Analysis.pdf
ssuser598883
 
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
arianmutchpp
 
Python-for-Data-Analysis.pptx
ParveenShaik21
 
Unit 3_Numpy_VP.pptx
vishnupriyapm4
 
Lecture 9.pptx
MathewJohnSinoCruz
 
Download full ebook of Mastering Pandas Femi Anthony instant download pdf
siefphor
 
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
KrishnaJyotish1
 
introduction to data structures in pandas
vidhyapm2
 
pandas dataframe notes.pdf
AjeshSurejan2
 
pandas-221217084954-937bb582.pdf
scorsam1
 
Pandas.pptx
Govardhan Bhavani
 
Unit 3_Numpy_VP.pptx
vishnupriyapm4
 
Python Panda Library for python programming.ppt
tejaskumbhani111
 
lecture14DATASCIENCE AND MACHINE LER.pdf
smartashammari
 

Recently uploaded (20)

PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PPTX
Distribution reservoir and service storage pptx
dhanashree78
 
PPTX
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PDF
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PPTX
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PDF
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PPT
Testing and final inspection of a solar PV system
MuhammadSanni2
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PPTX
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
PDF
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
Distribution reservoir and service storage pptx
dhanashree78
 
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
Design Thinking basics for Engineers.pdf
CMR University
 
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
Testing and final inspection of a solar PV system
MuhammadSanni2
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
Ad

Lecture 1 Pandas Basics.pptx machine learning

  • 1. DISCOVER . LEARN . EMPOWER Lecture – 1 Pandas Basics APEX INSTITUTE OF TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING MACHINE LEARNING (22CSH-286) Faculty: Prof. (Dr.) Madan Lal Saini(E13485) 1
  • 2. Machine Learning: Course Objectives 2 COURSE OBJECTIVES The Course aims to: 1. Understand and apply various data handling and visualization techniques. 2. Understand about some basic learning algorithms and techniques and their applications, as well as general questions related to analysing and handling large data sets. 3. To develop skills of supervised and unsupervised learning techniques and implementation of these to solve real life problems. 4. To develop basic knowledge on the machine techniques to build an intellectual machine for making decisions behalf of humans. 5. To develop skills for selecting an algorithm and model parameters and apply them for designing optimized machine learning applications.
  • 3. COURSE OUTCOMES 3 On completion of this course, the students shall be able to:- CO1 Describe and apply various data pre-processing and visualization techniques on dataset. CO2 Understand about some basic learning on algorithms and analysing their applications, as well as general questions related to analysing and handling large data sets. CO3 Describe machine learning techniques to build an intellectual machine for making decisions on behalf of humans. CO4 Develop supervised and unsupervised learning techniques and implementation of these to solve real life problems. CO5 Analyse the performance of machine learning model and apply optimization techniques to improve the performance of the model.
  • 4. Unit-1 Syllabus 4 Unit-1 Data Pre-processing Techniques Data Pre- Processing Data Frame Basics, CSV File, Libraries for Pre-processing, Handling Missing data, Encoding Categorical data, Feature Scaling, Handling Time Series data. Feature Extraction Dimensionality Reduction: Feature Selection Techniques, Feature Extraction Techniques; Data Transformation, Data Normalization. Data Visualization Different types of plots, Plotting fundamentals using Matplotlib, Plotting fundamentals using Seaborn.
  • 5. SUGGESTIVE READINGS TEXT BOOKS: • T1: Tom.M.Mitchell, “Machine Learning”, McGraw Hill, International Edition, 2018 • T2: Ethern Alpaydin, “Introduction to Machine Learning”. Eastern Economy Edition, Prentice Hall of India, 2015. • T3: Andreas C. Miller, Sarah Guido, “Introduction to Machine Learning with Python”, O’REILLY (2018). REFERENCE BOOKS: • R1 Sebastian Raschka, Vahid Mirjalili, “Python Machine Learning”, Packt Publisher (2019) • R2 Aurélien Géron, “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, Wiley, 2nd Edition, 2022 • R3 Christopher Bishop, “Pattern Recognition and Machine Learning”, Illustrated Edition, Springer, 2016. 5
  • 6. Table of Contents  Introduction to Pandas  Data frame  Series  Operation  Plots 6
  • 7. Data Structures • Series: It is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). Pandas Series is nothing but a column in an excel sheet. • Import pandas as pd • data=np.array ([‘d’,’e’,’e’,’k’,’s’,’h’,’a’]) • ser=pd.series(data) • Data Frame: it is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (row and columns). • d=pd.DataRange(20200301,period=10) • pd.DataFrame(np.random.randn(10,4),index=d,columns=[‘A’,’B’,’C’,’D’])
  • 9. Handle Missing Values Missing data or null values in a data can create lot of ruckus in other stages of data science life cycle. It is very important to deal with the missing data in an effective manner • Ex. • df.isnull().count() • df.isnull().sum() • df.dropna() • df.fillna(value=2)
  • 10. Series data = np.array(['a','b','c','d’]) s = pd.Series(data,index=[100,101,102,103]) print s
  • 12. Data Frame Examples data = [1,2,3,4,5] df = pd.DataFrame(data) print df data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age’]) print df d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print df ['one'] d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 'three' : pd.Series([10,20,30], index=['a','b','c'])} df = pd.DataFrame(d) print ("Our dataframe is:") print df # using del function print ("Deleting the first column using DEL function:") del df['one’] print df # using pop function print ("Deleting another column using POP function:") df.pop('two’) print df
  • 13. Data Frame Functionality Sr.No. Attribute or Method & Description 1 T Transposes rows and columns. 2 axes Returns a list with the row axis labels and column axis labels as the only members. 3 dtypes Returns the dtypes in this object. 4 empty True if NDFrame is entirely empty [no items]; if any of the axes are of length 0. 5 ndim Number of axes / array dimensions. 6 shape Returns a tuple representing the dimensionality of the DataFrame. 7 size Number of elements in the NDFrame. 8 values Numpy representation of NDFrame. 9 head() Returns the first n rows. 10 tail() Returns last n rows.
  • 14. Continued.. • rename:The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function. • getdummies(): Returns the DataFrame with One-Hot Encoded values. • loc: Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also included. • iloc: Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0- based indexing. df = pd.DataFrame(np.random.randn(8, 4), index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D’]) # Select few rows for multiple columns, say list[] print df.loc[['a','b','f','h'],['A','C’]] df.loc[‘a’:’h’] print df.iloc[:4] print df.iloc[1:5, 2:4]
  • 15. More Functions.. Sr.No. Function Description 1 count() Number of non-null observations 2 sum() Sum of values 3 mean() Mean of Values 4 median() Median of Values 5 mode() Mode of values 6 std() Standard Deviation of the Values 7 min() Minimum Value 8 max() Maximum Value 9 abs() Absolute Value 10 prod() Product of Values 11 cumsum() Cumulative Sum 12 cumprod() Cumulative Product
  • 16. Data Frame: filtering 16 To subset the data we can apply Boolean indexing. This indexing is commonly known as a filter. For example if we want to subset the rows in which the salary value is greater than $120K: In [ ]: #Calculate mean salary for each professor rank: df_sub = df[ df['salary'] > 120000 ] In [ ]: #Select only those rows that contain female professors: df_f = df[ df['sex'] == 'Female' ] Any Boolean operator can be used to subset the data: > greater; >= greater or equal; < less; <= less or equal; == equal; != not equal;
  • 19. Data Frames groupby method 19 Using "group by" method we can: • Split the data into groups based on some criteria • Calculate statistics (or apply a function) to each group • Similar to dplyr() function in R In [ ]: #Group data using rank df_rank = df.groupby(['rank']) In [ ]: #Calculate mean value for each numeric column per each group df_rank.mean()
  • 20. Data Frames groupby method 20 Once groupby object is create we can calculate various statistics for each group: In [ ]: #Calculate mean salary for each professor rank: df.groupby('rank')[['salary']].mean() Note: If single brackets are used to specify the column (e.g. salary), then the output is Pandas Series object. When double brackets are used the output is a Data Frame
  • 21. Data Frames groupby method 21 groupby performance notes: - no grouping/splitting occurs until it's needed. Creating the groupby object only verifies that you have passed a valid mapping - by default the group keys are sorted during the groupby operation. You may want to pass sort=False for potential speedup: In [ ]: #Calculate mean salary for each professor rank: df.groupby(['rank'], sort=False)[['salary']].mean()
  • 28. Graphics to explore the data 28 To show graphs within Python notebook include inline directive: In [ ]: %matplotlib inline Seaborn package is built on matplotlib but provides high level interface for drawing attractive statistical graphics, similar to ggplot2 library in R. It specifically targets statistical data visualization
  • 29. Graphics 29 description distplot histogram barplot estimate of central tendency for a numeric variable violinplot similar to boxplot, also shows the probability density of the data jointplot Scatterplot regplot Regression plot pairplot Pairplot boxplot boxplot swarmplot categorical scatterplot factorplot General categorical plot
  • 30. Key Features • Fast and efficient DataFrame object with default and customized indexing. • Tools for loading data into in-memory data objects from different file formats. • Data alignment and integrated handling of missing data. • Reshaping and pivoting of date sets. • Label-based slicing, indexing and subsetting of large data sets. • Columns from a data structure can be deleted or inserted. • Group by data for aggregation and transformations. • High performance merging and joining of data. • Time Series functionality.
  • 31. Questions? • How Do You Handle Missing or Corrupted Data in a Dataset? • How Can You Choose a Classifier Based on a Training Set Data Size? • What Are the Three Stages of Building a Model in Machine Learning? • What Are the Different Types of Machine Learning? • What is ‘training Set’ and ‘test Set’ in a Machine Learning Model? How Much Data Will You Allocate for Your Training, Validation, and Test Sets? 31
  • 32. References Book: • Ethern Alpaydin, “Introduction to Machine Learning”. Eastern Economy Edition, Prentice Hall of India, 2015. • Andreas C. Miller, Sarah Guido, “Introduction to Machine Learning with Python”, O’REILLY (2018). Research Paper: • Bi, Qifang, et al. "What is machine learning? A primer for the epidemiologist." American journal of epidemiology 188.12 (2019): 2222-2239. • Jordan, Michael I., and Tom M. Mitchell. "Machine learning: Trends, perspectives, and prospects." Science 349.6245 (2015): 255-260. Websites: • https://www.geeksforgeeks.org/machine-learning/ • https://www.javatpoint.com/machine-learning Videos: • https://www.youtube.com/playlist?list=PLIg1dOXc_acbdJo-AE5RXpIM_rvwrerwR 32