文章目录
1. Common data problems
1.1 Data type constraints
1.2 Common data types
Manipulating and analyzing data with incorrect data types could lead to compromised analysis as you go along the data science workflow.
When working with new data, you should always check the data types of your columns using the .dtypes
attribute or the .info()
method which you’ll see in the next exercise. Often times, you’ll run into columns that should be converted to different data types before starting any analysis.
In this exercise, you’ll first identify different types of data and correctly map them to their respective types.
- Numeric data types
- Number of points on cusomer loyalty card
- Salary earned monthly
- Number of items bought in a basket
- Text
- First name
- Shipping address of a customer
- City of residence
- Datas
- Order date of a product
- Birthdates of clients
1.3 Numeric data or …?
In this exercise, and throughout this chapter, you’ll be working with bicycle ride sharing data in San Francisco called ride_sharing
. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.
The user_type
column contains information on whether a user is taking a free ride and takes on the following values:
1
for free riders.2
for pay per ride.3
for monthly subscribers.
In this instance, you will print the information of ride_sharing
using .info()
and see a firsthand example of how an incorrect data type can flaw your analysis of the dataset.
Instruction 1
- Print the information of
ride_sharing
. - Use
.describe()
to print the summary statistics of theuser_type
column fromride_sharing
.
# Print the information of ride_sharing
print(ride_sharing.info())
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())
Instruction 2
Question
By looking at the summary statistics - they don’t really seem to offer much description on how users are distributed along their purchase type, why do you think that is?
Possible Answers
■
\blacksquare
■ The user_type
column is not of the correct type, it should be converted to str
.
□
\square
□ The user_type
column has an infinite set of possible values, it should be converted to category
.
□
\square
□ The user_type
column has an finite set of possible values that represent groupings of data, it should be converted to category
.
Instruction 3
- Convert
user_type
into categorical by assigning it the'category'
data type and store it in theuser_type_cat
column. - Make sure you converted
user_type_cat
correctly by using anassert
statement.
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')
# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'
# Print new summary statistics
print(ride_sharing['user_type_cat'].describe())
1.4 Summing string and concatenating number
In the previous exercise, you were able to identify that category
is the correct data type for user_type
and convert it in order to extract relevant statistical summaries that shed light on the distribution of user_type
.
Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.
In this exercise, you’ll be converting the string column duration
to the type int
. Before that however, you will need to make sure to strip "minutes"
from the column in order to make sure pandas
reads it as numerical.
Instruction
- Use the
.strip()
method to stripduration
of"minutes"
and store it in theduration_trim
column. - Convert
duration_trim
toint
and store it in theduration_time
column. - Write an
assert
statement that checks ifduration_time
's data type is now anint
. - Print the average ride duration.
# Strip duration of minute
sride_sharing['duration_trim'] = ride_sharing['duration'].str.strip("minutes")
# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')
# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'
# Print formed columns and calculate average ride duration
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())
1.5 Data range constraints
1.6 Tier size constraints
In this lesson, you’re going to build on top of the work you’ve been doing with the ride_sharing
DataFrame. You’ll be working with the tire_sizes
column which contains data on each bike’s tire size.
Bicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″.
In this exercise, you will make sure the tire_sizes
column has the correct range by first converting it to an integer, then setting and testing the new upper limit of 27″ for tire sizes.
Instruction
- Convert the
tire_sizes
column fromcategory
to'int'
. - Use
.loc[]
to set all values oftire_sizes
above 27 to 27. - Reconvert back
tire_sizes
to'category'
from int. - Print the description of the
tire_sizes
.
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')
# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27,'tire_sizes'] = 27
# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')
# Print tire size description
print(ride_sharing['tire_sizes'].describe())
1.7 Back to the future
A new update to the data pipeline feeding into the ride_sharing
DataFrame has been updated to register each ride’s date. This information is stored in the ride_date
column of the type object
, which represents strings in pandas
.
A bug was discovered which was relaying rides taken today as taken next year. To fix this, you will find all instances of the ride_date
column that occur anytime in the future, and set the maximum possible value of this column to today’s date. Before doing so, you would need to convert ride_date
to a datetime
object.
Instruction
- Convert
ride_date
to adatetime
object and store it inride_dt
column usingto_datetime()
. - Create the variable
today
, which stores today’s date by using thedt.date.today()
function. - For all instances of
ride_dt
in the future, set them to today’s date. - Print the maximum date in the
ride_dt
column.
# Convert ride_date to datetime
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date'])
# Save today's date
today = dt.date.today()
# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today
# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())
1.8 Uniqueness constraints
1.9 How big is your subest?
You have the following loans
DataFrame which contains loan and credit score data for consumers, and some metadata such as their first and last names. You want to find both complete and incomplete duplicates using .duplicated()
.
first_name | last_name | credit_score | has_loan |
---|---|---|---|
Justin | Sadd;emeyer | 600 | 1 |
Hadrien | Lacroix | 450 | 0 |
Choose the correct usage of .duplicated()
below:
□
\square
□ loans.duplicated()
Because the default method returns both complete and incomplete duplicates.
□
\square
□ loans.duplicated(subset = 'first_name')
Because constraining the duplicate rows to the first name lets me find incomplete duplicates as well.
■
\blacksquare
■ loans.duplicated(subset = ['first_name', 'last_name'], keep = False)
Because subsetting on consumer metadata and not discarding any duplicate returns all duplicated rows.
□
\square
□ loans.duplicated(subset = ['first_name', 'last_name'], keep = 'first')
Because this drops all duplicates.
Subsetting on metadata and keeping all duplicate records gives you a better bird-eye’s view over your data and how to duplicate it ! You can even subset the loans DataFrame using bracketing and sort the values so you can properly identify the duplicates.
1.10 Finding duplicates
A new update to the data pipeline feeding into ride_sharing
has added the ride_id
column, which represents a unique identifier for each ride.
The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the ride_sharing
DataFrame.
In this exercise, you will confirm this suspicion by finding those duplicates.
Instruction
- Find duplicated rows of
ride_id
in theride_sharing
DataFrame while settingkeep
toFalse
. - Subset
ride_sharing
onduplicates
and sort byride_id
and assign the results toduplicated_rides
. - Print the
ride_id
, duration anduser_birth_year
columns ofduplicated_rides
in that order.
# Find duplicates
duplicates = ride_sharing.duplicated(subset = 'ride_id',
keep = False)
# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')
# Print relevant columns of duplicated_rides
print(duplicated_rides[['ride_id','duration','user_birth_year']])
1.11 Treating duplicates
In the last exercise, you were able to verify that the new update feeding into ride_sharing
contains a bug generating both complete and incomplete duplicated rows for some values of the ride_id
column, with occasional discrepant values for the user_birth_year
and duration columns.
In this exercise, you will be treating those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average duration
, and the minimum user_birth_year
for each set of incomplete duplicate rows.
Instruction
- Drop complete duplicates in
ride_sharing
and store the results inride_dup
. - Create the
statistics
dictionary which holds minimum aggregation foruser_birth_year
and mean aggregation forduration
. - Drop incomplete duplicates by grouping by
ride_id
and applying the aggregation instatistics
. - Find duplicates again and run the
assert
statement to verify de-duplication.
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()
# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}
# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby(by = 'ride_id').agg(statistics).reset_index()
# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]
# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0
2. Text and categorical data problems
2.1 Membership constraints
2.2 Members only
Throughout the course so far, you’ve been exposed to some common problems that you may encounter with your data, from data type constraints, data range constrains, uniqueness constraints, and now membership constraints for categorical values.
In this exercise, you will map hypothetical problems to their respective categories.
- Membership Constraint
- A
month
column with the value14
. - A
day_of_week
column with the valueSuntermonday
- A
has_loan
column with the value12
. - A
GPA
column containing aZ-
grade.
- A
- Other Constraint
- An
age
column with values above130
. - A
revenue
column represneted as a string. - A
birthdate
columns with values in the future.
- An
2.3 Finding consistency
In this exercise and throughout this chapter, you’ll be working with the airlines
DataFrame which contains survey responses on the San Francisco Airport from airline customers.
The DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction. Another DataFrame named categories
was created, containing all correct possible values for the survey columns.
In this exercise, you will use both of these DataFrames to find survey answers with inconsistent values, and drop them, effectively performing an outer and inner join on both these DataFrames as seen in the video exercise.
Instruction 1
- Print the
categories
DataFrame and take a close look at all possible correct categories of the survey columns. - Print the unique values of the survey columns in
airlines
using the.unique()
method.
# Print categories DataFrame
print(categories)
# Print unique values of survey columns in airlines
print('Cleanliness: ',
airlines['cleanliness'].unique(),
"\n")
print('Safety: ',
airlines['safety'].unique(),
"\n")
print('Satisfaction: ',
airlines['satisfaction'].unique(),
"\n")
Instruction 2
Question
Take a look at the output. Out of the cleanliness
, safety
and satisfaction
columns, which one has an inconsistent category and what is it?
■
\blacksquare
■ cleanliness
because it has an Unacceptable
category.
□
\square
□ cleanliness
because it has a Terribly dirty
category.
□
\square
□ satisfaction
because it has a Very satisfied
category.
□
\square
□ safety
because it has a Neutral
category.
Instruction 3
- Create a set out of the
cleanliness
column inairlines
usingset()
and find the inconsistent category by finding the difference in thecleanliness
column ofcategories
. - Find rows of
airlines
with acleanliness
value not incategories
and print the output. - Print the rows with the consistent categories of
cleanliness
only.
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
# Print rows with inconsistent category
print(airlines[cat_clean_rows])
# Print rows with consistent categories only
print(airlines[~cat_clean_rows])
2.4 Categorical variables
2.5 Categorical of errors
In the video exercise, you saw how to address common problems affecting categorical variables in your data, including white spaces and inconsistencies in your categories, and the problem of creating new categories and mapping existing ones to new ones.
To get a better idea of the toolkit at your disposal, you will be mapping functions and methods from pandas
and Python used to address each type of problem.
- White spaces and Inconsistency
.str.strip()
.str.lower()
.str.upper()
- Creating or remapping caregories
.replace()
pandas.qcut()
pandas.cut()
2.6 Inconsistent categories
In this exercise, you’ll be revisiting the airlines
DataFrame from the previous lesson.
As a reminder, the DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction on the San Francisco Airport.
In this exercise, you will examine two categorical columns from this DataFrame, dest_region
and dest_size
respectively, assess how to address them and make sure that they are cleaned and ready for analysis.
Instruction 1
Print the unique values in dest_region
and dest_size
respectively.
# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())
Instruction 2
Question
From looking at the output, what do you think is the problem with these columns?
□
\square
□ The dest_region
column has only inconsistent values due to capitalization.
□
\square
□ The dest_region
column has inconsistent values due to capitalization and has one value that needs to be remapped.
□
\square
□ The dest_size
column has only inconsistent values due to leading and trailing spaces.
■ \blacksquare ■ Both 2 and 3 are correct.
Instruction 3
- Change the capitalization of all values of
dest_region
to lowercase. - Replace the
'eur'
with'europe'
indest_region
using the.replace()
method.
# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})
Instruction 4
- Strip white spaces from the
dest_size
column using the.strip()
method. - Verify that the changes have been into effect by printing the unique values of the columns using
.unique()
.
# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()
# Verify changes have been effected
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())
2.7 Remapping categories
To better understand survey respondents from airlines
, you want to find out if there is a relationship between certain responses and the day of the week and wait time at the gate.
The airlines
DataFrame contains the day
and wait_min
columns, which are categorical and numerical respectively. The day
column contains the exact day a flight took place, and wait_min
contains the amount of minutes it took travelers to wait at the gate. To make your analysis easier, you want to create two new categorical variables:
wait_type
:'short'
for 0-60 min,'medium'
for 60-180 andlong
for 180+day_week
:'weekday'
if day is in the weekday,'weekend'
if day is in the weekend.
Instruction
- Create the ranges and labels for the
wait_type
column mentioned in the description above. - Create the
wait_type
column by fromwait_min
by usingpd.cut()
, while inputtinglabel_ranges
andlabel_names
in the correct arguments. - Create the mapping dictionary
mapping
weekdays to'weekday'
and weekend days to'weekend'
. - Create the
day_week
column by using.replace()
.
# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']
# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'],
bins = label_ranges,
labels = label_names)
# Create mappings and replace
mappings = {'Monday':'weekday',
'Tuesday':'weekday',
'Wednesday': 'weekday',
'Thursday': 'weekday',
'Friday': 'weekday',
'Saturday': 'weekend',
'Sunday': 'weekend'}
airlines['day_week'] = airlines['day'].replace(mappings)
Don’t forget, you can always use an assert statement to check your changes passed.
2.8 Clening text data
2.9 Removing titles and taking names
While collecting survey respondent metadata in the airlines
DataFrame, the full name of respondents was saved in the full_name
column. However upon closer inspection, you found that a lot of the different names are prefixed by honorifics such as "Dr."
, "Mr."
, "Ms."
and "Miss"
.
Your ultimate objective is to create two new columns named first_name
and last_name
, containing the first and last names of respondents respectively. Before doing so however, you need to remove honorifics.
Instruction
- Remove
"Dr."
,"Mr."
,"Miss"
and"Ms."
fromfull_name
by replacing them with an empty string""
in that order. - Run the
assert
statement using.str.contains()
that tests whetherfull_name
still contains any of the honorifics.
# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")
# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.","")
# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss","")
# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.","")
# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False
2.10 Keeping it descriptive
To further understand travelers’ experiences in the San Francisco Airport, the quality assurance department sent out a qualitative questionnaire to all travelers who gave the airport the worst score on all possible categories. The objective behind this questionnaire is to identify common patterns in what travelers are saying about the airport.
Their response is stored in the survey_response
column. Upon a closer look, you realized a few of the answers gave the shortest possible character amount without much substance. In this exercise, you will isolate the responses with a character count higher than 40 , and make sure your new DataFrame contains responses with 40 characters or more using an assert
statement.
Instruction
- Using the
airlines
DataFrame, store the length of each instance in thesurvey_response
column inresp_length
by using.str.len()
. - Isolate the rows of
airlines
withresp_length
higher than40
. - Assert that the smallest survey response length in
airlines_survey
is now bigger than 40.
# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()
# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]
# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40
# Print new survey_response column
print(airlines_survey['survey_response'])
3. Advanced data problems
3.1 Uniformity
3.2 Ambiguous dates
You have a DataFrame containing a subscription_date
column that was collected from various sources with different Date formats such as YYYY-mm-dd
and YYYY-dd-mm
. What is the best way to unify the formats for ambiguous values such as 2019-04-07
?
□
\square
□ Set them to NA
and drop them.
□ \square □ Infer the format of the data in question by checking the format of subsequent and previous values.
□ \square □ Infer the format from the original data source.
■ \blacksquare ■ All of the above are possible, as long as we investigate where our data comes from, and understand the dynamics affecting it before cleaning it.
3.3 Uniform currencies
In this exercise and throughout this chapter, you will be working with a retail banking dataset stored in the banking
DataFrame. The dataset contains data on the amount of money stored in accounts, their currency, amount invested, account opening date and last transaction date that were consolidated from American and European branches.
You are tasked with understanding the average account size and how investments vary by the size of account, however in order to produce this analysis accurately, you first need to unify the currency amount into dollars.
Instruction
- Find the rows of
acct_cur
inbanking
that are equal to'euro'
and store them inacct_eu
. - Find all the rows of
acct_amount
inbanking
that fit theacct_eu
condition, and convert them to USD by multiplying them with1.1
. - Find all the rows of
acct_cur
inbanking
that fit theacct_eu
condition, set them to'dollar'
.
# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'
# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1
# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[banking['acct_cur'] == 'euro', 'acct_cur'] = 'dollar'
# Print unique values of acct_cur
assert banking['acct_cur'].unique() == 'dollar'
3.4 Uniform dates
After having unified the currencies of your different account amounts, you want to add a temporal dimension to your analysis and see how customers have been investing their money given the size of their account over each year. The account_opened
column represents when customers opened their accounts and is a good proxy for segmenting customer activity and investment over time.
However, since this data was consolidated from multiple sources, you need to make sure that all dates are of the same format. You will do so by converting this column into a datetime
object, while making sure that the format is inferred and potentially incorrect formats are set to missing.
Instruction 1
Print the header of account_opened
from the banking
DataFrame and take a look at the different results.
# Print the header of account_opened
print(banking.account_opened.head())
Instruction 2
Take a look at the output. You tried converting the values to datetime
using the default to_datetime()
function without changing any argument, however received the following error:
ValueError: month must be in 1..12
Why do you think that is?
□
\square
□ The to_datetime()
function needs to be explicitly told which date format each row is in.
□
\square
□ The to_datetime()
function can only be applied on YY-mm-dd
date formats.
■
\blacksquare
■ The 21-14-17
entry is erroneous and leads to an error.
Instruction 3
Convert the account_opened
column to datetime
, while making sure the date format is inferred and that erroneous formats that raise error return a missing value.
# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
# Infer datetime format
infer_datetime_format = True,
# Return missing value for error
errors = 'coerce')
Instruction 4
- Extract the year from the amended
account_opened
column and assign it to the acct_year column. - Print the newly created
acct_year
column.
# Get year of account opened
banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')
# Print acct_year
print(banking['acct_year'])
3.5 Cross field validation
3.6 Cross field or no cross field?
Throughout this course, you’ve been immersed in a variety of data cleaning problems from range constraints, data type constraints, uniformity and more.
In this lesson, you were introduced to cross field validation as a means to sanity check your data and making sure you have strong data integrity.
Now, you will map different applicable concepts and techniques to their respective categories.
- Cross field validation
- Confirming the Age provided by users by cross checking their birthdays.
- Row wise operations such as
.sum(axis=1)
- Not cross field validation
- Making sure that a
revenue
column is a numeric column. - Making sure a
subscription_data
column has no values set in the future. - The use of the
.astype
method.
- Making sure that a
3.7 How’s our data integrity?
New data has been merged into the banking
DataFrame that contains details on how investments in the inv_amount
column are allocated across four different funds A, B, C and D.
Furthermore, the age and birthdays of customers are now stored in the age
and birth_date
columns respectively.
You want to understand how customers of different age groups invest. However, you want to first make sure the data you’re analyzing is correct. You will do so by cross field checking values of inv_amount
and age
against the amount invested in different funds and customers’ birthdays.
Instruction 1
- Find the rows where the sum of all rows of the
fund_columns
inbanking
are equal to theinv_amount
column. - Store the values of
banking
with consistentinv_amount
inconsistent_inv
, and those with inconsistent ones ininconsistent_inv
.
# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']
# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[fund_columns].sum(axis = 1) == banking['inv_amount']
# Store consistent and inconsistent data
consistent_inv = banking[inv_equ]
inconsistent_inv = banking[~inv_equ]
# Store consistent and inconsistent data
print("Number of inconsistent investments: ",
inconsistent_inv.shape[0])
Instruction 2
- Store today’s date into
today
, and manually calculate customers’ ages and store them inages_manual
. - Find all rows of
banking
where theage
column is equal toages_manual
and then filterbanking
intoconsistent_ages
andinconsistent_ages
.
# Store today's date and find ages
today = dt.date.today()
ages_manual = today.year - banking['birth_date'].dt.year
# Find rows where age column == ages_manual
age_equ = ages_manual == banking['age']
# Store consistent and inconsistent data
consistent_ages = banking[age_equ]
inconsistent_ages = banking[~age_equ]
# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])
3.8 Completeness
3.9 Is this missing at random?
You’ve seen in the video exercise how there are a variety of missingness types when observing missing data. As a reminder, missingness types can be described as the following:
- Missing Completely at Random: No systematic relationship between a column’s missing values and other or own values.
- Missing at Random: There is a systematic relationship between a column’s missing values and other observed values.
- Missing not at Random: There is a systematic relationship between a column’s missing values and unobserved values.
You have a DataFrame containing customer satisfaction scores for a service. What type of missingness is the following?
A customer satisfaction_score
column with missing values for highly dissatisfied customers.
□ \square □ Missing completely at random
□ \square □ Missing at random
■ \blacksquare ■ Missing not at random
3.10 Missing Investors
Dealing with missing data is one of the most common tasks in data science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.
You just received a new version of the banking
DataFrame containing data on the amount held and invested for new and existing customers. However, there are rows with missing inv_amount
values.
You know for a fact that most customers below 25 do not have investment accounts yet, and suspect it could be driving the missingness.
Instruction 1
- Print the number of missing values by column in the
banking
DataFrame. - Plot and show the missingness matrix of
banking
with themsno.matrix()
function. - Isolate the values of
banking
missing values ofinv_amount
intomissing_investors
and with non-missinginv_amount
values intoinvestors
.
# Print number of missing values in banking
print(banking.isna().sum())
# Visualize missingness matrixmsno.matrix(banking)
plt.show()
# Isolate missing and non missing values of inv_amount
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]
Instruction 2
Now that you’ve isolated banking into investors
and missing_investors
, use the .describe()
method on both of these DataFrames in the console to understand whether there are structural differences between them. What do you think is going on?
□ \square □ The data is missing completely at random and there are no drivers behind the missingness.
□
\square
□ The inv_amount
is missing only for young customers, since the average age in missing_investors
is 22 and the maximum age is 25.
■
\blacksquare
■ The inv_amount
is missing only for old customers, since the average age in missing_investors
is 42 and the maximum age is 59.
Instruction 3
Sort the banking
DataFrame by the age
column and plot the missingness matrix of banking_sorted
.
# Sort banking by age and visualize
banking_sorted = banking.sort_values(by = 'age')
msno.matrix(banking_sorted)plt.show()
3.11 Follow the money
In this exercise, you’re working with another version of the banking
DataFrame that contains missing values for both the cust_id
column and the acct_amount
column.
You want to produce analysis on how many unique customers the bank has, the average amount held by customers and more. You know that rows with missing cust_id
don’t really help you, and that on average acct_amount
is usually 5 times the amount of inv_amount
.
Instruction
- Use
.dropna()
to drop missing values of thecust_id
column inbanking
and store the results inbanking_fullid
. - Compute the estimated
acct_amount
ofbanking_fullid
knowing thatacct_amount
is usuallyinv_amount * 5
and assign the results toacct_imp
. - Impute the missing values of
acct_amount
inbanking_fullid
with the newly createdacct_imp
using.fillna()
.
# Drop missing values of cust_id
banking_fullid = banking.dropna(subset = ['cust_id'])
# Compute estimated acct_amount
acct_imp = banking_fullid['inv_amount'] * 5
# Impute missing acct_amount with corresponding acct_imp
banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})
# Print number of missing values
print(banking_imputed.isna().sum())
4. Record linkage
4.1 Comparing strings
4.2 Minimum edit distance
In the video exercise, you saw how minimum edit distance is used to identify how similar two strings are. As a reminder, minimum edit distance is the minimum number of steps needed to reach from String A to String B, with the operations available being:
- Insertion of a new character.
- Deletion of an existing character.
- Substitution of an existing character.
- Transposition of two existing consecutive characters.
What is the minimum edit distance from 'sign'
to 'sing'
, and which operation(s) gets you there?
□
\square
□ 2 by substituting 'g'
with 'n'
and 'n'
with 'g'
.
■
\blacksquare
■ 1 by transposing 'g'
with 'n'
.
□
\square
□ 1 by substituting 'g'
with 'n'
.
□
\square
□ 2 by deleting 'g'
and inserting a new 'g'
at the end.
4.3 The cutoff point
In this exercise, and throughout this chapter, you’ll be working with the restaurants
DataFrame which has data on various restaurants. Your ultimate goal is to create a restaurant recommendation engine, but you need to first clean your data.
This version of restaurants
has been collected from many sources, where the cuisine_type
column is riddled with typos, and should contain only italian
, american
and asian
cuisine types. There are so many unique categories that remapping them manually isn’t scalable, and it’s best to use string similarity instead.
Before doing so, you want to establish the cutoff point for the similarity score using the fuzzywuzzy
's process.extract()
function by finding the similarity score of the most distant typo of each category.
Instruction 1
- Import
process
fromfuzzywuzzy
. - Store the unique
cuisine_types
intounique_types
. - Calculate the similarity of
'asian'
,'american'
, and'italian'
to all possiblecuisine_types
usingprocess.extract()
, while returning all possible matches.
# Import process from fuzzywuzzy
from fuzzywuzzy import process
# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()
# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian',
unique_types,
limit = len(unique_types)))
# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american',
unique_types,
limit = len(unique_types)))
# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian',
unique_types,
limit = len(unique_types)))
Instruction 2
Take a look at the output, what do you think should be the similarity cutoff point when remapping categories?
■ \blacksquare ■ 80
□ \square □ 70
□ \square □ 60
4.4 Remapping categories II
In the last exercise, you determined that the distance cutoff point for remapping typos of 'american',
'asian'
, and 'italian'
cuisine types stored in the cuisine_type
column should be 80.
In this exercise, you’re going to put it all together by finding matches with similarity scores equal to or higher than 80 by using fuzywuzzy.process
's extract()
function, for each correct cuisine type, and replacing these matches with it. Remember, when comparing a string with an array of strings using process.extract()
, the output is a list of tuples where each of tuple is as such:
(closest match, similarity score, index of match)
Instruction 1
- Using the method shown in the video, return all of the unique values for the
cuisine_type
column ofrestaurants
. - Okay! Looks like you will need to use some string matching to correct these misspellings!
- As a first step, create a list of
matches
comparing'italian'
with the restaurant types listed in thecuisine_type
column.
# Inspect the unique values of the cuisine_type column
print(restaurants['cuisine_type'].unique())
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian',
restaurants['cuisine_type'],
limit=len(restaurants.cuisine_type))
# Inspect the first 5 matches
print(matches[0:5])
Instruction 2
Now you’re getting somewhere! Now you can iterate through matches
to reassign similar entries.
- Within the
for
loop, use anif
statement to check whether the similarity score in eachmatch
is greater than or equal to 80. - If it is, use
.loc
to select rows wherecusine_type
inrestaurants
is equal to the current match (which is the first element ofmatch
), and reassign them to be'italian'
.
# Iterate through the list of matches to italian
for match in matches:
# Check whether the similarity score is greater than or equal to 80
if match[1] >= 80:
# Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
restaurants.loc[restaurants['cuisine_type'] == match[0]] = 'italian'
Instruction 3
Finally, you’ll adapt your code to work with every restaurant type in categories
.
- Using the variable
cuisine
to iterate throughcategories
, embed your code from the previous step in an outerfor
loop. - Inspect the final result. This has been done for you.
# Iterate through categories
for cuisine in categories:
# Create a list of matches, comparing cuisine with the cuisine_type column
matches = process.extract(cuisine,
restaurants['cuisine_type'],
limit=len(restaurants.cuisine_type))
# Iterate through the list of matches
for match in matches:
# Check whether the similarity score is greater than or equal to 80
if match[1] >= 80:
# If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine
# Inspect the final result
restaurants['cuisine_type'].unique()
4.5 Generating pairs
4.6 To link or not to link?
Similar to joins, record linkage is the act of linking data from different sources regarding the same entity. But unlike joins, record linkage does not require exact matches between different pairs of data, and instead can find close matches using string similarity. This is why record linkage is effective when there are no common unique keys between the data sources you can rely upon when linking data sources such as a unique identifier.
In this exercise, you will classify each card whether it is a traditional join problem, or a record linkage one.
-
Record linkage
- Usubg an
address
column to join tow DataFrames, with the address in each DataFrame being formatted slightly differently. - Two customer DataFrames containing names and address, one with a unique identifier per customer, one without.
- Merging two basketball DataFrames, with columns
team_A
,team_B
, andtime
and differently formatted team names betwwen each DataFrames.
- Usubg an
-
Regular join
- Consolidating two DataFrames containing details on DataCamp courses, with each DataCamp course having its own unique identifier.
- Two basketball DataFrames with a common unique identifier per game.
4.7 Pair of restaurants
In the last lesson, you cleaned the restaurants
dataset to make it ready for building a restaurants recommendation engine. You have a new DataFrame named restaurants_new
with new restaurants to train your model on, that’s been scraped from a new data source.
You’ve already cleaned the cuisine_type
and city
columns using the techniques learned throughout the course. However you saw duplicates with typos in restaurants names that require record linkage instead of joins with restaurants
.
In this exercise, you will perform the first step in record linkage and generate possible pairs of rows between restaurants
and restaurants_new
.
Instruction 1
- Instantiate an indexing object by using the
Index()
function fromrecordlinkage
. - Block your pairing on
cuisine_type
by usingindexer's
’.block()
method. - Generate pairs by indexing
restaurants
andrestaurants_new
in that order.
# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()
# Block pairing on cuisine_type
indexer.block('cuisine_type')
# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)
Instruction 2
Now that you’ve generated your pairs, you’ve achieved the first step of record linkage. What are the steps remaining to link both restaurants DataFrames, and in what order?
■ \blacksquare ■ Compare between columns, score the comparison, then link the DataFrames.
□ \square □ Clean the data, compare between columns, link the DataFrames, then score the comparison.
□ \square □ Clean the data, compare between columns, score the comparison, then link the DataFrames.
4.8 Similiar restairants
In the last exercise, you generated pairs between restaurants
and restaurants_new
in an effort to cleanly merge both DataFrames using record linkage.
When performing record linkage, there are different types of matching you can perform between different columns of your DataFrames, including exact matches, string similarities, and more.
Now that your pairs have been generated and stored in pairs
, you will find exact matches in the city
and cuisine_type
columns between each pair, and similar strings for each pair in the rest_name
column.
Instruction 1
Instantiate a comparison object using the recordlinkage.Compare()
function.
# Create a comparison object
comp_cl = recordlinkage.Compare()
Instruction 2
- Use the appropriate
comp_cl
method to find exact matches between thecity
andcuisine_type
columns of both DataFrames. - Use the appropriate
comp_cl
method to find similar strings with a0.8
similarity threshold in therest_name
column of both DataFrames.
# Find exact matches on city, cuisine_types
comp_cl.exact('city',
'city',
label='city')
comp_cl.exact('cuisine_type',
'cuisine_type',
label='cuisine_type')
# Find similar matches of rest_name
comp_cl.string('rest_name',
'rest_name',
label='name',
threshold = 0.8)
Instruction 3
Compute the comparison of the pairs by using the .compute()
method of comp_cl
.
# Get potential matches and print
potential_matches = comp_cl.compute(pairs,
restaurants,
restaurants_new)
print(potential_matches)
Instruction 4
Print out potential_matches
, the columns are the columns being compared, with values being 1 for a match, and 0 for not a match for each pair of rows in your DataFrames. To find potential matches, you need to find rows with more than matching value in a column. You can find them with
potential_matches[potential_matches.sum(axis = 1) >= n]
Where n
is the minimum number of columns you want matching to ensure a proper duplicate find, what do you think should the value of n
be?
■ \blacksquare ■ 3 because I need to have matches in all my columns.
□ \square □ 2 because matching on any of the 2 columns or more is enough to find potential duplicates.
□ \square □ 1 because matching on just 1 column like the restaurant name is enough to find potential duplicates.
4.9 Linking DataFrames
4.10 Getting the right index
Here’s a DataFrame named matches containing potential matches
between two DataFrames, users_1
and users_2
. Each DataFrame’s row indices is stored in uid_1
and uid_2
respectively.
first_name | address_1 | address_2 | marriage_status | data_of_birth | ||
uid_1 | uid_2 | |||||
0 | 3 | 1 | 1 | 1 | 1 | 0 |
··· | ··· | ··· | ··· | ··· | ··· | |
··· | ··· | ··· | ··· | ··· | ··· | |
1 | 3 | 1 | 1 | 1 | 1 | 0 |
··· | ··· | ··· | ··· | ··· | ··· | |
··· | ··· | ··· | ··· | ··· | ··· |
How do you extract all values of the uid_1
index column?
□ \square □ matches.index.get_level_values(0)
□ \square □ matches.index.get_level_values(1)
□ \square □ matches.index.get_level_values(‘uid_1’)
■ \blacksquare ■ Both 1 and 3 are correct.
4.11 Linking them together!
In the last lesson, you’ve finished the bulk of the work on your effort to link restaurants and restaurants_new. You’ve generated the different pairs of potentially matching rows, searched for exact matches between the cuisine_type
and city
columns, but compared for similar strings in the rest_name
column. You stored the DataFrame containing the scores in potential_matches
.
Now it’s finally time to link both DataFrames. You will do so by first extracting all row indices of restaurants_new
that are matching across the columns mentioned above from potential_matches
. Then you will subset restaurants_new
on these indices, then append the non-duplicate values to restaurants
Instruction
- Isolate instances of
potential_matches
where the row sum is above or equal to 3 by using the.sum()
method. - Extract the second column index from
matches
, which represents row indices of matching record fromrestaurants_new
by using the.get_level_values()
method. - Subset
restaurants_new
for rows that are not inmatching_indices
. - Append
non_dup
torestaurants
.
# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]
# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)
# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]
# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)