Machine Learning - Dummy Variable Conversion

Regression Methods in
Machine Learning
Categorical Variable Conversion
Portland Data Science Group
Andrew Ferlitsch
Community Outreach Officer
July, 2017

Linear Regression
• All the features (independent variables) need to be a
real number.
• CANNOT be a categorical value, ie., a named or
enumerated value.
• Example:
Male vs. Female
Red, Blue, Green
Apple, Banana, Pear, Orange

Categorical Variables
Age Gender Income
25 Male 25000
26 Female 22000
30 Male 45000
24 Female 26000
Independent Variables (Features)
Dependent Variables (Label)
Real Values Value to Predict
Categorical Values

Dummy Variable Conversion
Known in Python as OneHotEncoder
For each categorical feature:
1. Scan the dataset and determine all the unique instances.
2. Create a new feature (i.e., dummy variable) in dataset, one
per unique instance.
3. Remove the categorical feature from the dataset.
4. For each sample (row), set a 1 in the feature (dummy
variable) that corresponds to that categorical value instance,
and:
5. Set a 0 in the remaining features (dummy variables) for that
categorical field.
6. Remove one dummy variable field.

Dummy Variable Trap
Gender
Male
Female
Male
Female
Need to Drop one Dummy Variable!
Male Female
1 0
0 1
1 0
0 1
x1 x2 x3
Multicollinearity occurs when one variable predicts another.
i.e., x2 = ( 1 – x3)
As a result, a regression analysis cannot distinguish between the
contribution of x2 and x3.

Drop one of Dummy Variables
Age Male Income
25 1 25000
26 0 22000
30 1 45000
24 0 26000
Drop one of the Dummy Variables
Age Gender Income
25 Male 25000
26 Female 22000
30 Male 45000
24 Female 26000
Gender is Replaced with Male
Age Race Income
20 White Apple
26 Hispanic 22000
30 Asian 45000
24 Asian 26000
Age White Asian Income
20 1 0 Apple
26 0 0 22000
30 0 1 45000
24 0 1 26000
Dropped Hispanic (i.e., Hispanic = White: 0, Asian: 0)

Machine Learning - Dummy Variable Conversion

More Related Content

What's hot (20)

Similar to Machine Learning - Dummy Variable Conversion (20)

More from Andrew Ferlitsch (20)

Recently uploaded (20)

Machine Learning - Dummy Variable Conversion