SlideShare a Scribd company logo
Regression Methods in
Machine Learning
Categorical Variable Conversion
Portland Data Science Group
Andrew Ferlitsch
Community Outreach Officer
July, 2017
Linear Regression
• All the features (independent variables) need to be a
real number.
• CANNOT be a categorical value, ie., a named or
enumerated value.
• Example:
Male vs. Female
Red, Blue, Green
Apple, Banana, Pear, Orange
Categorical Variables
Age Gender Income
25 Male 25000
26 Female 22000
30 Male 45000
24 Female 26000
Independent Variables (Features)
Dependent Variables (Label)
Real Values Value to Predict
Categorical Values
Dummy Variable Conversion
Known in Python as OneHotEncoder
For each categorical feature:
1. Scan the dataset and determine all the unique instances.
2. Create a new feature (i.e., dummy variable) in dataset, one
per unique instance.
3. Remove the categorical feature from the dataset.
4. For each sample (row), set a 1 in the feature (dummy
variable) that corresponds to that categorical value instance,
and:
5. Set a 0 in the remaining features (dummy variables) for that
categorical field.
6. Remove one dummy variable field.
Dummy Variable Trap
Gender
Male
Female
Male
Female
Need to Drop one Dummy Variable!
Male Female
1 0
0 1
1 0
0 1
x1 x2 x3
Multicollinearity occurs when one variable predicts another.
i.e., x2 = ( 1 – x3)
As a result, a regression analysis cannot distinguish between the
contribution of x2 and x3.
Drop one of Dummy Variables
Age Male Income
25 1 25000
26 0 22000
30 1 45000
24 0 26000
Drop one of the Dummy Variables
Age Gender Income
25 Male 25000
26 Female 22000
30 Male 45000
24 Female 26000
Gender is Replaced with Male
Age Race Income
20 White Apple
26 Hispanic 22000
30 Asian 45000
24 Asian 26000
Age White Asian Income
20 1 0 Apple
26 0 0 22000
30 0 1 45000
24 0 1 26000
Dropped Hispanic (i.e., Hispanic = White: 0, Asian: 0)

More Related Content

What's hot (20)

PPTX
Session 06 machine learning.pptx
bodaceacat
 
PPTX
264finalppt (1)
Mahima Verma
 
PPTX
Learn ActionScript programming myassignmenthelp.net
www.myassignmenthelp.net
 
PDF
Aaa ped-15-Ensemble Learning: Random Forests
AminaRepo
 
PPTX
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
PDF
A Primer on Entity Resolution
Benjamin Bengfort
 
PPTX
Machine Learning Innovations
HPCC Systems
 
PDF
Data exploration validation and sanitization
Venkata Reddy Konasani
 
PPTX
Machine learning with R
Maarten Smeets
 
KEY
Building a Mongo DSL in Scala at Hot Potato
MongoDB
 
PPT
4. Classes and Methods
Nilesh Dalvi
 
PDF
LectureNotes-03-DSA
Haitham El-Ghareeb
 
PDF
Boosted tree
Zhuyi Xue
 
PPT
Abstract data types (adt) intro to data structure part 2
Self-Employed
 
PPTX
Abstract Data Types
karthikeyanC40
 
PPT
L6 structure
mondalakash2012
 
PPTX
Mini_Project
Ashish Yadav
 
PPT
Data structure lecture 1
Kumar
 
Session 06 machine learning.pptx
bodaceacat
 
264finalppt (1)
Mahima Verma
 
Learn ActionScript programming myassignmenthelp.net
www.myassignmenthelp.net
 
Aaa ped-15-Ensemble Learning: Random Forests
AminaRepo
 
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
A Primer on Entity Resolution
Benjamin Bengfort
 
Machine Learning Innovations
HPCC Systems
 
Data exploration validation and sanitization
Venkata Reddy Konasani
 
Machine learning with R
Maarten Smeets
 
Building a Mongo DSL in Scala at Hot Potato
MongoDB
 
4. Classes and Methods
Nilesh Dalvi
 
LectureNotes-03-DSA
Haitham El-Ghareeb
 
Boosted tree
Zhuyi Xue
 
Abstract data types (adt) intro to data structure part 2
Self-Employed
 
Abstract Data Types
karthikeyanC40
 
L6 structure
mondalakash2012
 
Mini_Project
Ashish Yadav
 
Data structure lecture 1
Kumar
 

Similar to Machine Learning - Dummy Variable Conversion (20)

PDF
Machine learning Mind Map
Ashish Patel
 
PDF
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Francesca Lazzeri, PhD
 
PDF
Machine Learning.pdf
BeyaNasr1
 
PDF
An introduction to variable and feature selection
Marco Meoni
 
PPTX
Feature selection concepts and methods
Reza Ramezani
 
DOCX
Classification modelling review
Jaideep Adusumelli
 
PPTX
evolution of data mining.pptx
ShimaaIbrahim33
 
PPTX
feature-Selection-Lab-8-20032024-111222am.pptx
chaudhryzunair4
 
PPTX
Weka.arff
Daniyar Mukhanov
 
PDF
Weka presentation cmt111
Clement Robert Habimana
 
PPT
Topic_6
butest
 
PPTX
Leveraging Feature Selection Within TreeNet
agdavis
 
PPTX
Discriminant Analysis in Sports
J P Verma
 
PPTX
0 introduction
Dmitry Grapov
 
PPT
Multivariate Analysis and Visualization of Proteomic Data
UC Davis
 
PPTX
High Dimensional Biological Data Analysis and Visualization
Dmitry Grapov
 
PPT
Prote-OMIC Data Analysis and Visualization
Dmitry Grapov
 
PPTX
Build_Machine_Learning_System for Machine Learning Course
ssuserfece35
 
PPTX
Week_8machine learning (feature selection).pptx
muhammadsamroz
 
PPTX
Metabolomic Data Analysis Workshop and Tutorials (2014)
Dmitry Grapov
 
Machine learning Mind Map
Ashish Patel
 
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Francesca Lazzeri, PhD
 
Machine Learning.pdf
BeyaNasr1
 
An introduction to variable and feature selection
Marco Meoni
 
Feature selection concepts and methods
Reza Ramezani
 
Classification modelling review
Jaideep Adusumelli
 
evolution of data mining.pptx
ShimaaIbrahim33
 
feature-Selection-Lab-8-20032024-111222am.pptx
chaudhryzunair4
 
Weka.arff
Daniyar Mukhanov
 
Weka presentation cmt111
Clement Robert Habimana
 
Topic_6
butest
 
Leveraging Feature Selection Within TreeNet
agdavis
 
Discriminant Analysis in Sports
J P Verma
 
0 introduction
Dmitry Grapov
 
Multivariate Analysis and Visualization of Proteomic Data
UC Davis
 
High Dimensional Biological Data Analysis and Visualization
Dmitry Grapov
 
Prote-OMIC Data Analysis and Visualization
Dmitry Grapov
 
Build_Machine_Learning_System for Machine Learning Course
ssuserfece35
 
Week_8machine learning (feature selection).pptx
muhammadsamroz
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Dmitry Grapov
 
Ad

More from Andrew Ferlitsch (20)

PPTX
AI - Intelligent Agents
Andrew Ferlitsch
 
PPTX
Pareto Principle Applied to QA
Andrew Ferlitsch
 
PPTX
Whiteboarding Coding Challenges in Python
Andrew Ferlitsch
 
PPTX
Object Oriented Programming Principles
Andrew Ferlitsch
 
PPTX
Python - OOP Programming
Andrew Ferlitsch
 
PPTX
Python - Installing and Using Python and Jupyter Notepad
Andrew Ferlitsch
 
PPTX
Natural Language Processing - Groupings (Associations) Generation
Andrew Ferlitsch
 
PPTX
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Andrew Ferlitsch
 
PPTX
Machine Learning - Introduction to Recurrent Neural Networks
Andrew Ferlitsch
 
PPTX
Machine Learning - Introduction to Convolutional Neural Networks
Andrew Ferlitsch
 
PPTX
Machine Learning - Introduction to Neural Networks
Andrew Ferlitsch
 
PPTX
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Andrew Ferlitsch
 
PPTX
Machine Learning - Accuracy and Confusion Matrix
Andrew Ferlitsch
 
PPTX
Machine Learning - Ensemble Methods
Andrew Ferlitsch
 
PPTX
ML - Multiple Linear Regression
Andrew Ferlitsch
 
PPTX
ML - Simple Linear Regression
Andrew Ferlitsch
 
PPTX
Machine Learning - Splitting Datasets
Andrew Ferlitsch
 
PPTX
Machine Learning - Introduction to Tensorflow
Andrew Ferlitsch
 
PPTX
Introduction to Machine Learning
Andrew Ferlitsch
 
PPTX
AI - Introduction to Dynamic Programming
Andrew Ferlitsch
 
AI - Intelligent Agents
Andrew Ferlitsch
 
Pareto Principle Applied to QA
Andrew Ferlitsch
 
Whiteboarding Coding Challenges in Python
Andrew Ferlitsch
 
Object Oriented Programming Principles
Andrew Ferlitsch
 
Python - OOP Programming
Andrew Ferlitsch
 
Python - Installing and Using Python and Jupyter Notepad
Andrew Ferlitsch
 
Natural Language Processing - Groupings (Associations) Generation
Andrew Ferlitsch
 
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Andrew Ferlitsch
 
Machine Learning - Introduction to Recurrent Neural Networks
Andrew Ferlitsch
 
Machine Learning - Introduction to Convolutional Neural Networks
Andrew Ferlitsch
 
Machine Learning - Introduction to Neural Networks
Andrew Ferlitsch
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Andrew Ferlitsch
 
Machine Learning - Accuracy and Confusion Matrix
Andrew Ferlitsch
 
Machine Learning - Ensemble Methods
Andrew Ferlitsch
 
ML - Multiple Linear Regression
Andrew Ferlitsch
 
ML - Simple Linear Regression
Andrew Ferlitsch
 
Machine Learning - Splitting Datasets
Andrew Ferlitsch
 
Machine Learning - Introduction to Tensorflow
Andrew Ferlitsch
 
Introduction to Machine Learning
Andrew Ferlitsch
 
AI - Introduction to Dynamic Programming
Andrew Ferlitsch
 
Ad

Recently uploaded (20)

PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 

Machine Learning - Dummy Variable Conversion

  • 1. Regression Methods in Machine Learning Categorical Variable Conversion Portland Data Science Group Andrew Ferlitsch Community Outreach Officer July, 2017
  • 2. Linear Regression • All the features (independent variables) need to be a real number. • CANNOT be a categorical value, ie., a named or enumerated value. • Example: Male vs. Female Red, Blue, Green Apple, Banana, Pear, Orange
  • 3. Categorical Variables Age Gender Income 25 Male 25000 26 Female 22000 30 Male 45000 24 Female 26000 Independent Variables (Features) Dependent Variables (Label) Real Values Value to Predict Categorical Values
  • 4. Dummy Variable Conversion Known in Python as OneHotEncoder For each categorical feature: 1. Scan the dataset and determine all the unique instances. 2. Create a new feature (i.e., dummy variable) in dataset, one per unique instance. 3. Remove the categorical feature from the dataset. 4. For each sample (row), set a 1 in the feature (dummy variable) that corresponds to that categorical value instance, and: 5. Set a 0 in the remaining features (dummy variables) for that categorical field. 6. Remove one dummy variable field.
  • 5. Dummy Variable Trap Gender Male Female Male Female Need to Drop one Dummy Variable! Male Female 1 0 0 1 1 0 0 1 x1 x2 x3 Multicollinearity occurs when one variable predicts another. i.e., x2 = ( 1 – x3) As a result, a regression analysis cannot distinguish between the contribution of x2 and x3.
  • 6. Drop one of Dummy Variables Age Male Income 25 1 25000 26 0 22000 30 1 45000 24 0 26000 Drop one of the Dummy Variables Age Gender Income 25 Male 25000 26 Female 22000 30 Male 45000 24 Female 26000 Gender is Replaced with Male Age Race Income 20 White Apple 26 Hispanic 22000 30 Asian 45000 24 Asian 26000 Age White Asian Income 20 1 0 Apple 26 0 0 22000 30 0 1 45000 24 0 1 26000 Dropped Hispanic (i.e., Hispanic = White: 0, Asian: 0)