Income Analysis
Ping Yin
11/10/2016
Contents
• Executive Summary ------------------------------------------------------------------------------------- 3
• Introduction ---------------------------------------------------------------------------------------------- 4
• Purpose ---------------------------------------------------------------------------------------------------- 5
• Methodology
Data Selection ----------------------------------------------------------------------------------- 6
Exploration ----------------------------------------------------------------------------------- 7-24
Preparation & Transformation ---------------------------------------------------------- 25-34
Model Development & Assessment --------------------------------------------------- 35-44
Model Comparison ------------------------------------------------------------------------ 45-47
• Options and Recommendations ---------------------------------------------------------------- 48-52
• Summary ------------------------------------------------------------------------------------------------- 53
• Appendix ------------------------------------------------------------------------------------------------- 54
Executive Summary
• After data preparation and partition, three models are built in SAS
studio, EM, and DataRobot
• The same test dataset is scored by these models
• The model built in EM has the best performance
Introduction
• Can we predict Income level based on age, gender, education, etc.?
• What is my income level after I graduate?
Purpose
• Figure out the best predictive model for Income dataset
• Predict my Income level
• Practice skills for preparing data, building model, and model assessment
Data Selection
• Income dataset is originally extracted from 1994 Census bureau database
• Downloaded from Kaggle.com
• Reasons for choosing it:
• Target variable, Income, is categorical variable
• Medium size: 10+ columns and 30K+ rows
• Used in Macro and DataRobot projects
Exploration
• Using SAS studio to explore data
• 32,561 observations
• 15 variables: 6 Num, 9 Char
• Num: Age Capitalgain Capitalloss Weekhour Edunum Fnlwgt
• Char: Income Relationship Education Occupation Sex Marital
Workclass Race Nativecountry
• Target: Income (“>50K” , “<=50k”)
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Data issues :
• Missing value: Workclass Occupation Nativecountry
• Multiple levels: Education Marital Workclass Nativecountry
• Numeric variables: Capitalgain Capitalloss
• Screen variable: Fnlwgt
Preparation & Transformations
• Solutions:
• Imputing missing value using subject matter knowledge:
impute missing value for Workclass and Occupation with “Unemployeed”
• Imputing missing value using mode value:
impute missing value for Nativecountry with “United-States”
Preparation & Transformations
• Solutions:
• Coverting Capitalgain and Capitalloss from Num to Char
• Binning multiple-level variables: Education Marital Workclass
Preparation & Transformations
• Solutions:
• Binning Nativecountry and creating a new variable: region
Preparation & Transformations
• Reasons for dropping variable Fnlwgt:
• It is the weight on the Current Population Survey files, not original data from Census
• It shows near zero importance in last week DataRobot project
Preparation & Transformations
• Reasons for not handling with variable Occupation:
• 15 levels
• Do not have a sound criterion
• Reasons for not handling with variable Race and Relationship:
• 5-6 Levels
• Each level is meaningful
Preparation & Transformations
After preparation:
Preparation & Transformations
Preparation & Transformations
Preparation & Transformations
• Data partition using Strata method
Now it is ready to go!
Training dataset
Test dataset
SAS Studio
Enterprise Miner
DataRobot
Model Development & Assessment: SAS Studio
Model Development & Assessment: SAS Studio
Model Development & Assessment: SAS Studio
Model Development & Assessment: SAS Studio
Model Development & Assessment: EM
Model Development & Assessment: EM
Model Development & Assessment: DataRobot
Model Development & Assessment: DataRobot
Model Development & Assessment: DataRobot
Model Development & Assessment: DataRobot
Model Comparison
Model Comparison
• The best model in this project:
EM Studio DataRobot
Model Comparison: Predict my Income level
Ping Dataset
EM
Studio
DataRobot
Options and Recommendations
Using 60% data to
build a model
Using 70% data to
build a model
Options and Recommendations
Macro
Project
DataRobot
Project
The overall
best model
Options and Recommendations
• Factors which may cause these differences:
• Dropping variable Fnlwgt
• Reducing levels
• Variable transformation: Capitalgain Capitalloss
• Increase speed, but decrease model performance
Options
• Using DataRobot to build models without handling “data issues”
• Keep trying in SAS studio
Summary
• We can predict Income level based on these characteristics
• For Income dataset, DataRobot is most robust to build models
• Be aware of unexpected outcomes for data preparing
• Back and forth, until getting an ideal result
Appendix
Link to Data:
https://www.kaggle.com/uciml/adult-census-Income
Thanks !

More Related Content

PPTX
Sap on azure airlift architecture (2)
PPTX
HANA SPS07 Architecture & Landscape
TXT
Sap transaction code
PPTX
Webinar - ServiceNow and SolarWinds: Improving IT Operations Together
PDF
Comparing SAP, Oracle, and Microsoft Solutions for Project Management; CLASH ...
PDF
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
PPTX
What is Informatica Powercenter
PDF
ETL Using Informatica Power Center
Sap on azure airlift architecture (2)
HANA SPS07 Architecture & Landscape
Sap transaction code
Webinar - ServiceNow and SolarWinds: Improving IT Operations Together
Comparing SAP, Oracle, and Microsoft Solutions for Project Management; CLASH ...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
What is Informatica Powercenter
ETL Using Informatica Power Center

What's hot (6)

DOCX
Submission to Journal of Logistics
PDF
Informatica Tutorial For Beginners | Informatica Powercenter Tutorial | Edureka
PDF
Big Data Analytics in the Food and Beverage Industry
PPTX
Building an Attribution engine with Bizible and Marketo (Adobe Summit)
PDF
Mastering AIOps with Deep Learning
PDF
Lean Itil Event Management
Submission to Journal of Logistics
Informatica Tutorial For Beginners | Informatica Powercenter Tutorial | Edureka
Big Data Analytics in the Food and Beverage Industry
Building an Attribution engine with Bizible and Marketo (Adobe Summit)
Mastering AIOps with Deep Learning
Lean Itil Event Management
Ad

Similar to Predictive model (20)

PDF
BI Knowledge Sharing Session 1
PPTX
The Nested Planning Cycle 160516
DOC
suresh resume BI ABAP
DOC
suresh resume BI ABAP
PDF
Pr dc 2015 sql server is cheaper than open source
PDF
Project management_Basics_Software_Eng.pdf
PDF
Putting Predictive Planning to Work
PPSX
Data Refinement: The missing link between data collection and decisions
DOC
Rohit Nagpal_Resume
DOC
Pavan_Resume
PDF
Complete Answer Guide for Performance Management 3rd Edition Aguinis Solution...
PDF
How an Admin Preps for Board
PPTX
PI Kickoff Template 1 (1).pptx
PPTX
AVATA Webinar: Solutions to Common Demantra & ASCP Challenges
PDF
12 tips to set up performance management system
PPTX
Strat draft 1 GDA
PPTX
Six Sigma Green Belt
PDF
AVATA S&OP / IBP Express
PPTX
A glimpse of business intelligence
DOC
CV_Gangadhar 1
BI Knowledge Sharing Session 1
The Nested Planning Cycle 160516
suresh resume BI ABAP
suresh resume BI ABAP
Pr dc 2015 sql server is cheaper than open source
Project management_Basics_Software_Eng.pdf
Putting Predictive Planning to Work
Data Refinement: The missing link between data collection and decisions
Rohit Nagpal_Resume
Pavan_Resume
Complete Answer Guide for Performance Management 3rd Edition Aguinis Solution...
How an Admin Preps for Board
PI Kickoff Template 1 (1).pptx
AVATA Webinar: Solutions to Common Demantra & ASCP Challenges
12 tips to set up performance management system
Strat draft 1 GDA
Six Sigma Green Belt
AVATA S&OP / IBP Express
A glimpse of business intelligence
CV_Gangadhar 1
Ad

Recently uploaded (20)

PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPTX
Introduction to Fundamentals of Data Security
PPTX
machinelearningoverview-250809184828-927201d2.pptx
PPTX
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PPTX
transformers as a tool for understanding advance algorithms in deep learning
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PPTX
865628565-Pertemuan-2-chapter-03-NUMERICAL-MEASURES.pptx
PDF
Grey Minimalist Professional Project Presentation (1).pdf
PDF
Mcdonald's : a half century growth . pdf
PPTX
Capstone Presentation a.pptx on data sci
PPTX
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
PPTX
indiraparyavaranbhavan-240418134200-31d840b3.pptx
PDF
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PPTX
research framework and review of related literature chapter 2
PDF
REPORT CARD OF GRADE 2 2025-2026 MATATAG
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PDF
General category merit rank list for neet pg
PPTX
Hushh Hackathon for IIT Bombay: Create your very own Agents
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
Introduction to Fundamentals of Data Security
machinelearningoverview-250809184828-927201d2.pptx
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
inbound6529290805104538764.pptxmmmmmmmmm
transformers as a tool for understanding advance algorithms in deep learning
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
865628565-Pertemuan-2-chapter-03-NUMERICAL-MEASURES.pptx
Grey Minimalist Professional Project Presentation (1).pdf
Mcdonald's : a half century growth . pdf
Capstone Presentation a.pptx on data sci
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
indiraparyavaranbhavan-240418134200-31d840b3.pptx
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
research framework and review of related literature chapter 2
REPORT CARD OF GRADE 2 2025-2026 MATATAG
1 hour to get there before the game is done so you don’t need a car seat for ...
General category merit rank list for neet pg
Hushh Hackathon for IIT Bombay: Create your very own Agents

Predictive model