Data mining - Machine Learning

Project Bank Marketing
By: Rupa Dutta

Gautam Buddha - A philosopher and a thinker

Legends have it that he
obtained enlightenment
sitting under a tree and
advocated to the world
a new philosophy that
is called..

Underfitting model Overfitting model
In the world of analytics, we are often faced with the challenge to avoid under-fitting and overfitting
models and find a balance. A balanced model has better chances of working for previously unseen
data. I would like to present this Data mining project as a quest for finding a reasonable model that
fares well against all matrices i.e. finding that “middle-path”
Optimalmodel

Critical to avoiding an
under-Fitting model is:
• Gather enough but not too much data
• Identify and get rid of noise and outliers
• Remove irrelevant features - can confuse models
• Massage the data well
• Identify nominal , ordinal and continuous feature
• Condition the features before feeding to models

Critical to avoiding an
overfitting model is:
• Test , test and more test
• Cross validate models against different mix
• Weigh against multiple performance matrices
• If possible, test against real time unseen data

Let’s get started…
Business problem at hand
Feature analysis - interesting observation
Feature selection and transformation
Model building and evaluation
Conclusion

Business problem at hand
What we have
Data gathered from recent campaign by a bank
Campaign was about getting people to sign up for term deposits
We have customer information along with information whether those
customer signed up for the term deposit
What we want
A machine learning model that can tell if a new customer is likely to
sign up for term deposit

Feature analysis -
interesting observation

Feature Analysis
Will a new customer sign up for term deposit?
Strong indicator for yes.. Previous outcome
65 % who previously said yes said yes again!!!
Although a lot of outcomes were unknown, still a good feature
0
17.5
35
52.5
70
Previously said yes said no
%whoSignedUpforTermDeposit

Feature Analysis
Strong indicator for yes.. Housing Loan
20 % of those who did not have a housing loan said yes!!!
0
5.5
11
16.5
22
No Housing Loan Housing Loan

Feature Analysis
Strong indicator for yes.. Loan Default
13% of those who had no loan default said yes.
Nobody with loan default said yes - type of info that classification algorithms can use
0
3.5
7
10.5
14
No Loan Default Loan Default

Feature Analysis
Moderate indicator for yes.. Age
Percentage almost constant across wide range - not much of a differentiating factor
% who Signed Up for Term Deposit
21
24
27
30
33
36
39
42
45
48
51
54
57
60
0 10 20 30 40

Feature selection and
transformation
Like a country is only as good as it’s people, a model
is only as good as quality of input data

Feature Selection table
Feature Description Pre-processing
age Continuous None
job Categorical Converted to Binary Matrix
marital status Categorical Converted to Binary Matrix
education Categorical
Converted to ordinal. 1 = primary,
2 = secondary, 3 = Tertiary
has credit in default?average yearly balance Continuous Numerically scaled
contact communication mode Categorical Discarded, feature irrelevant
last contact day of the month Categorical Discarded, feature irrelevant
last contact month of year Categorical Discarded, feature irrelevant
last contact duration Continuous Numerically scaled

Feature Selection table
Feature Description Pre-processing
number of contacts performed
during this campaign
Continuous Numerically scaled
number of days that passed by
after the client was last
contacted
Continuous Weak feature, discarded
marital status Categorical Converted to Binary Matrix
outcome of the previous
marketing campaign
YES/NO Converted to Binary
has credit in default? YES/NO Converted to Binary
has housing loan? YES/NO Converted to Binary
has personal loan? YES/NO Converted to Binary

Special mention about pre-processing done on education -
Analysis showed that higher the education level, more are the chances of a person signing up for term
deposit. Converting education to a binary matrix would have caused this information to be lost.
Therefore, the categories were manually converted to numerical scale of 1,2 and 3 with 1 = primary
and 3 = tertiary
0
3.5
7
10.5
14
Primary Secondary Tertiary

The special processing of “education” feature improved MCC score of several algorithm, specially of
gradient descent and AdaBoost that rely heavily on previous errors
0.38
0.39
0.4
0.41
0.42
Gradiant Descent AdaBoost
MCCscores

Choice of models - ensemble models
Random Forest - everyone’s favourite
An ensemble model that combines decision trees
Parameters used
Depth = 5
No of classifiers = 100
AdaBoost - acclaimed
Developed in 2003, it is considered one of the
Best out-of-the box classifier. Combines several
Weak algorithms and learns from mistakes. .
Less susceptible to overfitting

Choice of models - non- ensemble models - linear models worked well on the data!

Matthews Correlation Coefficient scores of each model
Moderate
Strong
Any model with MCC score greater
then 0.40 is considered strong.
According to stats, 4 different models
qualify, with gradient descent scoring
the most. Does it mean Gradient Descent
Is the right choice? Is it a good fit?
The real question is: does it overfit?
Gradient Descent
AdaBoost
Regression
Neural Net
G
radientD
escent
AdaBoost
R
egression
N
euralN
et

Let’s seek the answer using evaluation metrics from 5 fold cross validation
5-fold cross validation - Matrix Accuracy
Gradient Descent
AdaBoost
Regression
Neural Net
Gradient Descent
AdaBoost
Regression
Neural Net
5-fold cross validation - Matrix ROC score

Preferred Model
AdaBoost
MCC Score = 0.41 Accuracy = 90% ROC score = 0.88

Conclusion
linear ensemble models fitted well
With more effort, a better relationship of the features can be gleaned. For
example, marital status is strongly related to financial position. Such
information can help improve the models further
Quest for an optimal model demonstrated that cross validation is an quite an
useful strategy that can not only save time in testing but also assist in making
a better choice of model
In real world scenario, won’t harm to test all 4 top models on unseen data

May the light of Buddha’s wisdom be shown
on all of us and guide us towards good fitting
models.
Final Thoughts ….

Data mining - Machine Learning

More Related Content

What's hot (19)

Similar to Data mining - Machine Learning (20)

Recently uploaded (20)

Data mining - Machine Learning