SlideShare a Scribd company logo
PREDICTING EMPLOYEE ATTRITION
Predicting Employee Attrition
1.1 OBJECTIVE AND SCOPE OF THE STUDY
 The objective of this project is to predict the attrition rate for
each employee, to find out who’s more likely to leave the
organization.
 It will help organizations to find ways to prevent attrition or
to plan in advance the hiring of new candidate.
 Attrition proves to be a costly and time consuming problem
for the organization and it also leads to loss of productivity.
 The scope of the project extends to companies in all
industries.
1.2 ANALYTICS APPROACH
 Check for missing values in the data, and if any, will process
the data accordingly.
 Understand how the features are related with our target
variable - attrition
 Convert target variable into numeric form
 Apply feature selection and feature engineering to make it
model ready
 Apply various algorithms to check which one is the most
suitable
 Draw out recommendations based on our analysis.
1.3 DATA SOURCES
 For this project, an HR dataset named ‘IBM HR Analytics
Employee Attrition & Performance’, has been picked, which
is available on IBM website.
 The data contains records of 1,470 employees.
 It has information about employee’s current employment
status, the total number of companies worked for in the past,
Total number of years at the current company and the current
roles, Their education level, distance from home, monthly
income, etc.
1.4 TOOLS AND TECHNIQUES
 We have selected Python as our analytics tool.
 Python includes many packages such as Pandas, NumPy,
Matplotlib, Seaborn etc.
 Algorithms such as Logistic Regression, Random Forest,
Support Vector Machine and XGBoost have been used for
prediction.
Predicting Employee Attrition
 Importing Libraries
2.1 IMPORTING LIBRARY AND DATA EXTRACTION
 Importing Packages
 Data Extraction
2.2 EXPLORATORY DATA ANALYSIS
 Refers to the process of performing initial investigations on the
data so as to discover patterns, to spot inconsistencies, to test
hypothesis and to check assumptions with the help of graphical
representations
 Displaying First 5 Rows
 Displaying rows and columns
 Identifying Missing Values
 Count of “Yes” and “No” values of Attrition
2.3 VISUALIZATION(EDA) -
 Attrition V/s “Age”
 Attrition V/s “Distance from Home”
 Attrition V/s “Job Satisfaction”
 Attrition V/s “Performance Rating”
 Attrition V/s “Training Times Last Year”
 Attrition V/s “Work Life Balance”
 Attrition V/s “Years At Company”
 Attrition V/s “Years in Current Role”
 Attrition V/s “Years Since Last Promotion”
 Attrition V/s Categorical Variables
Attrition V/s “Gender, Marital status and Overtime”
Attrition V/s “Department, Job Role, and Business Travel”
Predicting Employee Attrition
Data Pre-Processing-
Steps Involved –
 Taking care of missing data and dropping non-relevant
features
 Feature extraction
 Converting categorical features into numeric form
Binarization of the converted categorical features
 Feature scaling
 Understanding correlation of features with each other
 Splitting data into training and test data sets
 Refers to data mining technique that transforms raw data into
an understandable format
 Useful in making the data ready for analysis
3.1 FEATURE SELECTION
 Process wherein those features are selected, which contribute
most to the prediction variable or output.
Benefits of feature selection :
 Improve the performance
 Improves Accuracy
 Providing the better understanding of Data
Dropping non-relevant variables
#dropping all fixed and non-relevant variables
attrition_df.drop(['DailyRate','EmployeeCount','EmployeeNumber','HourlyRate','Month
lyRate','Over18','PerformanceRating','StandardHours','StockOptionLevel','TrainingTi
mesLastYear'], axis=1,inplace=True)
Check number of rows and columns
Features Extraction
3.2 FEATURE ENGINEERING
Label Encoding
 Label Encoding refers to converting the categorical variables into numeric
form, so as to convert it into the machine-readable form.
 It is an important pre-processing step for the structured dataset in supervised
learning.
 Fit and transform the required columns of the data, and then replace the
existing text data with the new encoded data.
Convert categorical variables into numeric variables
 One Hot Encoder
 It is used to perform “binarization” of the categorical features and
include it as a feature to train the model.
 It takes a column which has categorical data that has been label
encoded, and then splits the column into multiple columns.
 The numbers are replaced by 1s and 0s, depending on which
column has what value.
Applying “One Hot Encoder” on Label Encoded features
Feature Scaling
 Feature scaling is a method used to standardize the range of
independent variables or features of data
 It is also known as Data Normalization
 It is used to scale the features to a range which is centred around
zero so that the variance of the features are in the same range
 Two most popular methods of feature scaling are standardization
and normalization
Scaling the features
Correlation Matrix
• Correlation is a statistical technique which determines how one
variables moves/changes in relation with the other variable.
• It’s a bi-variant analysis measure which describes the association
between different variables.
Usefulness of Correlation matrix –
 If two variables are closely correlated, then we can predict one
variable from the other.
 Correlation plays a vital role in locating the important variables
on which other variables depend.
 It is used as the foundation for various modeling techniques.
 Proper correlation analysis leads to better understanding of data.
Plotting correlation matrix
Correlation matrix Plot
Splitting data into train and test
Predicting Employee Attrition
 The process of modeling means training a machine learning
algorithm to predict the labels from the features, tuning it for
the business need, and validating it on holdout data.
 Models used for employee attrition:
 Logistic Regression
 Random Forest
 Support vector machine
 XG Boost
Model building -
4.1 LOGISTIC REGRESSION
 Logistic Regression is one of the most basic and widely used
machine learning algorithms for solving a classification problem.
 It is a method used to predict a dependent variable (Y), given an
independent variable (X), given that the dependent variable
is categorical.
 Linear Regression equation
 Y stands for the dependent variable that needs to be predicted.
 β0 is the Y-intercept, which is basically the point on the line which
touches the y-axis.
 β1 is the slope of the line (the slope can be negative or positive
depending on the relationship between the dependent variable and
the independent variable.)
 X here represents the independent variable that is used to predict
our resultant dependent value.
 ∈ denotes the error in the computation
 Sigmoid Function
p(x)= β0+ β1x
 Building Logistic Regression Model
 Testing the Model
 Confusion Matrix
 Confusion matrix is the most crucial metric commonly used to
evaluate classification models.
 The confusion matrix avoids "confusion" by measuring the
actual and predicted values in a tabular format.
In table above, Positive class = 1 and Negative class = 0.
Standard table of confusion matrix -
 Creating confusion matrix
 AUC score
 Receiver Operator Characteristic (ROC)
 ROC determines the accuracy of a classification model at a user
defined threshold value.
 It determines the model's accuracy using Area Under Curve
(AUC).
 The area under the curve (AUC), also referred to as index of
accuracy (A) or concordant index, represents the performance of
the ROC curve. Higher the area, better the model.
 Plotting ROC curve
 ROC Curve For Logistic Regression
Using Logistic Regression algorithm, we got the accuracy score of
79% and roc_auc score of 0.77
4.2 RANDOM FOREST
• Random Forest is a supervised learning algorithm.
• It creates a forest and makes it random based on bagging
technique. It aggregates Classification Trees.
• In Random Forest, only a random subset of the features is taken
into consideration by the algorithm for splitting a node.
 Building Random Forest Model
 Testing the Model
 Confusion Matrix
 AUC score
 Plotting ROC curve
Using Random Forest algorithm, we got the accuracy score of 79%
and roc_auc score of 0.76.
 ROC Curve For Random Forest
4.3 SUPPORT VECTOR MACHINE
 SVM is a supervised machine learning algorithm used for both
regression and classification problems.
 Objective is to find a hyperplane in an N -dimensional space.
 Hyperplanes
 Hyperplanes are decision boundaries
that help segregate the data points.
 The dimension of the hyperplane
depends upon the number of features.
 Support Vectors
 These are data points that are closest to the hyperplane and
influence the position and orientation of the hyperplane.
 Used to maximize the margin of the classifier.
 Considered as critical elements of a dataset
 Kernel Technique
 Used when non-linear hyperplanes are needed
 The hyperplane is no longer a line, it must now be a plane
 Since we have a non-linear
classification problem, kernel
technique used here is Radial Basis
Function (rbf)
 Helps in segregating data that are
linearly non-separable.
 Building SVM Model
 Testing SVM Model
 Confusion Matrix
 AUC Score
 Plotting ROC Curve
Using SVM algorithm, we got the accuracy score of 79% and
roc_auc score of 0.77
 ROC Curve For SVM
4.4 XG BOOST
 XGBoost is a decision-tree-based ensemble Machine Learning algorithm
that uses a gradient boosting framework.
 XGBoost belongs to a family of boosting algorithms that convert weak
learners into strong learners.
 It is a sequential process, i.e., trees are grown using the information from
a previously grown tree one after the other, iteratively, the errors of the
previous model are corrected by the next predictor.
 Advantages of XGBoost -
 Regularization
 Parallel Processing
 High Flexibility
 Handling Missing Values
 Tree Pruning
 Built-in Cross-Validation
 Building XGBoost Model
 Testing the Model
 Confusion Matrix
 AUC Score
 Plotting ROC Curve
Using XGBoost algorithm we got the accuracy score of 82% and
roc_auc score 0.81
 ROC Curve For XGBoost Model
4.5 COMPARISON OF MODELS
 It can be observed by the table that XGBoost outperforms all other models.
 Hence, based on these results we can conclude that, XGBoost will be the best
model to predict future Employee Attrition for this company.
Predicting Employee Attrition
KEY FINDINGS
 The dataset does not feature any missing values or any redundant
features.
 The strongest positive correlations with the target features are:
Distance from home, Job satisfaction, marital status, overtime and
business travel
 The strongest negative correlations with the target features are:
Performance Rating and Training times last year
Predicting Employee Attrition
RECOMMENDATIONS
 Transportation should be provided to employees living in the same
area, or else transportation allowance should be provided.
 Plan and allocate projects in such a way to avoid the use of
overtime.
 Employees who hit their two-year anniversary should be identified
as potentially having a higher-risk of leaving.
 Gather information on industry benchmarks to determine if the
company is providing competitive wages.
THANK YOU

More Related Content

PDF
Predictive HR Analytics_ Mastering the HR Metric ( PDFDrive ).pdf
Santhosh Prabhu
 
PPTX
Employee Attrition Analysis / Churn Prediction
Gopinadh Lakkoju
 
PDF
IBM HR Analytics Employee Attrition & Performance
ShivangiKrishna
 
PPTX
Introduction to predictive modeling v1
Venkata Reddy Konasani
 
PPTX
Bank Customer Churn Prediction- Saurav Singh.pptx
Boston Institute of Analytics
 
PDF
1345 keynote roberts
Rising Media, Inc.
 
PPTX
Prediction of customer propensity to churn - Telecom Industry
Pranov Mishra
 
PDF
NHRD HR Analytics Presentation
Supriya Thankappan
 
Predictive HR Analytics_ Mastering the HR Metric ( PDFDrive ).pdf
Santhosh Prabhu
 
Employee Attrition Analysis / Churn Prediction
Gopinadh Lakkoju
 
IBM HR Analytics Employee Attrition & Performance
ShivangiKrishna
 
Introduction to predictive modeling v1
Venkata Reddy Konasani
 
Bank Customer Churn Prediction- Saurav Singh.pptx
Boston Institute of Analytics
 
1345 keynote roberts
Rising Media, Inc.
 
Prediction of customer propensity to churn - Telecom Industry
Pranov Mishra
 
NHRD HR Analytics Presentation
Supriya Thankappan
 

What's hot (20)

PDF
Predicting Employee Attrition
Mohamad Sahil
 
PPTX
ATTRITION ppt
piya chauhan
 
PPTX
Employee Attrition Analysis
KrisGhimireMLSASCPCM
 
PPTX
Hr analytics
Shubham Singhal
 
PPTX
Employee Attrition
Vinay sattur
 
PDF
Machine Learning Approach for Employee Attrition Analysis
ijtsrd
 
PDF
EMPLOYEE ATTRITION PREDICTION IN INDUSTRY USING MACHINE LEARNING TECHNIQUES
IAEME Publication
 
PPTX
Hr analytics
BiswajitChangkakoty
 
PPTX
Hr analytics
Anjali Das V.M
 
PPTX
hr analytics
SHRUTI SAGAR
 
PPTX
Group 6 employee_attrition
tashig9
 
PPTX
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
Sri Ambati
 
PPTX
Predictive HR--Analytics
E J Sarma
 
PDF
Driving the Future of HR with Analytics and Bots
Ahmad Areeb Faraz
 
PPTX
Hr analytics
Preksha Pagare
 
PPTX
Telecom Churn Prediction Presentation
PinintiHarishReddy
 
PDF
HR Analytics
Shojibul Alam Shojib
 
PDF
churn prediction in telecom
Hong Bui Van
 
PDF
Predicting Credit Card Defaults using Machine Learning Algorithms
Sagar Tupkar
 
Predicting Employee Attrition
Mohamad Sahil
 
ATTRITION ppt
piya chauhan
 
Employee Attrition Analysis
KrisGhimireMLSASCPCM
 
Hr analytics
Shubham Singhal
 
Employee Attrition
Vinay sattur
 
Machine Learning Approach for Employee Attrition Analysis
ijtsrd
 
EMPLOYEE ATTRITION PREDICTION IN INDUSTRY USING MACHINE LEARNING TECHNIQUES
IAEME Publication
 
Hr analytics
BiswajitChangkakoty
 
Hr analytics
Anjali Das V.M
 
hr analytics
SHRUTI SAGAR
 
Group 6 employee_attrition
tashig9
 
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
Sri Ambati
 
Predictive HR--Analytics
E J Sarma
 
Driving the Future of HR with Analytics and Bots
Ahmad Areeb Faraz
 
Hr analytics
Preksha Pagare
 
Telecom Churn Prediction Presentation
PinintiHarishReddy
 
HR Analytics
Shojibul Alam Shojib
 
churn prediction in telecom
Hong Bui Van
 
Predicting Credit Card Defaults using Machine Learning Algorithms
Sagar Tupkar
 
Ad

Similar to Predicting Employee Attrition (20)

PPTX
Employee Retention Prediction: A Data Science Project by Devangi Shukla
Boston Institute of Analytics
 
PPTX
Machine_Learning.pptx
VickyKumar131533
 
DOCX
Data Analytics Using R - Report
Akanksha Gohil
 
PPTX
Informs presentation new ppt
Salford Systems
 
PPTX
JamieStainer ATA SCIEnCE path finder.pptx
RadhaKilari
 
PPTX
AI AND DATA SCIENCE generative data scinece.pptx
RadhaKilari
 
PPTX
DataAnalyticsIntroduction and its ci.pptx
PrincePatel272012
 
DOCX
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
tangyechloe
 
PDF
working with python
bhavesh lande
 
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET Journal
 
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET Journal
 
PDF
Machine_Learning_Trushita
Trushita Redij
 
PDF
Machine Learning.pdf
BeyaNasr1
 
PDF
Machine learning meetup
QuantUniversity
 
PDF
Machine learning Mind Map
Ashish Patel
 
PPTX
Intro to ml_2021
Sanghamitra Deb
 
PDF
Practical Predictive Modeling in Python
Robert Dempsey
 
PPTX
Analytics Boot Camp - Slides
Aditya Joshi
 
PDF
Introduction to machine learning
Sanghamitra Deb
 
PPTX
Data Mining Theory and Python Project.pptx
GaziMdNoorHossain
 
Employee Retention Prediction: A Data Science Project by Devangi Shukla
Boston Institute of Analytics
 
Machine_Learning.pptx
VickyKumar131533
 
Data Analytics Using R - Report
Akanksha Gohil
 
Informs presentation new ppt
Salford Systems
 
JamieStainer ATA SCIEnCE path finder.pptx
RadhaKilari
 
AI AND DATA SCIENCE generative data scinece.pptx
RadhaKilari
 
DataAnalyticsIntroduction and its ci.pptx
PrincePatel272012
 
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
tangyechloe
 
working with python
bhavesh lande
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET Journal
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET Journal
 
Machine_Learning_Trushita
Trushita Redij
 
Machine Learning.pdf
BeyaNasr1
 
Machine learning meetup
QuantUniversity
 
Machine learning Mind Map
Ashish Patel
 
Intro to ml_2021
Sanghamitra Deb
 
Practical Predictive Modeling in Python
Robert Dempsey
 
Analytics Boot Camp - Slides
Aditya Joshi
 
Introduction to machine learning
Sanghamitra Deb
 
Data Mining Theory and Python Project.pptx
GaziMdNoorHossain
 
Ad

Recently uploaded (20)

PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 

Predicting Employee Attrition

  • 3. 1.1 OBJECTIVE AND SCOPE OF THE STUDY  The objective of this project is to predict the attrition rate for each employee, to find out who’s more likely to leave the organization.  It will help organizations to find ways to prevent attrition or to plan in advance the hiring of new candidate.  Attrition proves to be a costly and time consuming problem for the organization and it also leads to loss of productivity.  The scope of the project extends to companies in all industries.
  • 4. 1.2 ANALYTICS APPROACH  Check for missing values in the data, and if any, will process the data accordingly.  Understand how the features are related with our target variable - attrition  Convert target variable into numeric form  Apply feature selection and feature engineering to make it model ready  Apply various algorithms to check which one is the most suitable  Draw out recommendations based on our analysis.
  • 5. 1.3 DATA SOURCES  For this project, an HR dataset named ‘IBM HR Analytics Employee Attrition & Performance’, has been picked, which is available on IBM website.  The data contains records of 1,470 employees.  It has information about employee’s current employment status, the total number of companies worked for in the past, Total number of years at the current company and the current roles, Their education level, distance from home, monthly income, etc.
  • 6. 1.4 TOOLS AND TECHNIQUES  We have selected Python as our analytics tool.  Python includes many packages such as Pandas, NumPy, Matplotlib, Seaborn etc.  Algorithms such as Logistic Regression, Random Forest, Support Vector Machine and XGBoost have been used for prediction.
  • 8.  Importing Libraries 2.1 IMPORTING LIBRARY AND DATA EXTRACTION
  • 9.  Importing Packages  Data Extraction
  • 10. 2.2 EXPLORATORY DATA ANALYSIS  Refers to the process of performing initial investigations on the data so as to discover patterns, to spot inconsistencies, to test hypothesis and to check assumptions with the help of graphical representations  Displaying First 5 Rows
  • 11.  Displaying rows and columns
  • 13.  Count of “Yes” and “No” values of Attrition
  • 14. 2.3 VISUALIZATION(EDA) -  Attrition V/s “Age”
  • 15.  Attrition V/s “Distance from Home”
  • 16.  Attrition V/s “Job Satisfaction”
  • 17.  Attrition V/s “Performance Rating”
  • 18.  Attrition V/s “Training Times Last Year”
  • 19.  Attrition V/s “Work Life Balance”
  • 20.  Attrition V/s “Years At Company”
  • 21.  Attrition V/s “Years in Current Role”
  • 22.  Attrition V/s “Years Since Last Promotion”
  • 23.  Attrition V/s Categorical Variables
  • 24. Attrition V/s “Gender, Marital status and Overtime”
  • 25. Attrition V/s “Department, Job Role, and Business Travel”
  • 27. Data Pre-Processing- Steps Involved –  Taking care of missing data and dropping non-relevant features  Feature extraction  Converting categorical features into numeric form Binarization of the converted categorical features  Feature scaling  Understanding correlation of features with each other  Splitting data into training and test data sets  Refers to data mining technique that transforms raw data into an understandable format  Useful in making the data ready for analysis
  • 28. 3.1 FEATURE SELECTION  Process wherein those features are selected, which contribute most to the prediction variable or output. Benefits of feature selection :  Improve the performance  Improves Accuracy  Providing the better understanding of Data
  • 29. Dropping non-relevant variables #dropping all fixed and non-relevant variables attrition_df.drop(['DailyRate','EmployeeCount','EmployeeNumber','HourlyRate','Month lyRate','Over18','PerformanceRating','StandardHours','StockOptionLevel','TrainingTi mesLastYear'], axis=1,inplace=True) Check number of rows and columns
  • 31. Label Encoding  Label Encoding refers to converting the categorical variables into numeric form, so as to convert it into the machine-readable form.  It is an important pre-processing step for the structured dataset in supervised learning.  Fit and transform the required columns of the data, and then replace the existing text data with the new encoded data.
  • 32. Convert categorical variables into numeric variables
  • 33.  One Hot Encoder  It is used to perform “binarization” of the categorical features and include it as a feature to train the model.  It takes a column which has categorical data that has been label encoded, and then splits the column into multiple columns.  The numbers are replaced by 1s and 0s, depending on which column has what value.
  • 34. Applying “One Hot Encoder” on Label Encoded features
  • 35. Feature Scaling  Feature scaling is a method used to standardize the range of independent variables or features of data  It is also known as Data Normalization  It is used to scale the features to a range which is centred around zero so that the variance of the features are in the same range  Two most popular methods of feature scaling are standardization and normalization
  • 37. Correlation Matrix • Correlation is a statistical technique which determines how one variables moves/changes in relation with the other variable. • It’s a bi-variant analysis measure which describes the association between different variables. Usefulness of Correlation matrix –  If two variables are closely correlated, then we can predict one variable from the other.  Correlation plays a vital role in locating the important variables on which other variables depend.  It is used as the foundation for various modeling techniques.  Proper correlation analysis leads to better understanding of data.
  • 40. Splitting data into train and test
  • 42.  The process of modeling means training a machine learning algorithm to predict the labels from the features, tuning it for the business need, and validating it on holdout data.  Models used for employee attrition:  Logistic Regression  Random Forest  Support vector machine  XG Boost Model building -
  • 43. 4.1 LOGISTIC REGRESSION  Logistic Regression is one of the most basic and widely used machine learning algorithms for solving a classification problem.  It is a method used to predict a dependent variable (Y), given an independent variable (X), given that the dependent variable is categorical.
  • 44.  Linear Regression equation  Y stands for the dependent variable that needs to be predicted.  β0 is the Y-intercept, which is basically the point on the line which touches the y-axis.  β1 is the slope of the line (the slope can be negative or positive depending on the relationship between the dependent variable and the independent variable.)  X here represents the independent variable that is used to predict our resultant dependent value.  ∈ denotes the error in the computation
  • 46.  Building Logistic Regression Model
  • 48.  Confusion Matrix  Confusion matrix is the most crucial metric commonly used to evaluate classification models.  The confusion matrix avoids "confusion" by measuring the actual and predicted values in a tabular format. In table above, Positive class = 1 and Negative class = 0. Standard table of confusion matrix -
  • 49.  Creating confusion matrix  AUC score
  • 50.  Receiver Operator Characteristic (ROC)  ROC determines the accuracy of a classification model at a user defined threshold value.  It determines the model's accuracy using Area Under Curve (AUC).  The area under the curve (AUC), also referred to as index of accuracy (A) or concordant index, represents the performance of the ROC curve. Higher the area, better the model.
  • 52.  ROC Curve For Logistic Regression Using Logistic Regression algorithm, we got the accuracy score of 79% and roc_auc score of 0.77
  • 53. 4.2 RANDOM FOREST • Random Forest is a supervised learning algorithm. • It creates a forest and makes it random based on bagging technique. It aggregates Classification Trees. • In Random Forest, only a random subset of the features is taken into consideration by the algorithm for splitting a node.
  • 54.  Building Random Forest Model
  • 55.  Testing the Model  Confusion Matrix
  • 56.  AUC score  Plotting ROC curve
  • 57. Using Random Forest algorithm, we got the accuracy score of 79% and roc_auc score of 0.76.  ROC Curve For Random Forest
  • 58. 4.3 SUPPORT VECTOR MACHINE  SVM is a supervised machine learning algorithm used for both regression and classification problems.  Objective is to find a hyperplane in an N -dimensional space.  Hyperplanes  Hyperplanes are decision boundaries that help segregate the data points.  The dimension of the hyperplane depends upon the number of features.
  • 59.  Support Vectors  These are data points that are closest to the hyperplane and influence the position and orientation of the hyperplane.  Used to maximize the margin of the classifier.  Considered as critical elements of a dataset
  • 60.  Kernel Technique  Used when non-linear hyperplanes are needed  The hyperplane is no longer a line, it must now be a plane  Since we have a non-linear classification problem, kernel technique used here is Radial Basis Function (rbf)  Helps in segregating data that are linearly non-separable.
  • 62.  Testing SVM Model  Confusion Matrix
  • 63.  AUC Score  Plotting ROC Curve
  • 64. Using SVM algorithm, we got the accuracy score of 79% and roc_auc score of 0.77  ROC Curve For SVM
  • 65. 4.4 XG BOOST  XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework.  XGBoost belongs to a family of boosting algorithms that convert weak learners into strong learners.  It is a sequential process, i.e., trees are grown using the information from a previously grown tree one after the other, iteratively, the errors of the previous model are corrected by the next predictor.  Advantages of XGBoost -  Regularization  Parallel Processing  High Flexibility  Handling Missing Values  Tree Pruning  Built-in Cross-Validation
  • 67.  Testing the Model  Confusion Matrix
  • 68.  AUC Score  Plotting ROC Curve
  • 69. Using XGBoost algorithm we got the accuracy score of 82% and roc_auc score 0.81  ROC Curve For XGBoost Model
  • 70. 4.5 COMPARISON OF MODELS  It can be observed by the table that XGBoost outperforms all other models.  Hence, based on these results we can conclude that, XGBoost will be the best model to predict future Employee Attrition for this company.
  • 72. KEY FINDINGS  The dataset does not feature any missing values or any redundant features.  The strongest positive correlations with the target features are: Distance from home, Job satisfaction, marital status, overtime and business travel  The strongest negative correlations with the target features are: Performance Rating and Training times last year
  • 74. RECOMMENDATIONS  Transportation should be provided to employees living in the same area, or else transportation allowance should be provided.  Plan and allocate projects in such a way to avoid the use of overtime.  Employees who hit their two-year anniversary should be identified as potentially having a higher-risk of leaving.  Gather information on industry benchmarks to determine if the company is providing competitive wages.