SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 08 | Aug -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 485
Comparative Analysis of Machine Learning Algorithms for their
Effectiveness in Churn Prediction in the Telecom Industry
Mr. Nand Kumar1, Mr. Chetankumar Naik2
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Churn is a term that gives insights of the
attrition rate of the customer in any particular company.
Telecom industry is highly dynamic in industry in nature;
highly active in terms of customer relationship and
management compared to any other industry but the
customer base is tend to be fragile because of the luring
offers offered by competitive companies to get strong hold
the customer base which results into customer churn,
thereby affecting the company’s customer’s assessments due
to the liquidity. This paper aims at providing solution to this
problem by identifying the potential customers who tend to
switch to other companies by using different machine
learning algorithms and evaluation each for their
effectiveness. Data analysis is done on previously recorded
data for extracting the potential features and their
dependencies that impact on the churn. State of art
classification algorithms like balanced logistic regression,
random forest and balanced random forest are used. To
optimize the solution, thorough analysis of the performance
is done and based on the measure of goodness of the
algorithm; the one that fits best is identified. The output
obtained will help the company to control the attrition rate
by handling their issues on time and retaining them with
the company.
Key Words: — Balanced Logistic Regression, Churn,
Classification, Data pre-processing, key feature extraction,
Principal Component Analysis, Random forest
1. INTRODUCTION
The churn rate, also known as the rate of attrition, is the
percentage of subscribers to a service who discontinue
their subscriptions to that service within a given time
period. For a company to expand its clientele, its
growth rate, as measured by the number of new
customers, must exceed its churn rate[1].
There are an ample number of reasons for a company to
lose its customers. The primary reason is the cost of the
product, the product quality and quality of service. The
valuation of any company is affected if the customer
outflow is veryhigh. Many companies lose its
reputation amongst its stakeholders.
However, for the customer to be retained, it is very
important to also measure customer satisfaction.
Evaluating the degree of customer satisfaction is a difficult
task. It becomes more difficult when the customer base
accounts up to millions. Value added service is another
reason for Churn. Telecom companies have started a
newoffering called Triple play [2], combining the TV,
broadband and the phone offering as compared to the
traditional model of just the phone services. This is seen as
a value adds to retain customers. The Triple play not only
helps retain customers but also increases the Average
revenue per user (ARPU) directly contributing to the
revenue of the company [3].
1.1 How to Reduce Churning:
1 “After sales service” is of prime importance.
Person consideration is very important when it
comes to the service part so that the customer feels
privileged.
2 Customized offers should be provided based on
their expenditure.
3 Value added services should be enhanced
4 Trust building by handling each individual’s query
and
resoling queries on time will manifest returns.
2. LITERATURE SURVEY
Telecom companies have used two approaches to address
churn –
a) Untargeted approach: Mass-marketing andcreating
brand loyalty for the product
b) Targeted approach: Involves focusing on the
customer base likely to churn. Intervening with the
personal care and taking steps to avoid their likeability to
churn
Targeted approach needs to derive some important insights
which is difficult to be evaluated manually since the data is
very dynamic and large. The machine learning techniques
have evolved to a great extent assisting intelligent solutions
to churn issues that are present in the telecom industry. The
researchershave come upwith significant algorithms that are
proven to give valuable insights over the given data. The
problem that is being taken up is of the classification of
customers on evaluation of their potential to churn based on
several factors.
The performance of each algorithm depends on the volume,
and variety of thedata. Hencegeneric solution isuncalled for.
The paper aims at comparative analysis of the various
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 08 | Aug -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 486
1 Number of observations: 25000
2
3
No of variables: 111(Independent=110
target=1)
Number of missing values: No missing values
and
algorithms used in machine learning for this particular
problem which can further provide solution for the
problem with similar inputs. The thorough analysis of the
algorithms will give a clear picture about the accuracy of
each. The final result of the analysis will decide on the best
solution with minimal errors.
3. METHODOLOGY
1 Understanding t h e b u s i n e s s p r o b l e m i s
h i g h l y important, evaluating which a final structure
of the problem can be devised with confidence.
2 The next step inculcates understanding the data and its
interdependency; correlation among different
variables should be done.
3 PCA for Feature Selection, new dimensions, pre-
processing
4 Build Logistic regression Model for Classification
5 Balance data using smote + Logistic Regression
6 Classification using Random Forest (35 PCA)
7 Classification using Random Forest (4 important
PCA)
8 Random forest in H2O – Parameter Tuning
9 The final outcome is analyzed
The accuracy of the test depends on how well the test
separates the group being tested into those with and
without the disease in question. Accuracy is measured by
the area under the ROC curve. An area of 1 represents a
perfect test; an area of .5 represents a worthless test[7].
3.1 Data Exploration
Complex structure of data implies new, sophisticated
solutions including data transformation, semantic
representation and new mathematical theories
accommodation. The most important problem of
data processing is to make sense of available data[4].
The data is provided by the telecom company which
needs some preprocessing. Given below are the details
about the data
3.2 Principal component analysis for Feature
Selection
Principal component analysis is a method of
extracting important variables (in form of components)
from a large set of variables available in a data set. It extracts
low dimensional set of features from a high dimensional
data set with a motive to capture as much information as
possible. With fewer variables, visualization also becomes
much more meaningful. PCA is more useful when dealing
with 3 or higher dimensional data [5].
Since the data to be worked on is of very high dimension. It
becomes necessary to eliminate thedimensions such that the
model works effectively without much unnecessary and
irrelevant computation, which can end up getting unwanted
results. It is proven that the first principal component
involves the dimensions that are highly important, followed
by thesecond component and so on. It is thedata expert who
can understand the problem better can decide on the
rightnumber of components to used in order to provide the
right input to the model
Fig -1: Principal Components
3.3 Stratified Sampling
Stratified Sampling with train and validation in the ratio 7:3
set.seed(1000)
split = sample.split(data_Pca_less_Comp$target, SplitRatio
= 0.70)
# Split up the data using subset
train = subset(data_Pca_less_Comp, split==TRUE)
test = subset(data_Pca_less_Comp, split==FALSE)
3.4 Build Logistic regression Model for
Classification
Logistic regression is the appropriate regression analysis to
conduct when the dependent variable is dichotomous
(binary). Likeall regression analyses, thelogistic regression
is a predictive analysis. Logistic regression is used to
describe data and to explain the relationship between one
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 08 | Aug -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 487
dependent binary variable and one or more nominal,
ordinal, interval or ratio-level independent variables[8]
Logistic Regression using train data set and 35
Components from P.C.A The results below are obtained by
running the model in RStudio. The glm() function is used to
apply logistic regression on the given data.
LogReg <- glm(target~.,data = train,family = "binomial”)
Inference:
Accuracy on Train Set: 0.79
Accuracy on Test Set: 0.80
Recall on the Test Set: 0.58
The graph below is obtained in the Rstudio
Fig -2: ROC for Logistic regression Model
The Area Under Curve is: 0.8547 for logistic regression
model
3.5 Logistic Regression Using Smote
We witnessed a low recall. Data is not completely
imbalanced, but building a model on a completely
balanced data could help
Use SMOTE to balance the data
train_smote=SMOTE(target~.,data=train,
perc.over=100,perc.under=200)
prop.table(table(train_smote$target))
Fig -3: ROC for Logistic regression Model using smote
Area Under Curve: 0.861
Accuracy on Train Set: 0.78
Accuracy on Test Set: 0.77
Recall on the Test Set: 0.76
3.6 Classification using Random Forest
Random forest (Breiman, 2001) is an ensemble of unpruned
classification or regression trees, induced from bootstrap
samples of the training data, using random feature selection
in the tree induction process.
Prediction is made by aggregating (majority vote for
classification or averaging for regression) the predictions
of the ensemble. Random forest generally exhibits a
substantial performance improvement over the single tree
classifier such as CART and C4.5 [6]. It yields generalization
error rate that compares favourably to Adaboost, yet is
more robust to noise. However, similar to most classifiers,
RF can also suffer from the curse of learning from an
extremely imbalanced training data set.
The model was executed in Rstudio. The output below is its
summary extract
Fig-4: Summary for the performance of random forest
model for 35 principal components
The summary in the Figure 3 explains the error rate when
the random forest is used for the classification when 35
principal components are used. But around 80 percent of
the components will have less or no impact on the
performance of the model. It is necessary to figure out the
features that impact the most which can thereby reduce
the computational complexity.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 08 | Aug -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 488
Fig -5: ROC for the random forest model with 35 principal
components
Inference:
Area Under Curve: 0.8324
we chose mtry = 5, ntree=50)
The area under the ROC curve is 0.83, which is less when
compared to that of the logistic regression model.
Fig -6: Random forest using top 4 important Attributes
Fig-7: Summary for the performance of random forest
model for 4 principal components
3.6 Random forest in H2O
1. We will use parameter tuning to find the values of
mtry and ntree to give us the best AUC
2. Import the dataset into H20
3. Grid Search and Model Selection with H2O and
define the tuning parameter -> mtries
Fig -8: H2O grid details
Inference:
Mtry=8 gives us the best model
The AUC is calculated on Out of Bag Samples. Hence cross
validation is not needed.
4. CONCLUSION
The various model presentedinthepaperhadgoodaccuracy
rate but the Logistic regression model on the balanced data
has the highest area under the curve, which by convention
proves to be the best model delivering high Accuracy and is
more parsimonious model
REFERENCES
[1] ]http://www.investopedia.com/terms/c/churnrate.asp
[2]
https://en.wikipedia.org/wiki/Tripleplay_(telecomm
uni cations)
[3] http://www.happiestminds.com/whitepapers/how-
to- reduce-churn-in-a-telco-industry
[4]
http://www.academia.edu/4506159/Top_Challenges_of
_Data_Processing
[5]
https://www.analyticsvidhya.com/blog/2016/03/pra
ct ical-guide-principal-component-analysis-python/
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 08 | Aug -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 489
[6] http://statistics.berkeley.edu/sites/default/files/tech-
reports/666.pdf
[7] http://gim.unmc.edu/dxtests/roc3.htm
[8] http://www.statisticssolutions.com/what-is-logistic-
regression/

More Related Content

What's hot (18)

PDF
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET Journal
 
PDF
IRJET- Analyzing Voting Results using Influence Matrix
IRJET Journal
 
PDF
A Defect Prediction Model for Software Product based on ANFIS
IJSRD
 
PDF
IRJET- The Machine Learning: The method of Artificial Intelligence
IRJET Journal
 
PDF
A02610104
theijes
 
PDF
Survey on semi supervised classification methods and feature selection
eSAT Journals
 
PDF
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
Venkata Karthik Gullapalli
 
PDF
Volume 2-issue-6-2165-2172
Editor IJARCET
 
PDF
IRJET - House Price Prediction using Machine Learning and RPA
IRJET Journal
 
PDF
IRJET - Employee Performance Prediction System using Data Mining
IRJET Journal
 
PDF
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
ijsc
 
PDF
Artificial Intelligence based Pattern Recognition
Dr. Amarjeet Singh
 
PDF
Survey on semi supervised classification methods and
eSAT Publishing House
 
PDF
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
IRJET Journal
 
PDF
IRJET- Disease Prediction System
IRJET Journal
 
PDF
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
IRJET Journal
 
PDF
Software Process Control on Ungrouped Data: Log-Power Model
Waqas Tariq
 
PDF
Survey on Feature Selection and Dimensionality Reduction Techniques
IRJET Journal
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET Journal
 
IRJET- Analyzing Voting Results using Influence Matrix
IRJET Journal
 
A Defect Prediction Model for Software Product based on ANFIS
IJSRD
 
IRJET- The Machine Learning: The method of Artificial Intelligence
IRJET Journal
 
A02610104
theijes
 
Survey on semi supervised classification methods and feature selection
eSAT Journals
 
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
Venkata Karthik Gullapalli
 
Volume 2-issue-6-2165-2172
Editor IJARCET
 
IRJET - House Price Prediction using Machine Learning and RPA
IRJET Journal
 
IRJET - Employee Performance Prediction System using Data Mining
IRJET Journal
 
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
ijsc
 
Artificial Intelligence based Pattern Recognition
Dr. Amarjeet Singh
 
Survey on semi supervised classification methods and
eSAT Publishing House
 
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
IRJET Journal
 
IRJET- Disease Prediction System
IRJET Journal
 
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
IRJET Journal
 
Software Process Control on Ungrouped Data: Log-Power Model
Waqas Tariq
 
Survey on Feature Selection and Dimensionality Reduction Techniques
IRJET Journal
 

Similar to Comparative Analysis of Machine Learning Algorithms for their Effectiveness in Churn Prediction in the Telecom Industry (20)

PDF
IRJET - Customer Churn Analysis in Telecom Industry
IRJET Journal
 
PDF
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
IRJET Journal
 
PPTX
Customer_Churn_prediction.pptx
patilaniket2418
 
PPTX
Customer_Churn_prediction.pptx
Aniket Patil
 
PDF
Customer choice probabilities
Allan D. Butler
 
PDF
CUSTOMER CHURN PREDICTION
IRJET Journal
 
PDF
Data Mining on Customer Churn Classification
Kaushik Rajan
 
PPTX
churn customer prediction model decision tree
drmohamadaboutaam
 
PDF
Customer churn classification using machine learning techniques
SindhujanDhayalan
 
PDF
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...
IRJET Journal
 
PPTX
churn_detection.pptx
DhanuDhanu49
 
PDF
Df24693697
IJERA Editor
 
PDF
Automated Feature Selection and Churn Prediction using Deep Learning Models
IRJET Journal
 
PDF
A Proposed Churn Prediction Model
Mona Nasr
 
PPTX
Churn Analysis in Telecom Industry
Satyam Barsaiyan
 
PDF
IRJET- Ad-Click Prediction using Prediction Algorithm: Machine Learning Approach
IRJET Journal
 
PPTX
Telecom Churn Prediction Presentation
PinintiHarishReddy
 
PDF
ML_project_ppt.pdf
HetansheeShah2
 
PDF
Improving customer insight through prediction models
Alessandro Leona
 
PDF
Manuscript dss
rakeshkumarford1
 
IRJET - Customer Churn Analysis in Telecom Industry
IRJET Journal
 
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
IRJET Journal
 
Customer_Churn_prediction.pptx
patilaniket2418
 
Customer_Churn_prediction.pptx
Aniket Patil
 
Customer choice probabilities
Allan D. Butler
 
CUSTOMER CHURN PREDICTION
IRJET Journal
 
Data Mining on Customer Churn Classification
Kaushik Rajan
 
churn customer prediction model decision tree
drmohamadaboutaam
 
Customer churn classification using machine learning techniques
SindhujanDhayalan
 
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...
IRJET Journal
 
churn_detection.pptx
DhanuDhanu49
 
Df24693697
IJERA Editor
 
Automated Feature Selection and Churn Prediction using Deep Learning Models
IRJET Journal
 
A Proposed Churn Prediction Model
Mona Nasr
 
Churn Analysis in Telecom Industry
Satyam Barsaiyan
 
IRJET- Ad-Click Prediction using Prediction Algorithm: Machine Learning Approach
IRJET Journal
 
Telecom Churn Prediction Presentation
PinintiHarishReddy
 
ML_project_ppt.pdf
HetansheeShah2
 
Improving customer insight through prediction models
Alessandro Leona
 
Manuscript dss
rakeshkumarford1
 
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
PDF
Kiona – A Smart Society Automation Project
IRJET Journal
 
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
PDF
Breast Cancer Detection using Computer Vision
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Ad

Recently uploaded (20)

PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PPTX
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
PPTX
MATLAB : Introduction , Features , Display Windows, Syntax, Operators, Graph...
Amity University, Patna
 
PPTX
Big Data and Data Science hype .pptx
SUNEEL37
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
PDF
smart lot access control system with eye
rasabzahra
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPTX
Presentation 2.pptx AI-powered home security systems Secure-by-design IoT fr...
SoundaryaBC2
 
PPT
Electrical Safety Presentation for Basics Learning
AliJaved79382
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PPTX
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
MATLAB : Introduction , Features , Display Windows, Syntax, Operators, Graph...
Amity University, Patna
 
Big Data and Data Science hype .pptx
SUNEEL37
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
smart lot access control system with eye
rasabzahra
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
Presentation 2.pptx AI-powered home security systems Secure-by-design IoT fr...
SoundaryaBC2
 
Electrical Safety Presentation for Basics Learning
AliJaved79382
 
Thermal runway and thermal stability.pptx
godow93766
 
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 

Comparative Analysis of Machine Learning Algorithms for their Effectiveness in Churn Prediction in the Telecom Industry

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 08 | Aug -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 485 Comparative Analysis of Machine Learning Algorithms for their Effectiveness in Churn Prediction in the Telecom Industry Mr. Nand Kumar1, Mr. Chetankumar Naik2 ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Churn is a term that gives insights of the attrition rate of the customer in any particular company. Telecom industry is highly dynamic in industry in nature; highly active in terms of customer relationship and management compared to any other industry but the customer base is tend to be fragile because of the luring offers offered by competitive companies to get strong hold the customer base which results into customer churn, thereby affecting the company’s customer’s assessments due to the liquidity. This paper aims at providing solution to this problem by identifying the potential customers who tend to switch to other companies by using different machine learning algorithms and evaluation each for their effectiveness. Data analysis is done on previously recorded data for extracting the potential features and their dependencies that impact on the churn. State of art classification algorithms like balanced logistic regression, random forest and balanced random forest are used. To optimize the solution, thorough analysis of the performance is done and based on the measure of goodness of the algorithm; the one that fits best is identified. The output obtained will help the company to control the attrition rate by handling their issues on time and retaining them with the company. Key Words: — Balanced Logistic Regression, Churn, Classification, Data pre-processing, key feature extraction, Principal Component Analysis, Random forest 1. INTRODUCTION The churn rate, also known as the rate of attrition, is the percentage of subscribers to a service who discontinue their subscriptions to that service within a given time period. For a company to expand its clientele, its growth rate, as measured by the number of new customers, must exceed its churn rate[1]. There are an ample number of reasons for a company to lose its customers. The primary reason is the cost of the product, the product quality and quality of service. The valuation of any company is affected if the customer outflow is veryhigh. Many companies lose its reputation amongst its stakeholders. However, for the customer to be retained, it is very important to also measure customer satisfaction. Evaluating the degree of customer satisfaction is a difficult task. It becomes more difficult when the customer base accounts up to millions. Value added service is another reason for Churn. Telecom companies have started a newoffering called Triple play [2], combining the TV, broadband and the phone offering as compared to the traditional model of just the phone services. This is seen as a value adds to retain customers. The Triple play not only helps retain customers but also increases the Average revenue per user (ARPU) directly contributing to the revenue of the company [3]. 1.1 How to Reduce Churning: 1 “After sales service” is of prime importance. Person consideration is very important when it comes to the service part so that the customer feels privileged. 2 Customized offers should be provided based on their expenditure. 3 Value added services should be enhanced 4 Trust building by handling each individual’s query and resoling queries on time will manifest returns. 2. LITERATURE SURVEY Telecom companies have used two approaches to address churn – a) Untargeted approach: Mass-marketing andcreating brand loyalty for the product b) Targeted approach: Involves focusing on the customer base likely to churn. Intervening with the personal care and taking steps to avoid their likeability to churn Targeted approach needs to derive some important insights which is difficult to be evaluated manually since the data is very dynamic and large. The machine learning techniques have evolved to a great extent assisting intelligent solutions to churn issues that are present in the telecom industry. The researchershave come upwith significant algorithms that are proven to give valuable insights over the given data. The problem that is being taken up is of the classification of customers on evaluation of their potential to churn based on several factors. The performance of each algorithm depends on the volume, and variety of thedata. Hencegeneric solution isuncalled for. The paper aims at comparative analysis of the various
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 08 | Aug -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 486 1 Number of observations: 25000 2 3 No of variables: 111(Independent=110 target=1) Number of missing values: No missing values and algorithms used in machine learning for this particular problem which can further provide solution for the problem with similar inputs. The thorough analysis of the algorithms will give a clear picture about the accuracy of each. The final result of the analysis will decide on the best solution with minimal errors. 3. METHODOLOGY 1 Understanding t h e b u s i n e s s p r o b l e m i s h i g h l y important, evaluating which a final structure of the problem can be devised with confidence. 2 The next step inculcates understanding the data and its interdependency; correlation among different variables should be done. 3 PCA for Feature Selection, new dimensions, pre- processing 4 Build Logistic regression Model for Classification 5 Balance data using smote + Logistic Regression 6 Classification using Random Forest (35 PCA) 7 Classification using Random Forest (4 important PCA) 8 Random forest in H2O – Parameter Tuning 9 The final outcome is analyzed The accuracy of the test depends on how well the test separates the group being tested into those with and without the disease in question. Accuracy is measured by the area under the ROC curve. An area of 1 represents a perfect test; an area of .5 represents a worthless test[7]. 3.1 Data Exploration Complex structure of data implies new, sophisticated solutions including data transformation, semantic representation and new mathematical theories accommodation. The most important problem of data processing is to make sense of available data[4]. The data is provided by the telecom company which needs some preprocessing. Given below are the details about the data 3.2 Principal component analysis for Feature Selection Principal component analysis is a method of extracting important variables (in form of components) from a large set of variables available in a data set. It extracts low dimensional set of features from a high dimensional data set with a motive to capture as much information as possible. With fewer variables, visualization also becomes much more meaningful. PCA is more useful when dealing with 3 or higher dimensional data [5]. Since the data to be worked on is of very high dimension. It becomes necessary to eliminate thedimensions such that the model works effectively without much unnecessary and irrelevant computation, which can end up getting unwanted results. It is proven that the first principal component involves the dimensions that are highly important, followed by thesecond component and so on. It is thedata expert who can understand the problem better can decide on the rightnumber of components to used in order to provide the right input to the model Fig -1: Principal Components 3.3 Stratified Sampling Stratified Sampling with train and validation in the ratio 7:3 set.seed(1000) split = sample.split(data_Pca_less_Comp$target, SplitRatio = 0.70) # Split up the data using subset train = subset(data_Pca_less_Comp, split==TRUE) test = subset(data_Pca_less_Comp, split==FALSE) 3.4 Build Logistic regression Model for Classification Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Likeall regression analyses, thelogistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 08 | Aug -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 487 dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables[8] Logistic Regression using train data set and 35 Components from P.C.A The results below are obtained by running the model in RStudio. The glm() function is used to apply logistic regression on the given data. LogReg <- glm(target~.,data = train,family = "binomial”) Inference: Accuracy on Train Set: 0.79 Accuracy on Test Set: 0.80 Recall on the Test Set: 0.58 The graph below is obtained in the Rstudio Fig -2: ROC for Logistic regression Model The Area Under Curve is: 0.8547 for logistic regression model 3.5 Logistic Regression Using Smote We witnessed a low recall. Data is not completely imbalanced, but building a model on a completely balanced data could help Use SMOTE to balance the data train_smote=SMOTE(target~.,data=train, perc.over=100,perc.under=200) prop.table(table(train_smote$target)) Fig -3: ROC for Logistic regression Model using smote Area Under Curve: 0.861 Accuracy on Train Set: 0.78 Accuracy on Test Set: 0.77 Recall on the Test Set: 0.76 3.6 Classification using Random Forest Random forest (Breiman, 2001) is an ensemble of unpruned classification or regression trees, induced from bootstrap samples of the training data, using random feature selection in the tree induction process. Prediction is made by aggregating (majority vote for classification or averaging for regression) the predictions of the ensemble. Random forest generally exhibits a substantial performance improvement over the single tree classifier such as CART and C4.5 [6]. It yields generalization error rate that compares favourably to Adaboost, yet is more robust to noise. However, similar to most classifiers, RF can also suffer from the curse of learning from an extremely imbalanced training data set. The model was executed in Rstudio. The output below is its summary extract Fig-4: Summary for the performance of random forest model for 35 principal components The summary in the Figure 3 explains the error rate when the random forest is used for the classification when 35 principal components are used. But around 80 percent of the components will have less or no impact on the performance of the model. It is necessary to figure out the features that impact the most which can thereby reduce the computational complexity.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 08 | Aug -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 488 Fig -5: ROC for the random forest model with 35 principal components Inference: Area Under Curve: 0.8324 we chose mtry = 5, ntree=50) The area under the ROC curve is 0.83, which is less when compared to that of the logistic regression model. Fig -6: Random forest using top 4 important Attributes Fig-7: Summary for the performance of random forest model for 4 principal components 3.6 Random forest in H2O 1. We will use parameter tuning to find the values of mtry and ntree to give us the best AUC 2. Import the dataset into H20 3. Grid Search and Model Selection with H2O and define the tuning parameter -> mtries Fig -8: H2O grid details Inference: Mtry=8 gives us the best model The AUC is calculated on Out of Bag Samples. Hence cross validation is not needed. 4. CONCLUSION The various model presentedinthepaperhadgoodaccuracy rate but the Logistic regression model on the balanced data has the highest area under the curve, which by convention proves to be the best model delivering high Accuracy and is more parsimonious model REFERENCES [1] ]http://www.investopedia.com/terms/c/churnrate.asp [2] https://en.wikipedia.org/wiki/Tripleplay_(telecomm uni cations) [3] http://www.happiestminds.com/whitepapers/how- to- reduce-churn-in-a-telco-industry [4] http://www.academia.edu/4506159/Top_Challenges_of _Data_Processing [5] https://www.analyticsvidhya.com/blog/2016/03/pra ct ical-guide-principal-component-analysis-python/
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 08 | Aug -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 489 [6] http://statistics.berkeley.edu/sites/default/files/tech- reports/666.pdf [7] http://gim.unmc.edu/dxtests/roc3.htm [8] http://www.statisticssolutions.com/what-is-logistic- regression/