SlideShare a Scribd company logo
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
OPTIMISED STACKED ENSEMBLE TECHNIQUES IN THE
PREDICTION OF CERVICAL CANCER USING SMOTE AND RFERF
ABSTRACT :
Cervical cancer is frequently a deadly disease, common in females. However, early diagnosis of
cervical cancer can reduce the mortality rate and other associated complications. Cervical cancer
risk factors can aid the early diagnosis. For better diagnosis accuracy, we proposed a study for
early diagnosis of cervical cancer using reduced risk feature set and three ensemble-based
classification techniques, i.e., extreme Gradient Boosting (XGBoost), AdaBoost, and Random
Forest (RF) along with Firefly algorithm for optimization. Synthetic Minority Oversampling
Technique (SMOTE) data sampling technique was used to alleviate the data imbalance problem.
Cervical cancer Risk Factors data set, containing 32 risks factor and four targets (Hinselmann,
Schiller, Cytology, and Biopsy), is used in the study. The four targets are the widely used
diagnosis test for cervical cancer. The effectiveness of the proposed study is evaluated in terms
of accuracy, sensitivity, specificity, positive predictive accuracy (PPA), and negative predictive
accuracy (NPA). Moreover, Firefly features selection technique was used to achieve better
results with the reduced number of features. Experimental results reveal the significance of the
proposed model and achieved the highest outcome for Hinselmann test when compared with
other three diagnostic tests. Furthermore, the reduction in the number of features has enhanced
the outcomes. Additionally, the performance of the proposed models is noticeable in terms of
accuracy when compared with other benchmark studies for cervical cancer diagnosis using
reduced risk factors data set.
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
EXICITING SYSTEM :
Dataset Description:
cervical cancer risk factors data set used in the study was collected at “Hospital Universitario de
Caracas” in Caracas, Venezuela and is available on the UCI Machine Learning repository . It
consists of 858 records, with some missing values, as several patients did not answer some of the
questions due to privacy concerns. the data set contains 32 risk factors and 4 targets, i.e., the
diagnosis tests used for cervical cancer. It contains different categories of feature set such as
habits, demographic information, history, and Genomic medical records. Features such as age,
Dx: Cancer, Dx: CIN, Dx: HPV, and Dx features contains no missing values. Dx: CIN is a
change in the walls of cervix and is commonly due to HPV infection; sometimes, it may lead to
cancer if it is not treated properly. However, Dx: cancer variable is represented if the patient has
other types of cancer or not. Sometimes, a patient may have more tha the cervical cancer risk
factors data set used in the study was collected at “Hospital Universitario de Caracas” in Caracas,
Venezuela and is available on the UCI Machine Learning repository . It consists of 858 records,
with some missing values, as several patients did not answer some of the questions due to
privacy concerns. the data set contains 32 risk factors and 4 targets, i.e., the diagnosis tests used
for cervical cancer. It contains different categories of feature set such as habits, demographic
information, history, and Genomic medical records. Features such as age, Dx: Cancer, Dx: CIN,
Dx: HPV, and Dx features contains no missing values. Dx: CIN is a change in the walls of cervix
and is commonly due to HPV infection; sometimes, it may lead to cancer if it is not treated
properly. However, Dx: cancer variable is represented if the patient has other types of cancer or
not. Sometimes, a patient may have more than one type of cancer. In the data set, some of the
patients do not have cervical cancer, but they had the Dx: cancer value true. +erefore, it is not
used as a target variable.he data set, some of the patients do not have cervical cancer, but they
had the Dx: cancer value true. +erefore, it is not used as a target variable.
Table 1 presents a brief description of each feature with the type. Cervical cancer diagnosis
usually requires several tests; this data contains the widely used diagnosis tests as the target.
Hinselmann, Schiller, Cytology, and Biopsy are four widely used diagnosis tests for cervical
cancer. Hinselmann or Colposcopy is a test that examines the inside of the vagina and cervix
using a tool that magnifies the tissues to detect any anomalies . Schiller is a test in which a
chemical substance called iodine is applied to the cervix, where it stains healthy cells into brown
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
color and leaves the abnormal cells uncolored, while cytology is a test that examines body cells
from uterine cervix for any cancerous cells or other diseases. And Biopsy refers to the test where
a small part of cervical tissue is examined under a microscope. Most Biopsy tests can make
significant diagnosis.
Dataset Preprocessing :
the data set suffers from a huge number of missing values; 24 features out of the 32 contained
missing values. Initially, the features with the huge percentage of missing values were removed.
STDs: Time since first diagnosis and STDs: Time since last diagnosis features were removed
since they have 787 missing values (see Table 2), which is more than half of the data. However,
the data imputation was performed for the features with fewer numbers of missing values. +e
most frequent value technique was used to impute the remaining missing values. Additionally,
the data set also suffers from huge class imbalance. the data set target labels were imbalanced
with 35 for the Hinselmann, 74 for Schiller, 44 for Cytology, and 55 Biopsy out of the 858
records as shown in Figure 1. SMOTE was used to deal with class imbalance. SMOTE works by
oversampling the minority class by generating new synthetic data for minority instances based
on nearest neighbors using the Euclidean Distance between data points . Figure 1 shows the
number of records per class labels in the data set.
FireflyFeature Selection:
Dimensionality reduction is one of the effective ways to select the features that improve the
performance of the supervised learning model. In the study, we adopted nature-inspired
algorithm Firefly for selecting the features that better formulate the problem. Firefly was
proposed by Yang and was initially proposed for the optimization. Metaheuristic Firefly
algorithm is inspired by fireflies’ and flash lightening capability of a fly. It is a population-based
optimization algorithm to find the optimal value or parameter for a target function. In this
technique, each fly is pulled out by the glow intensity of the nearby flies. If the intensity of the
gleam is extremely low at some point, then the attraction will be declining. Firefly used three
rules; that is, (a) all the flies should be of the same gender; (b) the criteria of attractiveness
depend upon the intensity of the glow; (c) target function will generate the gleam of the firefly.
the flies with less glow will move towards the flies with brighter glow. the brightness can be
adjusted using objective function. the same idea is implemented in the algorithm to search the
optimal features that can better fit the training model. Firefly is more computationally
economical and produced better outcome in feature selection when compared with other
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
metaheuristic techniques like genetic algorithms and particle swarm optimization . the time
complexity of firefly is O(n2t) . It uses the light intensity to select the features. Highly relevant
features are represented as the features with high intensity light.
For feature selection, initially, some fireflies will be generated, and each fly will randomly assign
the weights to all features. In our study, we generated 50 number of flies (n = 50). +e dimension
of the data set is 30. Furthermore, the lower bound was set to − 50, while the upper bound is
equal to 50. the maximum generations were 500. Additionally, α (alpha) was initially set to 0.5
and in every subsequent iteration, we used the and to update α (alpha) value.
However, the gamma (c) was set to 1. the number of features selected using Firefly for
Hinselmann was 15, for Schiller 13 features, for Cytology 11 features, and 11 features for
Biopsy, respectively.
Ensemble-Based Classification Methods :
ensemble-based classification techniques such as Random Forest, Extreme Gradient Boosting,
and Ada Boost were used to train the model. the description of these techniques is discussed in the
sectionbelow.
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
Random Forest :
Random Forest (RF) was first proposed by Breiman in 2001 . Random forest is an ensemble
model that uses decision tree as individual model and bagging as ensemble method. It improves
the performance of decision tree by adding many trees to reduce the overfitting in the decision
tree. RF can be used for both classification and regression. RF generates a random forest that
contains decision trees and gets a prediction from each one of them and then selects the best
solution with the maximum votes .
When training a tree, it is important to measure how much each feature decreases the impurity,
as the decrease in the impurity indicates the significance of the feature. the tree classification
result depends on the impurity measure used. For classification, the measures for impurity are
either Gini impurity or information gain and for regression, and the measure for impurity is
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
variance. Training decision tree consists of iteratively splitting the data. Gini impurity decides
the best split of the data using the formula.
where p (i) is the probability of selecting a datapoint with class; i.e., Information gain (IG) is also
another measure to decide the best split of the data depending on the gain of each feature. the
formula that calculates the information gain is given in the following equation:where p (i) is the
probability of selecting a datapoint with class; i.e., Information gain (IG) is also another measure
to decide the best split of the data depending on the gain of each feature. the formula that
calculates the information gain is given in the following equation:where p (i) is the probability of
selecting a datapoint with class; i.e., Information gain (IG) is also another measure to decide the
best split of the data depending on the gain of each feature. +e formula that calculates the
information gain is given in the following equation:
Extreme Gradient Boosting :
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
eXtreme Gradient Boosting (XGBoost) is a tree-based ensemble technique . XGBoost can be
used for classification, regression, and ranking problems. XG boosting is a type of gradient
boosting. Gradient Boosting (GB) is a boosting ensemble technique that makes predicators
sequentially instead of individually. GB is a method that produces a strong classifier by
combining weak classifiers . +e goal of the GB is building an iterative model that optimizes a
loss function. It pinpoints the failings of weak learners by using gradients in the loss function
where e denotes the error term. loss function measures how good is the model at fitting the
underlying data. the loss function depends on the optimization goal, for regression is a measure
of the error between the true and predicated values, whereas, for classification, it measures the
how good is a model at classifying cases correctly .this technique takes less time and less
iterations, since predictors are learning from the past mistakes of the other predictors.
GB works by teaching a model C to predict values of the form
By minimizing a loss function, e.g., MSE
where i iterates over a training set of size n of true values of the target variable yyʹ = estimated values
of C (x) y = true values & n =number of instancesin y. Considering a GB model with Mphases and m as a
single phase being (1 ≤ m ≤M), to improve some deficient model Fm, a new estimator hm (x) is added.
Therefore
Estimator h will be fitted to Y − Fm(x), which is the difference between the true value and the predicated
value,i.e.,the residual.thus,we attempttoadjustthe errorsof the previousmodel (Fm)
XGBoost is better than Ada boost in terms of speed and performance. It is highly scalable and runs 10
times faster as compared to the other traditional single machine learning algorithms. XGBoost handles
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
the sparse data and implements several optimization and regularization techniques. Moreover, it also
usesthe conceptof parallel anddistributedcomputing.
DISADVANTAGES OF EXISTING SYSTEM :
1) Less accuracy
2)low Efficiency
PROPOSED SYSTEM:
the model was implemented in Python language 3.8.0 release using Jupyter Notebook
environment. Ski-learn library was used for the classifiers along with other needed built-in tools,
while separate library (xgboost 1.2.0) was used for XGBoost ensemble. +ere is K-fold cross
validation with K =10 for partitioning the data into training and testing. Five evaluation measures
such as accuracy, sensitivity (recall), specificity (precision), positive predictive accuracy (PPA),
and negative predictive accuracy (NPA) were used. Sensitivity and specificity are focused more
during the study due to the application of the proposed model. Accuracy denotes the percentage
of correctly classified cases, sensitivity measures the percentage of positives cases that were
classified as positives, and specificity refers to the percentage of negative cases that were
classified as negatives. Moreover, the criteria for the selection of the performance evaluating.
measures depend upon the measures used in the benchmark studies. Two sets of experiments
were conducted for each target using selected features by using Firefly feature selection
algorithm and 30 features for four targets. +e SMOTE technique was applied to generate
synthetic data. +e results of model are presented in section below
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
Hinselmann :
Table 6 presents the accuracy, sensitivity, specificity, PPA, and NPA for the RF, AdaBoost, and
XGBoost models, respectively, using SMOTE for Hinselmann test target class. +e number of
selected features for Hinselmann was 15. XGBoost outperformed the other classifiers for both
feature sets. However, the performance of XGBoost with selected feature is better when
compared with 30 features. +e model produces an accuracy of 98.83, sensitivity of 97.5,
specificity of 99.2, PPA of 99.17, and NPA of 97.63, respectively
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
Schiller :
Table 7 presents the outcomes for the Schiller test. Like Hinselmann target, XGBoost with
selected features outperformed that of Schiller, respectively. However, the outcomes achieved by
the model for Schiller are lower when compared with Hinselmann target class. the
performanceof RF and XGBoost is similar with selected feature for Schiller with a minor
difference. the number of features selected by Firefly for Schiller was 13.
ADVANTAGES OF PROPOSED SYSTEM :
1) High accuracy
2)High efficiency
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
SYSTEM ARCHITECTUTRE :
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
HARDWARE & SOFTWARE REQUIREMENTS:
HARD REQUIRMENTS :
 System : i3 or above
 Ram : 4GB Ram. 
 Hard disk : 40GB
SOFTWARE REQUIRMENTS :
 Operating system : Windows
 Coding Language : python
VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110

More Related Content

PDF
An approach of cervical cancer diagnosis using class weighting and oversampli...
TELKOMNIKA JOURNAL
 
PDF
Classification of Breast Cancer Tissues using Decision Tree Algorithms
Lovely Professional University
 
PDF
Enhancing breast cancer diagnosis: a comparative analysis of feature selectio...
IAESIJAI
 
PDF
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET Journal
 
PPTX
Cancer detection using data mining
RishabhKumar283
 
PDF
A machine learning based framework for breast cancer prediction using biomar...
IAESIJAI
 
PDF
Logistic Regression Model for Predicting the Malignancy of Breast Cancer
IRJET Journal
 
PDF
Diagnosis of Cancer using Fuzzy Rough Set Theory
IRJET Journal
 
An approach of cervical cancer diagnosis using class weighting and oversampli...
TELKOMNIKA JOURNAL
 
Classification of Breast Cancer Tissues using Decision Tree Algorithms
Lovely Professional University
 
Enhancing breast cancer diagnosis: a comparative analysis of feature selectio...
IAESIJAI
 
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET Journal
 
Cancer detection using data mining
RishabhKumar283
 
A machine learning based framework for breast cancer prediction using biomar...
IAESIJAI
 
Logistic Regression Model for Predicting the Malignancy of Breast Cancer
IRJET Journal
 
Diagnosis of Cancer using Fuzzy Rough Set Theory
IRJET Journal
 

Similar to OPTIMISED STACKED ENSEMBLE TECHNIQUES IN THE PREDICTION OF CERVICAL CANCER USING SMOTE AND RFERF.docx (20)

PDF
My own Machine Learning project - Breast Cancer Prediction
Gabriele Mineo
 
PDF
The effect of features combination on coloscopy images of cervical cancer usi...
IAESIJAI
 
PPS
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Sunil Nair
 
PDF
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
cscpconf
 
PDF
Breast Cancer Prediction using Machine Learning
IRJET Journal
 
PPTX
Machine Learning - Breast Cancer Diagnosis
Pramod Sharma
 
PDF
V5I3_IJERTV5IS031157
ahmad abdelhafeez
 
PDF
Cervical cancer diagnosis based on cytology pap smear image classification us...
TELKOMNIKA JOURNAL
 
PDF
Intelligent cervical cancer detection: empowering healthcare with machine lea...
IAESIJAI
 
PDF
Performance enhancement of machine learning algorithm for breast cancer diagn...
IJECEIAES
 
DOCX
A Pioneering Cervical Cancer Prediction Prototype in Medical Data Mining usin...
IIRindia
 
PPTX
DataMining Techniques in BreastCancer.pptx
MaligireddyTanujaRed1
 
PDF
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET Journal
 
PDF
Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 g...
Lietuvos kompiuterininkų sąjunga
 
PDF
Breast Tumor Detection Using Efficient Machine Learning and Deep Learning Tec...
mlaij
 
PDF
BREAST TUMOR DETECTION USING EFFICIENT MACHINE LEARNING AND DEEP LEARNING TEC...
mlaij
 
PDF
Breast Tumor Detection Using Efficient Machine Learning and Deep Learning Tec...
mlaij
 
PDF
Cervical Cancer Detection: An Enhanced Approach through Transfer Learning and...
IRJET Journal
 
PDF
fnano-04-972421.pdf
EverestTechnomania
 
PDF
Predictive modeling for breast cancer based on machine learning algorithms an...
IJECEIAES
 
My own Machine Learning project - Breast Cancer Prediction
Gabriele Mineo
 
The effect of features combination on coloscopy images of cervical cancer usi...
IAESIJAI
 
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Sunil Nair
 
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
cscpconf
 
Breast Cancer Prediction using Machine Learning
IRJET Journal
 
Machine Learning - Breast Cancer Diagnosis
Pramod Sharma
 
V5I3_IJERTV5IS031157
ahmad abdelhafeez
 
Cervical cancer diagnosis based on cytology pap smear image classification us...
TELKOMNIKA JOURNAL
 
Intelligent cervical cancer detection: empowering healthcare with machine lea...
IAESIJAI
 
Performance enhancement of machine learning algorithm for breast cancer diagn...
IJECEIAES
 
A Pioneering Cervical Cancer Prediction Prototype in Medical Data Mining usin...
IIRindia
 
DataMining Techniques in BreastCancer.pptx
MaligireddyTanujaRed1
 
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET Journal
 
Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 g...
Lietuvos kompiuterininkų sąjunga
 
Breast Tumor Detection Using Efficient Machine Learning and Deep Learning Tec...
mlaij
 
BREAST TUMOR DETECTION USING EFFICIENT MACHINE LEARNING AND DEEP LEARNING TEC...
mlaij
 
Breast Tumor Detection Using Efficient Machine Learning and Deep Learning Tec...
mlaij
 
Cervical Cancer Detection: An Enhanced Approach through Transfer Learning and...
IRJET Journal
 
fnano-04-972421.pdf
EverestTechnomania
 
Predictive modeling for breast cancer based on machine learning algorithms an...
IJECEIAES
 
Ad

More from Venkat Projects (20)

DOCX
1.AUTOMATIC DETECTION OF DIABETIC RETINOPATHY USING CNN.docx
Venkat Projects
 
DOCX
12.BLOCKCHAIN BASED MILK DELIVERY PLATFORM FOR STALLHOLDER DAIRY FARMERS IN K...
Venkat Projects
 
DOCX
10.ATTENDANCE CAPTURE SYSTEM USING FACE RECOGNITION.docx
Venkat Projects
 
DOCX
9.IMPLEMENTATION OF BLOCKCHAIN IN FINANCIAL SECTOR TO IMPROVE SCALABILITY.docx
Venkat Projects
 
DOCX
8.Geo Tracking Of Waste And Triggering Alerts And Mapping Areas With High Was...
Venkat Projects
 
DOCX
Image Forgery Detection Based on Fusion of Lightweight Deep Learning Models.docx
Venkat Projects
 
DOCX
6.A FOREST FIRE IDENTIFICATION METHOD FOR UNMANNED AERIAL VEHICLE MONITORING ...
Venkat Projects
 
DOCX
WATERMARKING IMAGES
Venkat Projects
 
DOCX
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
Venkat Projects
 
DOCX
Application and evaluation of a K-Medoidsbased shape clustering method for an...
Venkat Projects
 
DOCX
1.AUTOMATIC DETECTION OF DIABETIC RETINOPATHY USING CNN.docx
Venkat Projects
 
DOCX
2022 PYTHON MAJOR PROJECTS LIST.docx
Venkat Projects
 
DOCX
2022 PYTHON PROJECTS LIST.docx
Venkat Projects
 
DOCX
2021 PYTHON PROJECTS LIST.docx
Venkat Projects
 
DOCX
2021 python projects list
Venkat Projects
 
DOCX
10.sentiment analysis of customer product reviews using machine learni
Venkat Projects
 
DOCX
9.data analysis for understanding the impact of covid–19 vaccinations on the ...
Venkat Projects
 
DOCX
6.iris recognition using machine learning technique
Venkat Projects
 
DOCX
5.local community detection algorithm based on minimal cluster
Venkat Projects
 
DOCX
4.detection of fake news through implementation of data science application
Venkat Projects
 
1.AUTOMATIC DETECTION OF DIABETIC RETINOPATHY USING CNN.docx
Venkat Projects
 
12.BLOCKCHAIN BASED MILK DELIVERY PLATFORM FOR STALLHOLDER DAIRY FARMERS IN K...
Venkat Projects
 
10.ATTENDANCE CAPTURE SYSTEM USING FACE RECOGNITION.docx
Venkat Projects
 
9.IMPLEMENTATION OF BLOCKCHAIN IN FINANCIAL SECTOR TO IMPROVE SCALABILITY.docx
Venkat Projects
 
8.Geo Tracking Of Waste And Triggering Alerts And Mapping Areas With High Was...
Venkat Projects
 
Image Forgery Detection Based on Fusion of Lightweight Deep Learning Models.docx
Venkat Projects
 
6.A FOREST FIRE IDENTIFICATION METHOD FOR UNMANNED AERIAL VEHICLE MONITORING ...
Venkat Projects
 
WATERMARKING IMAGES
Venkat Projects
 
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
Venkat Projects
 
Application and evaluation of a K-Medoidsbased shape clustering method for an...
Venkat Projects
 
1.AUTOMATIC DETECTION OF DIABETIC RETINOPATHY USING CNN.docx
Venkat Projects
 
2022 PYTHON MAJOR PROJECTS LIST.docx
Venkat Projects
 
2022 PYTHON PROJECTS LIST.docx
Venkat Projects
 
2021 PYTHON PROJECTS LIST.docx
Venkat Projects
 
2021 python projects list
Venkat Projects
 
10.sentiment analysis of customer product reviews using machine learni
Venkat Projects
 
9.data analysis for understanding the impact of covid–19 vaccinations on the ...
Venkat Projects
 
6.iris recognition using machine learning technique
Venkat Projects
 
5.local community detection algorithm based on minimal cluster
Venkat Projects
 
4.detection of fake news through implementation of data science application
Venkat Projects
 
Ad

Recently uploaded (20)

PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PDF
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
PDF
All chapters of Strength of materials.ppt
girmabiniyam1234
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PDF
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
PDF
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
PDF
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
PDF
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
PDF
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PPT
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
PDF
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
All chapters of Strength of materials.ppt
girmabiniyam1234
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 

OPTIMISED STACKED ENSEMBLE TECHNIQUES IN THE PREDICTION OF CERVICAL CANCER USING SMOTE AND RFERF.docx

  • 1. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110 OPTIMISED STACKED ENSEMBLE TECHNIQUES IN THE PREDICTION OF CERVICAL CANCER USING SMOTE AND RFERF ABSTRACT : Cervical cancer is frequently a deadly disease, common in females. However, early diagnosis of cervical cancer can reduce the mortality rate and other associated complications. Cervical cancer risk factors can aid the early diagnosis. For better diagnosis accuracy, we proposed a study for early diagnosis of cervical cancer using reduced risk feature set and three ensemble-based classification techniques, i.e., extreme Gradient Boosting (XGBoost), AdaBoost, and Random Forest (RF) along with Firefly algorithm for optimization. Synthetic Minority Oversampling Technique (SMOTE) data sampling technique was used to alleviate the data imbalance problem. Cervical cancer Risk Factors data set, containing 32 risks factor and four targets (Hinselmann, Schiller, Cytology, and Biopsy), is used in the study. The four targets are the widely used diagnosis test for cervical cancer. The effectiveness of the proposed study is evaluated in terms of accuracy, sensitivity, specificity, positive predictive accuracy (PPA), and negative predictive accuracy (NPA). Moreover, Firefly features selection technique was used to achieve better results with the reduced number of features. Experimental results reveal the significance of the proposed model and achieved the highest outcome for Hinselmann test when compared with other three diagnostic tests. Furthermore, the reduction in the number of features has enhanced the outcomes. Additionally, the performance of the proposed models is noticeable in terms of accuracy when compared with other benchmark studies for cervical cancer diagnosis using reduced risk factors data set.
  • 2. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110 EXICITING SYSTEM : Dataset Description: cervical cancer risk factors data set used in the study was collected at “Hospital Universitario de Caracas” in Caracas, Venezuela and is available on the UCI Machine Learning repository . It consists of 858 records, with some missing values, as several patients did not answer some of the questions due to privacy concerns. the data set contains 32 risk factors and 4 targets, i.e., the diagnosis tests used for cervical cancer. It contains different categories of feature set such as habits, demographic information, history, and Genomic medical records. Features such as age, Dx: Cancer, Dx: CIN, Dx: HPV, and Dx features contains no missing values. Dx: CIN is a change in the walls of cervix and is commonly due to HPV infection; sometimes, it may lead to cancer if it is not treated properly. However, Dx: cancer variable is represented if the patient has other types of cancer or not. Sometimes, a patient may have more tha the cervical cancer risk factors data set used in the study was collected at “Hospital Universitario de Caracas” in Caracas, Venezuela and is available on the UCI Machine Learning repository . It consists of 858 records, with some missing values, as several patients did not answer some of the questions due to privacy concerns. the data set contains 32 risk factors and 4 targets, i.e., the diagnosis tests used for cervical cancer. It contains different categories of feature set such as habits, demographic information, history, and Genomic medical records. Features such as age, Dx: Cancer, Dx: CIN, Dx: HPV, and Dx features contains no missing values. Dx: CIN is a change in the walls of cervix and is commonly due to HPV infection; sometimes, it may lead to cancer if it is not treated properly. However, Dx: cancer variable is represented if the patient has other types of cancer or not. Sometimes, a patient may have more than one type of cancer. In the data set, some of the patients do not have cervical cancer, but they had the Dx: cancer value true. +erefore, it is not used as a target variable.he data set, some of the patients do not have cervical cancer, but they had the Dx: cancer value true. +erefore, it is not used as a target variable. Table 1 presents a brief description of each feature with the type. Cervical cancer diagnosis usually requires several tests; this data contains the widely used diagnosis tests as the target. Hinselmann, Schiller, Cytology, and Biopsy are four widely used diagnosis tests for cervical cancer. Hinselmann or Colposcopy is a test that examines the inside of the vagina and cervix using a tool that magnifies the tissues to detect any anomalies . Schiller is a test in which a chemical substance called iodine is applied to the cervix, where it stains healthy cells into brown
  • 3. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110 color and leaves the abnormal cells uncolored, while cytology is a test that examines body cells from uterine cervix for any cancerous cells or other diseases. And Biopsy refers to the test where a small part of cervical tissue is examined under a microscope. Most Biopsy tests can make significant diagnosis. Dataset Preprocessing : the data set suffers from a huge number of missing values; 24 features out of the 32 contained missing values. Initially, the features with the huge percentage of missing values were removed. STDs: Time since first diagnosis and STDs: Time since last diagnosis features were removed since they have 787 missing values (see Table 2), which is more than half of the data. However, the data imputation was performed for the features with fewer numbers of missing values. +e most frequent value technique was used to impute the remaining missing values. Additionally, the data set also suffers from huge class imbalance. the data set target labels were imbalanced with 35 for the Hinselmann, 74 for Schiller, 44 for Cytology, and 55 Biopsy out of the 858 records as shown in Figure 1. SMOTE was used to deal with class imbalance. SMOTE works by oversampling the minority class by generating new synthetic data for minority instances based on nearest neighbors using the Euclidean Distance between data points . Figure 1 shows the number of records per class labels in the data set. FireflyFeature Selection: Dimensionality reduction is one of the effective ways to select the features that improve the performance of the supervised learning model. In the study, we adopted nature-inspired algorithm Firefly for selecting the features that better formulate the problem. Firefly was proposed by Yang and was initially proposed for the optimization. Metaheuristic Firefly algorithm is inspired by fireflies’ and flash lightening capability of a fly. It is a population-based optimization algorithm to find the optimal value or parameter for a target function. In this technique, each fly is pulled out by the glow intensity of the nearby flies. If the intensity of the gleam is extremely low at some point, then the attraction will be declining. Firefly used three rules; that is, (a) all the flies should be of the same gender; (b) the criteria of attractiveness depend upon the intensity of the glow; (c) target function will generate the gleam of the firefly. the flies with less glow will move towards the flies with brighter glow. the brightness can be adjusted using objective function. the same idea is implemented in the algorithm to search the optimal features that can better fit the training model. Firefly is more computationally economical and produced better outcome in feature selection when compared with other
  • 4. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110 metaheuristic techniques like genetic algorithms and particle swarm optimization . the time complexity of firefly is O(n2t) . It uses the light intensity to select the features. Highly relevant features are represented as the features with high intensity light. For feature selection, initially, some fireflies will be generated, and each fly will randomly assign the weights to all features. In our study, we generated 50 number of flies (n = 50). +e dimension of the data set is 30. Furthermore, the lower bound was set to − 50, while the upper bound is equal to 50. the maximum generations were 500. Additionally, α (alpha) was initially set to 0.5 and in every subsequent iteration, we used the and to update α (alpha) value. However, the gamma (c) was set to 1. the number of features selected using Firefly for Hinselmann was 15, for Schiller 13 features, for Cytology 11 features, and 11 features for Biopsy, respectively. Ensemble-Based Classification Methods : ensemble-based classification techniques such as Random Forest, Extreme Gradient Boosting, and Ada Boost were used to train the model. the description of these techniques is discussed in the sectionbelow.
  • 5. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110 Random Forest : Random Forest (RF) was first proposed by Breiman in 2001 . Random forest is an ensemble model that uses decision tree as individual model and bagging as ensemble method. It improves the performance of decision tree by adding many trees to reduce the overfitting in the decision tree. RF can be used for both classification and regression. RF generates a random forest that contains decision trees and gets a prediction from each one of them and then selects the best solution with the maximum votes . When training a tree, it is important to measure how much each feature decreases the impurity, as the decrease in the impurity indicates the significance of the feature. the tree classification result depends on the impurity measure used. For classification, the measures for impurity are either Gini impurity or information gain and for regression, and the measure for impurity is
  • 6. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110 variance. Training decision tree consists of iteratively splitting the data. Gini impurity decides the best split of the data using the formula. where p (i) is the probability of selecting a datapoint with class; i.e., Information gain (IG) is also another measure to decide the best split of the data depending on the gain of each feature. the formula that calculates the information gain is given in the following equation:where p (i) is the probability of selecting a datapoint with class; i.e., Information gain (IG) is also another measure to decide the best split of the data depending on the gain of each feature. the formula that calculates the information gain is given in the following equation:where p (i) is the probability of selecting a datapoint with class; i.e., Information gain (IG) is also another measure to decide the best split of the data depending on the gain of each feature. +e formula that calculates the information gain is given in the following equation: Extreme Gradient Boosting :
  • 7. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110 eXtreme Gradient Boosting (XGBoost) is a tree-based ensemble technique . XGBoost can be used for classification, regression, and ranking problems. XG boosting is a type of gradient boosting. Gradient Boosting (GB) is a boosting ensemble technique that makes predicators sequentially instead of individually. GB is a method that produces a strong classifier by combining weak classifiers . +e goal of the GB is building an iterative model that optimizes a loss function. It pinpoints the failings of weak learners by using gradients in the loss function where e denotes the error term. loss function measures how good is the model at fitting the underlying data. the loss function depends on the optimization goal, for regression is a measure of the error between the true and predicated values, whereas, for classification, it measures the how good is a model at classifying cases correctly .this technique takes less time and less iterations, since predictors are learning from the past mistakes of the other predictors. GB works by teaching a model C to predict values of the form By minimizing a loss function, e.g., MSE where i iterates over a training set of size n of true values of the target variable yyʹ = estimated values of C (x) y = true values & n =number of instancesin y. Considering a GB model with Mphases and m as a single phase being (1 ≤ m ≤M), to improve some deficient model Fm, a new estimator hm (x) is added. Therefore Estimator h will be fitted to Y − Fm(x), which is the difference between the true value and the predicated value,i.e.,the residual.thus,we attempttoadjustthe errorsof the previousmodel (Fm) XGBoost is better than Ada boost in terms of speed and performance. It is highly scalable and runs 10 times faster as compared to the other traditional single machine learning algorithms. XGBoost handles
  • 8. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110 the sparse data and implements several optimization and regularization techniques. Moreover, it also usesthe conceptof parallel anddistributedcomputing. DISADVANTAGES OF EXISTING SYSTEM : 1) Less accuracy 2)low Efficiency PROPOSED SYSTEM: the model was implemented in Python language 3.8.0 release using Jupyter Notebook environment. Ski-learn library was used for the classifiers along with other needed built-in tools, while separate library (xgboost 1.2.0) was used for XGBoost ensemble. +ere is K-fold cross validation with K =10 for partitioning the data into training and testing. Five evaluation measures such as accuracy, sensitivity (recall), specificity (precision), positive predictive accuracy (PPA), and negative predictive accuracy (NPA) were used. Sensitivity and specificity are focused more during the study due to the application of the proposed model. Accuracy denotes the percentage of correctly classified cases, sensitivity measures the percentage of positives cases that were classified as positives, and specificity refers to the percentage of negative cases that were classified as negatives. Moreover, the criteria for the selection of the performance evaluating. measures depend upon the measures used in the benchmark studies. Two sets of experiments were conducted for each target using selected features by using Firefly feature selection algorithm and 30 features for four targets. +e SMOTE technique was applied to generate synthetic data. +e results of model are presented in section below
  • 9. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
  • 10. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110 Hinselmann : Table 6 presents the accuracy, sensitivity, specificity, PPA, and NPA for the RF, AdaBoost, and XGBoost models, respectively, using SMOTE for Hinselmann test target class. +e number of selected features for Hinselmann was 15. XGBoost outperformed the other classifiers for both feature sets. However, the performance of XGBoost with selected feature is better when compared with 30 features. +e model produces an accuracy of 98.83, sensitivity of 97.5, specificity of 99.2, PPA of 99.17, and NPA of 97.63, respectively
  • 11. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110 Schiller : Table 7 presents the outcomes for the Schiller test. Like Hinselmann target, XGBoost with selected features outperformed that of Schiller, respectively. However, the outcomes achieved by the model for Schiller are lower when compared with Hinselmann target class. the performanceof RF and XGBoost is similar with selected feature for Schiller with a minor difference. the number of features selected by Firefly for Schiller was 13. ADVANTAGES OF PROPOSED SYSTEM : 1) High accuracy 2)High efficiency
  • 12. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110 SYSTEM ARCHITECTUTRE :
  • 13. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110 HARDWARE & SOFTWARE REQUIREMENTS: HARD REQUIRMENTS :  System : i3 or above  Ram : 4GB Ram.  Hard disk : 40GB SOFTWARE REQUIRMENTS :  Operating system : Windows  Coding Language : python
  • 14. VENKAT PROJECTS Email:[email protected] Mobile No: +91 9966499110 Website: www.venkatjavaprojects.com What‘s app: +91 9966499110