OPTIMISED STACKED ENSEMBLE TECHNIQUES IN THE PREDICTION OF CERVICAL CANCER USING SMOTE AND RFERF.docx

VENKAT PROJECTS
Email:venkatjavaprojects@gmail.com Mobile No: +91 9966499110
Website: www.venkatjavaprojects.com What‘s app: +91 9966499110
OPTIMISED STACKED ENSEMBLE TECHNIQUES IN THE
PREDICTION OF CERVICAL CANCER USING SMOTE AND RFERF
ABSTRACT :
Cervical cancer is frequently a deadly disease, common in females. However, early diagnosis of
cervical cancer can reduce the mortality rate and other associated complications. Cervical cancer
risk factors can aid the early diagnosis. For better diagnosis accuracy, we proposed a study for
early diagnosis of cervical cancer using reduced risk feature set and three ensemble-based
classification techniques, i.e., extreme Gradient Boosting (XGBoost), AdaBoost, and Random
Forest (RF) along with Firefly algorithm for optimization. Synthetic Minority Oversampling
Technique (SMOTE) data sampling technique was used to alleviate the data imbalance problem.
Cervical cancer Risk Factors data set, containing 32 risks factor and four targets (Hinselmann,
Schiller, Cytology, and Biopsy), is used in the study. The four targets are the widely used
diagnosis test for cervical cancer. The effectiveness of the proposed study is evaluated in terms
of accuracy, sensitivity, specificity, positive predictive accuracy (PPA), and negative predictive
accuracy (NPA). Moreover, Firefly features selection technique was used to achieve better
results with the reduced number of features. Experimental results reveal the significance of the
proposed model and achieved the highest outcome for Hinselmann test when compared with
other three diagnostic tests. Furthermore, the reduction in the number of features has enhanced
the outcomes. Additionally, the performance of the proposed models is noticeable in terms of
accuracy when compared with other benchmark studies for cervical cancer diagnosis using
reduced risk factors data set.

VENKAT PROJECTS
EXICITING SYSTEM :
Dataset Description:
cervical cancer risk factors data set used in the study was collected at “Hospital Universitario de
Caracas” in Caracas, Venezuela and is available on the UCI Machine Learning repository . It
consists of 858 records, with some missing values, as several patients did not answer some of the
questions due to privacy concerns. the data set contains 32 risk factors and 4 targets, i.e., the
diagnosis tests used for cervical cancer. It contains different categories of feature set such as
habits, demographic information, history, and Genomic medical records. Features such as age,
Dx: Cancer, Dx: CIN, Dx: HPV, and Dx features contains no missing values. Dx: CIN is a
change in the walls of cervix and is commonly due to HPV infection; sometimes, it may lead to
cancer if it is not treated properly. However, Dx: cancer variable is represented if the patient has
other types of cancer or not. Sometimes, a patient may have more tha the cervical cancer risk
factors data set used in the study was collected at “Hospital Universitario de Caracas” in Caracas,
Venezuela and is available on the UCI Machine Learning repository . It consists of 858 records,
with some missing values, as several patients did not answer some of the questions due to
privacy concerns. the data set contains 32 risk factors and 4 targets, i.e., the diagnosis tests used
for cervical cancer. It contains different categories of feature set such as habits, demographic
information, history, and Genomic medical records. Features such as age, Dx: Cancer, Dx: CIN,
Dx: HPV, and Dx features contains no missing values. Dx: CIN is a change in the walls of cervix
and is commonly due to HPV infection; sometimes, it may lead to cancer if it is not treated
properly. However, Dx: cancer variable is represented if the patient has other types of cancer or
not. Sometimes, a patient may have more than one type of cancer. In the data set, some of the
patients do not have cervical cancer, but they had the Dx: cancer value true. +erefore, it is not
used as a target variable.he data set, some of the patients do not have cervical cancer, but they
had the Dx: cancer value true. +erefore, it is not used as a target variable.
Table 1 presents a brief description of each feature with the type. Cervical cancer diagnosis
usually requires several tests; this data contains the widely used diagnosis tests as the target.
Hinselmann, Schiller, Cytology, and Biopsy are four widely used diagnosis tests for cervical
cancer. Hinselmann or Colposcopy is a test that examines the inside of the vagina and cervix
using a tool that magnifies the tissues to detect any anomalies . Schiller is a test in which a
chemical substance called iodine is applied to the cervix, where it stains healthy cells into brown

VENKAT PROJECTS
color and leaves the abnormal cells uncolored, while cytology is a test that examines body cells
from uterine cervix for any cancerous cells or other diseases. And Biopsy refers to the test where
a small part of cervical tissue is examined under a microscope. Most Biopsy tests can make
significant diagnosis.
Dataset Preprocessing :
the data set suffers from a huge number of missing values; 24 features out of the 32 contained
missing values. Initially, the features with the huge percentage of missing values were removed.
STDs: Time since first diagnosis and STDs: Time since last diagnosis features were removed
since they have 787 missing values (see Table 2), which is more than half of the data. However,
the data imputation was performed for the features with fewer numbers of missing values. +e
most frequent value technique was used to impute the remaining missing values. Additionally,
the data set also suffers from huge class imbalance. the data set target labels were imbalanced
with 35 for the Hinselmann, 74 for Schiller, 44 for Cytology, and 55 Biopsy out of the 858
records as shown in Figure 1. SMOTE was used to deal with class imbalance. SMOTE works by
oversampling the minority class by generating new synthetic data for minority instances based
on nearest neighbors using the Euclidean Distance between data points . Figure 1 shows the
number of records per class labels in the data set.
FireflyFeature Selection:
Dimensionality reduction is one of the effective ways to select the features that improve the
performance of the supervised learning model. In the study, we adopted nature-inspired
algorithm Firefly for selecting the features that better formulate the problem. Firefly was
proposed by Yang and was initially proposed for the optimization. Metaheuristic Firefly
algorithm is inspired by fireflies’ and flash lightening capability of a fly. It is a population-based
optimization algorithm to find the optimal value or parameter for a target function. In this
technique, each fly is pulled out by the glow intensity of the nearby flies. If the intensity of the
gleam is extremely low at some point, then the attraction will be declining. Firefly used three
rules; that is, (a) all the flies should be of the same gender; (b) the criteria of attractiveness
depend upon the intensity of the glow; (c) target function will generate the gleam of the firefly.
the flies with less glow will move towards the flies with brighter glow. the brightness can be
adjusted using objective function. the same idea is implemented in the algorithm to search the
optimal features that can better fit the training model. Firefly is more computationally
economical and produced better outcome in feature selection when compared with other

VENKAT PROJECTS
metaheuristic techniques like genetic algorithms and particle swarm optimization . the time
complexity of firefly is O(n2t) . It uses the light intensity to select the features. Highly relevant
features are represented as the features with high intensity light.
For feature selection, initially, some fireflies will be generated, and each fly will randomly assign
the weights to all features. In our study, we generated 50 number of flies (n = 50). +e dimension
of the data set is 30. Furthermore, the lower bound was set to − 50, while the upper bound is
equal to 50. the maximum generations were 500. Additionally, α (alpha) was initially set to 0.5
and in every subsequent iteration, we used the and to update α (alpha) value.
However, the gamma (c) was set to 1. the number of features selected using Firefly for
Hinselmann was 15, for Schiller 13 features, for Cytology 11 features, and 11 features for
Biopsy, respectively.
Ensemble-Based Classification Methods :
ensemble-based classification techniques such as Random Forest, Extreme Gradient Boosting,
and Ada Boost were used to train the model. the description of these techniques is discussed in the
sectionbelow.

VENKAT PROJECTS
Random Forest :
Random Forest (RF) was first proposed by Breiman in 2001 . Random forest is an ensemble
model that uses decision tree as individual model and bagging as ensemble method. It improves
the performance of decision tree by adding many trees to reduce the overfitting in the decision
tree. RF can be used for both classification and regression. RF generates a random forest that
contains decision trees and gets a prediction from each one of them and then selects the best
solution with the maximum votes .
When training a tree, it is important to measure how much each feature decreases the impurity,
as the decrease in the impurity indicates the significance of the feature. the tree classification
result depends on the impurity measure used. For classification, the measures for impurity are
either Gini impurity or information gain and for regression, and the measure for impurity is

VENKAT PROJECTS
variance. Training decision tree consists of iteratively splitting the data. Gini impurity decides
the best split of the data using the formula.
where p (i) is the probability of selecting a datapoint with class; i.e., Information gain (IG) is also
another measure to decide the best split of the data depending on the gain of each feature. the
formula that calculates the information gain is given in the following equation:where p (i) is the
probability of selecting a datapoint with class; i.e., Information gain (IG) is also another measure
to decide the best split of the data depending on the gain of each feature. the formula that
calculates the information gain is given in the following equation:where p (i) is the probability of
selecting a datapoint with class; i.e., Information gain (IG) is also another measure to decide the
best split of the data depending on the gain of each feature. +e formula that calculates the
information gain is given in the following equation:
Extreme Gradient Boosting :

VENKAT PROJECTS
eXtreme Gradient Boosting (XGBoost) is a tree-based ensemble technique . XGBoost can be
used for classification, regression, and ranking problems. XG boosting is a type of gradient
boosting. Gradient Boosting (GB) is a boosting ensemble technique that makes predicators
sequentially instead of individually. GB is a method that produces a strong classifier by
combining weak classifiers . +e goal of the GB is building an iterative model that optimizes a
loss function. It pinpoints the failings of weak learners by using gradients in the loss function
where e denotes the error term. loss function measures how good is the model at fitting the
underlying data. the loss function depends on the optimization goal, for regression is a measure
of the error between the true and predicated values, whereas, for classification, it measures the
how good is a model at classifying cases correctly .this technique takes less time and less
iterations, since predictors are learning from the past mistakes of the other predictors.
GB works by teaching a model C to predict values of the form
By minimizing a loss function, e.g., MSE
where i iterates over a training set of size n of true values of the target variable yyʹ = estimated values
of C (x) y = true values & n =number of instancesin y. Considering a GB model with Mphases and m as a
single phase being (1 ≤ m ≤M), to improve some deficient model Fm, a new estimator hm (x) is added.
Therefore
Estimator h will be fitted to Y − Fm(x), which is the difference between the true value and the predicated
value,i.e.,the residual.thus,we attempttoadjustthe errorsof the previousmodel (Fm)
XGBoost is better than Ada boost in terms of speed and performance. It is highly scalable and runs 10
times faster as compared to the other traditional single machine learning algorithms. XGBoost handles

VENKAT PROJECTS
the sparse data and implements several optimization and regularization techniques. Moreover, it also
usesthe conceptof parallel anddistributedcomputing.
DISADVANTAGES OF EXISTING SYSTEM :
1) Less accuracy
2)low Efficiency
PROPOSED SYSTEM:
the model was implemented in Python language 3.8.0 release using Jupyter Notebook
environment. Ski-learn library was used for the classifiers along with other needed built-in tools,
while separate library (xgboost 1.2.0) was used for XGBoost ensemble. +ere is K-fold cross
validation with K =10 for partitioning the data into training and testing. Five evaluation measures
such as accuracy, sensitivity (recall), specificity (precision), positive predictive accuracy (PPA),
and negative predictive accuracy (NPA) were used. Sensitivity and specificity are focused more
during the study due to the application of the proposed model. Accuracy denotes the percentage
of correctly classified cases, sensitivity measures the percentage of positives cases that were
classified as positives, and specificity refers to the percentage of negative cases that were
classified as negatives. Moreover, the criteria for the selection of the performance evaluating.
measures depend upon the measures used in the benchmark studies. Two sets of experiments
were conducted for each target using selected features by using Firefly feature selection
algorithm and 30 features for four targets. +e SMOTE technique was applied to generate
synthetic data. +e results of model are presented in section below

VENKAT PROJECTS

VENKAT PROJECTS
Hinselmann :
Table 6 presents the accuracy, sensitivity, specificity, PPA, and NPA for the RF, AdaBoost, and
XGBoost models, respectively, using SMOTE for Hinselmann test target class. +e number of
selected features for Hinselmann was 15. XGBoost outperformed the other classifiers for both
feature sets. However, the performance of XGBoost with selected feature is better when
compared with 30 features. +e model produces an accuracy of 98.83, sensitivity of 97.5,
specificity of 99.2, PPA of 99.17, and NPA of 97.63, respectively

VENKAT PROJECTS
Schiller :
Table 7 presents the outcomes for the Schiller test. Like Hinselmann target, XGBoost with
selected features outperformed that of Schiller, respectively. However, the outcomes achieved by
the model for Schiller are lower when compared with Hinselmann target class. the
performanceof RF and XGBoost is similar with selected feature for Schiller with a minor
difference. the number of features selected by Firefly for Schiller was 13.
ADVANTAGES OF PROPOSED SYSTEM :
1) High accuracy
2)High efficiency

VENKAT PROJECTS
SYSTEM ARCHITECTUTRE :

VENKAT PROJECTS
HARDWARE & SOFTWARE REQUIREMENTS:
HARD REQUIRMENTS :
 System : i3 or above
 Ram : 4GB Ram.
 Hard disk : 40GB
SOFTWARE REQUIRMENTS :
 Operating system : Windows
 Coding Language : python

OPTIMISED STACKED ENSEMBLE TECHNIQUES IN THE PREDICTION OF CERVICAL CANCER USING SMOTE AND RFERF.docx

More Related Content

Similar to OPTIMISED STACKED ENSEMBLE TECHNIQUES IN THE PREDICTION OF CERVICAL CANCER USING SMOTE AND RFERF.docx (20)

More from Venkat Projects (20)

Recently uploaded (20)

OPTIMISED STACKED ENSEMBLE TECHNIQUES IN THE PREDICTION OF CERVICAL CANCER USING SMOTE AND RFERF.docx