SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 454
A COMPARATIVE ANALYSIS OF CLASSIFICATION TECHNIQUES ON
MEDICAL DATA SETS
Pooja Mittal1
, Nasib Singh Gill2
1
Department of Computer Science & Application, Maharshi Dayanand University, Rohtak 124001, Haryana, India
2
Department of Computer Science & Application, Maharshi Dayanand University, Rohtak 124001, Haryana, India
Abstract
Classification is the preliminary stage of data mining which is used to categorize the dataset in smaller groups where each group
contains the similar data items. The classification basically deals with two main parameters; one is the number of classes and another
is the criteria for deciding the class members. Different recognition algorithms also use the classification process as an initial stage to
perk up the efficiency and the accuracy. The accuracy of the classification algorithm also decides the effectiveness of its use in other
mining applications. The present work is about to analyze the effectiveness of the most popular classification techniques. In this paper,
the analysis has been performed for five different classification algorithms in terms of accuracy, kappa statistics, execution time, mean
absolute error under three datasets, collected from medical domain. The work has been implemented in WEKA environment and
obtained results show that SVM is the most robust classification method and KNN is the least effective classifier for medical data sets.
Keywords: KNN, Neural Network, SVM, Decision Tree, Classification.
----------------------------------------------------------------------***--------------------------------------------------------------------
1. INTRODUCTION
Data mining on medical data is a challenging area. It is a
process of deriving consequential and imperative information
hidden in exhaustive comprehensive data. Classification is an
integral component of clinical data mining as it synthesizes the
bulky data into meaningful groups, and assists in effective
mining. Classification [1] is an initial stage of data mining to
divide the data in sub groups according to the data and dataset
features. These features can be application based or the data
based. Variety of Classification algorithms is available like
KNN, Decision Trees, Bayesian networks, SVM. Initially, the
classification algorithm divides the dataset in two parts called
training and testing data. The training data is the input data that
is been learned under by the classification algorithm as a
premature stage. The learning process can be supervised or
unsupervised. Once the learning of training data is
accomplished, the classes are identified. The testing data is a
new entity for which the class is unrecognized. This testing data
is analyzed under different parameters and compared with
available class’s features. The maximum harmonized class is
taken as the data class [2]. Classification algorithms like
decision tree, neural network, support vector machine, KNN etc
[2,3], collectively forms a generalization model to perform the
classification. Some of these methods are statistical and some
are soft in terms of rules and works on weighted values. In
section 3, these methods are described in detail.
2. LITERATURE SURVEY
An early detection & intervention plays a significant role in
controlling the clinical deterioration of ICU patients [11]. An
integrated data mining approach was designed to give early
warnings and alarms by synthesizing large set of features like
DFA, Time Series, Entropy by applying linear and non-linear
classification, forward feature selection and exploratory under
sampling. An improved hybrid prediction model was proposed
by ILango & Ramaraj[12] by implementing F-Score feature
selection to deduce optimal feature set from high dimensional
data bases. They achieved predictive accuracy of 98.9427% for
diabetes data set. Data mining techniques are significantly
important to derive and concise information from large & high
dimensional clinical data [13]. Real data sets from various
domains were analyzed & identified the performance of various
techniques, to pick out the best out performer. Quantum C4.5
and Random tree produced 100% accuracy on one data set and
91.36% on another data set. Accuracy of any technique depends
on the algorithm and on the nature & behavior of data set. The
medical coding problem can be visualized as multi label
classification problem, in which patient’s information is
translated in standard pre-defined codes. A multi label large
margin classifier is capable of learning the code structure, using
the previous knowledge [14]. Medical data mining primarily
focuses on hidden pattern extraction. An amalgam model was
developed for classifying medical data set .Combination of
multistep pre-processing, k-means and K nearest neighbor
improves the performance of the process [15]. Missing values
present in medical data set effects the pattern extraction process
significantly. Imputation is the most popular and common
approach for solving MV’s problems. Heart failure data set was
referred and concluded that no universal imputation tool is
available which can outperform even if applied on diverse data
sets [16]. Temporal patterns can be mined by applying fuzzy
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 455
neural networks. Lower approximations can be derived by
implementing hypothesis and fuzzy decision tables [17]. A
novel approach was proposed for classification based on
bijective soft sets. Bijective soft set theory performs the
analysis on data and identifies the dependency in data values. It
also discovers the redundant information and identifies the
classes over data values . It comes out to be a valuable asset
for inductive learning. Precision, Recall and F-measure were
the factors which were analyzed for comparing the performance
of bijective classification from decision table and naïve
bayes[18]. Neural networks when applied on data set of heart
disease patients resulted in sensitivity of 81.1%, specificity of
78.7% and accuracy of 80.2%. Decision tress yielded
sensitivity, specificity and accuracy of 81.7%, 76.0% and
79.3% where as logistic regression achieved 81.2% , 73.1% and
71.1% of sensitivity, specificity and accuracy respectively [19].
Various medical data sets from diverse domains were analyzed
by applying SSI, KD, RB and L measures. SSI produced
maximum classification accuracy on different data sets
[20],[16]. Maintenance of patients data sets is equally important
as it may assist in acquiring knowledge and identifying
problems [21],[12]. Heart disease diagnostic is one of the prime
applications of data mining in medical stream. Many
researchers contributed to develop intelligent systems. My
Chau Tu, Dongil Shin applied decision tree C4.5, bagging with
C4.5 and bagging with Naïve Baye’s theorem and analyzed the
effectiveness, correction rate. 10 fold cross validation was used
for evaluation. Out of these techniques bagging with Naïve
Baye’s theorem out shines with 82.5% accuracy [22]. Pre
processing improves the performance of mining techniques by
removing noisy data, substituting missing values (MV’s)
[15][16] whereas k-means algorithms are used to eliminate
incorrect classification. Value of k depends on the nature and
behavior of data. Larger the value of k, lesser the chances for
noise. 97.4% accuracy was achieved by caching KNN & k
means. The state of classifiers are mainly decided by the nature
of the data set, when k-fold cross validation was applied [23].
Dempster – Shafer’s theory of evidence combination was
implemented on two varied medical domains of Breast Cancer
and Skin lesions, by applying k-nearest neighbor, Naïve
Bayesian and Decision Tree. In this scenario, beliefs were
computed by classifier outputs. Belief m(X) is a measure of
trust or confidence with∑ X€Ω m(X) = 1, where m(ϕ) = 0 where
ϕ is empty [23]. Classification can be improved by using
evidence combination approach. To enhance the performance,
the probability theory is modified and the mathematical theory
of Evidence is devised for handling uncertainty [24]. Binary –
coded genetic algorithms and Real-coded genetic algorithms
are used for assigning weights to the features, so that set of
optimal features can be deduced from high dimensional data.
Different k-NN algorithms (crisp k-NN, fuzzy KNN and
weighting fuzzy k-NN) are evaluated and compared on same
data set. They achieved 98.77% of accuracy by applying RGA
based model whereas GA approach comes out to be time
consuming [25]. Knowledge can be extracted by clustering the
given data set in to soft clusters and can be fused by using serial
and parallel fusion to outperform as classifier [26]. Hybrid
approaches can be used for classifying the given voluminous
data [27,26,18]. Ant Colony Optimization algorithm is an
efficient approach used for classification purpose[27]. Quality
can be improved if ACO is combined with mRAR, a feature
selection algorithm.
3. CLASSIFICATION APPROACHES
Classification is a form of data analysis [10], which is used for
extracting a model for describing and differentiating the data
classes of given data objects, with an objective of predicting the
class for an object whose class label is not known. The
classification process can be broadly divided into two phases:
Learning step (training data) and Classification step(testing
data). Classification is used to predict categorical labels
including discrete and unordered values. This derived model
can be represented in many forms known as classification
algorithms such as IF-THEN rules, decision tree, mathematical
formulae, neural networks etc. Some of the popular techniques
are described below.
3.1 Decision Tree
It is the hierarchical decision making approach used to partition
the dataset. It is an approximate discrete function technique for
retrieving useful expressions.
Fig. 1 A simple classification or decision tree
Age
OverWeight
<40
>=40
Yes No
OverWeight
Yes No
BP
High Normal
Heartattack
Good Health
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 456
According to this approach, the dataset is classified to n mutual
exclusive datasets and each dataset will be defined with a label.
Now the data points are identified by a certain action to its
relative data class. Decision Tree is a supervised learning
technique that defines a transparent tree based structure to
perform the action on available dataset. The tree structure itself
contains number of nodes that are connected by the edges.
These nodes basically define the conditions and the edges
define the relative event based on the condition. Each edge
itself defines a separate class. The decision is here been taken
respective to the true or false case of a conditional analysis. If
the condition is true, one category will be elected otherwise
second categorization. Figure 1 is showing the simple
classification process defined by decision tree [3,4].
3.2 Bayesian Classification
Bayesian network, the probabilistic graphical model that
identify the relation between the variables and based on the
dependency analysis, the classification will be performed. The
Bayesian network is a directed acyclic graph based model. The
dependency is here been identified between different attributes
in terms of conditional analysis. Each attribute is at first defined
independently and with the next level, the relation between
these attributes is identified. Each attribute itself is defined with
some weight age called the probabilistic analysis and as the
relations are identified, the conditional probability is evaluated
for each factor. This classifier performs the learning on training
data under the conditional probability analysis on each attribute
as well assign a random class label to each instance. Once the
conditional probability based acyclic graph is generated, the
next work is to predict the class with highest probability vector.
The goal of classification process is to predict the discrete value
of class for any testing data. The structure of Bayesian network
is been defined under the one to one feature vector. The arcs
defined in the graph represents the features based on which the
conditional dependencies are evaluated [6].
3.3 Neural Network
Neural network is one of effective soft computing based
classification algorithm that uses the concept of neurons that
logically represents the working of human brain. In this
classification process, the data values are represented by the
neurons and the connectivity is represented by synapses. It is
basically the layered approach in which there are two main
layers called two end points represented by input and output
layer. Other then these two layers, the model also have m
intermediate layers called hidden layers. On each layer some
weight age is assigned. The graphical representation of Neural
network is given in figure 2. As we can see, the middle layer of
the network defines the weights to different input values so that
effective classification will be done. Neural network accepts the
dataset as input layer and represent it as the network nodes. The
predictor weights are applied to these nodes in hidden layer.
This layer actually defines the degree of connectivity between
the nodes. After adjusting the weightage, output layer is derived
as the final result [6,7].
Fig. 2: A Three Layer Neural Network
3.4 Support Vector Machine
SVM is another robust and successful classification algorithm.
SVM basically works as the linear separator between two data
points to identify two different classes in the multidimensional
environment. The prime objective of this approach is to
maximize the margin between the classes and to minimize the
distance between the hyper plane points. SVM basically defines
the dealing of interaction respective to the features and the
repetitive features. SVM split the dataset in two vector sets
under n dimensional space vector. The SVM algorithm
basically construct a hyper plane environment so that each
element is been compared respective to the separated linear
line. Hyper-plane concept is presented to perform the data
separation based on largest distance analysis to identify the
classes. To reduce the error ratio, the largest margin classifier is
defined. The work also includes the analysis based on margin
vector along with support vector analysis [8].
3.5 k-Nearest Neighbor
KNN is the instance based statistical analysis approach to
perform data classification, called as lazy learning algorithm. It
is the simplest algorithmic approach among all algorithmic
approaches. According to this approach, an object is classified
by neighbor point analysis based on majority analysis. The
object that will get the highest vote will be selected as the class
member respective to the distance defined class. Once the
classification rule is decided, the relative neighboring objects
are identified. If the value of k=1, then it is simply called as
nearest neighbor. K-NN requires
1. An integer k
2. A training data set
3. A metric to measure closeness
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 457
Fig. 3 Example of KNN classification
Example of KNN classification is shown in figure 3. Here input
data objects are represented by greed circles and the object
classes are identified by blue and red objects. In case of three
classes, triangles are represented by one class and rest objects
are represented by other objects. In case of 5 classes, three
classes are presented by square and rest two classes are
represented by triangles shown inside the outer circle[9,10].
4. EXPERIMENTAL RESULTS AND ANALYSIS
The performance analysis was carried on five different
algorithms for three different datasets. The datasets considered
here are from medical domain. The classification algorithms
used in this work are described in previous section. The present
work has been implemented in WEKA (Waikato Environment
for Knowledge Analysis) environment [4,9] and the results
have been taken under different parameters: the accuracy,
Kappa Statistics[4], Mean Absolute Error and Execution Time.
The results obtained from these different models have been
defined in the form of tables as well as graph.
4.1 Datasets
In this work, three medical datasets are considered, which are
collected from the UCI Repository. These datasets are effective
enough to show the classification process. These datasets are
here analyzed under different classification parameters. These
datasets are available in ARFF format. The detailed description
of these datasets, in terms of features and data points, is given
in table 5.
Table 5: Medical Datasets
Sr. No. Dataset Name Instances Attributes
1 heart-
statlog.arff
270 14
2 diabetes.arff 768 9
3 hepatitis.arff 155 20
The medical datasets that have been utilized to conduct the
classification are taken from diverse range of medical areas, to
ensure the excellence of analysis . Every dataset has different
types of data, including numbers, text and other domain data
points. Each of the dataset is explored explicitly due to their
uniqueness in terms of their varying attributes, discrete or
continuous nature of data etc. These datasets are been analyzed
for the classification task by using WEKA tool under different
classification approaches. To perform this classification, 10-
fold method is adopted in this present work. WEKA is an open
source data mining software tool written in java. WEKA, itself
contains number of built-in data mining algorithms so that
different mining operations can be performed directly. WEKA
is used by the researchers to analyze the effectiveness of
different machine learning algorithms. In this present work, we
have used WEKA to perform the analytical study of
classification algorithms on medical datasets. Various
parameters considered, are described next.
4.2 Accuracy Analysis
Accuracy of a classification algorithm is been defined in terms
of number of correctly classified instances. Accuracy Analysis
is given by
CCount
Accuracy Analysis= --------------------------x100
TCount
Here,
CCount is Number of Correctly Identified Objects, and
TCount is Total Number of Objects.
Higher the accuracy level, more effective the algorithm will be.
Table 1: Accuracy Analysis of Different Classification
Algorithms
Dataset
name
Classification Techniques
Bayesia
n
Networ
ks
Neural
Networ
ks
SVM KNN Decisio
n Tree
heart-
statlog.arff
83.7037 77.4074 84.074
1
75.185
2
76.666
7
diabetes.ar
ff
76.3021 75.1302 77.474 69.791
7
74.218
8
hepatitis.ar
ff
83.2258 80 85.161
3
80.645
2
81.290
3
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 458
Fig 4: Accuracy Analysis of Different Classification
Algorithms
As it can be viewed, figure 4 is showing the accuracy based
comparison of different algorithms. The figure shows that the
SVM is most robust, effective and consistent classifier for
different datasets. SVM provided the highest accuracy among
all algorithms whereas KNN is the least effective algorithm in
terms of accuracy analysis.
4.3 Based on Execution Time
To analyze any algorithmic approach, the execution time is one
of the foremost parameter. In this present work, we have
analyzed the execution time to identify the efficient
classification algorithm. Here table 2 is showing the execution
time results obtained from different algorithms.
Table 2: Execution Time Analysis of Different Classification
Algorithms
Dataset
name
Classification Techniques
Bayesia
n
Networ
ks
Neur
al
Net
work
s
SVM KNN Decisi
on
Tree
heart-
statlog.arff
0.02 1.3 0.08 0.003 0.03
diabetes.ar
ff
0.02 2.03 0.11 0.005 0.06
hepatitis.ar
ff
0.04 1.28 0.09 0.004 0.03
Fig 5: Execution Time Analysis of Different Algorithms
Here figure 5, is showing the execution time analysis of
different classification algorithms. As we can see, the KNN is
most efficient classification algorithm whereas the results
obtained for neural network are worst.
4.4 Based on Kappa Statistic
Kappa Statistics is the statistical analysis based on the inter-
rater agreement for qualitative data. It basically performs the
analysis between different classes. Higher Value of kappa
statistic is considered as good. Here figure 3 is showing the
comparative analysis of different algorithms under the kappa
statistics.
Table 3: Kappa Statistic Analysis of Different Algorithms
Dataset
name
Classification Techniques
Bayesia
n
Networ
ks
Neura
l
Netwo
rks
SVM KNN Decisi
on
Tree
heart-
statlog.arf
f
0.6683 0.544
4
0.676
2
0.4988 0.5271
diabetes.a
rff
0.4664 0.444
5
0.470
8
0.3223 0.4246
hepatitis.a
rff
0.5107 0.382
5
0.530
9
0.3953 0.394
0102030405060708090
NeuralNetworks
SVM
KNN
DecisionTree
Classification Techniques
heart-statlog.arff
diabetes.arff
hepatitis.arff
0
0.5
1
1.5
2
2.5
NeuralNetworks
SVM
KNN
DecisionTree
Classification Techniques
heart-
statlog.arff
diabetes.arff
hepatitis.arff
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 459
Fig 6: Kappa Statistic based Analysis of Different
Classification Algorithms
As we can see, figure 6 shows the Kappa statistics based
analysis. As we can see, SVM classification method has
provided the highest Kappa statistic value for different dataset.
It shows the effectiveness in terms of robustness. The values
obtained in case of KNN algorithm is least that shows the least
effective algorithm under this parameter.
4.5 Based on Mean Absolute Error (MAE)
MAE actually forecasts the capability of an algorithm. Lesser
the MAE , higher the capability of the algorithm to perform the
prediction. Here table 4, is showing the results obtained from
different algorithm s under mean absolute error parameter.
Table 4: Mean Absolute Error based Analysis for Different
Classification Algorithms
Dataset
name
Classification Techniques
Bayesia
n
Network
s
Neural
Network
s
SVM KNN Decisio
n Tree
heart-
statlog.arff
0.1835 0.2328 0.159
3
0.250
2
0.274
diabetes.arf
f
0.2841 0.294 0.225
3
0.302
7
0.3134
hepatitis.ar
ff
0.1754 0.1928 0.148
4
0.197
9
0.2073
Fig 7: Mean Absolute Error based Analysis for Different
Algorithms
Here figure 7, is showing the estimation of different algorithms
under the mean absolute error. As we can see, MAE value in
case of SVM algorithm is least that shows the accurate
prediction capability of the algorithm. Whereas the highest
values driven from the KNN algorithm, shows that the
algorithm is not effective for the prediction.
5. CONCLUSIONS
In this paper, the analysis of different classification algorithms
has been performed under four different parameters called
execution time, mean absolute error, kappa statistic and
accuracy analysis for three medical datasets. The obtained
results show that the SVM is the most robust, consistent and
reliable classification algorithm whereas KNN is the worst
algorithm for the classification.
REFERENCES
[1] Gupta, M., and Aggarwal, N. , “Performance Analysis
of Classification Techniques on XML Dataset”,
International Journal of Computer Science and
Technology IJCST Vol. 1, Issue 1, pp. 76-79, 2010.
[2] Justin, T., Gajsek, R., Struc, V., and Dobrisek, S.,
“Comparison of Different Classification Methods for
Emotion Recognition”, MIPRO 2010, Opatija, Croatia,
pp. 700-703, 2010.
[3] Gupta, S., Kumar, D., and Sharma, A., “Data Mining
Classification Techniques applied for Breast Cancer
Diagnosis and Prognosis”, Indian Journal of Computer
Science and Engineering (IJCSE) Vol. 2 No. 2, pp. 188-
195, 2011
[4] Viera, A. J., and Garrett, J. M., “Understanding
Interobserver Agreement: The Kappa Statistic”,
Research Series Vol. 37, No. 5 , pp. 360-363, 2005.
[5] Desai, A., and Rai, S., “Analysis of Machine Learning
Algorithms using WEKA”, International Conference &
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8 BayesianNetworks
NeuralNetworks
SVM
KNN
DecisionTree
Classification Techniques
heart-statlog.arff
diabetes.arff
hepatitis.arff
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
BayesianNetworks
NeuralNetworks
SVM
KNN
DecisionTree
Classification Techniques
heart-
statlog.arff
diabetes.arff
hepatitis.arff
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 460
Workshop on Recent Trends in Technology, (TCET)
2012 Proceedings published in International Journal of
Computer Applications (IJCA) 27, pp.27-32, 2012.
[6] Kumari, Milan, and Godara, Sunila, “Comparative
Study of Data Mining Classification Methods in
Cardiovascular Disease Prediction”, International
Journal of Computer Science and Technology Vol. 2,
Issue 2, pp. 304-308, 2011.
[7] Ture, M., Kurt, I., Kurum, A. T., and Ozdamar, K.,
“Comparing classification techniques for predicting
essential hypertension”, Expert Systems with
Applications 29, pp. 583–588, 2011.
[8] Burges, C., “A Tutorial on Support Vector Machines for
Pattern Recognition”. Data Mining and Knowledge
Discovery, Vol. 2, pp. 121-167, 1998.
[9] Pushpa, “Comparision of Clustering Techniques using
WEKA”, M. Tech. Thesis, Guru Jambheshwar
University of Science and Technology, Hisar,
India,2010.
[10] Han, J., and Kamber, M., “Data Mining: Concepts and
Techniques”, 2nd
edition, Morgan Kaufmann Publishers,
2006
[11] Yi Mao, Wenlin Chen, Yixin Chen, Lu Chenyang,
Kollef Marin & Thomas C., “ An Intergrated Data
Mining Approach to Real Time Clinical Monitoring and
Deterioration Warning”,KDD’12 ACM 978-1-4503-
1462-6/12/08.
[12] Ilango B.S, Ramaraj N., “ A Hybrid Prediction Model
with F-score Feature selection for TypeII Diabetes
Databases”, A2CWiC 2010, September 16-17, 2010 .
[13] Jacob S.G. , Ramani R.G, “ Mining of Classification
Patterns in Clinical Data through Data Mining
Algorithms”, ICACCI’12 –ACM 978-1-4503-1196-
0/12/08.
[14] Yan Yan, Fung G., DY J.G, “ Medical Coding
Classification by Leveraging Inter-Code Relationships”,
KDD’10 July 25-28 2010 Washington, USA.
[15] NirmalaDevi M, Balamurugan S, Swathi U V, “ An
amalgam KNN to predict Diabetes Mellitus”, 2013
IEEE International Conference on Emerging Trends in
Computing, Communication and Nanotechnology.
[16] Davis D N, Zhang Y, Kambhampati C, Goode K,
Cleland J.G.F, “ A Comparative study of missing value
imputation with multi class classification for clinical
heart failure data”, 2012 IEEE, 9th
International
Conference on Fuzzy Systems and Knowledge
Discovery.
[17] Sethukkarasi R., Keerthika U., Kannan A. , “ A Self
Learning Rough Fuzzy Neural Network Classifier for
Mining Temporal Patterns”, ICACCI’12 –ACM 978-1-
4503-1196-0/12/08.
[18] Kumar S U, Inbarani H, Senthil Kumar, “ Bijective Soft
Set Based Classification of Medical Data”, 2013 IEEE,
International Conference on Pattern Recognition,
Informatics and Mobile Engineering.
[19] Khemphila A, Boonjing V, “ Comparing performances
of logistic regression, decision trees and neural networks
for classifying heart disease patients”, 2010 IEEE ,
International Conference on Computer Information
Systems and Industrial Management Applications.
[20] Saastamoinen K, Ketola J, “ Medical Data Classification
using Logical Similarity based Measures”, 1-4244-0023-
6/06 2006 IEEE.
[21] Mutalib S., Razak Abd. R., Nordin S., Rahman S.A.,
Mohamed A., “ Intelligent classification in Medical
Data”,2012 IEEE International Conference on
Biomedical Engineering and Sciences.
[22] Tu M C, Shin D, Shin Dong, “ A Comparative Study of
Medical Data Classification Methods Based on Decision
Tree and Bagging Algorithms”, 2009 8th
IEEE
International Conference on Dependable, Autonomic
and Secure Computing.
[23] Aslandogan Y.A, Mahajani G. A, “ Evidence
Combination in Data Mining”, 2004 IEEE Proceedings
of the International Conference on Information
Technology: Coding and Computing .
[24] Shafer G, “ A mathematical theory of evidence”,
Princeton University Press, 1976.
[25] Tang P.H, Tseng M.H, “ Medical data mining using
BGA and RGA for weighting of features in Fuzzy k-NN
classification”, 2009 8th
IEEE International Conference
on Machine Learning and Cybernetics.
[26] Hassan S. Z, Verma B, “ A Hybrid Data Mining
Approach for Knowledge Extraction and Classification
in Medical Databases”, 2007 IEEE 7th
International
Conference on Intelligent Systems Design and
Applications.
[27] Michelakos I, Papageorgiou E, Vasilakopoulos M, “ A
hybrid classification algorithm evaluated on medical
data”, 2010 IEEE Workshops on Enabling
Technologies: Infrastructure for Collaborative
Enterprises.

More Related Content

What's hot (18)

PDF
[IJCT-V3I2P26] Authors: Sunny Sharma
IJET - International Journal of Engineering and Techniques
 
PDF
DATA MINING CLASSIFICATION ALGORITHMS FOR KIDNEY DISEASE PREDICTION
IJCI JOURNAL
 
PDF
A Survey on Heart Disease Prediction Techniques
ijtsrd
 
PDF
Decision Tree Classifiers to determine the patient’s Post-operative Recovery ...
Waqas Tariq
 
PDF
PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...
Sivagowry Shathesh
 
PDF
Heart Disease Prediction Using Data Mining Techniques
IJRES Journal
 
PPTX
Data mining techniques on heart failure diagnosis
Steve Iduye
 
PDF
Chronic Kidney Disease Prediction
Rajandeep Gill
 
PDF
Data mining technique for opinion
IJDKP
 
PDF
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
IRJET Journal
 
PDF
IRJET- Human Heart Disease Prediction using Ensemble Learning and Particle Sw...
IRJET Journal
 
PDF
Hybrid prediction model with missing value imputation for medical data 2015-g...
Jitender Grover
 
PDF
IRJET - Chronic Kidney Disease Prediction using Data Mining and Machine Learning
IRJET Journal
 
PDF
IRJET- Heart Failure Risk Prediction using Trained Electronic Health Record
IRJET Journal
 
PDF
Framework for efficient transformation for complex medical data for improving...
IJECEIAES
 
PDF
Heart Disease Prediction Using Associative Relational Classification Techniq...
IJMER
 
PDF
Paper id 212014112
IJRAT
 
PDF
An Enhanced Feature Selection Method to Predict the Severity in Brain Tumor
ijtsrd
 
DATA MINING CLASSIFICATION ALGORITHMS FOR KIDNEY DISEASE PREDICTION
IJCI JOURNAL
 
A Survey on Heart Disease Prediction Techniques
ijtsrd
 
Decision Tree Classifiers to determine the patient’s Post-operative Recovery ...
Waqas Tariq
 
PSO-An Intellectual Technique for Feature Reduction on Heart Malady Anticipat...
Sivagowry Shathesh
 
Heart Disease Prediction Using Data Mining Techniques
IJRES Journal
 
Data mining techniques on heart failure diagnosis
Steve Iduye
 
Chronic Kidney Disease Prediction
Rajandeep Gill
 
Data mining technique for opinion
IJDKP
 
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
IRJET Journal
 
IRJET- Human Heart Disease Prediction using Ensemble Learning and Particle Sw...
IRJET Journal
 
Hybrid prediction model with missing value imputation for medical data 2015-g...
Jitender Grover
 
IRJET - Chronic Kidney Disease Prediction using Data Mining and Machine Learning
IRJET Journal
 
IRJET- Heart Failure Risk Prediction using Trained Electronic Health Record
IRJET Journal
 
Framework for efficient transformation for complex medical data for improving...
IJECEIAES
 
Heart Disease Prediction Using Associative Relational Classification Techniq...
IJMER
 
Paper id 212014112
IJRAT
 
An Enhanced Feature Selection Method to Predict the Severity in Brain Tumor
ijtsrd
 

Viewers also liked (20)

PDF
Hybrid Technique for Associative Classification of Heart Diseases
Jagdeep Singh Malhi
 
PPT
Survey on data mining techniques in heart disease prediction
Sivagowry Shathesh
 
PDF
Support Vector Machines for Classification
Prakash Pimpale
 
PDF
Automated water head controller for domestic application
eSAT Publishing House
 
PDF
Hydrogen production from glycerol using microbial electrolysis cell
eSAT Publishing House
 
PDF
A review of pre combustion co2 capture in igcc
eSAT Publishing House
 
PDF
Power system stability improvement under three
eSAT Publishing House
 
PDF
Hazard object reporting to respective authorities
eSAT Publishing House
 
PDF
An iterative unsymmetrical trimmed midpoint median filter for removal of high...
eSAT Publishing House
 
PDF
Effect of spikes integrated to airfoil at supersonic
eSAT Publishing House
 
PDF
Effects of different mole spacings on the yield of summer groundnut
eSAT Publishing House
 
PDF
Elevating forensic investigation system for file clustering
eSAT Publishing House
 
PDF
Lab view study of electrical power distribution system
eSAT Publishing House
 
PDF
Effect of free surface wave on free vibration of a
eSAT Publishing House
 
PDF
Process design features of a 5 tonnesday multi – stage, intermittent drainage...
eSAT Publishing House
 
PDF
Power house automation using wireless communication
eSAT Publishing House
 
PDF
Design and characterization of various shapes of microcantilever for human im...
eSAT Publishing House
 
PDF
A challenge for security and service level agreement in cloud computinge
eSAT Publishing House
 
PDF
Android malware
eSAT Publishing House
 
PDF
Study of properties of banana fiber reinforced composites
eSAT Publishing House
 
Hybrid Technique for Associative Classification of Heart Diseases
Jagdeep Singh Malhi
 
Survey on data mining techniques in heart disease prediction
Sivagowry Shathesh
 
Support Vector Machines for Classification
Prakash Pimpale
 
Automated water head controller for domestic application
eSAT Publishing House
 
Hydrogen production from glycerol using microbial electrolysis cell
eSAT Publishing House
 
A review of pre combustion co2 capture in igcc
eSAT Publishing House
 
Power system stability improvement under three
eSAT Publishing House
 
Hazard object reporting to respective authorities
eSAT Publishing House
 
An iterative unsymmetrical trimmed midpoint median filter for removal of high...
eSAT Publishing House
 
Effect of spikes integrated to airfoil at supersonic
eSAT Publishing House
 
Effects of different mole spacings on the yield of summer groundnut
eSAT Publishing House
 
Elevating forensic investigation system for file clustering
eSAT Publishing House
 
Lab view study of electrical power distribution system
eSAT Publishing House
 
Effect of free surface wave on free vibration of a
eSAT Publishing House
 
Process design features of a 5 tonnesday multi – stage, intermittent drainage...
eSAT Publishing House
 
Power house automation using wireless communication
eSAT Publishing House
 
Design and characterization of various shapes of microcantilever for human im...
eSAT Publishing House
 
A challenge for security and service level agreement in cloud computinge
eSAT Publishing House
 
Android malware
eSAT Publishing House
 
Study of properties of banana fiber reinforced composites
eSAT Publishing House
 
Ad

Similar to A comparative analysis of classification techniques on medical data sets (20)

PDF
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
IRJET Journal
 
PDF
IRJET- Medical Data Mining
IRJET Journal
 
PDF
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AIRCC Publishing Corporation
 
PDF
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET Journal
 
PDF
Ijarcet vol-2-issue-4-1393-1397
Editor IJARCET
 
PDF
Ez36937941
IJERA Editor
 
PDF
Heart disease classification using optimized Machine learning algorithms.pdf
abdulkareemmerhej
 
PDF
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
AIRCC Publishing Corporation
 
PDF
Efficiency of Prediction Algorithms for Mining Biological Databases
IOSR Journals
 
PDF
HEALTH PREDICTION ANALYSIS USING DATA MINING
Ashish Salve
 
PDF
ICU Patient Deterioration Prediction : A Data-Mining Approach
csandit
 
PDF
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
cscpconf
 
PDF
Health Care Application using Machine Learning and Deep Learning
IRJET Journal
 
PDF
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...
ijsc
 
PDF
Evolving Efficient Clustering and Classification Patterns in Lymphography Dat...
ijsc
 
PDF
IRJET- Disease Prediction using Machine Learning
IRJET Journal
 
PDF
Correlation of artificial neural network classification and nfrs attribute fi...
eSAT Journals
 
PDF
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Editor IJMTER
 
PDF
Optimizing classification models for medical image diagnosis: a comparative a...
CSITiaesprime
 
PDF
Comparing Data Mining Techniques used for Heart Disease Prediction
IRJET Journal
 
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
IRJET Journal
 
IRJET- Medical Data Mining
IRJET Journal
 
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AIRCC Publishing Corporation
 
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET Journal
 
Ijarcet vol-2-issue-4-1393-1397
Editor IJARCET
 
Ez36937941
IJERA Editor
 
Heart disease classification using optimized Machine learning algorithms.pdf
abdulkareemmerhej
 
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
AIRCC Publishing Corporation
 
Efficiency of Prediction Algorithms for Mining Biological Databases
IOSR Journals
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
Ashish Salve
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
csandit
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
cscpconf
 
Health Care Application using Machine Learning and Deep Learning
IRJET Journal
 
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...
ijsc
 
Evolving Efficient Clustering and Classification Patterns in Lymphography Dat...
ijsc
 
IRJET- Disease Prediction using Machine Learning
IRJET Journal
 
Correlation of artificial neural network classification and nfrs attribute fi...
eSAT Journals
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Editor IJMTER
 
Optimizing classification models for medical image diagnosis: a comparative a...
CSITiaesprime
 
Comparing Data Mining Techniques used for Heart Disease Prediction
IRJET Journal
 
Ad

More from eSAT Publishing House (20)

PDF
Likely impacts of hudhud on the environment of visakhapatnam
eSAT Publishing House
 
PDF
Impact of flood disaster in a drought prone area – case study of alampur vill...
eSAT Publishing House
 
PDF
Hudhud cyclone – a severe disaster in visakhapatnam
eSAT Publishing House
 
PDF
Groundwater investigation using geophysical methods a case study of pydibhim...
eSAT Publishing House
 
PDF
Flood related disasters concerned to urban flooding in bangalore, india
eSAT Publishing House
 
PDF
Enhancing post disaster recovery by optimal infrastructure capacity building
eSAT Publishing House
 
PDF
Effect of lintel and lintel band on the global performance of reinforced conc...
eSAT Publishing House
 
PDF
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
eSAT Publishing House
 
PDF
Wind damage to buildings, infrastrucuture and landscape elements along the be...
eSAT Publishing House
 
PDF
Shear strength of rc deep beam panels – a review
eSAT Publishing House
 
PDF
Role of voluntary teams of professional engineers in dissater management – ex...
eSAT Publishing House
 
PDF
Risk analysis and environmental hazard management
eSAT Publishing House
 
PDF
Review study on performance of seismically tested repaired shear walls
eSAT Publishing House
 
PDF
Monitoring and assessment of air quality with reference to dust particles (pm...
eSAT Publishing House
 
PDF
Low cost wireless sensor networks and smartphone applications for disaster ma...
eSAT Publishing House
 
PDF
Coastal zones – seismic vulnerability an analysis from east coast of india
eSAT Publishing House
 
PDF
Can fracture mechanics predict damage due disaster of structures
eSAT Publishing House
 
PDF
Assessment of seismic susceptibility of rc buildings
eSAT Publishing House
 
PDF
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
eSAT Publishing House
 
PDF
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
eSAT Publishing House
 
Likely impacts of hudhud on the environment of visakhapatnam
eSAT Publishing House
 
Impact of flood disaster in a drought prone area – case study of alampur vill...
eSAT Publishing House
 
Hudhud cyclone – a severe disaster in visakhapatnam
eSAT Publishing House
 
Groundwater investigation using geophysical methods a case study of pydibhim...
eSAT Publishing House
 
Flood related disasters concerned to urban flooding in bangalore, india
eSAT Publishing House
 
Enhancing post disaster recovery by optimal infrastructure capacity building
eSAT Publishing House
 
Effect of lintel and lintel band on the global performance of reinforced conc...
eSAT Publishing House
 
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
eSAT Publishing House
 
Wind damage to buildings, infrastrucuture and landscape elements along the be...
eSAT Publishing House
 
Shear strength of rc deep beam panels – a review
eSAT Publishing House
 
Role of voluntary teams of professional engineers in dissater management – ex...
eSAT Publishing House
 
Risk analysis and environmental hazard management
eSAT Publishing House
 
Review study on performance of seismically tested repaired shear walls
eSAT Publishing House
 
Monitoring and assessment of air quality with reference to dust particles (pm...
eSAT Publishing House
 
Low cost wireless sensor networks and smartphone applications for disaster ma...
eSAT Publishing House
 
Coastal zones – seismic vulnerability an analysis from east coast of india
eSAT Publishing House
 
Can fracture mechanics predict damage due disaster of structures
eSAT Publishing House
 
Assessment of seismic susceptibility of rc buildings
eSAT Publishing House
 
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
eSAT Publishing House
 
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
eSAT Publishing House
 

Recently uploaded (20)

PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
ETP Presentation(1000m3 Small ETP For Power Plant and industry
MD Azharul Islam
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PDF
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PDF
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PDF
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PPTX
Basics of Auto Computer Aided Drafting .pptx
Krunal Thanki
 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PPTX
Water resources Engineering GIS KRT.pptx
Krunal Thanki
 
PDF
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
PDF
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
ETP Presentation(1000m3 Small ETP For Power Plant and industry
MD Azharul Islam
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
Inventory management chapter in automation and robotics.
atisht0104
 
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
Basics of Auto Computer Aided Drafting .pptx
Krunal Thanki
 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
Water resources Engineering GIS KRT.pptx
Krunal Thanki
 
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 

A comparative analysis of classification techniques on medical data sets

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 454 A COMPARATIVE ANALYSIS OF CLASSIFICATION TECHNIQUES ON MEDICAL DATA SETS Pooja Mittal1 , Nasib Singh Gill2 1 Department of Computer Science & Application, Maharshi Dayanand University, Rohtak 124001, Haryana, India 2 Department of Computer Science & Application, Maharshi Dayanand University, Rohtak 124001, Haryana, India Abstract Classification is the preliminary stage of data mining which is used to categorize the dataset in smaller groups where each group contains the similar data items. The classification basically deals with two main parameters; one is the number of classes and another is the criteria for deciding the class members. Different recognition algorithms also use the classification process as an initial stage to perk up the efficiency and the accuracy. The accuracy of the classification algorithm also decides the effectiveness of its use in other mining applications. The present work is about to analyze the effectiveness of the most popular classification techniques. In this paper, the analysis has been performed for five different classification algorithms in terms of accuracy, kappa statistics, execution time, mean absolute error under three datasets, collected from medical domain. The work has been implemented in WEKA environment and obtained results show that SVM is the most robust classification method and KNN is the least effective classifier for medical data sets. Keywords: KNN, Neural Network, SVM, Decision Tree, Classification. ----------------------------------------------------------------------***-------------------------------------------------------------------- 1. INTRODUCTION Data mining on medical data is a challenging area. It is a process of deriving consequential and imperative information hidden in exhaustive comprehensive data. Classification is an integral component of clinical data mining as it synthesizes the bulky data into meaningful groups, and assists in effective mining. Classification [1] is an initial stage of data mining to divide the data in sub groups according to the data and dataset features. These features can be application based or the data based. Variety of Classification algorithms is available like KNN, Decision Trees, Bayesian networks, SVM. Initially, the classification algorithm divides the dataset in two parts called training and testing data. The training data is the input data that is been learned under by the classification algorithm as a premature stage. The learning process can be supervised or unsupervised. Once the learning of training data is accomplished, the classes are identified. The testing data is a new entity for which the class is unrecognized. This testing data is analyzed under different parameters and compared with available class’s features. The maximum harmonized class is taken as the data class [2]. Classification algorithms like decision tree, neural network, support vector machine, KNN etc [2,3], collectively forms a generalization model to perform the classification. Some of these methods are statistical and some are soft in terms of rules and works on weighted values. In section 3, these methods are described in detail. 2. LITERATURE SURVEY An early detection & intervention plays a significant role in controlling the clinical deterioration of ICU patients [11]. An integrated data mining approach was designed to give early warnings and alarms by synthesizing large set of features like DFA, Time Series, Entropy by applying linear and non-linear classification, forward feature selection and exploratory under sampling. An improved hybrid prediction model was proposed by ILango & Ramaraj[12] by implementing F-Score feature selection to deduce optimal feature set from high dimensional data bases. They achieved predictive accuracy of 98.9427% for diabetes data set. Data mining techniques are significantly important to derive and concise information from large & high dimensional clinical data [13]. Real data sets from various domains were analyzed & identified the performance of various techniques, to pick out the best out performer. Quantum C4.5 and Random tree produced 100% accuracy on one data set and 91.36% on another data set. Accuracy of any technique depends on the algorithm and on the nature & behavior of data set. The medical coding problem can be visualized as multi label classification problem, in which patient’s information is translated in standard pre-defined codes. A multi label large margin classifier is capable of learning the code structure, using the previous knowledge [14]. Medical data mining primarily focuses on hidden pattern extraction. An amalgam model was developed for classifying medical data set .Combination of multistep pre-processing, k-means and K nearest neighbor improves the performance of the process [15]. Missing values present in medical data set effects the pattern extraction process significantly. Imputation is the most popular and common approach for solving MV’s problems. Heart failure data set was referred and concluded that no universal imputation tool is available which can outperform even if applied on diverse data sets [16]. Temporal patterns can be mined by applying fuzzy
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 455 neural networks. Lower approximations can be derived by implementing hypothesis and fuzzy decision tables [17]. A novel approach was proposed for classification based on bijective soft sets. Bijective soft set theory performs the analysis on data and identifies the dependency in data values. It also discovers the redundant information and identifies the classes over data values . It comes out to be a valuable asset for inductive learning. Precision, Recall and F-measure were the factors which were analyzed for comparing the performance of bijective classification from decision table and naïve bayes[18]. Neural networks when applied on data set of heart disease patients resulted in sensitivity of 81.1%, specificity of 78.7% and accuracy of 80.2%. Decision tress yielded sensitivity, specificity and accuracy of 81.7%, 76.0% and 79.3% where as logistic regression achieved 81.2% , 73.1% and 71.1% of sensitivity, specificity and accuracy respectively [19]. Various medical data sets from diverse domains were analyzed by applying SSI, KD, RB and L measures. SSI produced maximum classification accuracy on different data sets [20],[16]. Maintenance of patients data sets is equally important as it may assist in acquiring knowledge and identifying problems [21],[12]. Heart disease diagnostic is one of the prime applications of data mining in medical stream. Many researchers contributed to develop intelligent systems. My Chau Tu, Dongil Shin applied decision tree C4.5, bagging with C4.5 and bagging with Naïve Baye’s theorem and analyzed the effectiveness, correction rate. 10 fold cross validation was used for evaluation. Out of these techniques bagging with Naïve Baye’s theorem out shines with 82.5% accuracy [22]. Pre processing improves the performance of mining techniques by removing noisy data, substituting missing values (MV’s) [15][16] whereas k-means algorithms are used to eliminate incorrect classification. Value of k depends on the nature and behavior of data. Larger the value of k, lesser the chances for noise. 97.4% accuracy was achieved by caching KNN & k means. The state of classifiers are mainly decided by the nature of the data set, when k-fold cross validation was applied [23]. Dempster – Shafer’s theory of evidence combination was implemented on two varied medical domains of Breast Cancer and Skin lesions, by applying k-nearest neighbor, Naïve Bayesian and Decision Tree. In this scenario, beliefs were computed by classifier outputs. Belief m(X) is a measure of trust or confidence with∑ X€Ω m(X) = 1, where m(ϕ) = 0 where ϕ is empty [23]. Classification can be improved by using evidence combination approach. To enhance the performance, the probability theory is modified and the mathematical theory of Evidence is devised for handling uncertainty [24]. Binary – coded genetic algorithms and Real-coded genetic algorithms are used for assigning weights to the features, so that set of optimal features can be deduced from high dimensional data. Different k-NN algorithms (crisp k-NN, fuzzy KNN and weighting fuzzy k-NN) are evaluated and compared on same data set. They achieved 98.77% of accuracy by applying RGA based model whereas GA approach comes out to be time consuming [25]. Knowledge can be extracted by clustering the given data set in to soft clusters and can be fused by using serial and parallel fusion to outperform as classifier [26]. Hybrid approaches can be used for classifying the given voluminous data [27,26,18]. Ant Colony Optimization algorithm is an efficient approach used for classification purpose[27]. Quality can be improved if ACO is combined with mRAR, a feature selection algorithm. 3. CLASSIFICATION APPROACHES Classification is a form of data analysis [10], which is used for extracting a model for describing and differentiating the data classes of given data objects, with an objective of predicting the class for an object whose class label is not known. The classification process can be broadly divided into two phases: Learning step (training data) and Classification step(testing data). Classification is used to predict categorical labels including discrete and unordered values. This derived model can be represented in many forms known as classification algorithms such as IF-THEN rules, decision tree, mathematical formulae, neural networks etc. Some of the popular techniques are described below. 3.1 Decision Tree It is the hierarchical decision making approach used to partition the dataset. It is an approximate discrete function technique for retrieving useful expressions. Fig. 1 A simple classification or decision tree Age OverWeight <40 >=40 Yes No OverWeight Yes No BP High Normal Heartattack Good Health
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 456 According to this approach, the dataset is classified to n mutual exclusive datasets and each dataset will be defined with a label. Now the data points are identified by a certain action to its relative data class. Decision Tree is a supervised learning technique that defines a transparent tree based structure to perform the action on available dataset. The tree structure itself contains number of nodes that are connected by the edges. These nodes basically define the conditions and the edges define the relative event based on the condition. Each edge itself defines a separate class. The decision is here been taken respective to the true or false case of a conditional analysis. If the condition is true, one category will be elected otherwise second categorization. Figure 1 is showing the simple classification process defined by decision tree [3,4]. 3.2 Bayesian Classification Bayesian network, the probabilistic graphical model that identify the relation between the variables and based on the dependency analysis, the classification will be performed. The Bayesian network is a directed acyclic graph based model. The dependency is here been identified between different attributes in terms of conditional analysis. Each attribute is at first defined independently and with the next level, the relation between these attributes is identified. Each attribute itself is defined with some weight age called the probabilistic analysis and as the relations are identified, the conditional probability is evaluated for each factor. This classifier performs the learning on training data under the conditional probability analysis on each attribute as well assign a random class label to each instance. Once the conditional probability based acyclic graph is generated, the next work is to predict the class with highest probability vector. The goal of classification process is to predict the discrete value of class for any testing data. The structure of Bayesian network is been defined under the one to one feature vector. The arcs defined in the graph represents the features based on which the conditional dependencies are evaluated [6]. 3.3 Neural Network Neural network is one of effective soft computing based classification algorithm that uses the concept of neurons that logically represents the working of human brain. In this classification process, the data values are represented by the neurons and the connectivity is represented by synapses. It is basically the layered approach in which there are two main layers called two end points represented by input and output layer. Other then these two layers, the model also have m intermediate layers called hidden layers. On each layer some weight age is assigned. The graphical representation of Neural network is given in figure 2. As we can see, the middle layer of the network defines the weights to different input values so that effective classification will be done. Neural network accepts the dataset as input layer and represent it as the network nodes. The predictor weights are applied to these nodes in hidden layer. This layer actually defines the degree of connectivity between the nodes. After adjusting the weightage, output layer is derived as the final result [6,7]. Fig. 2: A Three Layer Neural Network 3.4 Support Vector Machine SVM is another robust and successful classification algorithm. SVM basically works as the linear separator between two data points to identify two different classes in the multidimensional environment. The prime objective of this approach is to maximize the margin between the classes and to minimize the distance between the hyper plane points. SVM basically defines the dealing of interaction respective to the features and the repetitive features. SVM split the dataset in two vector sets under n dimensional space vector. The SVM algorithm basically construct a hyper plane environment so that each element is been compared respective to the separated linear line. Hyper-plane concept is presented to perform the data separation based on largest distance analysis to identify the classes. To reduce the error ratio, the largest margin classifier is defined. The work also includes the analysis based on margin vector along with support vector analysis [8]. 3.5 k-Nearest Neighbor KNN is the instance based statistical analysis approach to perform data classification, called as lazy learning algorithm. It is the simplest algorithmic approach among all algorithmic approaches. According to this approach, an object is classified by neighbor point analysis based on majority analysis. The object that will get the highest vote will be selected as the class member respective to the distance defined class. Once the classification rule is decided, the relative neighboring objects are identified. If the value of k=1, then it is simply called as nearest neighbor. K-NN requires 1. An integer k 2. A training data set 3. A metric to measure closeness
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 457 Fig. 3 Example of KNN classification Example of KNN classification is shown in figure 3. Here input data objects are represented by greed circles and the object classes are identified by blue and red objects. In case of three classes, triangles are represented by one class and rest objects are represented by other objects. In case of 5 classes, three classes are presented by square and rest two classes are represented by triangles shown inside the outer circle[9,10]. 4. EXPERIMENTAL RESULTS AND ANALYSIS The performance analysis was carried on five different algorithms for three different datasets. The datasets considered here are from medical domain. The classification algorithms used in this work are described in previous section. The present work has been implemented in WEKA (Waikato Environment for Knowledge Analysis) environment [4,9] and the results have been taken under different parameters: the accuracy, Kappa Statistics[4], Mean Absolute Error and Execution Time. The results obtained from these different models have been defined in the form of tables as well as graph. 4.1 Datasets In this work, three medical datasets are considered, which are collected from the UCI Repository. These datasets are effective enough to show the classification process. These datasets are here analyzed under different classification parameters. These datasets are available in ARFF format. The detailed description of these datasets, in terms of features and data points, is given in table 5. Table 5: Medical Datasets Sr. No. Dataset Name Instances Attributes 1 heart- statlog.arff 270 14 2 diabetes.arff 768 9 3 hepatitis.arff 155 20 The medical datasets that have been utilized to conduct the classification are taken from diverse range of medical areas, to ensure the excellence of analysis . Every dataset has different types of data, including numbers, text and other domain data points. Each of the dataset is explored explicitly due to their uniqueness in terms of their varying attributes, discrete or continuous nature of data etc. These datasets are been analyzed for the classification task by using WEKA tool under different classification approaches. To perform this classification, 10- fold method is adopted in this present work. WEKA is an open source data mining software tool written in java. WEKA, itself contains number of built-in data mining algorithms so that different mining operations can be performed directly. WEKA is used by the researchers to analyze the effectiveness of different machine learning algorithms. In this present work, we have used WEKA to perform the analytical study of classification algorithms on medical datasets. Various parameters considered, are described next. 4.2 Accuracy Analysis Accuracy of a classification algorithm is been defined in terms of number of correctly classified instances. Accuracy Analysis is given by CCount Accuracy Analysis= --------------------------x100 TCount Here, CCount is Number of Correctly Identified Objects, and TCount is Total Number of Objects. Higher the accuracy level, more effective the algorithm will be. Table 1: Accuracy Analysis of Different Classification Algorithms Dataset name Classification Techniques Bayesia n Networ ks Neural Networ ks SVM KNN Decisio n Tree heart- statlog.arff 83.7037 77.4074 84.074 1 75.185 2 76.666 7 diabetes.ar ff 76.3021 75.1302 77.474 69.791 7 74.218 8 hepatitis.ar ff 83.2258 80 85.161 3 80.645 2 81.290 3
  • 5. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 458 Fig 4: Accuracy Analysis of Different Classification Algorithms As it can be viewed, figure 4 is showing the accuracy based comparison of different algorithms. The figure shows that the SVM is most robust, effective and consistent classifier for different datasets. SVM provided the highest accuracy among all algorithms whereas KNN is the least effective algorithm in terms of accuracy analysis. 4.3 Based on Execution Time To analyze any algorithmic approach, the execution time is one of the foremost parameter. In this present work, we have analyzed the execution time to identify the efficient classification algorithm. Here table 2 is showing the execution time results obtained from different algorithms. Table 2: Execution Time Analysis of Different Classification Algorithms Dataset name Classification Techniques Bayesia n Networ ks Neur al Net work s SVM KNN Decisi on Tree heart- statlog.arff 0.02 1.3 0.08 0.003 0.03 diabetes.ar ff 0.02 2.03 0.11 0.005 0.06 hepatitis.ar ff 0.04 1.28 0.09 0.004 0.03 Fig 5: Execution Time Analysis of Different Algorithms Here figure 5, is showing the execution time analysis of different classification algorithms. As we can see, the KNN is most efficient classification algorithm whereas the results obtained for neural network are worst. 4.4 Based on Kappa Statistic Kappa Statistics is the statistical analysis based on the inter- rater agreement for qualitative data. It basically performs the analysis between different classes. Higher Value of kappa statistic is considered as good. Here figure 3 is showing the comparative analysis of different algorithms under the kappa statistics. Table 3: Kappa Statistic Analysis of Different Algorithms Dataset name Classification Techniques Bayesia n Networ ks Neura l Netwo rks SVM KNN Decisi on Tree heart- statlog.arf f 0.6683 0.544 4 0.676 2 0.4988 0.5271 diabetes.a rff 0.4664 0.444 5 0.470 8 0.3223 0.4246 hepatitis.a rff 0.5107 0.382 5 0.530 9 0.3953 0.394 0102030405060708090 NeuralNetworks SVM KNN DecisionTree Classification Techniques heart-statlog.arff diabetes.arff hepatitis.arff 0 0.5 1 1.5 2 2.5 NeuralNetworks SVM KNN DecisionTree Classification Techniques heart- statlog.arff diabetes.arff hepatitis.arff
  • 6. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 459 Fig 6: Kappa Statistic based Analysis of Different Classification Algorithms As we can see, figure 6 shows the Kappa statistics based analysis. As we can see, SVM classification method has provided the highest Kappa statistic value for different dataset. It shows the effectiveness in terms of robustness. The values obtained in case of KNN algorithm is least that shows the least effective algorithm under this parameter. 4.5 Based on Mean Absolute Error (MAE) MAE actually forecasts the capability of an algorithm. Lesser the MAE , higher the capability of the algorithm to perform the prediction. Here table 4, is showing the results obtained from different algorithm s under mean absolute error parameter. Table 4: Mean Absolute Error based Analysis for Different Classification Algorithms Dataset name Classification Techniques Bayesia n Network s Neural Network s SVM KNN Decisio n Tree heart- statlog.arff 0.1835 0.2328 0.159 3 0.250 2 0.274 diabetes.arf f 0.2841 0.294 0.225 3 0.302 7 0.3134 hepatitis.ar ff 0.1754 0.1928 0.148 4 0.197 9 0.2073 Fig 7: Mean Absolute Error based Analysis for Different Algorithms Here figure 7, is showing the estimation of different algorithms under the mean absolute error. As we can see, MAE value in case of SVM algorithm is least that shows the accurate prediction capability of the algorithm. Whereas the highest values driven from the KNN algorithm, shows that the algorithm is not effective for the prediction. 5. CONCLUSIONS In this paper, the analysis of different classification algorithms has been performed under four different parameters called execution time, mean absolute error, kappa statistic and accuracy analysis for three medical datasets. The obtained results show that the SVM is the most robust, consistent and reliable classification algorithm whereas KNN is the worst algorithm for the classification. REFERENCES [1] Gupta, M., and Aggarwal, N. , “Performance Analysis of Classification Techniques on XML Dataset”, International Journal of Computer Science and Technology IJCST Vol. 1, Issue 1, pp. 76-79, 2010. [2] Justin, T., Gajsek, R., Struc, V., and Dobrisek, S., “Comparison of Different Classification Methods for Emotion Recognition”, MIPRO 2010, Opatija, Croatia, pp. 700-703, 2010. [3] Gupta, S., Kumar, D., and Sharma, A., “Data Mining Classification Techniques applied for Breast Cancer Diagnosis and Prognosis”, Indian Journal of Computer Science and Engineering (IJCSE) Vol. 2 No. 2, pp. 188- 195, 2011 [4] Viera, A. J., and Garrett, J. M., “Understanding Interobserver Agreement: The Kappa Statistic”, Research Series Vol. 37, No. 5 , pp. 360-363, 2005. [5] Desai, A., and Rai, S., “Analysis of Machine Learning Algorithms using WEKA”, International Conference & 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 BayesianNetworks NeuralNetworks SVM KNN DecisionTree Classification Techniques heart-statlog.arff diabetes.arff hepatitis.arff 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 BayesianNetworks NeuralNetworks SVM KNN DecisionTree Classification Techniques heart- statlog.arff diabetes.arff hepatitis.arff
  • 7. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 460 Workshop on Recent Trends in Technology, (TCET) 2012 Proceedings published in International Journal of Computer Applications (IJCA) 27, pp.27-32, 2012. [6] Kumari, Milan, and Godara, Sunila, “Comparative Study of Data Mining Classification Methods in Cardiovascular Disease Prediction”, International Journal of Computer Science and Technology Vol. 2, Issue 2, pp. 304-308, 2011. [7] Ture, M., Kurt, I., Kurum, A. T., and Ozdamar, K., “Comparing classification techniques for predicting essential hypertension”, Expert Systems with Applications 29, pp. 583–588, 2011. [8] Burges, C., “A Tutorial on Support Vector Machines for Pattern Recognition”. Data Mining and Knowledge Discovery, Vol. 2, pp. 121-167, 1998. [9] Pushpa, “Comparision of Clustering Techniques using WEKA”, M. Tech. Thesis, Guru Jambheshwar University of Science and Technology, Hisar, India,2010. [10] Han, J., and Kamber, M., “Data Mining: Concepts and Techniques”, 2nd edition, Morgan Kaufmann Publishers, 2006 [11] Yi Mao, Wenlin Chen, Yixin Chen, Lu Chenyang, Kollef Marin & Thomas C., “ An Intergrated Data Mining Approach to Real Time Clinical Monitoring and Deterioration Warning”,KDD’12 ACM 978-1-4503- 1462-6/12/08. [12] Ilango B.S, Ramaraj N., “ A Hybrid Prediction Model with F-score Feature selection for TypeII Diabetes Databases”, A2CWiC 2010, September 16-17, 2010 . [13] Jacob S.G. , Ramani R.G, “ Mining of Classification Patterns in Clinical Data through Data Mining Algorithms”, ICACCI’12 –ACM 978-1-4503-1196- 0/12/08. [14] Yan Yan, Fung G., DY J.G, “ Medical Coding Classification by Leveraging Inter-Code Relationships”, KDD’10 July 25-28 2010 Washington, USA. [15] NirmalaDevi M, Balamurugan S, Swathi U V, “ An amalgam KNN to predict Diabetes Mellitus”, 2013 IEEE International Conference on Emerging Trends in Computing, Communication and Nanotechnology. [16] Davis D N, Zhang Y, Kambhampati C, Goode K, Cleland J.G.F, “ A Comparative study of missing value imputation with multi class classification for clinical heart failure data”, 2012 IEEE, 9th International Conference on Fuzzy Systems and Knowledge Discovery. [17] Sethukkarasi R., Keerthika U., Kannan A. , “ A Self Learning Rough Fuzzy Neural Network Classifier for Mining Temporal Patterns”, ICACCI’12 –ACM 978-1- 4503-1196-0/12/08. [18] Kumar S U, Inbarani H, Senthil Kumar, “ Bijective Soft Set Based Classification of Medical Data”, 2013 IEEE, International Conference on Pattern Recognition, Informatics and Mobile Engineering. [19] Khemphila A, Boonjing V, “ Comparing performances of logistic regression, decision trees and neural networks for classifying heart disease patients”, 2010 IEEE , International Conference on Computer Information Systems and Industrial Management Applications. [20] Saastamoinen K, Ketola J, “ Medical Data Classification using Logical Similarity based Measures”, 1-4244-0023- 6/06 2006 IEEE. [21] Mutalib S., Razak Abd. R., Nordin S., Rahman S.A., Mohamed A., “ Intelligent classification in Medical Data”,2012 IEEE International Conference on Biomedical Engineering and Sciences. [22] Tu M C, Shin D, Shin Dong, “ A Comparative Study of Medical Data Classification Methods Based on Decision Tree and Bagging Algorithms”, 2009 8th IEEE International Conference on Dependable, Autonomic and Secure Computing. [23] Aslandogan Y.A, Mahajani G. A, “ Evidence Combination in Data Mining”, 2004 IEEE Proceedings of the International Conference on Information Technology: Coding and Computing . [24] Shafer G, “ A mathematical theory of evidence”, Princeton University Press, 1976. [25] Tang P.H, Tseng M.H, “ Medical data mining using BGA and RGA for weighting of features in Fuzzy k-NN classification”, 2009 8th IEEE International Conference on Machine Learning and Cybernetics. [26] Hassan S. Z, Verma B, “ A Hybrid Data Mining Approach for Knowledge Extraction and Classification in Medical Databases”, 2007 IEEE 7th International Conference on Intelligent Systems Design and Applications. [27] Michelakos I, Papageorgiou E, Vasilakopoulos M, “ A hybrid classification algorithm evaluated on medical data”, 2010 IEEE Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises.