SlideShare a Scribd company logo
Research Article
Volume 1 Issue 2 - January 2017
Curr Trends Biomedical Eng & Biosci
Copyright © All rights are reserved by Damian R Mingle
Controlling Informative Features for Improved
Accuracy and Faster Predictions in Omentum Cancer
Models
Damian R Mingle*
WPC Healthcare, Nashville, USA
Submission: December 09, 2016; Published: January 03, 2017
*
Corresponding author: Damian Mingle, Chief Data Scientist, WPC Healthcare, 1802 Williamson Ct, Brentwood, TN 37027, USA
Email:
Introduction
In recent years, the dawn of technologies like microarrays,
proteomics, and next-generation sequencing has transformed
life science. The data from these experimental approaches
deliver a comprehensive depiction of the complexity of biological
systems at different levels. A challenge within the “-omics” data
strata is in finding the small amount of information that is
relevant to a particular question, such as biomarkers that can
accurately classify phenotypic outcomes [1]. This is certainly
true in the fold of peritoneum connecting the stomach with
other abdominal organs known as the omentum. Numerous
machine learning techniques and methods have been proposed
to identify biomarkers that accurately classify these outcomes
by learning the elusive pattern latent in the data. To date, there
have been three categories that assist in biomarker selection
and phenotypic classification:
A. Filters
B. Wrappers
C. Embedding
In practice, time-to-prediction and accuracy of prediction
matter a great deal.
Filtering methods are generally considered in an effort
to spend the least time-to-prediction and can be used to
decide which are the most informative features in relation
to the biological target [2]. Filtering produces the degrees of
correlation with a given phenotype and then ranks the markers
in a given dataset. Many researchers acknowledge the weakness
of such methods and take careful note to observe the selection
of redundant biomarkers. In addition, filtering methods do not
allow for interactions between biomarkers. An example of a
popular filtering method is Student’s t-test [3].
In order to optimize the predictive power of a classification
model, wrapper methods iteratively perform combinatorial
biomarker search. Since this combinatorial optimization process
is computationally complex, a NP-hard problem, many heuristics
have been proposed, for example, to reduce the search space
and thus reduce the computational burden of the biomarker
selection [4].
With the exception of performing feature selection and
classification simultaneously, embedded methods are similar to
wrapper methods. Recursive feature elimination support vector
machine (SVM-RFE) is a widely used technique for analysis of
microarray data [5,6]. The SVM-RFE procedure constructs a
Curr Trends Biomedical Eng & Biosci 1(2): CTBEB.MS.ID.555559 (2017) 001
Current Trends in Biomedical
Engineering & Biosciences
Abstract
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current
machine learning approaches are either too complex or perform poorly. Here, a novel feature detection and engineering machine-learning
framework is presented to address this need. First, the Rip Curl process is applied which generates a set of 10 additional features. Second, we
rank all features including the Rip Curl features from which the top-ranked will most likely contain the most informative features for prediction of
the underlying biological classes. The top-ranked features are used in model building. This process creates for more expressive features which are
captured in models with an eye towards the model learning from increasing sample amount and the accuracy/time results. The performance of
the proposed Rip Curl classification framework was tested on omentum cancer data. Rip Curl outperformed other more sophisticated classification
methods in terms of prediction accuracy, minimum number of classification markers, and computational time.
Keywords: Omentum, Cancer, Data science, Machine learning, Biomarkers, Phenotype, Personalized medicine
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
002
Current Trends in Biomedical Engineering & Biosciences
classification model using all available features, with the least
informative features being pruned from the model.This process
continues iteratively until a model has learned the minimum
number of features that are useful. In the case of “-omics” data,
this process becomes impractical when considering a large
feature space.
Our research used a hybrid approach between user and
machine that dramatically reduced the computational time
required by similar approaches while increasing prediction
accuracy when comparing other state-of-the art machine
learning techniques. Our proposed framework includes
A. Ranking and pruning attributes using information gain
to extract the most informative features and thus greatly
reducing the number of data dimensions;
B. A user to view histograms on attributes where the
information gain is 0.80 or higher and creating new binary
features from continuous values;
C. Re-ranking both the original features and the newly
constructed features; and
D. Using the number of instances to determine how many
top-n informative features should be used in modeling.
The Rip Curl framework can be used to construct a high-
dimensional classification model that takes into account
dependencies among the attributes for analysis of complex
biological -omics datasets containing dependencies of
features. The performance of the proposed four-step
classification framework was evaluated using datasets from
microarray. The proposed framework was compared with
SVM-RFE in terms of area under the ROC curve (AUC) and the
number of informative biomarkers used for classification.
Results and Discussion
Using the omentum dataset we conducted the Rip Curl
process of setting the target feature (in this case it was one-
versus-all), characterized the target variable, loaded and
prepared the omentum data, saved the target and portioning
information, analyzed the omentum features, created cross-
validation and hold-out partitions, and conducted exploratory
data analysis.
Table 1: Different types of descriptive features.
Type Description
Predictive
A predictive descriptive feature provides information
that is useful in estimating the correct value of a target
feature.
Interacting
By itself, an interacting descriptive feature is not
informative about the value of the target feature. In
conjunction with one or more other features, however,
it becomes informative.
Redundant
A descriptive feature is redundant if it has a strong
correlation with another descriptive feature.
Irrelevant
An irrelevant descriptive feature does not provide
information that is useful in estimating the value of the
target feature.
In an effort to increase performance and accuracy we opted
for an approach of feature selection to help reduce the number
of descriptive features in the omentum dataset to just the subset
that is most useful for prediction. Before we begin our discussion
of approaches to feature selection, it is useful to distinguish
between different types of descriptive features (Table 1):
The goal of any feature selection approach is to identify the
smallest subset of descriptive features that maintains overall
model performance. Ideally a feature selection approach will
return the subset of features that includes the predictive
and interacting features while excluding the irrelevant and
redundant features.
Using conventional methods, it is not efficient to find the
ideal subset of descriptive features used to train an omentum
model. Considerd features. There are 2d
different possible feature
subsets, which is far too many to evaluate unless d is very small.
For example, with the descriptive features represented in the
omentum dataset,there are 210,960
which produces a 3,300 digit
integer as the possible feature subsets.
Material and Methods
Datasets
The dataset used in the experiments were provided by the
Gene Expression Machine Learning Repository (GEMLeR) [7].
GEMLeR contains microarray data from 9 different tissue types
including colon, breast, endometrium, kidney, lung, omentum,
ovary, prostate, and uterus. Each microarray sample is classified
as tumor or normal. The data from this repository were collated
into 9 one-tissue-types versus all-other-types (OVA) datasets
where the second class is labeled as “other.” All GEMLeR
microarray datasets have been analyzed by SVM-RFE, the results
of which are available from the same resource.
Methods
Figure 1: Represent protein and LDH level in different types of
meningitis.
Figure 1 demonstrates the Rip Curl framework and its
dependencies on the prior stage.
In applying the Rip Curl framework, we initially ran the
omentum microarray data through to gain informative feature
feedback and then rank those features from most to least
important. We confirmed the number of instances available in
the dataset and then applied 1% (1,545 instances X 0.01 = 154
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
003
Current Trends in Biomedical Engineering & Biosciences
features) to discover how many top informative features we
would make use of in our framework. Where the features were
both in the top 1% and expressed informativeness at or above
80%, we created unique features that followed some meaningful
thresholds which grouped biomarker data into bins of “0” or “1”.
Finally, we sent the enhanced data back through the informative
feature test to reduce the feature space to the top 1% and then
removed all other features, modeling this subset using Random
Forest (Entropy).
Our general approach to model selection was to run several
model types and to select the best performing based on the
highest AUC from the cross-validation results. Once those models
were selected, we confirmed that the models were learning
models based on sample sizes of the data 16%, 32%, and 64%
and are reported in Figure 2. If the model for each sample size
did not increase then the model was discarded as a non-learning
model (Figure 2).
Figure 2: A demonstration of a proper learning model.
We chose area under the ROC curve for its immediate
understanding and calculated it as follows
Where T is a set of thresholds, |T| is the number of thresholds
tested, and TPR(T[i] ) and FPR(T[i] ) are the true positive and
false positive rates at threshold i respectively. The ROC index is
quite robust in the presence of imbalanced data, which makes
it a common choice for practitioners, especially when multiple
modeling techniques are being compared to one another.
In addition, we ran a second evaluation measure, Gini Norm,
which is calculated as follows
Gini coefficient=(2 ×ROC index)-1 (2)
The Gini coefficient can take values in the range of (0,1), and
higher values indicate better model performance.
In our experiment with the omentum microarray data, we
wanted to pay particular attention to reducing complexity and
thereby improving time-to-prediction. This is especially true
with an M X N dimensional dataset, where M is the number of
samples and N is the features respectively, more specifically
where N is orders of magnitude greater than M, as is the case in
our experiment.
We selected Random Forest to represent the general
technique of random decision forests, an ensemble learning
method for classification, regression, and other tasks. Random
Forest operates by constructing a multitude of decision trees
at training time and outputting the class that is the mode of
the classes (classification) or mean prediction (regression)
of the individual trees. Random decision forests correct for
decision trees’ habit of over fitting to their training set. Figure
3 demonstrates visually the increase in predictive accuracy as it
relates to complexity of that prediction.
Figure 3: Comparison of Rip Curl vs other methods.
Table 2: Comparative analysis of model performance.
Data Model
Number
of
Variables
AUC
(Validation)
AUC (Cross
-Validation)
AUC
(Hold-
Out)
Gini Norm
(Validation)
Gini Norm
(Cross-
Validation)
Gini Norm
(Hold-Out)
Time
(milliseconds)
Raw
Features
Random
Forest
(Entropy)
10,935 0.9520 0.9427 0.9269 0.9040 0.8855 0.8537 10,859.25
Univariate
Features
Random
Forest
(Entropy)
8,165 0.9592 0.9492 0.9232 0.9184 0.8984 0.8463 8,233.39
Informative
Features
Random
Forest
(Entropy)
8,283 0.9520 0.9427 0.9269 0.9040 0.8855 0.8537 10,732.36
Top 1% of
Features
Random
Forest
(Entropy)
15 0.9379 0.9201 0.9344 0.8757 0.8401 0.8689 3,374.21
Table 2 emphasizes the disparity of different results in
prediction and time by holding constant the model type, Random
Forest (Entropy).
We observed that Rip Curl (Top 1% of Features) made use
of the best parameters, which we found to be max_depth: None,
max_features: 0.2, max_leaf_nodes: 50, min-samples_leaf: 5, and
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
004
Current Trends in Biomedical Engineering & Biosciences
min_samples_split: 10. Rip Curl improved time-to-prediction
by a range of 59.02% to 68.93%, increased the hold-out AUC
by 0.81% to 1.21%, and increased the hold-out Gini Norm by a
range of 1.78% to 2.67%.
Our framework makes use of Claude Shannon’s entropy
formula which defines a computational measure of the impurity
of the elements in a set. Shannon’s idea of entropy is a weighted
sum of the logs of the probabilities of each possible outcome
when we make a random selection from a set. The weights used
in the sum are the probabilities of the outcomes themselves so
that outcomes with high probabilities contribute more to the
overall entropy of a set than outcomes with low probabilities.
Shannon’s entropy is defined as
where P(t=i) is the probability that the outcome of randomly
selecting an element t is the type i, l is the number of different
types of things in the set, and s is an arbitrary logarithmic base
(which we selected as 2) (Shannon, 1948).
Once we established the in formativeness of each feature we
visually explored the histograms of each variable that expressed
an informative value of ≥80%. Below is an example of the
histogram for 206067_s_at, indicating where the target’s signal
was most concentrated within this biomarker (Figure 4).
Figure 4: Histogram of Gene 206067_s_at in raw expression.
In an effort to concentrate the omentum tissue signal, we
generated a rule that stated
Which rendered a new feature that generated an additional
histogram (Figure 5)?
Figure 5: Demonstration of Rip Curl feature engineering using
visual thresholds.
Allowing us to pass a different, possibly more understandable
context to our algorithm.
We repeated this process above, applying rules based on our
observation of the training data.
	
Some additional features were simple descriptive statistics
such as Min and Mode while others were a bit unconventional
such as
Where X a gene is feature and i represents the placement
within the feature index.
Binsum was another engineered feature that was simply
Where bin is one of the generated binary features and i is the
index of the bin within the omentum training data. In an effort
to develop greater context for the omentum data and the new
features that were engineered, we analyzed key values and their
respective informativeness (or importance) [8] (Table 3):
Table 3: Rip Curl Feature engineering statistics.
Feature Name Importance Unique Missing Mean SD Median Min Max
Binsum 92.88 9 0 1.46 2.07 1 0 8
1800_206067_s_at 84.93 2 0 0.15 0.35 0 0 1
400_216953_s_at 82.15 2 0 0.14 0.34 0 0 1
1100_219454_at 82.09 2 0 0.22 0.42 0 0 1
1300_214844_s_at 78.42 2 0 0.13 0.34 0 0 1
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
005
Current Trends in Biomedical Engineering & Biosciences
4000_227195_at 77.04 2 0 0.21 0.41 0 0 1
1100_213518_at 76.07 2 0 0.31 0.46 0 0 1
3500_204457_s_at 75.71 2 0 0.21 0.40 0 0 1
900_219778_at 71.29 2 0 0.09 0.29 0 0 1
Lensum 55.36 3 0 13.82 2.98 16 8 16
Mode 53.04 837 0 214 413 19.80 1.60 4152
Min 51.85 81 0 242 1.41 2.20 0.20 15.40
Figure 6: Variable importance rank generated from the Rip Curl
framework.
Figure 6 demonstrates visually the variable importance
of the final Rip Curl model. We observed that 20% of the top
informative features were generated through the Rip Curl
framework: Binsum, >1800_206067_s_at, and >400_216953_s_
at with a range of importance between 27% and 93%.
GEMLeR provides an initial classification accuracy value for
the omentum dataset. In their experiments designed to generate
the state-of-the-art benchmark, all measurements were
performed using WEKA machine learning environment. They
opted to make use of one of the most popular machine learning
methods for gene expression analysis, Support Vector Machines
– Recursive Feature Elimination (SVM-RFE) feature selection
algorithm. They evaluated (feature selection + classification
was done inside a 10-fold cross-validation loop on the omentum
dataset to avoid so called selection bias [9] and demonstrates
their approach (Figure 7).
Figure 7: SVM-RFE Process.
Head-to-Head Comparison with SVM-RFE
Table 4: Comparison results of international benchmark and Rip Curl.
Model AUC
SVM-RFE (Benchmark) 0.703
Rip Curl 0.934
Table 4 shows a comparison of the SVM-RFE benchmark
established in with the Rip Curl framework, and the following
results were observed [10] (Table 4):
Rip Curl represents a 32.92% gain in prediction accuracy
over the GEMLeR benchmark for the same omentum dataset
[11].
Conclusion
The Rip Curl classification framework outperformed the
state-of-the-art benchmark (SVM-RFE) in the GEMLeR omentum
cancer experiment. Since the Rip Curl classification framework
utilizes entropy-based feature filtering and adds more contexts
through feature engineering, the complexity of this classification
framework is very low permitting analysis of data with many
features. Future research would suggest comparisons beyond
the omentum cancer data and exploration of other one-versus-
all experiments in the areas of breast, colon, endomentrium,
kidney, lung, ovary, prostate, and uterus [12-14].
Acknowledgment
We would like to acknowledge GEMLeR for making this
important dataset available to researchers and WPC Healthcare
for supporting our work. Finally, the authors would like to thank
the donors who participated in this study.
References
1.	 Abeel T, HelleputteT, Van de Peer Y, Dupont P, Saeys Y (2010) Robust
biomarker identification for cancer diagnosis with ensemble feature
selection methods. Bioinformatics 26(3): 392-398.
2.	 Mingle D (2015) A Discriminative Feature Space for Detecting and
Recognizing Pathologies of the Vertebral Column. International Journal
of Biomedical Data Mining.
3.	 Leung Y, Hung Y (2010) A multiple-filter-multiple-wrapper approach
to gene selection and microarray data classification. IEEE/ACM Trans
Comput Biol Bioinform 7(1): 108-117.
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
006
Current Trends in Biomedical Engineering & Biosciences
4.	 Mohammadi A, Saraee MH, Salehi M (2011) Identification of disease-
causing genes using microarray data mining and Gene Ontology. BMC
medical genomics 4(1): 1.
5.	 Ding Y, Wilkins D (2006) Improving the performance of SVM-RFE to
select genes in microarray data. BMC bioinformatics 7(Suppl 2): S12.
6.	 Balakrishnan S, Narayanaswamy R, Savarimuthu N, Samikannu R
(2008) SVM ranking with backward search for feature selection in type
II diabetes databases. IEEE pp. 2628-2633.
7.	 Stiglic G, Kokol P (2010) Stability of ranked gene lists in large
microarray analysis studies. BioMed Research International 2010: ID
616358.
8.	 Breiman L (2001) Random forests. Machine learning 45(1): 5-32.
9.	 Ambroise C, Mc Lachlan GJ (2002) Selection bias in gene extraction on
the basis of microarray gene-expression data. Proc Natl Acad Sci U S A
99(10): 6562-6566.
10.	Duan K, Rajapakse JC (2004) SVM-RFE peak selection for cancer
classification with mass spectrometry data. In APBC pp. 191-200.
11.	Hu ZZ, Huang H, Wu CH, Jung M, Dritschilo A, et al. (2011) Omics-based
molecular target and biomarker identification. Methods Mol Biol 547-
571.
12.	Shannon CE (1948) A note on the concept of entropy. Bell System Tech
J 27: 379-423.
13.	StiglicG,RodriguezJJ,KokolP(2010)Findingoptimalclassifiersforsmall
feature sets in genomics and proteomics. Neurocomputing 73(13):
2346-2352.
14.	Zervakis M, Blazadonakis ME, Tsiliki G, Danilatou V, Tsiknakis M, et al.
(2009) Outcome prediction based on microarray analysis: a critical
perspective on methods. BMC bioinformatics 10: 53.
Your next submission with JuniperPublishers
will reach you the below assets
•	 Quality Editorial service
•	 Swift Peer Review
•	 Reprints availability
•	 E-prints Service
•	 Manuscript Podcast for convenient understanding
•	 Global attainment for your research
•	 Manuscript accessibility in different formats
( Pdf, E-pub, Full Text, Audio)
•	 Unceasing customer service
Track the below URL for one-step submission
https://juniperpublishers.com/online-submission.php

More Related Content

PDF
USING NLP APPROACH FOR ANALYZING CUSTOMER REVIEWS
csandit
 
PDF
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
IOSR Journals
 
DOCX
Final Report
imu409
 
PDF
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
ijtsrd
 
PDF
Improving the effectiveness of information retrieval system using adaptive ge...
ijcsit
 
PDF
Booster in High Dimensional Data Classification
rahulmonikasharma
 
PDF
Drug Discovery and Development Using AI
Databricks
 
PDF
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET Journal
 
USING NLP APPROACH FOR ANALYZING CUSTOMER REVIEWS
csandit
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
IOSR Journals
 
Final Report
imu409
 
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
ijtsrd
 
Improving the effectiveness of information retrieval system using adaptive ge...
ijcsit
 
Booster in High Dimensional Data Classification
rahulmonikasharma
 
Drug Discovery and Development Using AI
Databricks
 
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET Journal
 

What's hot (19)

PDF
Anomaly detection via eliminating data redundancy and rectifying data error i...
nalini manogaran
 
PDF
IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...
IRJET Journal
 
PDF
Trust Enhanced Role Based Access Control Using Genetic Algorithm
IJECEIAES
 
PDF
C017510717
IOSR Journals
 
PDF
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
theijes
 
DOC
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
Shakas Technologies
 
PDF
Sentiment Features based Analysis of Online Reviews
iosrjce
 
PDF
IRJET- Credit Card Fraud Detection using Isolation Forest
IRJET Journal
 
PDF
Recommender systems bener
diannepatricia
 
DOCX
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
Nexgen Technology
 
PPS
Statistics
D Dutta Roy
 
PDF
Ieee doctoral progarm final
Joydeb Roy Chowdhury
 
PDF
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
IJDKP
 
PDF
DataMining_CA2-4
Aravind Kumar
 
PDF
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
ijaia
 
PDF
Introduction to feature subset selection method
IJSRD
 
PDF
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
IJECEIAES
 
PDF
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
IJNSA Journal
 
PPT
Multivariate Data analysis Workshop at UC Davis 2012
Dmitry Grapov
 
Anomaly detection via eliminating data redundancy and rectifying data error i...
nalini manogaran
 
IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...
IRJET Journal
 
Trust Enhanced Role Based Access Control Using Genetic Algorithm
IJECEIAES
 
C017510717
IOSR Journals
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
theijes
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
Shakas Technologies
 
Sentiment Features based Analysis of Online Reviews
iosrjce
 
IRJET- Credit Card Fraud Detection using Isolation Forest
IRJET Journal
 
Recommender systems bener
diannepatricia
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
Nexgen Technology
 
Statistics
D Dutta Roy
 
Ieee doctoral progarm final
Joydeb Roy Chowdhury
 
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
IJDKP
 
DataMining_CA2-4
Aravind Kumar
 
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
ijaia
 
Introduction to feature subset selection method
IJSRD
 
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
IJECEIAES
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
IJNSA Journal
 
Multivariate Data analysis Workshop at UC Davis 2012
Dmitry Grapov
 
Ad

Similar to Controlling informative features for improved accuracy and faster predictions in omentum cancer models (20)

PDF
THE APPLICATION OF EXTENSIVE FEATURE EXTRACTION AS A COST STRATEGY IN CLINICA...
IJDKP
 
PDF
THE APPLICATION OF EXTENSIVE FEATURE EXTRACTION AS A COST STRATEGY IN CLINICA...
IJDKP
 
PDF
THE APPLICATION OF EXTENSIVE FEATURE EXTRACTION AS A COST STRATEGY IN CLINICA...
IJDKP
 
PDF
Towards a disease prediction system: biobert-based medical profile representa...
IAESIJAI
 
PDF
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
rahulmonikasharma
 
PDF
Variable and feature selection
Aaron Karper
 
PDF
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
Ehsan Lotfi
 
PDF
NTU-2019
FranciscoJAzuajeG
 
PPTX
SummerPresentation_FrankWaggoner_2015
Frank Waggoner
 
PDF
A machine learning based framework for breast cancer prediction using biomar...
IAESIJAI
 
PDF
Dr. Christopher Yau (University of Birmingham) - Data-driven systems medicine
mntbs1
 
PDF
BRITEREU_finalposter
Elsa Fecke
 
PPTX
Machine learning in disease diagnosis
SushrutaMishra1
 
PDF
Heart Failure Prediction using Different MachineLearning Techniques
IRJET Journal
 
PDF
An Approach for Disease Data Classification Using Fuzzy Support Vector Machine
IOSRJECE
 
PDF
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...
IRJET Journal
 
PDF
Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 g...
Lietuvos kompiuterininkų sąjunga
 
PDF
My own Machine Learning project - Breast Cancer Prediction
Gabriele Mineo
 
PDF
Big Data Analytics for Obesity Prediction
Ahsan Bilal
 
PDF
A NOVEL APPROACH FOR FEATURE EXTRACTION AND SELECTION ON MRI IMAGES FOR BRAIN...
cscpconf
 
THE APPLICATION OF EXTENSIVE FEATURE EXTRACTION AS A COST STRATEGY IN CLINICA...
IJDKP
 
THE APPLICATION OF EXTENSIVE FEATURE EXTRACTION AS A COST STRATEGY IN CLINICA...
IJDKP
 
THE APPLICATION OF EXTENSIVE FEATURE EXTRACTION AS A COST STRATEGY IN CLINICA...
IJDKP
 
Towards a disease prediction system: biobert-based medical profile representa...
IAESIJAI
 
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
rahulmonikasharma
 
Variable and feature selection
Aaron Karper
 
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
Ehsan Lotfi
 
SummerPresentation_FrankWaggoner_2015
Frank Waggoner
 
A machine learning based framework for breast cancer prediction using biomar...
IAESIJAI
 
Dr. Christopher Yau (University of Birmingham) - Data-driven systems medicine
mntbs1
 
BRITEREU_finalposter
Elsa Fecke
 
Machine learning in disease diagnosis
SushrutaMishra1
 
Heart Failure Prediction using Different MachineLearning Techniques
IRJET Journal
 
An Approach for Disease Data Classification Using Fuzzy Support Vector Machine
IOSRJECE
 
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...
IRJET Journal
 
Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 g...
Lietuvos kompiuterininkų sąjunga
 
My own Machine Learning project - Breast Cancer Prediction
Gabriele Mineo
 
Big Data Analytics for Obesity Prediction
Ahsan Bilal
 
A NOVEL APPROACH FOR FEATURE EXTRACTION AND SELECTION ON MRI IMAGES FOR BRAIN...
cscpconf
 
Ad

More from Damian R. Mingle, MBA (13)

PDF
Classify Rice Disease Using Self-Optimizing Models and Edge Computing with A...
Damian R. Mingle, MBA
 
DOCX
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Damian R. Mingle, MBA
 
PDF
Greek Letters with LaTeX Cheat Sheet
Damian R. Mingle, MBA
 
PPTX
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
PPTX
Scikit Learn: How to Deal with Missing Values
Damian R. Mingle, MBA
 
PPTX
SciKit Learn: How to Standardize Your Data
Damian R. Mingle, MBA
 
PPTX
Scikit Learn: Data Normalization Techniques That Work
Damian R. Mingle, MBA
 
PPTX
What is sepsis?
Damian R. Mingle, MBA
 
PDF
The evolving definition of sepsis
Damian R. Mingle, MBA
 
PPTX
Data and the Changing Role of the Tech Savvy CFO
Damian R. Mingle, MBA
 
PDF
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
Damian R. Mingle, MBA
 
PPTX
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
Damian R. Mingle, MBA
 
PDF
A Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
Damian R. Mingle, MBA
 
Classify Rice Disease Using Self-Optimizing Models and Edge Computing with A...
Damian R. Mingle, MBA
 
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Damian R. Mingle, MBA
 
Greek Letters with LaTeX Cheat Sheet
Damian R. Mingle, MBA
 
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
Scikit Learn: How to Deal with Missing Values
Damian R. Mingle, MBA
 
SciKit Learn: How to Standardize Your Data
Damian R. Mingle, MBA
 
Scikit Learn: Data Normalization Techniques That Work
Damian R. Mingle, MBA
 
What is sepsis?
Damian R. Mingle, MBA
 
The evolving definition of sepsis
Damian R. Mingle, MBA
 
Data and the Changing Role of the Tech Savvy CFO
Damian R. Mingle, MBA
 
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
Damian R. Mingle, MBA
 
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
Damian R. Mingle, MBA
 
A Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
Damian R. Mingle, MBA
 

Recently uploaded (20)

PPTX
Omphalocele: PowerPoint presentation
Nathan Lupiya
 
PPTX
DEVELOPMENTAL DYSPLASIA OF HIP , Congenital Dislocation of Hip
Deep Desai
 
PPTX
AUG 2025 ONCOLOGY CARTOONS BY DR KANHU CHARAN PATRO
Kanhu Charan
 
PPTX
Models of screening of Adrenergic Blocking Drugs.pptx
Dr Fatima Rani
 
PPTX
IMPORTANCE of WORLD ORS DAY July 29 & ORS.pptx
MedicalSuperintenden19
 
DOCX
RUHS II MBBS Pharmacology Paper-I with Answer Key | 26 July 2025 (New Scheme)
Shivankan Kakkar
 
PPTX
CANSA Womens Health UTERINE focus Top Cancers slidedeck Aug 2025
CANSA The Cancer Association of South Africa
 
PPTX
Transfusion of Blood Components – A Guide for Nursing Faculty.pptx
AbrarKabir3
 
PPTX
13.Anesthesia and its all types.....pptx
Bolan University of Medical and Health Sciences ,Quetta
 
PDF
ADVANCED CLINICAL PHARMACOKINETICS AND BIOPHARMACEUTICS AT ONE PLACE.pdf
BalisaMosisa
 
PPTX
CEPHALOPELVIC DISPROPORTION (Mufeez).pptx
mufeezwanim2
 
PDF
ICF around the World - Keynote presentation
Olaf Kraus de Camargo
 
PDF
Digital literacy note level 6 perioperative theatre technician
mercylindah47
 
PPTX
The Anatomy of the Major Salivary Glands
Srinjoy Chatterjee
 
PPTX
Models for screening of Local Anaesthetics.pptx
AntoRajiv1
 
PPTX
INFLAMMATION By Soumyadip Datta #physiotherapy
Soumyadip Datta
 
PPTX
LOW GRADE GLIOMA MANAGEMENT BY DR KANHU CHARAN PATRO
Kanhu Charan
 
PPTX
Sources, types and collection of data.pptx
drmadhulikakgmu
 
PPTX
Chemical Burn, Etiology, Types and Management.pptx
Dr. Junaid Khurshid
 
PPTX
12. Neurosurgery (part. 2) SURGERY OF VERTEBRAL COLUMN, SPINAL CORD AND PERIP...
Bolan University of Medical and Health Sciences ,Quetta
 
Omphalocele: PowerPoint presentation
Nathan Lupiya
 
DEVELOPMENTAL DYSPLASIA OF HIP , Congenital Dislocation of Hip
Deep Desai
 
AUG 2025 ONCOLOGY CARTOONS BY DR KANHU CHARAN PATRO
Kanhu Charan
 
Models of screening of Adrenergic Blocking Drugs.pptx
Dr Fatima Rani
 
IMPORTANCE of WORLD ORS DAY July 29 & ORS.pptx
MedicalSuperintenden19
 
RUHS II MBBS Pharmacology Paper-I with Answer Key | 26 July 2025 (New Scheme)
Shivankan Kakkar
 
CANSA Womens Health UTERINE focus Top Cancers slidedeck Aug 2025
CANSA The Cancer Association of South Africa
 
Transfusion of Blood Components – A Guide for Nursing Faculty.pptx
AbrarKabir3
 
13.Anesthesia and its all types.....pptx
Bolan University of Medical and Health Sciences ,Quetta
 
ADVANCED CLINICAL PHARMACOKINETICS AND BIOPHARMACEUTICS AT ONE PLACE.pdf
BalisaMosisa
 
CEPHALOPELVIC DISPROPORTION (Mufeez).pptx
mufeezwanim2
 
ICF around the World - Keynote presentation
Olaf Kraus de Camargo
 
Digital literacy note level 6 perioperative theatre technician
mercylindah47
 
The Anatomy of the Major Salivary Glands
Srinjoy Chatterjee
 
Models for screening of Local Anaesthetics.pptx
AntoRajiv1
 
INFLAMMATION By Soumyadip Datta #physiotherapy
Soumyadip Datta
 
LOW GRADE GLIOMA MANAGEMENT BY DR KANHU CHARAN PATRO
Kanhu Charan
 
Sources, types and collection of data.pptx
drmadhulikakgmu
 
Chemical Burn, Etiology, Types and Management.pptx
Dr. Junaid Khurshid
 
12. Neurosurgery (part. 2) SURGERY OF VERTEBRAL COLUMN, SPINAL CORD AND PERIP...
Bolan University of Medical and Health Sciences ,Quetta
 

Controlling informative features for improved accuracy and faster predictions in omentum cancer models

  • 1. Research Article Volume 1 Issue 2 - January 2017 Curr Trends Biomedical Eng & Biosci Copyright © All rights are reserved by Damian R Mingle Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models Damian R Mingle* WPC Healthcare, Nashville, USA Submission: December 09, 2016; Published: January 03, 2017 * Corresponding author: Damian Mingle, Chief Data Scientist, WPC Healthcare, 1802 Williamson Ct, Brentwood, TN 37027, USA Email: Introduction In recent years, the dawn of technologies like microarrays, proteomics, and next-generation sequencing has transformed life science. The data from these experimental approaches deliver a comprehensive depiction of the complexity of biological systems at different levels. A challenge within the “-omics” data strata is in finding the small amount of information that is relevant to a particular question, such as biomarkers that can accurately classify phenotypic outcomes [1]. This is certainly true in the fold of peritoneum connecting the stomach with other abdominal organs known as the omentum. Numerous machine learning techniques and methods have been proposed to identify biomarkers that accurately classify these outcomes by learning the elusive pattern latent in the data. To date, there have been three categories that assist in biomarker selection and phenotypic classification: A. Filters B. Wrappers C. Embedding In practice, time-to-prediction and accuracy of prediction matter a great deal. Filtering methods are generally considered in an effort to spend the least time-to-prediction and can be used to decide which are the most informative features in relation to the biological target [2]. Filtering produces the degrees of correlation with a given phenotype and then ranks the markers in a given dataset. Many researchers acknowledge the weakness of such methods and take careful note to observe the selection of redundant biomarkers. In addition, filtering methods do not allow for interactions between biomarkers. An example of a popular filtering method is Student’s t-test [3]. In order to optimize the predictive power of a classification model, wrapper methods iteratively perform combinatorial biomarker search. Since this combinatorial optimization process is computationally complex, a NP-hard problem, many heuristics have been proposed, for example, to reduce the search space and thus reduce the computational burden of the biomarker selection [4]. With the exception of performing feature selection and classification simultaneously, embedded methods are similar to wrapper methods. Recursive feature elimination support vector machine (SVM-RFE) is a widely used technique for analysis of microarray data [5,6]. The SVM-RFE procedure constructs a Curr Trends Biomedical Eng & Biosci 1(2): CTBEB.MS.ID.555559 (2017) 001 Current Trends in Biomedical Engineering & Biosciences Abstract Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly. Here, a novel feature detection and engineering machine-learning framework is presented to address this need. First, the Rip Curl process is applied which generates a set of 10 additional features. Second, we rank all features including the Rip Curl features from which the top-ranked will most likely contain the most informative features for prediction of the underlying biological classes. The top-ranked features are used in model building. This process creates for more expressive features which are captured in models with an eye towards the model learning from increasing sample amount and the accuracy/time results. The performance of the proposed Rip Curl classification framework was tested on omentum cancer data. Rip Curl outperformed other more sophisticated classification methods in terms of prediction accuracy, minimum number of classification markers, and computational time. Keywords: Omentum, Cancer, Data science, Machine learning, Biomarkers, Phenotype, Personalized medicine
  • 2. How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr Trends Biomedical Eng & Biosci. 2017; 1(2): 555559. 002 Current Trends in Biomedical Engineering & Biosciences classification model using all available features, with the least informative features being pruned from the model.This process continues iteratively until a model has learned the minimum number of features that are useful. In the case of “-omics” data, this process becomes impractical when considering a large feature space. Our research used a hybrid approach between user and machine that dramatically reduced the computational time required by similar approaches while increasing prediction accuracy when comparing other state-of-the art machine learning techniques. Our proposed framework includes A. Ranking and pruning attributes using information gain to extract the most informative features and thus greatly reducing the number of data dimensions; B. A user to view histograms on attributes where the information gain is 0.80 or higher and creating new binary features from continuous values; C. Re-ranking both the original features and the newly constructed features; and D. Using the number of instances to determine how many top-n informative features should be used in modeling. The Rip Curl framework can be used to construct a high- dimensional classification model that takes into account dependencies among the attributes for analysis of complex biological -omics datasets containing dependencies of features. The performance of the proposed four-step classification framework was evaluated using datasets from microarray. The proposed framework was compared with SVM-RFE in terms of area under the ROC curve (AUC) and the number of informative biomarkers used for classification. Results and Discussion Using the omentum dataset we conducted the Rip Curl process of setting the target feature (in this case it was one- versus-all), characterized the target variable, loaded and prepared the omentum data, saved the target and portioning information, analyzed the omentum features, created cross- validation and hold-out partitions, and conducted exploratory data analysis. Table 1: Different types of descriptive features. Type Description Predictive A predictive descriptive feature provides information that is useful in estimating the correct value of a target feature. Interacting By itself, an interacting descriptive feature is not informative about the value of the target feature. In conjunction with one or more other features, however, it becomes informative. Redundant A descriptive feature is redundant if it has a strong correlation with another descriptive feature. Irrelevant An irrelevant descriptive feature does not provide information that is useful in estimating the value of the target feature. In an effort to increase performance and accuracy we opted for an approach of feature selection to help reduce the number of descriptive features in the omentum dataset to just the subset that is most useful for prediction. Before we begin our discussion of approaches to feature selection, it is useful to distinguish between different types of descriptive features (Table 1): The goal of any feature selection approach is to identify the smallest subset of descriptive features that maintains overall model performance. Ideally a feature selection approach will return the subset of features that includes the predictive and interacting features while excluding the irrelevant and redundant features. Using conventional methods, it is not efficient to find the ideal subset of descriptive features used to train an omentum model. Considerd features. There are 2d different possible feature subsets, which is far too many to evaluate unless d is very small. For example, with the descriptive features represented in the omentum dataset,there are 210,960 which produces a 3,300 digit integer as the possible feature subsets. Material and Methods Datasets The dataset used in the experiments were provided by the Gene Expression Machine Learning Repository (GEMLeR) [7]. GEMLeR contains microarray data from 9 different tissue types including colon, breast, endometrium, kidney, lung, omentum, ovary, prostate, and uterus. Each microarray sample is classified as tumor or normal. The data from this repository were collated into 9 one-tissue-types versus all-other-types (OVA) datasets where the second class is labeled as “other.” All GEMLeR microarray datasets have been analyzed by SVM-RFE, the results of which are available from the same resource. Methods Figure 1: Represent protein and LDH level in different types of meningitis. Figure 1 demonstrates the Rip Curl framework and its dependencies on the prior stage. In applying the Rip Curl framework, we initially ran the omentum microarray data through to gain informative feature feedback and then rank those features from most to least important. We confirmed the number of instances available in the dataset and then applied 1% (1,545 instances X 0.01 = 154
  • 3. How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr Trends Biomedical Eng & Biosci. 2017; 1(2): 555559. 003 Current Trends in Biomedical Engineering & Biosciences features) to discover how many top informative features we would make use of in our framework. Where the features were both in the top 1% and expressed informativeness at or above 80%, we created unique features that followed some meaningful thresholds which grouped biomarker data into bins of “0” or “1”. Finally, we sent the enhanced data back through the informative feature test to reduce the feature space to the top 1% and then removed all other features, modeling this subset using Random Forest (Entropy). Our general approach to model selection was to run several model types and to select the best performing based on the highest AUC from the cross-validation results. Once those models were selected, we confirmed that the models were learning models based on sample sizes of the data 16%, 32%, and 64% and are reported in Figure 2. If the model for each sample size did not increase then the model was discarded as a non-learning model (Figure 2). Figure 2: A demonstration of a proper learning model. We chose area under the ROC curve for its immediate understanding and calculated it as follows Where T is a set of thresholds, |T| is the number of thresholds tested, and TPR(T[i] ) and FPR(T[i] ) are the true positive and false positive rates at threshold i respectively. The ROC index is quite robust in the presence of imbalanced data, which makes it a common choice for practitioners, especially when multiple modeling techniques are being compared to one another. In addition, we ran a second evaluation measure, Gini Norm, which is calculated as follows Gini coefficient=(2 ×ROC index)-1 (2) The Gini coefficient can take values in the range of (0,1), and higher values indicate better model performance. In our experiment with the omentum microarray data, we wanted to pay particular attention to reducing complexity and thereby improving time-to-prediction. This is especially true with an M X N dimensional dataset, where M is the number of samples and N is the features respectively, more specifically where N is orders of magnitude greater than M, as is the case in our experiment. We selected Random Forest to represent the general technique of random decision forests, an ensemble learning method for classification, regression, and other tasks. Random Forest operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of over fitting to their training set. Figure 3 demonstrates visually the increase in predictive accuracy as it relates to complexity of that prediction. Figure 3: Comparison of Rip Curl vs other methods. Table 2: Comparative analysis of model performance. Data Model Number of Variables AUC (Validation) AUC (Cross -Validation) AUC (Hold- Out) Gini Norm (Validation) Gini Norm (Cross- Validation) Gini Norm (Hold-Out) Time (milliseconds) Raw Features Random Forest (Entropy) 10,935 0.9520 0.9427 0.9269 0.9040 0.8855 0.8537 10,859.25 Univariate Features Random Forest (Entropy) 8,165 0.9592 0.9492 0.9232 0.9184 0.8984 0.8463 8,233.39 Informative Features Random Forest (Entropy) 8,283 0.9520 0.9427 0.9269 0.9040 0.8855 0.8537 10,732.36 Top 1% of Features Random Forest (Entropy) 15 0.9379 0.9201 0.9344 0.8757 0.8401 0.8689 3,374.21 Table 2 emphasizes the disparity of different results in prediction and time by holding constant the model type, Random Forest (Entropy). We observed that Rip Curl (Top 1% of Features) made use of the best parameters, which we found to be max_depth: None, max_features: 0.2, max_leaf_nodes: 50, min-samples_leaf: 5, and
  • 4. How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr Trends Biomedical Eng & Biosci. 2017; 1(2): 555559. 004 Current Trends in Biomedical Engineering & Biosciences min_samples_split: 10. Rip Curl improved time-to-prediction by a range of 59.02% to 68.93%, increased the hold-out AUC by 0.81% to 1.21%, and increased the hold-out Gini Norm by a range of 1.78% to 2.67%. Our framework makes use of Claude Shannon’s entropy formula which defines a computational measure of the impurity of the elements in a set. Shannon’s idea of entropy is a weighted sum of the logs of the probabilities of each possible outcome when we make a random selection from a set. The weights used in the sum are the probabilities of the outcomes themselves so that outcomes with high probabilities contribute more to the overall entropy of a set than outcomes with low probabilities. Shannon’s entropy is defined as where P(t=i) is the probability that the outcome of randomly selecting an element t is the type i, l is the number of different types of things in the set, and s is an arbitrary logarithmic base (which we selected as 2) (Shannon, 1948). Once we established the in formativeness of each feature we visually explored the histograms of each variable that expressed an informative value of ≥80%. Below is an example of the histogram for 206067_s_at, indicating where the target’s signal was most concentrated within this biomarker (Figure 4). Figure 4: Histogram of Gene 206067_s_at in raw expression. In an effort to concentrate the omentum tissue signal, we generated a rule that stated Which rendered a new feature that generated an additional histogram (Figure 5)? Figure 5: Demonstration of Rip Curl feature engineering using visual thresholds. Allowing us to pass a different, possibly more understandable context to our algorithm. We repeated this process above, applying rules based on our observation of the training data. Some additional features were simple descriptive statistics such as Min and Mode while others were a bit unconventional such as Where X a gene is feature and i represents the placement within the feature index. Binsum was another engineered feature that was simply Where bin is one of the generated binary features and i is the index of the bin within the omentum training data. In an effort to develop greater context for the omentum data and the new features that were engineered, we analyzed key values and their respective informativeness (or importance) [8] (Table 3): Table 3: Rip Curl Feature engineering statistics. Feature Name Importance Unique Missing Mean SD Median Min Max Binsum 92.88 9 0 1.46 2.07 1 0 8 1800_206067_s_at 84.93 2 0 0.15 0.35 0 0 1 400_216953_s_at 82.15 2 0 0.14 0.34 0 0 1 1100_219454_at 82.09 2 0 0.22 0.42 0 0 1 1300_214844_s_at 78.42 2 0 0.13 0.34 0 0 1
  • 5. How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr Trends Biomedical Eng & Biosci. 2017; 1(2): 555559. 005 Current Trends in Biomedical Engineering & Biosciences 4000_227195_at 77.04 2 0 0.21 0.41 0 0 1 1100_213518_at 76.07 2 0 0.31 0.46 0 0 1 3500_204457_s_at 75.71 2 0 0.21 0.40 0 0 1 900_219778_at 71.29 2 0 0.09 0.29 0 0 1 Lensum 55.36 3 0 13.82 2.98 16 8 16 Mode 53.04 837 0 214 413 19.80 1.60 4152 Min 51.85 81 0 242 1.41 2.20 0.20 15.40 Figure 6: Variable importance rank generated from the Rip Curl framework. Figure 6 demonstrates visually the variable importance of the final Rip Curl model. We observed that 20% of the top informative features were generated through the Rip Curl framework: Binsum, >1800_206067_s_at, and >400_216953_s_ at with a range of importance between 27% and 93%. GEMLeR provides an initial classification accuracy value for the omentum dataset. In their experiments designed to generate the state-of-the-art benchmark, all measurements were performed using WEKA machine learning environment. They opted to make use of one of the most popular machine learning methods for gene expression analysis, Support Vector Machines – Recursive Feature Elimination (SVM-RFE) feature selection algorithm. They evaluated (feature selection + classification was done inside a 10-fold cross-validation loop on the omentum dataset to avoid so called selection bias [9] and demonstrates their approach (Figure 7). Figure 7: SVM-RFE Process. Head-to-Head Comparison with SVM-RFE Table 4: Comparison results of international benchmark and Rip Curl. Model AUC SVM-RFE (Benchmark) 0.703 Rip Curl 0.934 Table 4 shows a comparison of the SVM-RFE benchmark established in with the Rip Curl framework, and the following results were observed [10] (Table 4): Rip Curl represents a 32.92% gain in prediction accuracy over the GEMLeR benchmark for the same omentum dataset [11]. Conclusion The Rip Curl classification framework outperformed the state-of-the-art benchmark (SVM-RFE) in the GEMLeR omentum cancer experiment. Since the Rip Curl classification framework utilizes entropy-based feature filtering and adds more contexts through feature engineering, the complexity of this classification framework is very low permitting analysis of data with many features. Future research would suggest comparisons beyond the omentum cancer data and exploration of other one-versus- all experiments in the areas of breast, colon, endomentrium, kidney, lung, ovary, prostate, and uterus [12-14]. Acknowledgment We would like to acknowledge GEMLeR for making this important dataset available to researchers and WPC Healthcare for supporting our work. Finally, the authors would like to thank the donors who participated in this study. References 1. Abeel T, HelleputteT, Van de Peer Y, Dupont P, Saeys Y (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3): 392-398. 2. Mingle D (2015) A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column. International Journal of Biomedical Data Mining. 3. Leung Y, Hung Y (2010) A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification. IEEE/ACM Trans Comput Biol Bioinform 7(1): 108-117.
  • 6. How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr Trends Biomedical Eng & Biosci. 2017; 1(2): 555559. 006 Current Trends in Biomedical Engineering & Biosciences 4. Mohammadi A, Saraee MH, Salehi M (2011) Identification of disease- causing genes using microarray data mining and Gene Ontology. BMC medical genomics 4(1): 1. 5. Ding Y, Wilkins D (2006) Improving the performance of SVM-RFE to select genes in microarray data. BMC bioinformatics 7(Suppl 2): S12. 6. Balakrishnan S, Narayanaswamy R, Savarimuthu N, Samikannu R (2008) SVM ranking with backward search for feature selection in type II diabetes databases. IEEE pp. 2628-2633. 7. Stiglic G, Kokol P (2010) Stability of ranked gene lists in large microarray analysis studies. BioMed Research International 2010: ID 616358. 8. Breiman L (2001) Random forests. Machine learning 45(1): 5-32. 9. Ambroise C, Mc Lachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A 99(10): 6562-6566. 10. Duan K, Rajapakse JC (2004) SVM-RFE peak selection for cancer classification with mass spectrometry data. In APBC pp. 191-200. 11. Hu ZZ, Huang H, Wu CH, Jung M, Dritschilo A, et al. (2011) Omics-based molecular target and biomarker identification. Methods Mol Biol 547- 571. 12. Shannon CE (1948) A note on the concept of entropy. Bell System Tech J 27: 379-423. 13. StiglicG,RodriguezJJ,KokolP(2010)Findingoptimalclassifiersforsmall feature sets in genomics and proteomics. Neurocomputing 73(13): 2346-2352. 14. Zervakis M, Blazadonakis ME, Tsiliki G, Danilatou V, Tsiknakis M, et al. (2009) Outcome prediction based on microarray analysis: a critical perspective on methods. BMC bioinformatics 10: 53. Your next submission with JuniperPublishers will reach you the below assets • Quality Editorial service • Swift Peer Review • Reprints availability • E-prints Service • Manuscript Podcast for convenient understanding • Global attainment for your research • Manuscript accessibility in different formats ( Pdf, E-pub, Full Text, Audio) • Unceasing customer service Track the below URL for one-step submission https://juniperpublishers.com/online-submission.php