SlideShare a Scribd company logo
A comparative study of different measures used in filter method
to select relevant genes from microarray gene expression
presented by
Jayati Mitra
Content
Chapter 1. Introduction.
Chapter 2. Basic concepts of Bioinformatics and Molecular Biology
2.1 Introduction to Bioinformatics.
2.2 Goal of Bioinformatics.
2.3 Research area of Bioinformatics.
2.4 Real world application of Bioinformatics.
2.5 Introduction to Molecular Biology
2.6 Central dogma of Molecular Biology.
Chapter 3. DNA Microarray technology and Gene Expression Data.
Chapter 4. Literature Survey on feature selection techniques applied on Microarray Gene expression
data.
Chapter 5. Proposed Work .
5.1 Scoring function based feature selection.
5.2 Working Principal.
Chapter 6. Result Analysis.
Chapter 7. Conclusion and future scope.
References.
Chapter 1.Introduction
 Gene expression is the process of transcribing a gene’s DNA sequence
into RNA. A gene’s expression level indicates the approximate number of
copies of that gene’s RNA produced in a cell and it is correlated with the
amount of the corresponding proteins made.
 DNA microarray (also commonly known as DNA chip or biochip) is a
collection of microscopic DNA spots attached to a solid surface. Scientists
use DNA microarrays to measure the expression levels of large numbers of
genes simultaneously or to genotype multiple regions of a genome.
 Classification is the form of data analysis that extracts models of
describing important data classes. Such models called classifiers.Data
classification consists of two steps:
1.First one is the learning step or training phase where a
classification model or classifier is created from the training dataset and their
associated class labels and
2.the second one is the classification step where classifier is applied
to classify unseen data.
 Feature selection is the process of selecting a subset of relevant and redundant features
from a dataset in order to improve the performance of the classification algorithms in
terms of accuracy and time to build the model.
 The process of feature selection is classified into three categories:- (1) filter (2) wrapper
and (3) embedded. Filter methods evaluate a subset of genes by looking at the intrinsic
characteristics of data with respect to class labels.
 Here in this project thesis a review is completed on those score functions which are used
in filter methods.
Chapter 2. Basic concept of Bioinformatics and molecular biology
2.1.Introduction to Bioinformatics:
Bioinformatics is the field of science in which biology, computer science, mathematics
and information technology merge into a single discipline.
The ultimate goal of the field is to enable the discovery of new biological insights as well
as to create a global perspective from which unifying principles in biology can be
discerned.
2.2 Goal of Bioinformatics:
Bioinformatics then became more ambitious, aiming to revolutionize medicine by
making sequencing a diagnostic tool.
 The goal was to develop new approaches to eradicate diseases like cancer, and to pave
the way towards personalized medicine.
2.3 Research areas of bioinformatics:
Sequence analysis: Sequence analysis is the most primitive operation in computational
biology. This operation consists of finding which part of the biological sequences are
alike and which part differs during medical analysis and genome mapping processes.
• Genome annotation:In the context of genomics, annotation is the process of marking
the genes and other biological features in a DNA sequence.
• Analysis of gene expression: The expression of many genes can be determined by
measuring mRNA levels with various techniques such as microarrays, expressed cDNA
sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag
sequencing, massively parallel signature sequencing (MPSS), or various applications of
multiplexed in-situ hybridization etc.
•Analysis of protein expression: Gene expression is measured in many ways including
mRNA and protein expression, however protein expression is one of the best clues of
actual gene activity since proteins are usually final catalysts of cell activity. Protein
microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of
the proteins present in a biological sample
Analysis of mutations in cancer: In cancer, the genomes of affected cells are rearranged in
complex or even unpredictable ways.
Massive sequencing efforts are used to identify previously unknown point mutations in a
variety of genes in cancer.
Bioinformaticians continue to produce specialized automated systems to manage the sheer
volume of sequence data produced, and they create new algorithms and software to
compare the sequencing results to the growing collection of human genome sequences and
germ line polymorphisms.
New physical detection technologies are employed, such as oligonucleotide microarrays to
identify chromosomal gains and losses and single-nucleotide polymorphism arrays to detect
known point mutations.
Protein structure prediction: The amino acid sequence of a protein (so-called, primary
structure) can be easily determined from the sequence on the gene that codes for it. In most
of the cases, this primary structure uniquely determines a structure in its native
environment.
2.4 Real world applications of bioinformatics:
• Basic Research
• Functional genomics
• Evolutionary genomics
• Epigenomics
• Genome Wide Association Analysis (GWA)
• Genomics
• Proteomics
• Omic sciences
• Systems biology/ Systems genetics
• High Performance Computing (HPC)
Biomedicine
• Drug discovery
• Personalized medicine
• Preventive medicine
• Gene therapy
Microbiology
• Biotechnology
• Waste cleanup
• Climate change
• Alternative energy sources
• Antibiotic resistance
• Epidemiological studies
Agriculture
• Crops
• Insect resistance
• Improving of nutritional quality
2.5 Introduction to Molecular Biology
Molecular Biology is the study of gene structure and function at molecular level to
understand the molecular basis of hereditary genetic variation and the expression patterns
of the genes.
Some common Molecular Biology techniques are:-
 Electrophoresis – A process which separates molecules such as DNA or proteins out
according to their size, electrophoresis is a mainstay of molecular biology laboratories. It
can be used to identify molecules or fragments of molecules and as a check to make sure
that we have the correct molecule present.
 Cloning – The technique of introducing a new gene into a cell or organism. This can
be used to see what effect the expression of that gene has on the organism.
 Restriction Digest – The process of cutting DNA up into smaller fragments using
enzymes which only act at a particular genetic sequence.
 Ligation – The process of joining two pieces of DNA together. Ligation is useful when
introducing a new piece of DNA into another genome
Cell:-The cell is the basic structural, functional, and biological unit of all known living
organisms. A cell is the smallest unit of life. Cells are often called the "building blocks of
life". The study of cells is called cell biology or cellular biology.
DNA:- Deoxyribonucleic acid or DNA is a molecule which contains all the hereditary
material and holds the instructions for building the proteins that are essential for our bodies
to function.
There are four types nitrogen containing regions called bases:
• Adenine (A)
• Cytosine (C)
• Guanine (G)
• Thymine (T)
The bases of the two strands of DNA are stuck together to create a ladder-like shape. Within
the ladder, A always sticks to T, and G always sticks to C to create the "rungs." The length of
the ladder is formed by the sugar and phosphate groups.
RNA: RNAstands for ribonucleic acid. Molecules are single- stranded nucleic acids
composed of nucleotides. It plays an important role in protein synthesis through the process
of translation.
There are basically three types of RNA involved in translation process:-
➢ Messenger RNA (mRNA)
➢ Transfer RNA (tRNA)
➢ Ribosomal RNA (rRNA)
Gene: A gene is the basic physical and functional unit of heredity. Genes are made up of
DNA. Some genes act as instructions to make molecules called proteins. In humans, genes
vary in size from a few hundred DNA bases to more than 2 million bases. The Human
Genome Project estimated that humans have between 20,000 and 25,000 genes.
Protein: Proteins are essential nutrients for the human body. They are one of the building
blocks of body tissue and can also serve as a fuel source. As a fuel, proteins provide as
much energy density of protein from a nutritional standpoint is its amino acid composition.
2.6 Central Dogma of Molecular Biology:
2.6 Central Dogma of Molecular Biology:
A comparative study using different measure of filteration
Post-translational modification (PTM) refers to the covalent and
generally enzymatic modification of proteins following protein biosynthesis.
Proteins are synthesized by ribosomes translating mRNA into polypeptide chains,
which may then undergo PTM to form the mature protein product. PTMs are
important components in cell signaling, as for example when prohormones are
converted to hormones.
Splicing: The detailed splicing
mechanism is quite complex. In
short, it involves five snRNAs and
their associated proteins. These
ribonucleo proteins form a
large(60S)complex,called
spliceosome. Then, after a two-step
enzymatic reaction, the intron is
removed and two neighboring
exons are joined together . The
branch point A residue plays a
critical role in the enzymatic
reaction.
Chapter 3.DNA Microarray Technology:
A DNA microarray is a collection of microscopic DNA spots attached to a solid
surface. Scientists use DNA microarrays to measure the expression levels of large
numbers of genes simultaneously or to genotype multiple regions of a genome.
DNA microarray technology provides biologists with the ability to measure the
expression levels of thousands of genes in a single experiment. Initial experiments
suggest that genes of similar function capitulate similar expression patterns in
microarray hybridization experiments.
DNA Microarray technology has empowered the scientific community to
understand the fundamental aspects underlining the growth and development of
life as well as to explore the genetic causes of anomalies occurring in the
functioning of the human body.
Design of DNA
Microarray
System
1.Sample
Preparation
2.Purification
3.Reverse Transcription
4. Labelling: Cyanine dyes Cy3
and Cy5 used predominant label in
microarray analysis
5.Hibridization
6. Scanning
7.Normalization
and analysis
Types of microarray
• DNA Microarray
– cDNA microarray
– Oligonucleotide arrays
• Protein microarray
– Analytical
– Functional
– Reverse phase
• Chemical compound arrays
– collection of organic chemical compounds spotted on a solid surface
• Carbohydrate arrays
– various oligosaccharides and/or polysaccharides immobilized on a solid
support in a spatially defined arrangement
• Cellular Microarrays
– spotted with varying materials, such as antibodies, proteins, or lipids,
which can interact with the cells, leading to their capture on specific
spots.
Chapter 4. Literature Survey on Feature selection techniques applied on
microarray Gene expression data:
Gene Expression: Gene expression is the process by which information from a gene is
used in the synthesis of a functional gene product. These products are often proteins,
but in non-protein coding genes such as transfer RNA (tRNA) or small nuclear RNA
(snRNA) genes, the product is a functional RNA.
The process of gene expression is used by all known life—eukaryotes (including
multicellular organisms), prokaryotes (bacteria and archaea), and utilized by viruses—to
generate the macromolecular machinery for life.
Several steps in the gene expression process may be modulated, including the
transcription, RNA splicing, translation, and post-translational modification of a protein.
Feature Selection for Classification
Feature selection mainly affects the training phase of classification. After
generating features, instead of processing data with the whole features to the
learning algorithm directly.
feature selection for classification will first perform feature selection to select a
subset of features and then process the data with the selected features to the
learning algorithm.
The feature selection phase might be independent of the learning algorithm, like
filter models, or it may iteratively utilize the performance of the learning
algorithms to evaluate the quality of the selected features, like wrapper models.
The process of feature selection is classified into three categories:-
a)Filter: It is also called open-loop method. It is the earliest method. It examines the
features based on the intrinsic characteristics prior to the learning tasks. A filter
algorithm principally measures the feature characteristics based on four types of
evaluation criteria, i.e., dependency, information, distance, and consistency. However,
filter methods ignore the interactions between classifiers and the possible interaction
among features (combined features may have net effect that is not necessarily
reflected by the individual features in that group). It also leads to varied prediction
performance when the selected features are applied to different learning algorithms.
.
Advantages and disadvantages of filters methods.
Filter method based works in Microarray gene expression data: some process and
methods are used as of based on filter method...In the table which is given below
showing some key references for filter method of feature selection technique in
the microarray domain.
b) Wrapper or close-loop method:wraps the feature selection around the learning
algorithm and utilizes classification error rate or performance accuracy as feature
evaluation criterion. It selects the most discriminative subset of features by minimizing the
prediction error of a particular classifier.
Advantages and disadvantages of Wrapper methods.
Wrapper method based works in Microarray gene expression data:
c) Embedded :method is a built-in feature selection mechanism that embeds the feature
selection in the learning algorithm and uses its properties to guide feature evaluation.
Embedded method is more efficient and computationally more tractable than wrapper
method while maintaining similar performance. This is because the embedded method
avoids the repetitive execution of classifier and examination of every feature subset.
Advantages and disadvantages of Embedded methods.
Embedded method based works in Microarray gene expression data:
d)Hybrid : represent the latest developments in feature selection. Hybrid method can be
either formed by combining two different methods (e.g. filter and wrapper), two methods
of the same criterion, or two feature selection approaches. It uses different evaluation
criteria in different search stages to improve the efficiency and prediction performance
with better computational performance.
)
Advantages and disadvantages of Hybrid methods:
Hybrid method based works in Microarray gene expression data:
e)Ensemble: this method is a method that aims to construct a group of feature subsets
and then produce an aggregated result out of the group. It is purposely designed to tackle
the instability and perturbation issues in many feature selection algorithms. This method
is based on different sub-sampling strategies where a particular feature selection method
is run on a number of sub-samples and the obtained features are merged to form a more
stable subset..
Advantages and disadvantages of Ensemble methods
Ensemble method based works in Microarray gene expression data:
Chapter 5. Proposed work
5.1 Scoring Function Based Feature Selection
In this work our main objective is to focus on scoring function which is used in filter
method.
•Filter methods are divided into two categories: univariate methods and multivariate
methods.
In the univariate scheme, each feature is ranked independently based on some
score functions or measures. and then a given number of features are selected
according to their rank.
In the multivariate scheme feature dependency is considered. Therefore, the
multivariate scheme is naturally capable of handling redundant features.
Here the score functions which are used to rank relevant genes in gene expression
data are discussed below.
Here, 𝐾𝑁×𝑀 is a gene expression data matrix, which contains 𝑁 number of objects (samples)
and 𝑀 number of features (genes).
•Here, 𝑋={𝑋1,𝑋2,...........𝑋𝑁 } is a set of samples .
•𝑓={𝑓1,𝑓2,…….,𝑓𝑀 } is a set of features(Gene).
•𝐶𝑁×1 is a class vector which contains a class value associated with every sample.
1.Mutual Information: the mutual information (MI) of two random variables is a measure
of the mutual dependence between the two features.
•fi and fs are individual features.
average normalized MI as a measure of redundancy between the i th feature and the
subset of selected features S={fs}
|s| is s the cardinality of set S .
For selecting the best features we calculating MI by defining this equation which is
given below:
2. Symmetric Uncertainty:
This is one of normalized form of Mutual Information; introduced by Witten and
Frank, 2005 . Its defined as bellow:
3. Information Gain:
•The information gain measure is based on the entropy concept
•It is commonly used as measure of feature relevance in filter strategies that evaluate
features individually
•Information gain (IG) measures the amount of information in bits about the class
prediction, if the only information available is the presence of a feature and the
corresponding class distribution
Where Pd is the marginal probability of class ci th.
Here, Dataset is partitioned with respect a feature f into k parts.
The information gain with respect to feature f is given below:
Where|𝐷𝑘
𝑓 |and |𝐷| represent the number of objects
present in and respectively
4. Chi-square test:
 Chi-square is a statistical test commonly used to compare observed data with data we
would expect to obtain according to a specific hypothesis.
The chi-square test is always testing what scientists call the null hypothesis
There is no significant difference between the expected and observed result.
The formula for calculating chi-square (x2) is:
Here we calculating Chi-square for every feature variable and target variable
and observe the existence of relationship between the feature variable and
target variable.
•In feature selection, the two events are occurrence of the term and occurrence of the
class.
•X2 is a measure of how much expected counts and observed counts diverge from each
other.
•A high value of X 2 indicates that the hypothesis of independence, which implies that
expected and observed counts are similar, is incorrect.
•If the two events are dependent, then the occurrence of the term makes the occurrence of
the class more likely (or less likely), so it should be helpful as a feature.
5.Gini Index
•It is a univariate and supervised feature weighting method. It is a measure for quantifying a
feature's ability to distinguish between classes
•The main idea behind the Gini-Index theory is the, is a univariate and supervised feature
weighting method
•It is a measure for quantifying a feature's ability to distinguish between classes
•Given c classes, GI of a feature f can be calculated as:
6.Relief
•filter-method approach to feature selection that is notably sensitive to feature interactions
•Relief calculates a feature score for each feature which can then be applied to rank and
select top scoring features for feature selection
•Relief feature scoring is based on the identification of feature value differences between
nearest neighbor instance pairs
•If a feature value difference is observed in a neighboring instance pair with the same class
(a 'hit'), the feature score decreases
•if a feature value difference is observed in a neighboring instance pair with different class
values (a 'miss'), the feature score increases
Relief finds the nearest instance from same class to find a hit and a miss from different
class and according to that 𝑊(𝑓𝑡 ) is increased or decreased.
…………………………………………………11
7. Fisher Score:
•It is a supervised and univariate feature weighting method
•It picks features that assigns similar value to the samples from the same class
•And picks different value to samples from different classes to evaluate measure used in
Fisher Score
Fisher score can be expressed as:
•where, 𝜇𝒇𝒊𝒄𝒊 and 𝜌𝑓𝑖𝑐𝑖 are the mean and variance of ith features in ci class.
•𝜇𝑓𝑖 is the mean of ith feature
•𝑛𝑐 is the number of samples of ci th class
8.T-test :
•It measures the relationship between two samples statistically by comparing its mean
values
•It calculates the ratio between two class mean and variability of two classes.
…………………………………..13
𝑓̅𝑡1 is the mean of sample values of feature for class 1
𝑓̅𝑡2 is the mean of sample values of feature for class 2
𝑆𝑡1 𝑎𝑛𝑑 𝑆𝑡2are standard deviation of sample value of feature 𝑓𝑡 for class 1
and sample values of feature 𝑓𝑡for class 2 respectively.
𝑛1represents number of samples of class1 and 𝑛2 represents number of
samples of class2.
5.2 Working Principle:
Algorithm:
Filter
1. Choose k= the number of genes to be selected by each filter.
2. For each filter FTi(i=1,2,3)
a) Calculate the statistical scores for all genes and rank the scores from the highest
to the lowest.
b) Select k genes with top ranking scores in each list.
3. Take the union of the list of genes obtained by FTi(i=1,2,3) to produce a set of p features,
Flowchart of Filter base feature selection
Chapter 6 Result Analysis
The score functions or measures are applied on different microarray gene expression
datasets and then best 100 number of genes are selected by each measure and
classification accuracy of samples are checked using those genes for every measure. Here
KNN classifier is used to check classification accuracy using leave one out cross
validation method.
Table 1. Dataset Description
Table 2. PERFORMANCE OF DIFFERENT MEASURES ON BREAST CANCER
DATASET USING
KNN
Measures Classification accuracy
Fisher Score 81.6
T-test 81.6
Chi-square 83.7
Symmetric Uncertainty 85.7
Information Gain 89.8
Gini Index 85.7
Mutual Information 89.8
Relief 89.8
Table 3. PERFORMANCE OF DIFFERENT MEASURES ON COLON CANCER
DATASET USING
KNN
Measures Classification accuracy
Fisher Score 72.6
T-test 72.6
Chi-square 83.9
Symmetric Uncertainty 83.9
Information Gain 90.3
Gini Index 85.5
Mutual Information 90.3
Relief 85.5
Chapter 7 Conclusion and Future Scope:
This project focuses on the filter approach to select the most relevant features, which is
based on the study of the existing scoring functions, which are univariate.
After analyzing the outcomes which have been carried out from these scoring function a
comparative study can be established
It also helps us to analyze the methodology in selecting the more relevant features and
removing irrelevant features.
In future, this proposed work can be applied in the multivariate scheme by using filter
based scoring function.
References
[1]Analysis of Gene Expression Data, E. Klipp, R. Herwig, A. Kowald, C. Wierling, H.
Lehrach, ISBN: 3-527-31078-9.
[2]. C.Lavanya1, M.Nandihini2, R.Niranjana3, C.Gunavathi4(2014) "Classification of
Microarray Data Based On Feature Selection Method".An ISO 3297: 2007 Certified
Organization, Volume 3, Special Issue 1,PAGE -126.
[3]. Rabia Aziz *, C.K. Verma, and Namita Srivastava "Dimension reduction methods for
microarray data: a review".DOI: 10.3934/bioeng.2017.1.179.
[4]Ang Jun Chin, Andri Mirzal, Habibollah Haron, Senior Member, IEEE, Haza Nuzly
Abdull Hamed: Supervised, Unsupervised and Semi-supervised Feature Selection: A
Review on Gene Selection. DOI 10.1109/TCBB.2015.2478454,pages 6-9.
[5]Alejandra J. Magana1,Manaz Taleyarkhan2, Daniela Rivera Alvarado3,Michael Kane4, John
Springer5, and Kari Clase6 "A Survey of Scholarly Literature Describing the Field of
Bioinformatics Education and Bioinformatics Educational Research"doi: 10.1187/cbe.13-
10-0193.
[6] P. D. Karp. "An ontology for biological function based on molecular interactions"
DOI:16(3):269–285, 2000.
[7] R. Bals1, B. Jany2: "Identification of disease genes by expression profiling" Eur Respir
J. 2001 Nov;18(5):8829.
[8] Rajeshwar Govindarajan, Jeyapradha Duraiyan1 , Karunakaran Kaliyappan, Murugesan
Palanisamy2 "Microarray and its applications" doi: 10.4103/0975-7406.100283 .2012 jan
page S311.
[9]. Yvan Saeys1.*, In˜aki Inza2 and Pedro Larran˜aga2: A review of feature selection
techniques in bioinformatics"doi:10.1093/bioinformatics/btm344 , June 25, 2007 page
2508-2514.
[10]. Carmen Lai*1, Marcel JT Reinders1, Laura J van't Veer2 and Lodewyk FA Wessels1,2" A
comparison of univariate and multivariate gene selection techniques for classification of
cancer datasets". doi.org/10.1186/1471-2105-7-235,2006 may.
[11]. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach
Learn Res3: 1157–1182.
[12]. Xiong, M., Fang, Z., & Zhao, J. (2001). Biomarker identification by feature
wrappers, Genome Research, 11, 1878-1887.
[13]" A weighted logistic regression analysis for predicting the odds of head/face and
neck injuries during rollover crashes" by Jingwen Hu, Clifford C. Chou, King H. Yang,
Albert I. King, 2007; 51: 363–379.
[14].A comparative review of statistical methods for discovering differentially expressed
Genes in replicated Microarray Experiments.Wei Pan Bioinformatics, Volume 18, Issue
4, April 2002, Pages 546–554,https://doi.org/10.1093/bioinformatics/18.4.546.
[15]. Somol P, Pudil P, Novovičová J, et al. (1999) Adaptive floating search methods in
feature selection. Pattern Recogn Lett 20: 1157–1163.
[16]. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput
Electr Eng 40:16–28.
[17]. F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and Robust Feature Selection
via Joint mathscrl2,1-Norms Minimization,” in Advances in Neural Information
Processing Systems 23, J. D. Laf-ferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel,
and A. Cu-lotta, Eds. Curran Associates, Inc., 2010, pp. 1813–1821.
[18]. S. Xiang, F. Nie, G. Meng, C. Pan, and C. Zhang, “Discrimina-tive Least Squares
Regression for Multiclass Classification and Feature Selection,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 23, no. 11, pp. 1738–1754, Nov. 2012.
[19]. H. Pang, S. L. George, K. Hui, and T. Tong, “Gene Selection Using Iterative
Feature Elimination Random Forests for Sur-vival Outcomes,” IEEEACM Trans
Comput Biol Bioinforma., vol. 9, no. 5, pp. 1422–1431, Sep. 2012.
[20]. Q. Hu, W. Pan, S. An, P. Ma, and J. Wei, “An efficient gene se-lection technique
for cancer recognition based on neighbor-hood mutual information,” Int. J. Mach.
Learn. Cybern., vol. 1, no. 1–4, pp. 63–74, Dec. 2010.
[21]. P. Saengsiri, S. N. Wichian, P. Meesad, and U. Herwig, “Com-parison of hybrid
feature selection models on gene expression data,” in Knowledge Engineering, 2010
8th International Conference on ICT and, 2010, pp. 13–18.
[22]. Y. Leung and Y. Hung, “A Multiple-Filter-Multiple-Wrapper Approach to Gene
Selection and Microarray Data Classifica-tion,” IEEEACM Trans Comput Biol
Bioinforma., vol. 7, no. 1, pp. 108–117, Jan. 2010.
[23]. H. Liu, L. Liu, and H. Zhang, “Ensemble gene selection by grouping for
microarray data classification,” J. Biomed. Inform., vol. 43, no. 1, pp. 81–87, Feb.
2010.
[24]. P. Yang, B. B. Zhou, Z. Zhang, and A. Y. Zomaya, “A multi-filter enhanced
genetic ensemble system for gene selection and sam-ple classification of microarray
data,” BMC Bioinformatics, vol. 11, no. Suppl 1, p. S5, Jan. 2010.
[25]. M. A. Gaafar, N. A. Yousri, and M. A. Ismail, “A novel ensemble selection
method for cancer diagnosis using microarray da-tasets,” in IEEE 12th International
Conference on BioInformatics and BioEngineering, BIBE 2012, 2012, pp. 368–373.
[26]. K.Mani1 P.Kalpana2 " A Review on Filter Based Feature Selection".Vol. 4, Issue 5,
May 2016, DOI: 10.15680/IJIRCCE.2015. 0405094 ,PAGE-9149-9151
[27]. Shilpi Bose1 , Chandra Das2 , Matangini Chattopadhyay3 , Kuntal Ghosh4 ,
Samiran Chattopadhyay5 " An Ensemble Filtering Approach based Supervised Gene
Clustering Algorithm to Identify Informative Genes to Improve Sample Classification
Accuracy in Microarray Gene Expression Data.
[28] . Ryan J. Urbanowicza,_, Melissa Meekerb, William LaCavaa, Randal S. Olsona,
Jason H. Moorea" Relief-Based Feature Selection: Introduction and
Review.doi:1711.08421V2-2 APR 2018.

More Related Content

Similar to A comparative study using different measure of filteration (20)

DOCX
Bioinformatics
Vidya Kalaivani Rajkumar
 
PPTX
presentation on epigenomics , and technologies in epigenomics
yanshikasain13
 
PPTX
Data Mining
Neethu Devasia
 
PPTX
Introduction to bioinformatics
maulikchaudhary8
 
PPTX
Introduction
Naeem Ahmed
 
PPTX
INTRODUCTION OF Genes AND GENOMICS .pptx
AshwiniSenthil4
 
PPT
Lecture 1 Introduction to Bioinformatics BCH 433.ppt
KelechiChukwuemeka
 
PDF
Understanding Molecular Biology With Techniques and Applications | The Lifesc...
The Lifesciences Magazine
 
PPT
Molecular biology lecture
Dr. GURPREET SINGH
 
PPTX
Patho presentation
Bilal Siddique
 
PPTX
OMICS.pptx
PagudalaSangeetha
 
PDF
LECTURE NOTES ON BIOINFORMATICS
MSCW Mysore
 
PPTX
BioInformatics Tools -Genomics , Proteomics and metablomics
AyeshaYousaf20
 
PPTX
PROTEOMICS is the study of proteins at the molecular level.
YogeetaTanty1
 
PPTX
MEDICINAL PLANT BIOTECHNOLOGY UNIT 1, PCG SEM 2.pptx
Prithivirajan Senthilkumar
 
PPT
2013 10 23_dna_for_dummies_v_presented
Prof. Wim Van Criekinge
 
PPTX
proteomic and Genomics and the available proteomic technologies and the data ...
SamiMohamed28
 
PPTX
Proteomics
Sarfaraz Nasri
 
PPT
Pcmd bioinformatics-lecture i
Muhammad Younis
 
PDF
Introduction to Bioinformatics 2025.....pdf
omniaabdo276
 
Bioinformatics
Vidya Kalaivani Rajkumar
 
presentation on epigenomics , and technologies in epigenomics
yanshikasain13
 
Data Mining
Neethu Devasia
 
Introduction to bioinformatics
maulikchaudhary8
 
Introduction
Naeem Ahmed
 
INTRODUCTION OF Genes AND GENOMICS .pptx
AshwiniSenthil4
 
Lecture 1 Introduction to Bioinformatics BCH 433.ppt
KelechiChukwuemeka
 
Understanding Molecular Biology With Techniques and Applications | The Lifesc...
The Lifesciences Magazine
 
Molecular biology lecture
Dr. GURPREET SINGH
 
Patho presentation
Bilal Siddique
 
OMICS.pptx
PagudalaSangeetha
 
LECTURE NOTES ON BIOINFORMATICS
MSCW Mysore
 
BioInformatics Tools -Genomics , Proteomics and metablomics
AyeshaYousaf20
 
PROTEOMICS is the study of proteins at the molecular level.
YogeetaTanty1
 
MEDICINAL PLANT BIOTECHNOLOGY UNIT 1, PCG SEM 2.pptx
Prithivirajan Senthilkumar
 
2013 10 23_dna_for_dummies_v_presented
Prof. Wim Van Criekinge
 
proteomic and Genomics and the available proteomic technologies and the data ...
SamiMohamed28
 
Proteomics
Sarfaraz Nasri
 
Pcmd bioinformatics-lecture i
Muhammad Younis
 
Introduction to Bioinformatics 2025.....pdf
omniaabdo276
 

Recently uploaded (20)

PPTX
How to Define Translation to Custom Module And Add a new language in Odoo 18
Celine George
 
PPTX
How to Configure Lost Reasons in Odoo 18 CRM
Celine George
 
PPTX
Latest Features in Odoo 18 - Odoo slides
Celine George
 
PPTX
Optimizing Cancer Screening With MCED Technologies: From Science to Practical...
i3 Health
 
PPTX
LEGAL ASPECTS OF PSYCHIATRUC NURSING.pptx
PoojaSen20
 
PDF
1, 2, 3… E MAIS UM CICLO CHEGA AO FIM!.pdf
Colégio Santa Teresinha
 
PDF
community health nursing question paper 2.pdf
Prince kumar
 
PPTX
CONVULSIVE DISORDERS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
Nutri-QUIZ-Bee-Elementary.pptx...................
ferdinandsanbuenaven
 
PPTX
Pyhton with Mysql to perform CRUD operations.pptx
Ramakrishna Reddy Bijjam
 
PPSX
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
PDF
IMP NAAC REFORMS 2024 - 10 Attributes.pdf
BHARTIWADEKAR
 
PPTX
Capitol Doctoral Presentation -July 2025.pptx
CapitolTechU
 
PPTX
PPT on the Development of Education in the Victorian England
Beena E S
 
PPTX
Accounting Skills Paper-I, Preparation of Vouchers
Dr. Sushil Bansode
 
PPTX
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
PPTX
Explorando Recursos do Summer '25: Dicas Essenciais - 02
Mauricio Alexandre Silva
 
PPTX
HEAD INJURY IN CHILDREN: NURSING MANAGEMENGT.pptx
PRADEEP ABOTHU
 
PDF
BÀI TẬP BỔ TRỢ THEO LESSON TIẾNG ANH - I-LEARN SMART WORLD 7 - CẢ NĂM - CÓ ĐÁ...
Nguyen Thanh Tu Collection
 
PPTX
How to Manage Access Rights & User Types in Odoo 18
Celine George
 
How to Define Translation to Custom Module And Add a new language in Odoo 18
Celine George
 
How to Configure Lost Reasons in Odoo 18 CRM
Celine George
 
Latest Features in Odoo 18 - Odoo slides
Celine George
 
Optimizing Cancer Screening With MCED Technologies: From Science to Practical...
i3 Health
 
LEGAL ASPECTS OF PSYCHIATRUC NURSING.pptx
PoojaSen20
 
1, 2, 3… E MAIS UM CICLO CHEGA AO FIM!.pdf
Colégio Santa Teresinha
 
community health nursing question paper 2.pdf
Prince kumar
 
CONVULSIVE DISORDERS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Nutri-QUIZ-Bee-Elementary.pptx...................
ferdinandsanbuenaven
 
Pyhton with Mysql to perform CRUD operations.pptx
Ramakrishna Reddy Bijjam
 
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
IMP NAAC REFORMS 2024 - 10 Attributes.pdf
BHARTIWADEKAR
 
Capitol Doctoral Presentation -July 2025.pptx
CapitolTechU
 
PPT on the Development of Education in the Victorian England
Beena E S
 
Accounting Skills Paper-I, Preparation of Vouchers
Dr. Sushil Bansode
 
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
Explorando Recursos do Summer '25: Dicas Essenciais - 02
Mauricio Alexandre Silva
 
HEAD INJURY IN CHILDREN: NURSING MANAGEMENGT.pptx
PRADEEP ABOTHU
 
BÀI TẬP BỔ TRỢ THEO LESSON TIẾNG ANH - I-LEARN SMART WORLD 7 - CẢ NĂM - CÓ ĐÁ...
Nguyen Thanh Tu Collection
 
How to Manage Access Rights & User Types in Odoo 18
Celine George
 
Ad

A comparative study using different measure of filteration

  • 1. A comparative study of different measures used in filter method to select relevant genes from microarray gene expression presented by Jayati Mitra
  • 2. Content Chapter 1. Introduction. Chapter 2. Basic concepts of Bioinformatics and Molecular Biology 2.1 Introduction to Bioinformatics. 2.2 Goal of Bioinformatics. 2.3 Research area of Bioinformatics. 2.4 Real world application of Bioinformatics. 2.5 Introduction to Molecular Biology 2.6 Central dogma of Molecular Biology. Chapter 3. DNA Microarray technology and Gene Expression Data. Chapter 4. Literature Survey on feature selection techniques applied on Microarray Gene expression data. Chapter 5. Proposed Work . 5.1 Scoring function based feature selection. 5.2 Working Principal. Chapter 6. Result Analysis. Chapter 7. Conclusion and future scope. References.
  • 3. Chapter 1.Introduction  Gene expression is the process of transcribing a gene’s DNA sequence into RNA. A gene’s expression level indicates the approximate number of copies of that gene’s RNA produced in a cell and it is correlated with the amount of the corresponding proteins made.  DNA microarray (also commonly known as DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome.  Classification is the form of data analysis that extracts models of describing important data classes. Such models called classifiers.Data classification consists of two steps: 1.First one is the learning step or training phase where a classification model or classifier is created from the training dataset and their associated class labels and 2.the second one is the classification step where classifier is applied to classify unseen data.
  • 4.  Feature selection is the process of selecting a subset of relevant and redundant features from a dataset in order to improve the performance of the classification algorithms in terms of accuracy and time to build the model.  The process of feature selection is classified into three categories:- (1) filter (2) wrapper and (3) embedded. Filter methods evaluate a subset of genes by looking at the intrinsic characteristics of data with respect to class labels.  Here in this project thesis a review is completed on those score functions which are used in filter methods.
  • 5. Chapter 2. Basic concept of Bioinformatics and molecular biology 2.1.Introduction to Bioinformatics: Bioinformatics is the field of science in which biology, computer science, mathematics and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. 2.2 Goal of Bioinformatics: Bioinformatics then became more ambitious, aiming to revolutionize medicine by making sequencing a diagnostic tool.  The goal was to develop new approaches to eradicate diseases like cancer, and to pave the way towards personalized medicine.
  • 6. 2.3 Research areas of bioinformatics: Sequence analysis: Sequence analysis is the most primitive operation in computational biology. This operation consists of finding which part of the biological sequences are alike and which part differs during medical analysis and genome mapping processes. • Genome annotation:In the context of genomics, annotation is the process of marking the genes and other biological features in a DNA sequence. • Analysis of gene expression: The expression of many genes can be determined by measuring mRNA levels with various techniques such as microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), or various applications of multiplexed in-situ hybridization etc. •Analysis of protein expression: Gene expression is measured in many ways including mRNA and protein expression, however protein expression is one of the best clues of actual gene activity since proteins are usually final catalysts of cell activity. Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the proteins present in a biological sample
  • 7. Analysis of mutations in cancer: In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways. Massive sequencing efforts are used to identify previously unknown point mutations in a variety of genes in cancer. Bioinformaticians continue to produce specialized automated systems to manage the sheer volume of sequence data produced, and they create new algorithms and software to compare the sequencing results to the growing collection of human genome sequences and germ line polymorphisms. New physical detection technologies are employed, such as oligonucleotide microarrays to identify chromosomal gains and losses and single-nucleotide polymorphism arrays to detect known point mutations. Protein structure prediction: The amino acid sequence of a protein (so-called, primary structure) can be easily determined from the sequence on the gene that codes for it. In most of the cases, this primary structure uniquely determines a structure in its native environment.
  • 8. 2.4 Real world applications of bioinformatics: • Basic Research • Functional genomics • Evolutionary genomics • Epigenomics • Genome Wide Association Analysis (GWA) • Genomics • Proteomics • Omic sciences • Systems biology/ Systems genetics • High Performance Computing (HPC) Biomedicine • Drug discovery • Personalized medicine • Preventive medicine • Gene therapy
  • 9. Microbiology • Biotechnology • Waste cleanup • Climate change • Alternative energy sources • Antibiotic resistance • Epidemiological studies Agriculture • Crops • Insect resistance • Improving of nutritional quality
  • 10. 2.5 Introduction to Molecular Biology Molecular Biology is the study of gene structure and function at molecular level to understand the molecular basis of hereditary genetic variation and the expression patterns of the genes. Some common Molecular Biology techniques are:-  Electrophoresis – A process which separates molecules such as DNA or proteins out according to their size, electrophoresis is a mainstay of molecular biology laboratories. It can be used to identify molecules or fragments of molecules and as a check to make sure that we have the correct molecule present.  Cloning – The technique of introducing a new gene into a cell or organism. This can be used to see what effect the expression of that gene has on the organism.  Restriction Digest – The process of cutting DNA up into smaller fragments using enzymes which only act at a particular genetic sequence.  Ligation – The process of joining two pieces of DNA together. Ligation is useful when introducing a new piece of DNA into another genome
  • 11. Cell:-The cell is the basic structural, functional, and biological unit of all known living organisms. A cell is the smallest unit of life. Cells are often called the "building blocks of life". The study of cells is called cell biology or cellular biology. DNA:- Deoxyribonucleic acid or DNA is a molecule which contains all the hereditary material and holds the instructions for building the proteins that are essential for our bodies to function.
  • 12. There are four types nitrogen containing regions called bases: • Adenine (A) • Cytosine (C) • Guanine (G) • Thymine (T) The bases of the two strands of DNA are stuck together to create a ladder-like shape. Within the ladder, A always sticks to T, and G always sticks to C to create the "rungs." The length of the ladder is formed by the sugar and phosphate groups.
  • 13. RNA: RNAstands for ribonucleic acid. Molecules are single- stranded nucleic acids composed of nucleotides. It plays an important role in protein synthesis through the process of translation.
  • 14. There are basically three types of RNA involved in translation process:- ➢ Messenger RNA (mRNA) ➢ Transfer RNA (tRNA) ➢ Ribosomal RNA (rRNA) Gene: A gene is the basic physical and functional unit of heredity. Genes are made up of DNA. Some genes act as instructions to make molecules called proteins. In humans, genes vary in size from a few hundred DNA bases to more than 2 million bases. The Human Genome Project estimated that humans have between 20,000 and 25,000 genes.
  • 15. Protein: Proteins are essential nutrients for the human body. They are one of the building blocks of body tissue and can also serve as a fuel source. As a fuel, proteins provide as much energy density of protein from a nutritional standpoint is its amino acid composition.
  • 16. 2.6 Central Dogma of Molecular Biology: 2.6 Central Dogma of Molecular Biology:
  • 18. Post-translational modification (PTM) refers to the covalent and generally enzymatic modification of proteins following protein biosynthesis. Proteins are synthesized by ribosomes translating mRNA into polypeptide chains, which may then undergo PTM to form the mature protein product. PTMs are important components in cell signaling, as for example when prohormones are converted to hormones. Splicing: The detailed splicing mechanism is quite complex. In short, it involves five snRNAs and their associated proteins. These ribonucleo proteins form a large(60S)complex,called spliceosome. Then, after a two-step enzymatic reaction, the intron is removed and two neighboring exons are joined together . The branch point A residue plays a critical role in the enzymatic reaction.
  • 19. Chapter 3.DNA Microarray Technology: A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. DNA microarray technology provides biologists with the ability to measure the expression levels of thousands of genes in a single experiment. Initial experiments suggest that genes of similar function capitulate similar expression patterns in microarray hybridization experiments. DNA Microarray technology has empowered the scientific community to understand the fundamental aspects underlining the growth and development of life as well as to explore the genetic causes of anomalies occurring in the functioning of the human body.
  • 22. 3.Reverse Transcription 4. Labelling: Cyanine dyes Cy3 and Cy5 used predominant label in microarray analysis
  • 24. Types of microarray • DNA Microarray – cDNA microarray – Oligonucleotide arrays • Protein microarray – Analytical – Functional – Reverse phase • Chemical compound arrays – collection of organic chemical compounds spotted on a solid surface • Carbohydrate arrays – various oligosaccharides and/or polysaccharides immobilized on a solid support in a spatially defined arrangement • Cellular Microarrays – spotted with varying materials, such as antibodies, proteins, or lipids, which can interact with the cells, leading to their capture on specific spots.
  • 25. Chapter 4. Literature Survey on Feature selection techniques applied on microarray Gene expression data: Gene Expression: Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as transfer RNA (tRNA) or small nuclear RNA (snRNA) genes, the product is a functional RNA. The process of gene expression is used by all known life—eukaryotes (including multicellular organisms), prokaryotes (bacteria and archaea), and utilized by viruses—to generate the macromolecular machinery for life. Several steps in the gene expression process may be modulated, including the transcription, RNA splicing, translation, and post-translational modification of a protein.
  • 26. Feature Selection for Classification Feature selection mainly affects the training phase of classification. After generating features, instead of processing data with the whole features to the learning algorithm directly. feature selection for classification will first perform feature selection to select a subset of features and then process the data with the selected features to the learning algorithm. The feature selection phase might be independent of the learning algorithm, like filter models, or it may iteratively utilize the performance of the learning algorithms to evaluate the quality of the selected features, like wrapper models. The process of feature selection is classified into three categories:-
  • 27. a)Filter: It is also called open-loop method. It is the earliest method. It examines the features based on the intrinsic characteristics prior to the learning tasks. A filter algorithm principally measures the feature characteristics based on four types of evaluation criteria, i.e., dependency, information, distance, and consistency. However, filter methods ignore the interactions between classifiers and the possible interaction among features (combined features may have net effect that is not necessarily reflected by the individual features in that group). It also leads to varied prediction performance when the selected features are applied to different learning algorithms. .
  • 28. Advantages and disadvantages of filters methods.
  • 29. Filter method based works in Microarray gene expression data: some process and methods are used as of based on filter method...In the table which is given below showing some key references for filter method of feature selection technique in the microarray domain.
  • 30. b) Wrapper or close-loop method:wraps the feature selection around the learning algorithm and utilizes classification error rate or performance accuracy as feature evaluation criterion. It selects the most discriminative subset of features by minimizing the prediction error of a particular classifier.
  • 31. Advantages and disadvantages of Wrapper methods.
  • 32. Wrapper method based works in Microarray gene expression data:
  • 33. c) Embedded :method is a built-in feature selection mechanism that embeds the feature selection in the learning algorithm and uses its properties to guide feature evaluation. Embedded method is more efficient and computationally more tractable than wrapper method while maintaining similar performance. This is because the embedded method avoids the repetitive execution of classifier and examination of every feature subset.
  • 34. Advantages and disadvantages of Embedded methods.
  • 35. Embedded method based works in Microarray gene expression data:
  • 36. d)Hybrid : represent the latest developments in feature selection. Hybrid method can be either formed by combining two different methods (e.g. filter and wrapper), two methods of the same criterion, or two feature selection approaches. It uses different evaluation criteria in different search stages to improve the efficiency and prediction performance with better computational performance.
  • 37. ) Advantages and disadvantages of Hybrid methods:
  • 38. Hybrid method based works in Microarray gene expression data:
  • 39. e)Ensemble: this method is a method that aims to construct a group of feature subsets and then produce an aggregated result out of the group. It is purposely designed to tackle the instability and perturbation issues in many feature selection algorithms. This method is based on different sub-sampling strategies where a particular feature selection method is run on a number of sub-samples and the obtained features are merged to form a more stable subset..
  • 40. Advantages and disadvantages of Ensemble methods
  • 41. Ensemble method based works in Microarray gene expression data:
  • 42. Chapter 5. Proposed work 5.1 Scoring Function Based Feature Selection In this work our main objective is to focus on scoring function which is used in filter method. •Filter methods are divided into two categories: univariate methods and multivariate methods. In the univariate scheme, each feature is ranked independently based on some score functions or measures. and then a given number of features are selected according to their rank. In the multivariate scheme feature dependency is considered. Therefore, the multivariate scheme is naturally capable of handling redundant features. Here the score functions which are used to rank relevant genes in gene expression data are discussed below.
  • 43. Here, 𝐾𝑁×𝑀 is a gene expression data matrix, which contains 𝑁 number of objects (samples) and 𝑀 number of features (genes). •Here, 𝑋={𝑋1,𝑋2,...........𝑋𝑁 } is a set of samples . •𝑓={𝑓1,𝑓2,…….,𝑓𝑀 } is a set of features(Gene). •𝐶𝑁×1 is a class vector which contains a class value associated with every sample. 1.Mutual Information: the mutual information (MI) of two random variables is a measure of the mutual dependence between the two features. •fi and fs are individual features.
  • 44. average normalized MI as a measure of redundancy between the i th feature and the subset of selected features S={fs} |s| is s the cardinality of set S . For selecting the best features we calculating MI by defining this equation which is given below:
  • 45. 2. Symmetric Uncertainty: This is one of normalized form of Mutual Information; introduced by Witten and Frank, 2005 . Its defined as bellow: 3. Information Gain: •The information gain measure is based on the entropy concept •It is commonly used as measure of feature relevance in filter strategies that evaluate features individually •Information gain (IG) measures the amount of information in bits about the class prediction, if the only information available is the presence of a feature and the corresponding class distribution
  • 46. Where Pd is the marginal probability of class ci th. Here, Dataset is partitioned with respect a feature f into k parts. The information gain with respect to feature f is given below: Where|𝐷𝑘 𝑓 |and |𝐷| represent the number of objects present in and respectively
  • 47. 4. Chi-square test:  Chi-square is a statistical test commonly used to compare observed data with data we would expect to obtain according to a specific hypothesis. The chi-square test is always testing what scientists call the null hypothesis There is no significant difference between the expected and observed result. The formula for calculating chi-square (x2) is: Here we calculating Chi-square for every feature variable and target variable and observe the existence of relationship between the feature variable and target variable.
  • 48. •In feature selection, the two events are occurrence of the term and occurrence of the class. •X2 is a measure of how much expected counts and observed counts diverge from each other. •A high value of X 2 indicates that the hypothesis of independence, which implies that expected and observed counts are similar, is incorrect. •If the two events are dependent, then the occurrence of the term makes the occurrence of the class more likely (or less likely), so it should be helpful as a feature.
  • 49. 5.Gini Index •It is a univariate and supervised feature weighting method. It is a measure for quantifying a feature's ability to distinguish between classes •The main idea behind the Gini-Index theory is the, is a univariate and supervised feature weighting method •It is a measure for quantifying a feature's ability to distinguish between classes •Given c classes, GI of a feature f can be calculated as:
  • 50. 6.Relief •filter-method approach to feature selection that is notably sensitive to feature interactions •Relief calculates a feature score for each feature which can then be applied to rank and select top scoring features for feature selection •Relief feature scoring is based on the identification of feature value differences between nearest neighbor instance pairs •If a feature value difference is observed in a neighboring instance pair with the same class (a 'hit'), the feature score decreases •if a feature value difference is observed in a neighboring instance pair with different class values (a 'miss'), the feature score increases Relief finds the nearest instance from same class to find a hit and a miss from different class and according to that 𝑊(𝑓𝑡 ) is increased or decreased. …………………………………………………11
  • 51. 7. Fisher Score: •It is a supervised and univariate feature weighting method •It picks features that assigns similar value to the samples from the same class •And picks different value to samples from different classes to evaluate measure used in Fisher Score Fisher score can be expressed as: •where, 𝜇𝒇𝒊𝒄𝒊 and 𝜌𝑓𝑖𝑐𝑖 are the mean and variance of ith features in ci class. •𝜇𝑓𝑖 is the mean of ith feature •𝑛𝑐 is the number of samples of ci th class
  • 52. 8.T-test : •It measures the relationship between two samples statistically by comparing its mean values •It calculates the ratio between two class mean and variability of two classes. …………………………………..13 𝑓̅𝑡1 is the mean of sample values of feature for class 1 𝑓̅𝑡2 is the mean of sample values of feature for class 2 𝑆𝑡1 𝑎𝑛𝑑 𝑆𝑡2are standard deviation of sample value of feature 𝑓𝑡 for class 1 and sample values of feature 𝑓𝑡for class 2 respectively. 𝑛1represents number of samples of class1 and 𝑛2 represents number of samples of class2.
  • 53. 5.2 Working Principle: Algorithm: Filter 1. Choose k= the number of genes to be selected by each filter. 2. For each filter FTi(i=1,2,3) a) Calculate the statistical scores for all genes and rank the scores from the highest to the lowest. b) Select k genes with top ranking scores in each list. 3. Take the union of the list of genes obtained by FTi(i=1,2,3) to produce a set of p features,
  • 54. Flowchart of Filter base feature selection
  • 55. Chapter 6 Result Analysis The score functions or measures are applied on different microarray gene expression datasets and then best 100 number of genes are selected by each measure and classification accuracy of samples are checked using those genes for every measure. Here KNN classifier is used to check classification accuracy using leave one out cross validation method. Table 1. Dataset Description
  • 56. Table 2. PERFORMANCE OF DIFFERENT MEASURES ON BREAST CANCER DATASET USING KNN Measures Classification accuracy Fisher Score 81.6 T-test 81.6 Chi-square 83.7 Symmetric Uncertainty 85.7 Information Gain 89.8 Gini Index 85.7 Mutual Information 89.8 Relief 89.8
  • 57. Table 3. PERFORMANCE OF DIFFERENT MEASURES ON COLON CANCER DATASET USING KNN Measures Classification accuracy Fisher Score 72.6 T-test 72.6 Chi-square 83.9 Symmetric Uncertainty 83.9 Information Gain 90.3 Gini Index 85.5 Mutual Information 90.3 Relief 85.5
  • 58. Chapter 7 Conclusion and Future Scope: This project focuses on the filter approach to select the most relevant features, which is based on the study of the existing scoring functions, which are univariate. After analyzing the outcomes which have been carried out from these scoring function a comparative study can be established It also helps us to analyze the methodology in selecting the more relevant features and removing irrelevant features. In future, this proposed work can be applied in the multivariate scheme by using filter based scoring function.
  • 59. References [1]Analysis of Gene Expression Data, E. Klipp, R. Herwig, A. Kowald, C. Wierling, H. Lehrach, ISBN: 3-527-31078-9. [2]. C.Lavanya1, M.Nandihini2, R.Niranjana3, C.Gunavathi4(2014) "Classification of Microarray Data Based On Feature Selection Method".An ISO 3297: 2007 Certified Organization, Volume 3, Special Issue 1,PAGE -126. [3]. Rabia Aziz *, C.K. Verma, and Namita Srivastava "Dimension reduction methods for microarray data: a review".DOI: 10.3934/bioeng.2017.1.179. [4]Ang Jun Chin, Andri Mirzal, Habibollah Haron, Senior Member, IEEE, Haza Nuzly Abdull Hamed: Supervised, Unsupervised and Semi-supervised Feature Selection: A Review on Gene Selection. DOI 10.1109/TCBB.2015.2478454,pages 6-9. [5]Alejandra J. Magana1,Manaz Taleyarkhan2, Daniela Rivera Alvarado3,Michael Kane4, John Springer5, and Kari Clase6 "A Survey of Scholarly Literature Describing the Field of Bioinformatics Education and Bioinformatics Educational Research"doi: 10.1187/cbe.13- 10-0193.
  • 60. [6] P. D. Karp. "An ontology for biological function based on molecular interactions" DOI:16(3):269–285, 2000. [7] R. Bals1, B. Jany2: "Identification of disease genes by expression profiling" Eur Respir J. 2001 Nov;18(5):8829. [8] Rajeshwar Govindarajan, Jeyapradha Duraiyan1 , Karunakaran Kaliyappan, Murugesan Palanisamy2 "Microarray and its applications" doi: 10.4103/0975-7406.100283 .2012 jan page S311. [9]. Yvan Saeys1.*, In˜aki Inza2 and Pedro Larran˜aga2: A review of feature selection techniques in bioinformatics"doi:10.1093/bioinformatics/btm344 , June 25, 2007 page 2508-2514. [10]. Carmen Lai*1, Marcel JT Reinders1, Laura J van't Veer2 and Lodewyk FA Wessels1,2" A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets". doi.org/10.1186/1471-2105-7-235,2006 may. [11]. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res3: 1157–1182.
  • 61. [12]. Xiong, M., Fang, Z., & Zhao, J. (2001). Biomarker identification by feature wrappers, Genome Research, 11, 1878-1887. [13]" A weighted logistic regression analysis for predicting the odds of head/face and neck injuries during rollover crashes" by Jingwen Hu, Clifford C. Chou, King H. Yang, Albert I. King, 2007; 51: 363–379. [14].A comparative review of statistical methods for discovering differentially expressed Genes in replicated Microarray Experiments.Wei Pan Bioinformatics, Volume 18, Issue 4, April 2002, Pages 546–554,https://doi.org/10.1093/bioinformatics/18.4.546. [15]. Somol P, Pudil P, Novovičová J, et al. (1999) Adaptive floating search methods in feature selection. Pattern Recogn Lett 20: 1157–1163. [16]. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40:16–28. [17]. F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and Robust Feature Selection via Joint mathscrl2,1-Norms Minimization,” in Advances in Neural Information Processing Systems 23, J. D. Laf-ferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Cu-lotta, Eds. Curran Associates, Inc., 2010, pp. 1813–1821.
  • 62. [18]. S. Xiang, F. Nie, G. Meng, C. Pan, and C. Zhang, “Discrimina-tive Least Squares Regression for Multiclass Classification and Feature Selection,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 11, pp. 1738–1754, Nov. 2012. [19]. H. Pang, S. L. George, K. Hui, and T. Tong, “Gene Selection Using Iterative Feature Elimination Random Forests for Sur-vival Outcomes,” IEEEACM Trans Comput Biol Bioinforma., vol. 9, no. 5, pp. 1422–1431, Sep. 2012. [20]. Q. Hu, W. Pan, S. An, P. Ma, and J. Wei, “An efficient gene se-lection technique for cancer recognition based on neighbor-hood mutual information,” Int. J. Mach. Learn. Cybern., vol. 1, no. 1–4, pp. 63–74, Dec. 2010. [21]. P. Saengsiri, S. N. Wichian, P. Meesad, and U. Herwig, “Com-parison of hybrid feature selection models on gene expression data,” in Knowledge Engineering, 2010 8th International Conference on ICT and, 2010, pp. 13–18. [22]. Y. Leung and Y. Hung, “A Multiple-Filter-Multiple-Wrapper Approach to Gene Selection and Microarray Data Classifica-tion,” IEEEACM Trans Comput Biol Bioinforma., vol. 7, no. 1, pp. 108–117, Jan. 2010.
  • 63. [23]. H. Liu, L. Liu, and H. Zhang, “Ensemble gene selection by grouping for microarray data classification,” J. Biomed. Inform., vol. 43, no. 1, pp. 81–87, Feb. 2010. [24]. P. Yang, B. B. Zhou, Z. Zhang, and A. Y. Zomaya, “A multi-filter enhanced genetic ensemble system for gene selection and sam-ple classification of microarray data,” BMC Bioinformatics, vol. 11, no. Suppl 1, p. S5, Jan. 2010. [25]. M. A. Gaafar, N. A. Yousri, and M. A. Ismail, “A novel ensemble selection method for cancer diagnosis using microarray da-tasets,” in IEEE 12th International Conference on BioInformatics and BioEngineering, BIBE 2012, 2012, pp. 368–373. [26]. K.Mani1 P.Kalpana2 " A Review on Filter Based Feature Selection".Vol. 4, Issue 5, May 2016, DOI: 10.15680/IJIRCCE.2015. 0405094 ,PAGE-9149-9151 [27]. Shilpi Bose1 , Chandra Das2 , Matangini Chattopadhyay3 , Kuntal Ghosh4 , Samiran Chattopadhyay5 " An Ensemble Filtering Approach based Supervised Gene Clustering Algorithm to Identify Informative Genes to Improve Sample Classification Accuracy in Microarray Gene Expression Data. [28] . Ryan J. Urbanowicza,_, Melissa Meekerb, William LaCavaa, Randal S. Olsona, Jason H. Moorea" Relief-Based Feature Selection: Introduction and Review.doi:1711.08421V2-2 APR 2018.