SlideShare a Scribd company logo
A Fast Clustering-Based Feature Subset Selection Algorithm
for High-Dimensional Data
ABSTRACT:
Feature selection involves identifying a subset of the most useful features that
produces compatible results as the original entire set of features. A feature
selection algorithm may be evaluated from both the efficiency and effectiveness
points of view. While the efficiency concerns the time required to find a subset of
features, the effectiveness is related to the quality of the subset of features. Based
on these criteria, a fast clustering-based feature selection algorithm (FAST) is
proposed and experimentally evaluated in this paper. The FAST algorithm works
in two steps. In the first step, features are divided into clusters by using graph-
theoretic clustering methods. In the second step, the most representative feature
that is strongly related to target classes is selected from each cluster to form a
subset of features. Features in different clusters are relatively independent, the
clustering-based strategy of FAST has a high probability of producing a subset of
useful and independent features. To ensure the efficiency of FAST, we adopt the
efficient minimum-spanning tree (MST) clustering method. The efficiency and
effectiveness of the FAST algorithm are evaluated through an empirical study.
Extensive experiments are carried out to compare FAST and several representative
feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-
SF, with respect to four types of well-known classifiers, namely, the
probabilitybased Naive Bayes, the tree-based C4.5, the instance-based IB1, and the
rule-based RIPPER before and after feature selection. The results, on 35 publicly
available real-world high-dimensional image, microarray, and text data,
demonstrate that the FAST not only produces smaller subsets of features but also
improves the performances of the four types of classifiers.
EXISTING SYSTEM:
The embedded methods incorporate feature selection as a part of the training
process and are usually specific to given learning algorithms, and therefore may be
more efficient than the other three categories. Traditional machine learning
algorithms like decision trees or artificial neural networks are examples of
embedded approaches. The wrapper methods use the predictive accuracy of a
predetermined learning algorithm to determine the goodness of the selected
subsets, the accuracy of the learning algorithms is usually high. However, the
generality of the selected features is limited and the computational complexity is
large. The filter methods are independent of learning algorithms, with good
generality. Their computational complexity is low, but the accuracy of the learning
algorithms is not guaranteed. The hybrid methods are a combination of filter and
wrapper methods by using a filter method to reduce search space that will be
considered by the subsequent wrapper. They mainly focus on combining filter and
wrapper methods to achieve the best possible performance with a particular
learning algorithm with similar time complexity of the filter methods.
DISADVANTAGES OF EXISTING SYSTEM:
The generality of the selected features is limited and the computational
complexity is large.
Their computational complexity is low, but the accuracy of the learning
algorithms is not guaranteed.
The hybrid methods are a combination of filter and wrapper methods by
using a filter method to reduce search space that will be considered by the
subsequent wrapper.
PROPOSED SYSTEM
Feature subset selection can be viewed as the process of identifying and removing
as many irrelevant and redundant features as possible. This is because irrelevant
features do not contribute to the predictive accuracy and redundant features do not
redound to getting a better predictor for that they provide mostly information
which is already present in other feature(s). Of the many feature subset selection
algorithms, some can effectively eliminate irrelevant features but fail to handle
redundant features yet some of others can eliminate the irrelevant while taking care
of the redundant features. Our proposed FAST algorithm falls into the second
group. Traditionally, feature subset selection research has focused on searching for
relevant features. A well-known example is Relief which weighs each feature
according to its ability to discriminate instances under different targets based on
distance-based criteria function. However, Relief is ineffective at removing
redundant features as two predictive but highly correlated features are likely both
to be highly weighted. Relief-F extends Relief, enabling this method to work with
noisy and incomplete data sets and to deal with multiclass problems, but still
cannot identify redundant features.
ADVANTAGES OF PROPOSED SYSTEM:
Good feature subsets contain features highly correlated with (predictive of)
the class, yet uncorrelated with (not predictive of) each other.
The efficiently and effectively deal with both irrelevant and redundant
features, and obtain a good feature subset.
Generally all the six algorithms achieve significant reduction of
dimensionality by selecting only a small portion of the original features.
The null hypothesis of the Friedman test is that all the feature selection
algorithms are equivalent in terms of runtime.
MODULES:
 Distributed clustering
 Subset Selection Algorithm
 Time complexity
 Microarray data
 Data Resource
 Irrelevant feature
MODULE DESCRIPTION
1. Distributed clustering
The Distributional clustering has been used to cluster words into groups based
either on their participation in particular grammatical relations with other words by
Pereira et al. or on the distribution of class labels associated with each word by
Baker and McCallum . As distributional clustering of words are agglomerative in
nature, and result in suboptimal word clusters and high computational cost,
proposed a new information-theoretic divisive algorithm for word clustering and
applied it to text classification. proposed to cluster features using a special metric
of distance, and then makes use of the of the resulting cluster hierarchy to choose
the most relevant attributes. Unfortunately, the cluster evaluation measure based on
distance does not identify a feature subset that allows the classifiers to improve
their original performance accuracy. Furthermore, even compared with other
feature selection methods, the obtained accuracy is lower.
2. Subset Selection Algorithm
The Irrelevant features, along with redundant features, severely affect the accuracy
of the learning machines. Thus, feature subset selection should be able to identify
and remove as much of the irrelevant and redundant information as possible.
Moreover, “good feature subsets contain features highly correlated with (predictive
of) the class, yet uncorrelated with (not predictive of) each other. Keeping these in
mind, we develop a novel algorithm which can efficiently and effectively deal with
both irrelevant and redundant features, and obtain a good feature subset.
3. Time complexity
The major amount of work for Algorithm 1 involves the computation of SU values
for TR relevance and F-Correlation, which has linear complexity in terms of the
number of instances in a given data set. The first part of the algorithm has a linear
time complexity in terms of the number of features m. Assuming features are
selected as relevant ones in the first part, when k ¼ only one feature is selected.
4. Microarray data
The proportion of selected features has been improved by each of the six
algorithms compared with that on the given data sets. This indicates that the six
algorithms work well with microarray data. FAST ranks 1 again with the
proportion of selected features of 0.71 percent. Of the six algorithms, only CFS
cannot choose features for two data sets whose dimensionalities are 19,994 and
49,152, respectively.
5. Data Resource
The purposes of evaluating the performance and effectiveness of our proposed
FAST algorithm, verifying whether or not the method is potentially useful in
practice, and allowing other researchers to confirm our results, 35 publicly
available data sets1 were used. The numbers of features of the 35 data sets vary
from 37 to 49, 52 with a mean of 7,874. The dimensionalities of the 54.3 percent
data sets exceed 5,000, of which 28.6 percent data sets have more than 10,000
features. The 35 data sets cover a range of application domains such as text, image
and bio microarray data classification. The corresponding statistical information.
Note that for the data sets with continuous-valued features, the well-known off-the-
shelf MDL method was used to discredit the continuous values.
6. Irrelevant feature
The irrelevant feature removal is straightforward once the right relevance measure
is defined or selected, while the redundant feature elimination is a bit of
sophisticated. In our proposed FAST algorithm, it involves 1.the construction of
the minimum spanning tree from a weighted complete graph; 2. The partitioning of
the MST into a forest with each tree representing a cluster; and 3.the selection of
representative features from the clusters.
SYSTEM FLOW:
Data set
Irrelevant feature removal
Selected Feature
Minimum Spinning tree
constriction
Tree partition & representation
feature selection
SYSTEM CONFIGURATION:-
HARDWARE CONFIGURATION:-
 Processor - Pentium –IV
 Speed - 1.1 Ghz
 RAM - 256 MB(min)
 Hard Disk - 20 GB
 Key Board - Standard Windows Keyboard
 Mouse - Two or Three Button Mouse
 Monitor - SVGA
SOFTWARE CONFIGURATION:-
 Operating System : Windows XP
 Programming Language : JAVA
 Java Version : JDK 1.6 & above.
REFERENCE:
Qinbao Song, Jingjie Ni, and Guangtao Wang, “A Fast Clustering-Based Feature
Subset Selection Algorithm for High-Dimensional Data”, IEEE
TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25,
NO. 1, JANUARY 2013.

More Related Content

What's hot (17)

PDF
Application of three graph Laplacian based semisupervised learning methods to...
ijbbjournal
 
PDF
Iaetsd an efficient and large data base using subset selection algorithm
Iaetsd Iaetsd
 
PDF
International Journal of Computer Science, Engineering and Information Techno...
IJCSEIT Journal
 
PDF
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
IJERA Editor
 
PDF
PDN for Machine Learning
Srikanth Chavali
 
PDF
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
ijcsa
 
PPT
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
gregoryg
 
PDF
Network Based Intrusion Detection System using Filter Based Feature Selection...
IRJET Journal
 
PDF
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
IJDMS
 
PDF
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
IJCI JOURNAL
 
PDF
F017533540
IOSR Journals
 
PPTX
Rohit 10103543
Pulkit Chhabra
 
PPT
Decentralized Data Fusion Algorithm using Factor Analysis Model
Sayed Abulhasan Quadri
 
PDF
Bioinformatics data mining
Sangeeta Das
 
DOCX
Bioinformatics_Sequence Analysis
Sangeeta Das
 
PPTX
Drug discovery presentation
Theertha Raveendran
 
PDF
Particle Swarm Optimization based K-Prototype Clustering Algorithm
iosrjce
 
Application of three graph Laplacian based semisupervised learning methods to...
ijbbjournal
 
Iaetsd an efficient and large data base using subset selection algorithm
Iaetsd Iaetsd
 
International Journal of Computer Science, Engineering and Information Techno...
IJCSEIT Journal
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
IJERA Editor
 
PDN for Machine Learning
Srikanth Chavali
 
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
ijcsa
 
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
gregoryg
 
Network Based Intrusion Detection System using Filter Based Feature Selection...
IRJET Journal
 
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
IJDMS
 
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
IJCI JOURNAL
 
F017533540
IOSR Journals
 
Rohit 10103543
Pulkit Chhabra
 
Decentralized Data Fusion Algorithm using Factor Analysis Model
Sayed Abulhasan Quadri
 
Bioinformatics data mining
Sangeeta Das
 
Bioinformatics_Sequence Analysis
Sangeeta Das
 
Drug discovery presentation
Theertha Raveendran
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
iosrjce
 

Viewers also liked (15)

DOCX
2012 - 2013 bulk ieee projects for sale
JPINFOTECH JAYAPRAKASH
 
PDF
2015 - 2016 ieee ns2 project titles
JPINFOTECH JAYAPRAKASH
 
DOCX
2012-2013 IEEE PROJECT TITLES
JPINFOTECH JAYAPRAKASH
 
DOCX
Efficient algorithms for neighbor discovery in wireless networks
JPINFOTECH JAYAPRAKASH
 
DOCX
A probabilistic model of visual cryptography Scheme With Dynamic Group
JPINFOTECH JAYAPRAKASH
 
PPTX
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
JPINFOTECH JAYAPRAKASH
 
DOCX
Discovery and verification of neighbor positions in mobile ad hoc networks
JPINFOTECH JAYAPRAKASH
 
DOCX
Adaptive opportunistic routing for wireless ad hoc networks
JPINFOTECH JAYAPRAKASH
 
DOC
2012 - 2013 DOTNET IEEE PROJECT TITLES
JPINFOTECH JAYAPRAKASH
 
DOCX
Spatial approximate string search
JPINFOTECH JAYAPRAKASH
 
DOCX
A real time adaptive algorithm for video streaming over multiple wireless acc...
JPINFOTECH JAYAPRAKASH
 
DOCX
An adaptive cloud downloading service
JPINFOTECH JAYAPRAKASH
 
DOCX
Cooperative positioning and tracking in disruption tolerant networks
JPINFOTECH JAYAPRAKASH
 
PDF
Packet hiding methods for preventing selective jamming attacks
JPINFOTECH JAYAPRAKASH
 
DOCX
2013 2014 bulk ieee projects
JPINFOTECH JAYAPRAKASH
 
2012 - 2013 bulk ieee projects for sale
JPINFOTECH JAYAPRAKASH
 
2015 - 2016 ieee ns2 project titles
JPINFOTECH JAYAPRAKASH
 
2012-2013 IEEE PROJECT TITLES
JPINFOTECH JAYAPRAKASH
 
Efficient algorithms for neighbor discovery in wireless networks
JPINFOTECH JAYAPRAKASH
 
A probabilistic model of visual cryptography Scheme With Dynamic Group
JPINFOTECH JAYAPRAKASH
 
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
JPINFOTECH JAYAPRAKASH
 
Discovery and verification of neighbor positions in mobile ad hoc networks
JPINFOTECH JAYAPRAKASH
 
Adaptive opportunistic routing for wireless ad hoc networks
JPINFOTECH JAYAPRAKASH
 
2012 - 2013 DOTNET IEEE PROJECT TITLES
JPINFOTECH JAYAPRAKASH
 
Spatial approximate string search
JPINFOTECH JAYAPRAKASH
 
A real time adaptive algorithm for video streaming over multiple wireless acc...
JPINFOTECH JAYAPRAKASH
 
An adaptive cloud downloading service
JPINFOTECH JAYAPRAKASH
 
Cooperative positioning and tracking in disruption tolerant networks
JPINFOTECH JAYAPRAKASH
 
Packet hiding methods for preventing selective jamming attacks
JPINFOTECH JAYAPRAKASH
 
2013 2014 bulk ieee projects
JPINFOTECH JAYAPRAKASH
 
Ad

Similar to A fast clustering based feature subset selection algorithm for high-dimensional data (20)

DOCX
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
IEEEGLOBALSOFTTECHNOLOGIES
 
DOCX
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
IEEEGLOBALSOFTTECHNOLOGIES
 
DOCX
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
IEEEGLOBALSOFTTECHNOLOGIES
 
DOCX
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
IEEEMEMTECHSTUDENTSPROJECTS
 
PDF
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
IJCI JOURNAL
 
PPT
SEO PROCESS
Mohan Balakrishna
 
PDF
The International Journal of Engineering and Science (The IJES)
theijes
 
PDF
M43016571
IJERA Editor
 
PDF
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
theijes
 
PDF
IRJET- Survey of Feature Selection based on Ant Colony
IRJET Journal
 
PDF
Feature selection techniques for microarray dataset: a review
IAESIJAI
 
PDF
Booster in High Dimensional Data Classification
rahulmonikasharma
 
PDF
Android a fast clustering-based feature subset selection algorithm for high-...
ecway
 
PDF
Cloudsim a fast clustering-based feature subset selection algorithm for high...
ecway
 
PDF
A fast clustering based feature subset selection algorithm for high-dimension...
ecway
 
PDF
763354.MIPRO_2015_JovicBrkicBogunovic.pdf
srideviramaraj2
 
PDF
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
theijes
 
PDF
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Waqas Tariq
 
PDF
Feature selection for classification
efcastillo744
 
PDF
A Review on Feature Selection Methods For Classification Tasks
Editor IJCATR
 
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
IEEEGLOBALSOFTTECHNOLOGIES
 
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
IEEEGLOBALSOFTTECHNOLOGIES
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
IEEEGLOBALSOFTTECHNOLOGIES
 
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
IEEEMEMTECHSTUDENTSPROJECTS
 
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
IJCI JOURNAL
 
SEO PROCESS
Mohan Balakrishna
 
The International Journal of Engineering and Science (The IJES)
theijes
 
M43016571
IJERA Editor
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
theijes
 
IRJET- Survey of Feature Selection based on Ant Colony
IRJET Journal
 
Feature selection techniques for microarray dataset: a review
IAESIJAI
 
Booster in High Dimensional Data Classification
rahulmonikasharma
 
Android a fast clustering-based feature subset selection algorithm for high-...
ecway
 
Cloudsim a fast clustering-based feature subset selection algorithm for high...
ecway
 
A fast clustering based feature subset selection algorithm for high-dimension...
ecway
 
763354.MIPRO_2015_JovicBrkicBogunovic.pdf
srideviramaraj2
 
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
theijes
 
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Waqas Tariq
 
Feature selection for classification
efcastillo744
 
A Review on Feature Selection Methods For Classification Tasks
Editor IJCATR
 
Ad

Recently uploaded (20)

PPTX
QUARTER 1 WEEK 2 PLOT, POV AND CONFLICTS
KynaParas
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PPTX
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
PPTX
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
PDF
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
PPTX
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PPTX
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
PPT
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
PDF
Governor Josh Stein letter to NC delegation of U.S. House
Mebane Rash
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PPTX
Quarter 1_PPT_PE & HEALTH 8_WEEK 3-4.pptx
ronajadolpnhs
 
PPTX
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
PPTX
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PPTX
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
QUARTER 1 WEEK 2 PLOT, POV AND CONFLICTS
KynaParas
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
Horarios de distribución de agua en julio
pegazohn1978
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
Governor Josh Stein letter to NC delegation of U.S. House
Mebane Rash
 
Dimensions of Societal Planning in Commonism
StefanMz
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
Quarter 1_PPT_PE & HEALTH 8_WEEK 3-4.pptx
ronajadolpnhs
 
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 

A fast clustering based feature subset selection algorithm for high-dimensional data

  • 1. A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data ABSTRACT: Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm (FAST) is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph- theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST) clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-
  • 2. SF, with respect to four types of well-known classifiers, namely, the probabilitybased Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high-dimensional image, microarray, and text data, demonstrate that the FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers. EXISTING SYSTEM: The embedded methods incorporate feature selection as a part of the training process and are usually specific to given learning algorithms, and therefore may be more efficient than the other three categories. Traditional machine learning algorithms like decision trees or artificial neural networks are examples of embedded approaches. The wrapper methods use the predictive accuracy of a predetermined learning algorithm to determine the goodness of the selected subsets, the accuracy of the learning algorithms is usually high. However, the generality of the selected features is limited and the computational complexity is large. The filter methods are independent of learning algorithms, with good generality. Their computational complexity is low, but the accuracy of the learning algorithms is not guaranteed. The hybrid methods are a combination of filter and wrapper methods by using a filter method to reduce search space that will be considered by the subsequent wrapper. They mainly focus on combining filter and
  • 3. wrapper methods to achieve the best possible performance with a particular learning algorithm with similar time complexity of the filter methods. DISADVANTAGES OF EXISTING SYSTEM: The generality of the selected features is limited and the computational complexity is large. Their computational complexity is low, but the accuracy of the learning algorithms is not guaranteed. The hybrid methods are a combination of filter and wrapper methods by using a filter method to reduce search space that will be considered by the subsequent wrapper. PROPOSED SYSTEM Feature subset selection can be viewed as the process of identifying and removing as many irrelevant and redundant features as possible. This is because irrelevant features do not contribute to the predictive accuracy and redundant features do not redound to getting a better predictor for that they provide mostly information which is already present in other feature(s). Of the many feature subset selection algorithms, some can effectively eliminate irrelevant features but fail to handle redundant features yet some of others can eliminate the irrelevant while taking care
  • 4. of the redundant features. Our proposed FAST algorithm falls into the second group. Traditionally, feature subset selection research has focused on searching for relevant features. A well-known example is Relief which weighs each feature according to its ability to discriminate instances under different targets based on distance-based criteria function. However, Relief is ineffective at removing redundant features as two predictive but highly correlated features are likely both to be highly weighted. Relief-F extends Relief, enabling this method to work with noisy and incomplete data sets and to deal with multiclass problems, but still cannot identify redundant features. ADVANTAGES OF PROPOSED SYSTEM: Good feature subsets contain features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other. The efficiently and effectively deal with both irrelevant and redundant features, and obtain a good feature subset. Generally all the six algorithms achieve significant reduction of dimensionality by selecting only a small portion of the original features. The null hypothesis of the Friedman test is that all the feature selection algorithms are equivalent in terms of runtime.
  • 5. MODULES:  Distributed clustering  Subset Selection Algorithm  Time complexity  Microarray data  Data Resource  Irrelevant feature MODULE DESCRIPTION 1. Distributed clustering The Distributional clustering has been used to cluster words into groups based either on their participation in particular grammatical relations with other words by Pereira et al. or on the distribution of class labels associated with each word by Baker and McCallum . As distributional clustering of words are agglomerative in nature, and result in suboptimal word clusters and high computational cost, proposed a new information-theoretic divisive algorithm for word clustering and applied it to text classification. proposed to cluster features using a special metric of distance, and then makes use of the of the resulting cluster hierarchy to choose the most relevant attributes. Unfortunately, the cluster evaluation measure based on distance does not identify a feature subset that allows the classifiers to improve
  • 6. their original performance accuracy. Furthermore, even compared with other feature selection methods, the obtained accuracy is lower. 2. Subset Selection Algorithm The Irrelevant features, along with redundant features, severely affect the accuracy of the learning machines. Thus, feature subset selection should be able to identify and remove as much of the irrelevant and redundant information as possible. Moreover, “good feature subsets contain features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other. Keeping these in mind, we develop a novel algorithm which can efficiently and effectively deal with both irrelevant and redundant features, and obtain a good feature subset. 3. Time complexity The major amount of work for Algorithm 1 involves the computation of SU values for TR relevance and F-Correlation, which has linear complexity in terms of the number of instances in a given data set. The first part of the algorithm has a linear time complexity in terms of the number of features m. Assuming features are selected as relevant ones in the first part, when k ¼ only one feature is selected. 4. Microarray data
  • 7. The proportion of selected features has been improved by each of the six algorithms compared with that on the given data sets. This indicates that the six algorithms work well with microarray data. FAST ranks 1 again with the proportion of selected features of 0.71 percent. Of the six algorithms, only CFS cannot choose features for two data sets whose dimensionalities are 19,994 and 49,152, respectively. 5. Data Resource The purposes of evaluating the performance and effectiveness of our proposed FAST algorithm, verifying whether or not the method is potentially useful in practice, and allowing other researchers to confirm our results, 35 publicly available data sets1 were used. The numbers of features of the 35 data sets vary from 37 to 49, 52 with a mean of 7,874. The dimensionalities of the 54.3 percent data sets exceed 5,000, of which 28.6 percent data sets have more than 10,000 features. The 35 data sets cover a range of application domains such as text, image and bio microarray data classification. The corresponding statistical information. Note that for the data sets with continuous-valued features, the well-known off-the- shelf MDL method was used to discredit the continuous values. 6. Irrelevant feature
  • 8. The irrelevant feature removal is straightforward once the right relevance measure is defined or selected, while the redundant feature elimination is a bit of sophisticated. In our proposed FAST algorithm, it involves 1.the construction of the minimum spanning tree from a weighted complete graph; 2. The partitioning of the MST into a forest with each tree representing a cluster; and 3.the selection of representative features from the clusters.
  • 9. SYSTEM FLOW: Data set Irrelevant feature removal Selected Feature Minimum Spinning tree constriction Tree partition & representation feature selection
  • 10. SYSTEM CONFIGURATION:- HARDWARE CONFIGURATION:-  Processor - Pentium –IV  Speed - 1.1 Ghz  RAM - 256 MB(min)  Hard Disk - 20 GB  Key Board - Standard Windows Keyboard  Mouse - Two or Three Button Mouse  Monitor - SVGA SOFTWARE CONFIGURATION:-  Operating System : Windows XP  Programming Language : JAVA  Java Version : JDK 1.6 & above. REFERENCE: Qinbao Song, Jingjie Ni, and Guangtao Wang, “A Fast Clustering-Based Feature
  • 11. Subset Selection Algorithm for High-Dimensional Data”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 1, JANUARY 2013.