DS2014: Feature selection in hierarchical feature spaces

1
Feature Selection in
Hierarchical Feature Spaces
10/12/2014 Petar Ristoski, Heiko Paulheim

Motivation: Linked Open Data as Background
Knowledge
• Linked Open Data is a method for publishing interlinked
datasets using machine interpretable semantics
• Started 2007
• A collection of ~1,000 datasets
– Various domains, e.g. general knowledge, government data, …
– Using semantic web standards (HTTP, RDF, SPARQL)
• Free of charge
• Machine processable
• Sophisticated tool stacks
2

Motivation: Linked Open Data as Background
Knowledge
3

Example: the Auto MPG Dataset
• A well-known UCI dataset
– Goal: predict fuel consumption of cars
• Hypothesis: background knowledge → more accurate predictions
• Used background knowledge:
– Entity types and categories from DBpedia (=Wikipedia)
• Results: M5Rules down to almost half the prediction error
– i.e. on average, we are wrong by 1.6 instead of 2.9 MPG
Attribute set
Linear Regression M5Rules
RMSE RE RMSE RE
original 3.359 0.118 2.859 0.088
original + direct types 3.334 0.117 2.835 0.091
original + categories 4.474 0.144 2.926 0.090
original + direct types + categories 2.551 0.088 1.574 0.042
10/12/2014 Petar Ristoski, Heiko Paulheim 4

Drawbacks
• The generated feature sets are rather large
– e.g. for dataset of 300 instances, it may generate up to 5,000 features
from one source
• Increase complexity and runtime
• Overfitting for too specific features

Linked Open Data is Backed by Ontologies
LOD Graph Excerpt Ontology Excerpt

HIERARCHICAL FEATURE
SPACE

Problem Statement
• Each instance is an n-dimensional binary feature vector (v1,v2,…,vn),
where vi ∈ {0,1} for all 1≤ vi ≤n
• Feature space: V={v1,v2,…, vn}
• Hierarchic relation between two features vi and vj can be denoted as
vi < vj, where vi is more specific than vj
• For all hierarchical features, the following implication holds:
vi < vj → (vi = 1 → vj = 1)
• Transitivity between hierarchical features exists:
vi < vj ˄ vj < vk → vi < vk
• The problem of feature selection can be defined as finding a
projection of V to V’, where V’ ⊆ V and p(V’) ≥ p(V), where p is a
performance function:
푝: 푃 푉 → [0,1]

Hierarchical Feature Space: Example
Josh Donaldson is the best 3rd
baseman in the American League.
LeBron James NOT ranked #1 after
newly released list of Top NBA players
“Two things are infinite: the universe
and human stupidity; and I'm not sure
about the universe.”―Albert Einstein
Nineteen-year-old figure skater Yuzuru
Hanyu, who won a gold medal in the
Sochi Olympics, is among the 684
peo... http://bit.ly/1kb6W5y
In his weekly address, President
Barack Obama discusses expanding
opportunity for hard-working
Americans: http://ofa.bo/ccH
Barack Obama cracks jokes at Vladimir
Putin's expense http://dlvr.it/5Z7JCR
I spotted the Lance Armstrong case in
2006 when everyone thought he was
God, and now this case catches my
attention.

dbpedia-owl:
Basketball_Player
dbpedia-owl:
Baseball_Player
dbpedia-owl:
Athlete
dbpedia:LeBron_James dbpedia:Josh_Donaldson
Josh Donaldson is the best 3rd
baseman in the American League.
LeBron James NOT ranked #1 after
newly released list of Top NBA players

Hierarchical Feature Space
• Linked Open Data
– DBpedia, YAGO, Biperpedia, Google Knowledge Graph
• Lexical Databses
– WordNet, DANTE
• Domain specific ontologies, taxonomies and vocabularies
– Bioinformatics: Gene Ontology (GO), Entrez
– Drugs: the Drug Ontology
– E-commerce: GoodRelations

RELATED APPROACHES

Standard Feature Selection
• Wrapper methods
– Computationally expensive
• Filter methods
– Several techniques for scoring the relevance of the features
• Information Gain
• χ2
• Information Gain Ratio
• Gini Index
– Often similar results

Optimal Feature Selection

Standard Feature Selection: Information Gain

TSEL Feature Selection
• Tree-based feature selection (Jeong et al.)
– Select most representative and most effective feature from each branch
of the hierarchy
• 푙푖푓푡 =
푃(푓|퐶)
푃(퐶)

Bottom-Up Hill-Climbing Feature Selection
• Bottom-up hill climbing search algorithm to find an optimal subset of
concepts for document representation (Wang et al.)
푓 = 1 +
α − 푛
α
∗ β ∗
푖∈퐷
퐷푐푖 , 퐷푐푖⊆ 퐷퐾푁푁푖 푎푛푑 β > 0

Greedy Top-Down Feature Selection
• Greedy based top-down search strategy for feature selection (Lu et al.)
– Select the most effective nodes from different levels of the hierarchy

PROPOSED APPROACH

Hierarchical Feature Selection Approach
(SHSEL)
• Exploit the hierarchical structure of the feature space
• Hierarchical relation : vi < vj → (vi = 1 → vj = 1)
• Relevance similarity:
– Relevance (Blum et al.) : A feature vi is relevant to a target class C if
there exists a pair of examples A and B in the instance space such that
A and B differ only in their assignment to vi and C(A) ≠ C(B)
• Two features vi and vj have similar relevance if:
1 − 푅 푣푖 − 푅 푣푗 ≥ 푡, 푡 → [0,1]
• Goal: Identify features with similar relevance, and select the most
valuable abstract features, without losing predictive power

Hierarchical Feature Selection Approach
(SHSEL)
• Initial Selection
– Identify and filter out ranges of nodes with similar relevance in each
branch of the hierarchy
• Pruning
– Select only the most relevant features from the previously reduced set

Initial SHSEL Feature Selection
1. Identify range of nodes with similar relevance in each branch:
– Information 푠 푣, 푣gain: = 1 푠− (푣푖 0.45 , 푣푗 ) = − 1 0.5 − = 퐼퐺 0.95
푣푖 − 퐼퐺(푣푗 )
푖 푗 – Correlation: t=0.9
푠(푣푖 , 푣푗) = 퐶표푟푟푒푙푎푡푖표푛(푣푖 , 푣푗)
s>t
2. If the similarity is greater than a user specified threshold, remove
the more specific feature, based on the hierarchical relation

Post SHSEL Feature Selection
• Select the features with the highest relevance on each path
– user specified threshold
– select features with relevance above path average relevance
퐼퐺(푣푖)=0.2
AVG(Sp)=0.25

EVALUATION

Evaluation
• We use 5 real-world datasets and 6 synthetically generated datasets
• Classification methods:
– Naïve Bayes
– k-Nearest Neighbors (k=3)
– Support Vector Machine (polynomial kernel function)
 No parameter optimization

Evaluation: Real World Datasets
Name Features #Instances Class Labels #Features
Sports Tweets T DBpedia Direct Types 1,179 positive(523); negative(656) 4,082
Sports Tweets C DBpedia Categories 1,179 positive(523); negative(656) 10,883
Cities DBpedia Direct Types 212 high(67); medium(106); low(39) 727
NY Daily Headings DBpedia Direct Types 1,016 positive(580); negative(436) 5,145
StumbleUpon DMOZ Categories 3,020 positive(1,370); negative(1,650) 3,976
• Hierarchical features are generated from DBpedia (structured version of
Wikipedia)
– The text is annotated with concepts using DBpedia Spotlight
• The feature generation is independent of the class labels, and it is unbiased
towards any of the feature selection approaches

Evaluation: Synthetic Datasets
Name #Instances Class Labels #Features
S-D2-B2 1,000 positive(500); negative(500) 1,201
S-D2-B10 1,000 positive(500); negative(500) 961
• Generate the middle layer using polynomial function
• Generate the hierarchy upwards and downwards following the
hierarchical feature implication and transitivity rule
• The depth and branching factor are controlled with parameters D
and B

• Depth = 1 & Branching = 2
1
1 1 1 0
1 0 1 1 0 1 0 0
0
0 1 0 0 1 0 0

Name #Instances Class Labels #Features
S-D2-B10 1,000 positive(500); negative(500) 961
• Generate the middle layer using polynomial function
• Generate the hierarchy upwards and downwards following the
hierarchical feature implication and transitivity rule
• The depth and branching factor are controlled with parameters D
and B

Evaluation: Approach
• Testing all approaches using two classification methods
– Naïve Bayes, KNN and SVM
• Metrics for performance evaluation
– Accuracy: Acc V′ =
퐶표푟푟푒푐푡푙푦 퐶푙푎푠푠푓푖푒푑 퐼푛푠푡푎푛푐푒푠 (푉′)
푇표푡푎푙 푁푢푚푏푒푟 표푓 퐼푛푠푡푎푛푐푒푠
– Feature Space Compression: 푐 푉′ = 1 −
|푉′|
|푉|
– Harmonic Mean: 퐻 = 2 ∗
퐴푐푐 푉′ ∗푐 푉′
퐴푐푐 푉′ +푐 푉′
• Results calculated using stratified 10-fold cross validation
– Feature selection is performed inside each fold
• Parameter optimization for each feature selection strategy

• Classification accuracy when using different relevance similarity threshold
on the cities dataset
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Relevance Similarity Threshold
Accuracy
Compression
H. Mean
Evaluation: SHSEL IG

Evaluation: Classification Accuracy (NB)
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings
original
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10
original
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown

Evaluation: Feature Space Compression (NB)
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown

Evaluation: Harmonic Mean (NB)
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown

Conclusion & Outlook
• Contribution
– An approach that exploits hierarchies for feature selection in
combination with standard metrics
– The evaluation shows that the approach outperforms standard feature
selection techniques, and other approaches using hierarchies
• Future Work
– Conduct further experiments
• E.g. text mining, bioinformatics
– Feature Selection in unsupervised learning
• E.g. clustering, outlier detection
• Laplacian Score

44
Feature Selection in
Hierarchical Feature Spaces

DS2014: Feature selection in hierarchical feature spaces

More Related Content

What's hot (19)

Viewers also liked (6)

Similar to DS2014: Feature selection in hierarchical feature spaces (20)

Recently uploaded (20)

DS2014: Feature selection in hierarchical feature spaces