Efficient Implementation of Self-Organizing Map for Sparse Input Data

Eﬃcient Implementation of Self-Organizing Map
for Sparse Input Data
Josué Melka and Jean-Jacques Mariage

Introduction
Contributions
Experiments
SOM Description
Standard vs. Batch Versions
Motivation for Sparse SOM
1 Introduction
SOM Description
2 Contributions
3 Experiments

Introduction
Contributions
Experiments
SOM Description
Presentation: The SOM Network
Self-Organizing Map (Kohonen 1982):
an artiﬁcial neural network
trained by unsupervised competitive learning
produces a low-dimensional map of the input space
Many applications
Commonly used for data projection, clustering, etc.

Introduction
Contributions
Experiments
SOM Description
The SOM Map
The map: units (= nodes/neurons) on a lattice.
Associated with each node:
a weight vector
a position in the map
http://www.lohninger.com

Introduction
Contributions
Experiments
SOM Description
Example: MNIST Handwritten Dataset
Random sampling
SOM map (12x16 units)

Introduction
Contributions
Experiments
SOM Description
The SOM Training Algorithm
Within the main loop:
1 compute the distance between input and weight vectors
dk(t) = x(t) − wk(t) 2
(1)
2 ﬁnd the node weight closest to the input (BMU)
dc(t) = min
k
d(t) (2)
3 update the BMU and its neighbors to be closer to the input

Introduction
Contributions
Experiments
SOM Description
Standard Algorithm: The Learning Rule
Update weight vectors at each time step, for a random sample x:
wk(t + 1) = wk(t) + α(t)hck(t) [x(t) − wk(t)] (3)
α(t) is the decreasing learning rate
hck(t) is the neighborhood function
For example, the Gaussian:
hck(t) = exp −
rk − rc
2
2σ(t)2
Gaussian neighborhood

Introduction
Contributions
Experiments
SOM Description
The Batch Algorithm
The BMUs and their neighbors are updated once at the end of
each epoch, with the average of all the samples that trigger them:
wk(tf ) =
tf
t0
hck(t )x(t )
tf
t0
hck(t )
(4)

Introduction
Contributions
Experiments
SOM Description
Motivation: Reduce the Computing Time
Computing time depends on:
1 the T training iterations
2 the M nodes in the network
3 the D dimensions of the vectors
The issue: large sparse datasets
Many real-world datasets are sparse and high-dimensional.
But existing SOMs can’t exploit sparseness to save time.

Introduction
Contributions
Experiments
SOM Description
Datasets Examples: Dense vs. Sparse
MNIST dataset
(dim: 780, density: 19.22%)
News-20 dataset
(dim: 62061, density: 0.13%)

Introduction
Contributions
Experiments
SOM Description
Overcoming the Dimensionality Problem
A popular option: reduce the model space
Use less dimension-sensitive space reduction techniques (such as
SVD, Random-Mapping, etc.).
But is this the only way ?
Let’s deﬁne f is the fraction of non-zero values, and d = D × f .
Can we reduce the SOM complexity from O(TMD) to O(TMd) ?

Introduction
Contributions
Experiments
Sparse Algorithms
Parallelism
1 Introduction
2 Contributions
Sparse Algorithms
Parallelism
3 Experiments

Introduction
Contributions
Experiments
Sparse Algorithms
Parallelism
Compressed Sparse Rows Format
https://op2.github.io/PyOP2/linear_algebra.html
Suﬃcient to reduce the computing time ?
CSR supports eﬃcient linear algebra operations. But:
nodes weights must be dense for training
not all operations produce sparse output

Introduction
Contributions
Experiments
Sparse Algorithms
Parallelism
Speedup the BMU Search
Euclidean distance (1) is equivalent to:
dk(t) = x(t) 2
− 2(wk(t) · x(t)) + wk(t) 2
(5)
Batch version
This change suffices to make sparse Batch SOM efficient, since
wk
2 can be computed once for each epoch.
Standard version
By storing w(t) 2, we can compute w(t + 1) 2 efficiently.
wk(t + 1) 2
= (1 − β(t))2
wk(t) 2
+ β(t)2
x(t) 2
+ 2β(t)(1 − β(t))(wk(t) · x(t))
(6)

Introduction
Contributions
Experiments
Sparse Algorithms
Parallelism
Sparse-Som: Speedup the Update Phase
We express the learning rule (3) as (Natarajan 1997):
wk(t + 1) = (1 − β(t)) wk(t) +
β(t)
1 − β(t)
x(t) (7)
Don’t update entire weight vectors
We keep the scalar coefficient separately, so we update only the
values affected by x(t).
Numerical stability
To avoid numerical stability issues, we use double-precision
floating-point, and rescale the weights when needed.

Introduction
Contributions
Experiments
Sparse Algorithms
Parallelism
Parallel Approaches for SOM
Specialized hardware (historical)
Massively parallel computing
Shared-memory multiprocessing
http://www.nersc.gov/users/computational-systems/cori

Introduction
Contributions
Experiments
Sparse Algorithms
Parallelism
How to Split the Computation
Network partitioning divides
the neurons
Data partitioning dispatches
the input data

Introduction
Contributions
Experiments
Sparse Algorithms
Parallelism
What We Did
Sparse-Som: hard to parallelize
not adapted to data
partitioning
too much latency with
network partitioning
Sparse-BSom: much simpler
adapted both to data and
network partitioning
less synchronization needed
Another speciﬁc issue due to sparseness
Memory-access latency, because the non-linear access pattern to
weight vectors.
Mitigation: improve the processor cache locality
Access to the weight-vectors in the inner-loop.

Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
1 Introduction
2 Contributions
3 Experiments
Evaluation
Speed benchmark
Quality test

Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
The Evaluation
We’ve trained SOM networks
on various datasets
with same parameters
5 times each test
then measured their performance (speed and quality).
Our speed baseline
Somoclu (Wittek et al. 2017) is a massively parallel batch SOM
implementation, which uses the classical algorithm.

Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
Datasets Characteristics
classes features samples density
15564 0.14
rcv1 53 47236
518571 0.14
15933 0.13
news20 20 62061
3993 0.13
6412 0.29
sector 105 55197
3207 0.30
60000 19.22
mnist 10 780
10000 19.37
7291 100.00
usps 10 256
2007 100.00
15000 100.00
letter 26 16
5000 100.00
17766 29.00
protein 3 357
6621 26.06
2000 25.34
dna 3 180
1186 25.14
4435 98.99
satimage 6 36
2000 98.96

Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
Training Parameters
Each network :
30 × 40 units grid
Rectangular lattice
tmax = 10 × Nsamples / Kepochs = 10
α(t) = 1 − (t/tmax)
Gaussian neighborhood, with a radius decreasing linearly from
15 to 0.5

Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
Speed Benchmark
4 datasets used (2 sparse and 2 dense), to test both:
Serial mode (Sparse-Som vs. Sparse-BSom)
Parallel mode (Sparse-BSom vs. Somoclu)
Hardware and system speciﬁcations:
Intel Xeon E5-4610
4 sockets of 6 cores each
cadenced at 2.4 GHz
2 threads / core
Linux Ubuntu 16.04 (64 bits)
GCC 5.4

Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
Results: Serial Performance
usps mnist sector news20 rcv1
0
50
100
150
200
250
300
elapsedtime(seconds)
30
203
94
144
123
50
302
34
44 36
Sparse-Som
Sparse-BSom

Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
Results: Parallel Performance
1 2 4 8 16 32
number of CPU cores
101
102
103
104
elapsedtime(seconds)
Somoclu
sector
news20
mnist
usps
1 2 4 8 16 32
number of CPU cores
Sparse-BSom
sector
news20
mnist
usps

Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
Quality Evaluation: Methodology
Metrics used:
1 Average Quantization Error
Q =
N
i=1 xi − wc
N
(8)
2 Topological Error
3 Precision and Recall in classiﬁcation

Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
Results: Average Quantization Error
Sparse-Som Sparse-BSom
rcv1 0.825 ± 0.001 0.816 ± 0.001
news20 0.905 ± 0.000 0.901 ± 0.001
sector 0.814 ± 0.001 0.772 ± 0.003
mnist 4.400 ± 0.001 4.500 ± 0.008
usps 3.333 ± 0.002 3.086 ± 0.006
protein 2.451 ± 0.000 2.450 ± 0.001
dna 4.452 ± 0.006 3.267 ± 0.042
satimage 0.439 ± 0.001 0.377 ± 0.001
letter 0.357 ± 0.001 0.345 ± 0.002

Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
Results: Topological Error
rcv1 0.248 ± 0.007 0.353 ± 0.010
news20 0.456 ± 0.020 0.604 ± 0.014
sector 0.212 ± 0.012 0.514 ± 0.017
mnist 0.369 ± 0.005 0.268 ± 0.003
usps 0.150 ± 0.006 0.281 ± 0.011
protein 0.505 ± 0.006 0.448 ± 0.007
dna 0.099 ± 0.004 0.278 ± 0.023
satimage 0.103 ± 0.005 0.239 ± 0.015
letter 0.160 ± 0.003 0.269 ± 0.008

Introduction
Contributions
Experiments
Evaluation
Speed benchmark
Quality test
Results: Prediction Evaluation
precision recall precision recall
rcv1
79.2 ± 0.5 79.3 ± 0.6 81.3 ± 0.4 82.1 ± 0.3
73.7 ± 0.4 70.6 ± 0.5 76.6 ± 0.4 72.6 ± 0.5
news20
64.2 ± 0.5 62.8 ± 0.5 50.3 ± 0.9 49.6 ± 0.8
60.0 ± 1.7 55.4 ± 1.3 47.8 ± 1.2 43.6 ± 1.2
sector
77.2 ± 0.9 73.2 ± 0.9 58.4 ± 0.5 56.0 ± 1.0
73.3 ± 0.8 61.3 ± 1.8 60.9 ± 1.3 44.8 ± 1.0
mnist
93.5 ± 0.2 93.5 ± 0.2 91.5 ± 0.2 91.5 ± 0.2
93.4 ± 0.2 93.4 ± 0.2 91.7 ± 0.2 91.7 ± 0.2
usps
95.9 ± 0.2 95.9 ± 0.2 95.6 ± 0.2 95.6 ± 0.2
91.4 ± 0.3 90.7 ± 0.3 92.4 ± 0.5 91.5 ± 0.4
protein
56.7 ± 0.2 57.5 ± 0.2 56.7 ± 0.4 57.6 ± 0.3
49.8 ± 0.7 51.2 ± 0.6 50.7 ± 0.7 52.1 ± 0.6
dna
90.9 ± 0.6 90.8 ± 0.5 88.5 ± 0.6 88.5 ± 0.5
77.7 ± 1.5 69.6 ± 2.1 81.9 ± 2.9 30.3 ± 1.7
satimage
92.3 ± 0.4 92.4 ± 0.3 92.5 ± 0.4 92.6 ± 0.4
87.6 ± 0.3 85.4 ± 0.4 88.7 ± 0.5 86.3 ± 0.5
letter
83.8 ± 0.3 83.7 ± 0.3 81.9 ± 0.3 81.7 ± 0.4
81.5 ± 0.5 81.1 ± 0.5 80.2 ± 0.3 79.8 ± 0.5

Introduction
Contributions
Experiments
Summary: Main Beneﬁts
Sparse-Som and Sparse-BSom run much faster than their
classical “dense” counterparts with sparse data.
Advantages of each version:
Sparse-Som
maps seem to have a
better organization
Sparse-BSom
highly parallelizable
more memory eﬃcient
(single-precision)

Introduction
Contributions
Experiments
Thank you!

Efficient Implementation of Self-Organizing Map for Sparse Input Data

More Related Content

What's hot (19)

Similar to Efficient Implementation of Self-Organizing Map for Sparse Input Data (20)

Recently uploaded (20)

Efficient Implementation of Self-Organizing Map for Sparse Input Data