Paper on experimental setup for verifying - "Slow Learners are Fast"

Machine Learning on Cell Processor

Submitted By: Supervisor:
Robin Srivastava Dr. Eric McCreath
Uni ID: U4700252

Course: COMP8740

Abstract
The technique of delayed stochastic gradient given in the paper titled – “Slow Learners are
Fast” theoretically shows how online learning process could be parallelized. However, with
the real experimental setup, given in the paper, the parallelization does not improve the
performance. In this project we implement and evaluate this algorithm on Cell and an Intel
dual core processor with a target to obtain speedup with its outlined real experimental
setup. We also discuss the limitations of Cell processor pertaining to this algorithm along
with suggestion on CPU architectures for which it is better suited.

1

1. INTRODUCTION 3

2. BACKGROUND 5

2.1 MACHINE LEARNING 5
2.2 ALGORITHM (REFERENCED FROM [LANGFORD, SAMOLA AND ZINKEVICH, 2009]) 6
2.3 POSSIBLE TEMPLATES FOR IMPLEMENTATION 6
A) ASYNCHRONOUS OPTIMIZATION 6
B) PIPELINED OPTIMIZATION 7
C) RANDOMIZATION 7
2.4 CELL PROCESSOR 7
2.5 EXPERIMENTAL SETUP 9

3. DESIGN AND IMPLEMENTATION 11

3.1 PRE-PROCESSING TREC DATASET 11
3.1.1 INTEL DUAL CORE 11
3.1.2 CELL PROCESSOR 11
3.1.3 REPRESENTATION OF EMAILS AND LABELS 12
3.2 IMPLEMENTATION OF LOGISTIC REGRESSION 12
3.3 IMPLEMENTATION OF LOGISTIC REGRESSION WITH DELAYED UPDATE 13
3.3.1 IMPLEMENTATION ON A DUAL CORE INTEL PENTIUM PROCESSOR 14
3.3.2 IMPLEMENTATION ON CELL BROADBAND ENGINE 15

4. RESULTS 17

5. CONCLUSION AND FUTURE WORK 19

APPENDIX I 20

BAG OF WORDS REPRESENTATION 20

APPENDIX II 21

HASHING 21

REFERENCES 22

2

1. Introduction
The inherent properties exhibited by the online learning algorithm suggest that it is an
excellent way of making the machines learn. This type of learning uses the observations
either one at a time or in small batches and discard them before the next set of
observations are considered. They are found to be a suitable candidate for real-time
learning where data arrives in the form of stream and predictions are required to be made
before the whole dataset has been seen. Online algorithms are also useful in the case of
large dataset because they do not require the whole dataset to be loaded into the memory
at once.

On the flip side this very suitable property of sequentiality turns out to be a curse for its
performance. The algorithm in itself is a sequential one and with the advent of multi-core
processors it leads to severe under-utilization of resources put forward by these high-end
machines.

In Langford et. al. [1], the authors gave a parallel version of online learning algorithm along
with its performance data when it was run on a machine with eight cores and 32 GB of
memory. They did the implementation of the algorithm in Java. The simulation results were
promising and they obtained speedup with the increase in number of threads as shown in
(Figure 1). However, their efforts to parallelize the exact experiments resulted in a failure
because of the high speed of serial implementation which was capable to handle over
150,000 examples/second. Based on the facts that the mathematical calculations involved
in this algorithm can be accelerated by the use of SIMD operations and Java does not
have any programming support for SIMD, we have implemented and evaluated this
algorithm on Cell processor to exploit the SIMD capabilities of its specialized co-
processors in the view to obtain the speedup for the real experimental setup. An
implementation of this algorithm was also done for a machine having Intel dual core
processor and 1.86 GB of RAM.

Figure 1 From Langford et. al. [1]
The Cell processor is the first implementation on Cell Broadband Engine Architecture
(CBEA) having a primary processor of 64-bit IBM PowerPC architecture and eight
specialized SIMD supported co-processors. The communication amongst these
processors, their dedicated local store and main memory is done through a very high
speed communication channel which has a capability to transfer at a theoretical peak rate
of 96 B/cycle. The communication of data plays very crucial role for the implementation of
this algorithm on Cell, the primary reason being the large gap between the amounts of
data to be processed (approx. 76 MB) and memory available with the co-processors of
Cell (256 KB). An efficient approach to bridge this gap is discussed in section of design
and implementation. This section also gives the design of how the data was pre-processed

3

for implementation on Intel dual core and Cell processor. The section on background
discusses about the gradient descent and delayed stochastic gradient descent algorithm,
the possible templates for the latter’s implementation, an overview of Cell processor and
the real experimental setup suggested by the designers of this algorithm. The result
section shows comparative study of this algorithm on both the machines and we finally
conclude in the last section of conclusion and future work. This section also provides a
suggestion on the CPU architecture for which this algorithm would be better suited and we
might expect a better performance in terms of speedup and reduced coding complexity.

4

2. Background
2.1 Machine Learning
Machine learning is a technique by which a machine modifies its own behaviour on the
basis of past experiences and performance. The collection of data of past experiences and
performance is called training set. One of the methods to make a machine learn is to pass
the entire training set in one go. This method is known as batch learning. The generic
steps for batch learning are as follows:

Step 1: Initialize the weights.
Step 2: For each batch of training data
Step 2a: Process all the training data
Step 2b: Update the weight

A popularly known batch learning algorithm is gradient descent in which after every step
the weight vector of the function moves in the direction of greatest decrease of the error
function. Mathematically this is feasible due to the observation that if any real valued
function F (x) is defined and differentiable in a neighbourhood of point a , then F (x)
decreases fastest in the direction of negative gradient of function F (x) at point a − ∇F (a ) .
Therefore if b = a − η∇F (a ) for η > 0 being a small number then F (a ) ≥ F (b) . To perform the
actual steps, the algorithm goes as follows:

Step 1: Initialize the weight vector w 0 with sum arbitrary values
Step 2: Update the weight vector as follows
w (τ +1) = w (τ ) −η∇E  w (τ ) 
 
 
Where ∇E is the gradient of error function and η is the learning
rate.
Step 3: Follow step 2 for all the batches of data

This algorithm, however, does not prove to be a very efficient one (discussed in Bishop
and Nabney, 2008). Two major weaknesses of gradient descent are:

1. The algorithm can take many iterations to converge towards a local minimum, if the
curvature in different directions is very different.
2. Finding the optimal η per step can be time-consuming. Conversely, using a
fixed η can yield poor results.

5

Some of the other more robust and faster batch learning algorithms are conjugate
gradients and quasi-Newton methods. In gradient-based methods the algorithms are
required to run multiple numbers of times to obtain an optimal solution. This proves to be
computationally very costly for large datasets. There exists yet another method to make
the machines learn. It involves passing records from training set one at a time (online
learning). To overcome the aforementioned weakness in gradient-based methods there is
an online gradient descent algorithm that has proved useful in practice for training neural
networks on large data sets (Le Cun et al. 1989). It is also called sequential or stochastic
gradient descent and it involves updating the weight vector of the function based on one
record at a time. The update of weight vector is done for each record either in consecutive
order or randomly. The algorithm steps of stochastic gradient descent are similar to the
steps outlined above for batch gradient descent with a difference of considering one data
point per iteration.
The algorithm given in (2.2) is a parallel version of stochastic gradient descent through the
concept of delayed update.

2.2 Algorithm (Referenced from [Langford, Samola and Zinkevich, 2009])

Input:
Input: Feasible space W ⊆ R n , annealing schedule η t and delay τ ∈ N
Initialization: Set w1 ......wτ = 0 and compute corresponding g t = ∇f ( wt )
For t = τ + 1 to T + τ do
Obtain f t and incur loss f t ( wt )
Compute g t = ∇f t ( wt )
Update wt +1 = arg min w∈W w − ( wt − η t g t −τ )
End for
Where f i : χ  R is a convex function, χ is Banach space
→

The goal here is to find some parameter vector w such that the sum over functions f i
takes the smallest possible value. In the algorithm if τ = 0, it becomes the standard
stochastic gradient descent algorithm. Here, instead of updating the parameter vector wt
by the current gradient g t , it is updated by a delayed gradient g t −τ .

2.3 Possible templates for implementation
There are three suggested implementation models for delayed stochastic gradient
descent. Following any of these three model would lead to an effective implementation o
the algorithm. Each model follow some assumptions based on the dataset being used.
A model could be chosen on the basis of the constraints matching with the assumptions
highlighted in a specific model.

a) Asynchronous Optimization
Assume a machine with n cores. We further assume that the time taken to compute
the gradient f t is at least n times higher than that to update the value of weight
vector. We run the stochastic gradient descent on all the n cores of the machine on
different instances of f t while sharing a common instance of weight vector. Each

6

core is allowed to update the shared copy of weight vector in a round-robin fashion.
This would result in a delay of τ = n – 1 between when a core sees f t and when it
gets to update the shared copy of weight vector. This template is primarily suitable
when computation of f t takes a large time. This implementation requires explicit
synchronization for update of weight vector as it is an atomic operation. Based on
the architecture of CPU significant amount of bandwidth could be exclusively used
for the purpose of synchronization.

b) Pipelined Optimization
In this form of optimization we parallelize the computation of f t instead of running
the same instance on different cores. In this case the delay occurs in the second
stage of processing of results. While the second stage is still busy processing the
result of the first, the latter has already moved on to the processing of f t +1 . Even in
this case the weight vector is computed with a delay of τ .

c) Randomization
This form of optimization is used when there is a high correlation between τ and f t
such that we cannot treat data as i.i.d. The observations are de-correlated by doing
random permutations of the instances. The delay in this case occurs during the
update of model parameters because range of de-correlation needs to exceed τ
considerably.

2.4 Cell Processor

Cell processor is the first implementation of Cell Broadband Engine Architecture (CBEA)
(Figure 2) which emerged from a joint venture of IBM, Sony and Toshiba. It’s a fully
compatible extension of 64-bit PowerPC Architecture. The design of CBEA was based on
the analysis of workloads in wide variety of areas such as cryptography, graphic transform
and lighting, physics, fast-Fourier transforms (FFT), matrix operations, and scientific
workloads.

The Cell processor is a multicore, heterogeneous chip carrying one 64-bit power
processor element (PPE), eight specialized single-instruction multiple-data (SIMD)
architecture co- processors called synergistic processing element (SPE) and a high-
bandwidth bus interface (Element Interconnect Bus), all integrated on-chip.

The PPE consists of a power processing unit (PPU) connected to a 512 KB of L2 cache. It
is the main processor of Cell and is responsible for running the OS as well as managing
the workload amongst the SPE. The PPU is a dual-issue, in-order processor with dual-
thread support. The PPU can fetch four instructions at a time and issue two. To better the
performance of in-order issue, the PPE utilizes delayed-execution pipelines and allows
limited out-of-order execution.

An SPE (Figure 4) consists of a synergistic processing unit (SPU) and a synergistic
memory flow controller (SMF). It is used for data-intensive applications found readily in
cryptography, media and high performance scientific applications. Each SPE runs an
independent application thread. The SPE design is optimized for computation-intensive
applications. It has SIMD support, as mentioned above, and 256 KB of its local store. The
memory flow controller consists of a DMA controller along with a memory management

7

unit (MMU) and atomic unit to facilitate synchronization issues with other SPEs and with
the PPE. SPU is also a dual-issue, in-order processor like PPU.

SPU works on the data that exists in its dedicated local store which in turn depends on

Figure 2, Cell Broadband Engine Architecture
channel interface for accessing main memory and local stores in other SPEs. The channel

interface runs independently of SPU and resides in MFC. In parallel an SPU can perform
operations on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four
single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of
performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.

The PPE and SPEs communicate through an internal high-speed element interconnect
bus (EIB) [2] (Figure 3). Apart from these processors EIB also allows communication
among off-chip memory and external IO.

The EIB is implemented as a circular ring consisting of four 16B-wide unidirectional
channels. Two of them rotate clockwise and two anti-clockwise. These channels are
capable of giving a performance of three concurrent transactions. The EIB runs at half the
rate of system clock and thus have an effective channel rate of 16 bytes every two system
clocks. At maximum concurrency, with three active transactions on each of the four rings,
the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16

8

Interconnect
Figure 3 Element Interconnect Bus, from [3]

bytes wide / 2 system clocks per transfer). The theoretical peak of EIB at 3.2 GHz is
204.8GB/s.

Figure 4 SPE, from [4]
The memory interface controller (MIC) in the Cell BE chip is connected to the external
RAMBUS XDR memory through two XIO channels operating at a maximum effective
frequency of 3.2GHz. The MIC has separate read and write request queues for each XIO
channel operating independently. For each channel, the MIC arbiter alternates the
dispatch between read and write queues after a minimum of every eight dispatches from
each queue or until the queue becomes empty, whichever is shorter. High-priority read
requests are given priority over reads and writes. With both XIO channels operating at
3.2GHz, the peak raw memory bandwidth is 25.6GB/s. However, normal memory
operations such as refresh, scrubbing, and so on, typically reduce the bandwidth by about
1GB/s.

2.5 Experimental Setup
The experiment is done using the asynchronous optimization (section 2.3). Figure 5
schematically describes the optimization. Each core computes its own error gradient and
updates a shared copy of weight vector, shared amongst all the cores. This update is
carried out in a round-robin fashion. The delay in computation and gradient and update of
weight vector is of τ =n-1. Explicit synchronization is required for the atomic update of
9

weight vector.
The experiment is run on the complete dataset involving all the available cores.

Data Error Gradient

Data Error Gradient
In Weight
P Vector
ar Data Error Gradient

all
Data Error Gradient

Figure 5 Asynchronous Optimization

10

3. Design and Implementation
There were three stages in the implementation of the project
1. Pre-processing of TREC dataset
2. Implementation of logistic regression algorithm
3. Implementation of logistic regression in accordance to the methodologies
suggested in delayed stochastic gradient technique

3.1 Pre-processing TREC Dataset
3.1.1 Intel Dual Core
The dataset contains 75, 419 emails. These emails were tokenized by a list of symbols
(white spaces( ); comma(,); back slash(); period(.); semi-colon(;); colon(:); single(‘) and
double inverted comma(“); open and close parenthesis(( )), brace({ }) and bracket([ ]);
greater(>) and less(<) than sign; hyphen(-); at the rate of symbol(@); equals(=); new
line(n); carriage return(r); and tab (t)). Tokenization with the aforementioned symbol list
yielded 2,218,878 different tokens. A dictionary of tokens, containing the token name along
with a unique index for each token, was created and stored in a file (dictionary).

Complete Convert mails Create files for
Dictionary to vectors each mail vector

Set F1

Save to
Raw Dataset Disk

Set F2

Condensed Convert mails Create files for
Dictionary to vectors each mail vector

Figure 6 Pre-processing TREC dataset
Pre-

3.1.2 Cell Processor
Due to memory limitations on Cell processor a condensed form of dictionary was used.
This condensed form contained first hundred features from the complete dictionary. On
one hand the reduced size affected the performance of the algorithm in terms of accuracy
and on the other it became more suitable for implementation on Cell. With the condensed
form we transferred 32 mails vectors (discussion of vector form of mail representation

11

follows the current discussion) per MFC operation unlike the MFCs in the order of 10s for
transfer of one mail vector if complete dictionary is used.

3.1.3 Representation of Emails and Labels
The emails were represented as linear vectors by using a simple bag of words
representation (Appendix I). The representation of emails was done in a struct data-type
having an unsigned int for the index value and a short for the weight of respective index.
Since the dimensionality of the complete dataset comes out to be very high therefore
hashing (Appendix II) was used with 218 bins. While constructing the dictionary it took
approximately 3 hours to process ~6000 emails. This estimate was drastically reduced by
the use of hashing and finally it took approximately half an hour to process all the emails in
the dataset. Once the dictionary was in place along with a working framework for hashing
a second pass on the entire dataset was carried out. In this pass each email was
converted to a bag of words representation and stored in separate file. The format of the
file was in the following pattern:

Figure 7 Email Files after pre-processing
pre-

The labels were provided separately in an array of short time. A label ‘1’ signified that the
email is a ‘ham’ and label ‘-1’ signifies that it is a ‘spam’.
Since each mail was stored as a vector form in a file therefore on an average it took only
0.03 ms (on Intel dual core 2GHz) for parsing the emails and loading them into the
memory for logistic regression.

3.2 Implementation of logistic regression
For a two class problem (C1 and C2), the posterior probability of class C1 given the input
data x and a set of fixed basis function φ = φ (x) is define by the softmax transformation
exp(a1 )
p(C1 | φ ) = y (φ ) = 3.1
exp(a1 ) + exp(a 2 )
Where the activations a1 is given as follows
T
a k = wk φ 3.2
with p(C 2 | φ ) = 1 − p(C1 | φ ) , w being the weight vector.

The likelihood function for input data x and target data T (coded in the 1-of-K coding
scheme) is then
N N
( )
p (T | w1 , w2 ) = ∏ p (C1 | φ n ) tn1 . p (C 2 | φ n ) tn 2 = ∏ y nn11 . y nn22
t t
3.3
n =1 n =1

where y nk = y k (φ ( x n )) , and T is the N x 2 matrix of target variables with elements tnk.
The error function is determined by taking the negative logarithm of the equation of
likelihood and its gradient could be written as

12

N
∇ w j E ( w1 , w2 ) = ∑ ( y nj − t nj )φ n 3.4
n =1

The weight vector wk for the given class Ck is updated as follows:
wτ +1 = wτ − η∇ wk E ( w1 , w2 )
k k 3.5
where η is the learning rate.

In this project we have defined the first class as an email being a ‘Ham’ and second class
is for it being a ‘Spam’. The feature man φ is the identity function, φ ( x) = x . The weight
vectors are initialized as zero.
For the purpose of comparison two version of implementation of logistic regression was
provided. The first version was sequential and the second version of implementation was
parallel. As per the claim by the authors of the delayed stochastic gradient technique, the
parallel version gave a better performance compared to the sequential version without
affecting the correctness of the result. The comparison of performance is given in Section
4.

3.3 Implementation of Logistic Regression with delayed update

To incorporate the concept of delayed update the equation (3.5) mentioned above was
changed according to the algorithm described in Section 2.2. This required processing of
computing the error gradient on divided set of input separately. The division of input was
carried out differently for Intel Dual core and Cell processor. For the former case this
division was more direct with less programming complexity, however, for the latter the
division had to be carried out explicitly and it involved a significant complexity in terms of
programming. The division of data is explained in detail in the following discussion.

The representation chosen for the mail helps in improving the time performance of the
algorithm. Since we are storing the indices of the vectors therefore while updating a weight
vector in accordance to the contributions made by a specific mail vector we do not need to
iterate through the complete dimension of the weight vector and error gradient. This is
because the contributions by a particular mail vector would affect indices which are
present in it. This Figure 8, below shows this concept pictorially.

13

1 1
count index
6 6

13 13
1 6
3 13
2 73 73 Error 73 Weight
5 88 Gradient Vector

Mail Vector
88 88

D: Dimension of
weight vector
and error
gradient
D D

Figure 8

3.3.1 Implementation on a Dual Core Intel Pentium processor
For implementation on the Intel dual core machine (2 GHz with 1.86 GB of main memory)
the processed email from the complete dictionary was used. The mail-vectors were
created as and when they were required. The first core processed all the odd emails and
second one all the even ones. Each core computed the error gradient separately along
with updating a private copy of weight vectors for each core. The shared copy of weight
vectors were updated atomically by both the cores.
This implementation used OpenMP constructs for parallelization of the algorithm. Using
OpenMP helped in the division of email. The thread number was augmented with a
counter to determine the mail number. This ensured that no two thread would access the
same data.

Mail 1 Computes
Mail 2 error gradient
Mail 3 Core 1
Update
Set F1

Mail 4 weight vector
(atomically)
Core 2
Computes
error gradient
Mail N

Note: N is odd

Figure 9 Implementation on Intel Dual Core

14

3.3.2 Implementation on Cell Broadband Engine

Implementation of the algorithm on Cell processor used the processed mails generated
from the condensed dictionary. The data was divided sequentially into chunks for each
SPE. The PPE was responsible for constructing the labels and the array of mail vectors.
Using the MFC operation the data was made available to SPE. Each MFC operation
transferred data for 32 mails. This value was chosen because of the limited capacity (256
KB) of local store of SPEs.

The SIMD implementation on Cell could not benefit from the implementation model shown
in Figure 10. This is because for a full scale SIMD implementation there are operations
involved for converting the data in the __vector form specialized for SPEs. Since we are
storing the indices separately therefore converting the data to appropriate __vector would
require rearranging them according to the indices. This rearrangement would require large
number of load operations and would affect the overall benefits from SIMD operation. The
complexity for time converting the data to __vector would be O(N2) where N is the
dimension of mail-vector.

Mail 1 Note: N is odd
Mail 2

Mail 3
Set F2

Mail 4
PPE
Main
memory

Mail N

SPE-1 SPE-2 SPE-6

Computes error Computes error Computes error
gradient gradient gradient

Update weight Update weight Update weight
vector vector vector

Figure 10 Implementation on Cell

15

For the parallel version of the algorithm each SPE required a maximum of four weight
vectors to be stored in the local store. Two among them were supposed to be owned
privately by the SPE and the remaining two was shared among all the SPEs. Along with
the weight vector each SPE would also be required to store two error gradients. The data-
type for each of these quantities is float. Considering the dictionary containing 2,218,878
features the requirement of memory tend to be the order of MBs. Following two data
structures were considered for storing these quantities:
a) Storing the complete data as an array of required dimension. This data structure is
straight forward and easy to implement but there is possibility of potential wastage
of memory. For the original dimension of 2,218,878 this data structure would require
approx. 50 MB of memory for each instance of SPE. This is obviously not feasible
as the local store on SPEs are only of 256 KB.
b) The second data structure is to use struct having an index and a count value for
each entry. Since most of the values in weight vector and error gradient are not
required (refer to discussion pertaining to Figure 8), therefore by using this data
structure the required memory was significantly reduced and theoretically came in
the order of few MBs(approx. 3). This is also not feasible because of the limited size
of local store in SPE.

With the use of data generated by condensed dictionary and the latter data structure
proposed above, the requirement got reduced to 2400 bytes. Rest of the memory available
with the local store was used for storing the mail-vectors and the target labels.

To hide the latency of transfer of data from the main memory to the local store of the SPE,
the technique of double-buffering could be used. While the SPU is performing the
computation on the data, the MFC could be used to bring more data from main memory of
the system to the local store of the Cell. Therefore the wait for transfer of data is reduced
and latency of transfer is hidden (either partly or completely). The algorithm for processing
while doing double buffering is as follows:

1. The SPU queues a DMA GET to pull a portion of the problem data set from main
memory into buffer #1.
2. The SPU queues a DMA GET to pull a portion of the problem data set from main
memory into buffer #2.
3. The SPU waits for buffer #1 to finish filling.
4. The SPU processes buffer #1.
5. The SPU (a) queues a DMA PUT to transmit the contents of buffer #1 and then (b)
queues a DMA GETB to execute after thePUT to refill the buffer with the next
portion of data from main memory.
6. The SPU waits for buffer #2 to finish filling.
7. The SPU processes buffer #2.
8. The SPU (a) queues a DMA PUT to transmit the contents of buffer #2 and then (b)
queues a DMA GETB to execute after thePUT to refill the buffer with the next
portion of data from main memory.
9. Repeat starting at step 3 until all data has been processed.
10. Wait for all buffers to finish.

16

4. Results
The experiments on Intel dual core machine was run by using the mails processed with the
complete dictionary. The time taken on this machine is significantly higher than that on
Cell. Serial implementation of logistic regression on Intel dual core for two simultaneous
run takes 36.93 sec and 36.45 sec.

The taken by the parallel implementation using delayed stochastic gradient method is
given in Table 1.

Number Time in Time in
of seconds seconds (run
Threads (run 1) 2)
1. 113.09 47.09
2. 20.85 20.92

Table 1

For a single thread in the first run the time taken is very large compared to any other time.
This is because most of the memory load operation would be resulting in a cache miss.
Since all these runs were performed consecutively therefore the time is drastically reduced
because of reduction in the cache miss rate. It is also observed that this algorithm renders
a poorer performance when run with single thread as compared to that of serial
implementation. This time should be theoretically same; however, in the case of delayed
stochastic process extra time is spent in division of data which end up not being used
anywhere.

The Table 2 below shows the performance of the algorithm on multiple SPEs. The
performance with respect to time gets better with the increase in the number of SPEs.
These values are plotted in the graph given below. The performance on SPE although
shows better results, however, the accuracy is suffered to a great extent (results not shown
here).

Perform ance

48000

Time in
Number 47000
micro
of SPE
seconds 46000

1 47398
2 44419 45000

3 42407 44000
4 42384
5 42144 43000

6 41966
42000

Table 2
41000
0 1 2 3 4 5 6 7

The use of condensed dictionary N um be r of S P E comes
with a severe penalty of

17

accuracy.

The issue of accuracy could be solved by the use of complete dictionary. However, the
memory limitation on Cell processor constrained the use of complete dictionary.

18

5. Conclusion and Future Work
The approach of delayed update shows better time performance with the increase in
parallelization. The improvement was shown for Intel dual core as well as for Cell
processor. The former machine being SMP capable had less overhead of data division as
compared to that on latter. The use of Cell processor posed several limitations on the
implementation of this algorithm, the primary one being the memory limitation. The
limitation of memory caused extra overhead due to communication. A dataset having less
feature vectors might be expected to perform with a better speedup on this machine. For a
data set with large feature vectors, this algorithm might perform better on a symmetric
multiprocessing machine (SMP). A study of this algorithm could be done on more powerful
SMP capable machine with large amount of main memory as the amount of memory
required to store the data doubles with a unit increase in level of parallelization.

19

Appendix I
Bag of Words Representation

A bag-of-words representation is a model to represent a sentence in the form of vector. It
is frequently used in natural language processing and information retrieval. This model
represents a sentence as an unordered collection without any regard for grammar.

To form a vector for a sentence, firstly, all the distinct words are identified in it. Each
distinct word is given a unique identifier called index. Each index serves as a dimension in
a D-dimensional vector space, where D is total number of unique words. The magnitude of
the vector in a particular dimension is determined by the count of words having that index.
This process requires two passes through the entire dataset. In the first pass a dictionary
containing the unique words along with their unique indices is created. In the second pass
the vectors are formed by referencing to the dictionary.

For example:

Consider the following sentence

What do you think you are doing?

Word Index
what 0
do 1
you 2
think 3
are 4
doing 5

The resulting vector for the above sentence would be as follows:

1(0) + 1(1) + 2(2) + 1(3) + 1(4) + 1(5)

The vector dimension is given in parenthesis and the respective magnitudes are given
along side. The magnitude of dimension 2 is 2 because the word you is repeated twice in
the sentence. Others are one for the similar reason.

20

Appendix II
Hashing

Hashing is the transformation of a string of characters into a usually shorter fixed-length
value or key that represents the original string. Hashing is used to index and retrieve items
in a database because it is faster to find the item using the shorter hashed key than to find
it using the original value. It is also used in many encryption algorithms.

The hashing function used in this project is same as the one used by Oracle’s JVM. The
code snippet performing hashing is pasted below

unsigned int hashCode(char *word, int n) {
unsigned int h = 0;
int i;

for(i=0;i<n;i++)
h += word[i] * pow(31, n-i-1);

return h%SIZE;
}

21

References
[1] John Langford, Alexander J. Samola and Martin Zinkevich. Slow learners are fast
published in Journal of Machine Learning Research 1(2009)
[2] Michael Kistler, Michael Perrone, Fabrizio Petrini. Cell Multiprocessor
Communication Network: Built for Speed.
[3] Thomas Chen , Ram Raghavan , Jason Dale and Eiji Iwata. Cell Broadband Engine
Architecture and its first implementation
[4] Jonathan Bartlett. Programming high-performance applications on the Cell/B.E.
processor, Part 6: Smart buffer management with DMA transfers
[5] Introduction to Statistical Machine Learning, 2010 course assignment 1
[6] Christopher Bishop, Pattern Recognition and Machine Learning.

22

Paper on experimental setup for verifying - "Slow Learners are Fast"

More Related Content

What's hot (17)

Viewers also liked (16)

Similar to Paper on experimental setup for verifying - "Slow Learners are Fast" (20)

Recently uploaded (20)