Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012

It´s The Memory,
Stupid!
or:
How I Learned to Stop Worrying about CPU Speed
and Love Memory Access
Francesc Alted
Software Architect

Big Data Spain 2012, Madrid (Spain)
November 16, 2012

About Continuum
Analytics
• Develop new ways on how data is
stored, computed, and visualized.
• Provide open technologies for Data
Integration on a massive scale.
• Provide software tools, training, and
integration/consulting services to
corporate, government, and educational
clients worldwide.

Overview

• The Era of ‘Big Data’
• A few words about Python and NumPy
• The Starving CPU problem
• Choosing optimal containers for Big Data

“A wind of streaming data, social data
and unstructured data is knocking at
the door, and we're starting to let it in.
It's a scary place at the moment.”

-- Unidentiﬁed bank IT executive, as
quoted by “The American Banker”

The Dawn of ‘Big Data’

Challenges

• We have to deal with as much data as
possible by using limited resources

• So, we must use our computational
resources optimally to be able to get the
most out of Big Data

Interactivity and Big
Data

• Interactivity is crucial for handling data

• Interactivity and performance are crucial
for handling Big Data

Python and ‘Big Data’
• Python is an interpreted language and hence,
it offers interactivity
• Myth: “Python is slow, so why on the hell are
you going to use it for Big Data?”
• Answer: Python has access to an incredibly
powerful range of libraries that boost its
performance far beyond your expectations
• ...and during this talk I will prove it!

NumPy: A Standard ‘De
Facto’ Container







 

Operating
with NumPy
• array[2]; array[1,1:5, :]; array[[3,6,10]]
• (array1**3 / array2) - sin(array3)
• numpy.dot(array1, array2): access to
optimized BLAS (*GEMM) functions
• and much more...

Nothing Is Perfect

• NumPy is just great for many use cases
• However, it also has its own deﬁciencies:
• Follows the Python evaluation order in complex
expressions like : (a * b) + c

• Does not have support for multiprocessors
(except for BLAS computations)

Numexpr: Dealing with
Complex Expressions
• It comes with a specialized virtual machine
for evaluating expressions
• It accelerates computations mainly by
making a more efﬁcient memory usage
• It supports extremely easy to use
multithreading (active by default)

Exercise (I)
Evaluate the next polynomial:
0.25x3 + 0.75x2 + 1.5x - 2
in the range [-1, 1] with a step size of 2*10-7,
using both NumPy and numexpr.
Note: use a single processor for numexpr
numexpr.set_num_threads(1)

Exercise (II)
Rewrite the polynomial in this notation:

((0.25x + 0.75)x + 1.5)x - 2

and redo the computations.

What happens?

((.25*x + .75)*x - 1.5)*x – 2 0,301 0,11
x 0,052 0,045
sin(x)**2+cos(x)**2 0,715 0,559

Time to evaluate polynomial (1 thread)

1,8
1,6
1,4
1,2
NumPy
1
Time (s)

Numexpr
0,8
0,6
0,4
0,2
0
.25*x**3 + .75*x**2 - 1.5*x – 2 ((.25*x + .75)*x - 1.5)*x – 2

NumPy vs Numexpr (1 thread)

1,8

Power Expansion
Numexpr expands expression:

0.25x3 + 0.75x2 + 1.5x - 2
to:
0.25x*x*x + 0.75x*x + 1.5x*x - 2

so, no need to use transcendental pow()

Pending question

• Why numexpr continues to be 3x faster
than NumPy, even when both are executing
exactly the *same* number of operations?

“Across the industry, today’s chips are largely
able to execute code faster than we can feed
them with instructions and data.”

– Richard Sites, after his article
“It’s The Memory, Stupid!”,
Microprocessor Report, 10(10),1996

The Starving CPU
Problem

Memory Access Time
vs CPU Cycle Time

The Status of CPU
Starvation in 2012
• Memory latency is much slower (between
250x and 500x) than processors.
• Memory bandwidth is improving at a better
rate than memory latency, but it is also
slower than processors (between 30x and
100x).

CPU Caches to the
Rescue

• CPU cache latency and throughput
are much better than memory
• However: the faster they run the
smaller they must be

CPU Cache Evolution
Up to end 80’s 90’s and 2000’s 2010’s
Mechanical disk Mechanical disk Mechanical disk

Solid state disk
Capacity

Speed
Main memory Main memory Main memory

Level 3 cache

Level 2 cache Level 2 cache
Central
processing Level 1 cache Level 1 cache
unit (CPU) CPU CPU
(a) (b) (c)

Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current
implementation, which includes additional cache levels; and (c) a sensible guess at what’s coming over the next decade:
three levels of cache in the CPU and solid state disks lying between main memory and classical mechanical disks.

When CPU Caches Are
Effective?
Mainly in a couple of scenarios:
• Time locality: when the dataset is
reused
• Spatial locality: when the dataset is
accessed sequentially

The Blocking Technique
When accessing disk or memory, get a contiguous block that ﬁts
in CPU cache, operate upon it and reuse it as much as possible.



 



 

 Use this extensively to leverage
spatial and temporal localities

Time To Answer NumPy

Pending Questions
.25*x**3 + .75*x**2 - 1.5*x – 2
((.25*x + .75)*x - 1.5)*x – 2
x
NumPy
1,613
0,301
0,052
Numexpr
0,138
0,11
0,045
sin(x)**2+cos(x)**2 0,715 0,559

Time to evaluate polynomial (1 thread)

1,8
1,6
1,4
1,2
NumPy
1
Time (s)

Numexpr
0,8
0,6
0,4
0,2
0
.25*x**3 + .75*x**2 - 1.5*x – 2 ((.25*x + .75)*x - 1.5)*x – 2

NumPy vs Numexpr (1 thread)

1,8






   










   





Numexpr Limitations
• Numexpr only implements element-wise
operations, i.e. ‘a*b’ is evaluated as:
for i in range(N):

c[i] = a[i] * b[i]

• In particular, it cannot deal with things like:
for i in range(N):

c[i] = a[i-1] + a[i] * b[i]

Numba: Overcoming
numexpr Limitations
• Numba is a JIT that can translate a subset
of the Python language into machine code
• It uses LLVM infrastructure behind the
scenes
• Can achieve similar or better performance
than numexpr, but with more ﬂexibility

How Numba Works
Python Function Machine Code

LLVM-PY

LLVM 3.1
ISPC OpenCL OpenMP CUDA CLANG

Intel AMD Nvidia Apple

Numba Example:
Computing the Polynomial
import numpy as np
import numba as nb

N = 10*1000*1000

x = np.linspace(-1, 1, N)
y = np.empty(N, dtype=np.float64)

@nb.jit(arg_types=[nb.f8[:], nb.f8[:]])
def poly(x, y):
for i in range(N):
# y[i] = 0.25*x[i]**3 + 0.75*x[i]**2 + 1.5*x[i] - 2
y[i] = ((0.25*x[i] + 0.75)*x[i] + 1.5)*x[i] - 2

poly(x, y) # run through Numba!

Times for Computing the
Polynomial (In Seconds)
Poly version (I) (II)
Numpy 1.086 0.505

numexpr 0.108 0.096

Numba 0.055 0.054

Pure C, OpenMP 0.215 0.054

• Compilation time for Numba: 0.019 sec
• Run on Mac OSX, Core2 Duo @ 2.13 GHz

Numba: LLVM for
Python
Python code can reach C
speed without having to
program in C itself
(and without losing interactivity!)

Numba in SC2012
Awesome Python!

If a datastore requires all data to ﬁt in
memory, it isn't big data

-- Alex Gaynor (in twitter)

Optimal Containers for
Big Data

The Need for a Good
Data Container
• Too many times we are too focused on
computing as fast as possible
• But we have seen how important data
access is
• Hence, having an optimal data structure is
critical for getting good performance when
processing very large datasets

Appending Data in
Large NumPy Objects

array to be enlarged ﬁnal array object
Copy!

New memory
new data to append
allocation
• Normally a realloc() syscall will not succeed
• Both memory areas have to exist simultaneously

Contiguous vs Chunked
NumPy container Blaze container

chunk 1

chunk 2
.
.
.
chunk N

Contiguous memory Discontiguous memory

Appending data in Blaze
array to be enlarged ﬁnal array object

X
chunk 1 chunk 1

chunk 2 chunk 2

compress
new data to append new chunk

Only a small amount of data has to be compressed

Blosc: (de)compressing
faster than memcpy()

Transmission + decompression faster than direct transfer?

TABLE 1
Test Data Sets

Example of How Blosc Accelerates Genomics I/O:
#
1
Source
1000 Genomes
Identiﬁer
ERR000018
Sequencer
Illumina GA
Read Count
9,280,498
Read Length
36 bp
ID Lengths
40–50
FASTQ Size
1,105 MB
2
3 SeqPack (backed by Blosc)
1000 Genomes
1000 Genomes
SRR493233 1
SRR497004 1
Illumina HiSeq 2000
AB SOLiD 4
43,225,060
122,924,963
100 bp
51 bp
51–61
78–91
10,916 MB
22,990 MB

g. 1. In-memory throughputs for several compression schemes applied to increasing block sizes (where each
equence is 256 bytes Howison, M. (in press). High-throughput compression of FASTQ data
Source:
long).
with SeqDB. IEEE Transactions on Computational Biology and Bioinformatics.

to a memory buffer, timed the compression of block consistent throughput across both compression and

How Blaze Does Out-
Of-Core Computations
  
  
  
  
  
   

   

  
   
 
 
 




  

   

 
  





Virtual Machine : Python, numexpr, Numba

Last Message for Today
Big data is tricky to manage:

Look for the optimal containers for
your data

Spending some time choosing your
appropriate data container can be a big time
saver in the long run

Summary
• Python is a perfect language for Big Data
• Nowadays you should be aware of the
memory system for getting good
performance
• Choosing appropriate data containers is of
the utmost importance when dealing with
Big Data

“El éxito del Big Data lo conseguirán
aquellos desarrolladores que sean capaces
de mirar más allá del standard y sean
capaces de entender los recursos hardware
subyacentes y la variedad de algoritmos
disponibles.”

-- Oscar de Bustos, HPC Line of Business
Manager at BULL

Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012

More Related Content

What's hot (20)

Similar to Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012 (20)

More from Big Data Spain (20)

Recently uploaded (20)

Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012