SlideShare a Scribd company logo
It´s The Memory,
           Stupid!
                      or:
How I Learned to Stop Worrying about CPU Speed
           and Love Memory Access
                Francesc Alted
               Software Architect

     Big Data Spain 2012, Madrid (Spain)
              November 16, 2012
About Continuum
      Analytics
• Develop new ways on how data is
  stored, computed, and visualized.
• Provide open technologies for Data
  Integration on a massive scale.
• Provide software tools, training, and
  integration/consulting services to
  corporate, government, and educational
  clients worldwide.
Overview

• The Era of ‘Big Data’
• A few words about Python and NumPy
• The Starving CPU problem
• Choosing optimal containers for Big Data
“A wind of streaming data, social data
        and unstructured data is knocking at
      the door, and we're starting to let it in.
           It's a scary place at the moment.”

          -- Unidentified bank IT executive, as
            quoted by “The American Banker”




The Dawn of ‘Big Data’
Challenges

• We have to deal with as much data as
  possible by using limited resources


• So, we must use our computational
  resources optimally to be able to get the
  most out of Big Data
Interactivity and Big
          Data

• Interactivity is crucial for handling data

• Interactivity and performance are crucial
  for handling Big Data
Python and ‘Big Data’
• Python is an interpreted language and hence,
   it offers interactivity
• Myth: “Python is slow, so why on the hell are
   you going to use it for Big Data?”
• Answer: Python has access to an incredibly
   powerful range of libraries that boost its
   performance far beyond your expectations
• ...and during this talk I will prove it!
NumPy: A Standard ‘De
  Facto’ Container





                                            





         
Operating
    with NumPy
• array[2]; array[1,1:5, :]; array[[3,6,10]]
• (array1**3 / array2) - sin(array3)
• numpy.dot(array1, array2): access to
  optimized BLAS (*GEMM) functions
• and much more...
Nothing Is Perfect

• NumPy is just great for many use cases
• However, it also has its own deficiencies:
  •   Follows the Python evaluation order in complex
      expressions like : (a * b) + c

  •   Does not have support for multiprocessors
      (except for BLAS computations)
Numexpr: Dealing with
Complex Expressions
• It comes with a specialized virtual machine
  for evaluating expressions
• It accelerates computations mainly by
  making a more efficient memory usage
• It supports extremely easy to use
  multithreading (active by default)
Exercise (I)
Evaluate the next polynomial:
      0.25x3 + 0.75x2 + 1.5x - 2
in the range [-1, 1] with a step size of 2*10-7,
using both NumPy and numexpr.
Note: use a single processor for numexpr
numexpr.set_num_threads(1)
Exercise (II)
Rewrite the polynomial in this notation:

    ((0.25x + 0.75)x + 1.5)x - 2

and redo the computations.

What happens?
((.25*x + .75)*x - 1.5)*x – 2                         0,301            0,11
x                                                     0,052           0,045
sin(x)**2+cos(x)**2                                   0,715           0,559

                               Time to evaluate polynomial (1 thread)

              1,8
              1,6
              1,4
              1,2
                                                                                      NumPy
               1
   Time (s)




                                                                                      Numexpr
              0,8
              0,6
              0,4
              0,2
               0
                    .25*x**3 + .75*x**2 - 1.5*x – 2   ((.25*x + .75)*x - 1.5)*x – 2



                                    NumPy vs Numexpr (1 thread)

              1,8
Power Expansion
Numexpr expands expression:

0.25x3 + 0.75x2 + 1.5x - 2
to:
0.25x*x*x + 0.75x*x + 1.5x*x - 2

so, no need to use transcendental pow()
Pending question


• Why numexpr continues to be 3x faster
  than NumPy, even when both are executing
  exactly the *same* number of operations?
“Across the industry, today’s chips are largely
    able to execute code faster than we can feed
                them with instructions and data.”

               – Richard Sites, after his article
                    “It’s The Memory, Stupid!”,
          Microprocessor Report, 10(10),1996



The Starving CPU
    Problem
Memory Access Time
 vs CPU Cycle Time
Book in
 2009
The Status of CPU
   Starvation in 2012
• Memory latency is much slower (between
  250x and 500x) than processors.
• Memory bandwidth is improving at a better
  rate than memory latency, but it is also
  slower than processors (between 30x and
  100x).
CPU Caches to the
      Rescue

• CPU cache latency and throughput
  are much better than memory
• However: the faster they run the
  smaller they must be
CPU Cache Evolution
           Up to end 80’s                     90’s and 2000’s                                  2010’s
                 Mechanical disk                      Mechanical disk                         Mechanical disk



                                                                                              Solid state disk
Capacity




                                                                                                                         Speed
                 Main memory                           Main memory                             Main memory



                                                                                               Level 3 cache

                                                        Level 2 cache                          Level 2 cache
                    Central
                   processing                           Level 1 cache                          Level 1 cache
                   unit (CPU)                               CPU                                    CPU
           (a)                                (b)                                     (c)

 Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current
 implementation, which includes additional cache levels; and (c) a sensible guess at what’s coming over the next decade:
 three levels of cache in the CPU and solid state disks lying between main memory and classical mechanical disks.
When CPU Caches Are
     Effective?
Mainly in a couple of scenarios:
 • Time locality: when the dataset is
   reused
 • Spatial locality: when the dataset is
   accessed sequentially
The Blocking Technique
When accessing disk or memory, get a contiguous block that fits
in CPU cache, operate upon it and reuse it as much as possible.

                         




                  


                         

                                       




                            Use this extensively to leverage
                                      spatial and temporal localities
Time To Answer                         NumPy


              Pending Questions
.25*x**3 + .75*x**2 - 1.5*x – 2
((.25*x + .75)*x - 1.5)*x – 2
x
                                                 NumPy
                                                         1,613
                                                         0,301
                                                         0,052
                                                               Numexpr
                                                                     0,138
                                                                       0,11
                                                                     0,045
sin(x)**2+cos(x)**2                                      0,715       0,559

                               Time to evaluate polynomial (1 thread)

              1,8
              1,6
              1,4
              1,2
                                                                                         NumPy
               1
   Time (s)




                                                                                         Numexpr
              0,8
              0,6
              0,4
              0,2
               0
                    .25*x**3 + .75*x**2 - 1.5*x – 2      ((.25*x + .75)*x - 1.5)*x – 2



                                    NumPy vs Numexpr (1 thread)

              1,8





                                       
                                       




                                                 


                     




       





                                       
                                       




                                                 


                      




       
Beyond numexpr:
    Numba
Numexpr Limitations
• Numexpr only implements element-wise
  operations, i.e. ‘a*b’ is evaluated as:
  for i in range(N):

      c[i] = a[i] * b[i]


• In particular, it cannot deal with things like:
  for i in range(N):

      c[i] = a[i-1] + a[i] * b[i]
Numba: Overcoming
 numexpr Limitations
• Numba is a JIT that can translate a subset
  of the Python language into machine code
• It uses LLVM infrastructure behind the
  scenes
• Can achieve similar or better performance
  than numexpr, but with more flexibility
How Numba Works
Python Function                            Machine Code


                         LLVM-PY

                         LLVM 3.1
      ISPC      OpenCL    OpenMP    CUDA     CLANG

        Intel       AMD        Nvidia      Apple
Numba Example:
     Computing the Polynomial
import numpy as np
import numba as nb

N = 10*1000*1000

x = np.linspace(-1, 1, N)
y = np.empty(N, dtype=np.float64)

@nb.jit(arg_types=[nb.f8[:], nb.f8[:]])
def poly(x, y):
    for i in range(N):
        # y[i] = 0.25*x[i]**3 + 0.75*x[i]**2 + 1.5*x[i] - 2
        y[i] = ((0.25*x[i] + 0.75)*x[i] + 1.5)*x[i] - 2

poly(x, y)   # run through Numba!
Times for Computing the
   Polynomial (In Seconds)
  Poly version     (I)        (II)
    Numpy         1.086      0.505

    numexpr       0.108      0.096

    Numba         0.055      0.054

Pure C, OpenMP    0.215      0.054

• Compilation time for Numba: 0.019 sec
• Run on Mac OSX, Core2 Duo @ 2.13 GHz
Numba: LLVM for
    Python
Python code can reach C
 speed without having to
   program in C itself
  (and without losing interactivity!)
Numba in SC 2012
Numba in SC2012
 Awesome Python!
If a datastore requires all data to fit in
                     memory, it isn't big data

                   -- Alex Gaynor (in twitter)




Optimal Containers for
      Big Data
The Need for a Good
  Data Container
• Too many times we are too focused on
  computing as fast as possible
• But we have seen how important data
  access is
• Hence, having an optimal data structure is
  critical for getting good performance when
  processing very large datasets
Appending Data in
   Large NumPy Objects

 array to be enlarged           final array object
                        Copy!


                                New memory
 new data to append
                                 allocation
• Normally a realloc() syscall will not succeed
• Both memory areas have to exist simultaneously
Contiguous vs Chunked
 NumPy container       Blaze container

                          chunk 1

                          chunk 2
                             .
                             .
                             .
                          chunk N

Contiguous memory   Discontiguous memory
Appending data in Blaze
 array to be enlarged              final array object


                        X
        chunk 1                        chunk 1

       chunk 2                         chunk 2


                        compress
 new data to append                  new chunk

Only a small amount of data has to be compressed
Blosc: (de)compressing
     faster than memcpy()




Transmission + decompression faster than direct transfer?
TABLE 1
                                                  Test Data Sets

   Example of How Blosc Accelerates Genomics I/O:
     #
     1
         Source
         1000 Genomes
                        Identifier
                        ERR000018
                                      Sequencer
                                      Illumina GA
                                                            Read Count
                                                               9,280,498
                                                                           Read Length
                                                                                 36 bp
                                                                                         ID Lengths
                                                                                              40–50
                                                                                                      FASTQ Size
                                                                                                        1,105 MB
     2
     3        SeqPack (backed by Blosc)
         1000 Genomes
         1000 Genomes
                        SRR493233 1
                        SRR497004 1
                                      Illumina HiSeq 2000
                                      AB SOLiD 4
                                                              43,225,060
                                                             122,924,963
                                                                                100 bp
                                                                                 51 bp
                                                                                              51–61
                                                                                              78–91
                                                                                                       10,916 MB
                                                                                                       22,990 MB




 g. 1. In-memory throughputs for several compression schemes applied to increasing block sizes (where each
equence is 256 bytes Howison, M. (in press). High-throughput compression of FASTQ data
            Source:
                     long).
            with SeqDB. IEEE Transactions on Computational Biology and Bioinformatics.


to a memory buffer, timed the compression of block          consistent throughput across both compression and
How Blaze Does Out-
Of-Core Computations
                                                
                                                      
                                                      
                                                                            
                                                                            
                                                                        

                                     




                                                   
                                            
                        
                                        
                                                               
                                                                             
            
            
                                         
                                                                 
                                                                             
                                     


                                    
                                                     


                         
                                               
                                             



Virtual Machine : Python, numexpr, Numba
Last Message for Today
Big data is tricky to manage:

Look for the optimal containers for
your data


Spending some time choosing your
appropriate data container can be a big time
saver in the long run
Summary
• Python is a perfect language for Big Data
• Nowadays you should be aware of the
  memory system for getting good
  performance
• Choosing appropriate data containers is of
  the utmost importance when dealing with
  Big Data
“El éxito del Big Data lo conseguirán
aquellos desarrolladores que sean capaces
de mirar más allá del standard y sean
capaces de entender los recursos hardware
subyacentes y la variedad de algoritmos
disponibles.”

-- Oscar de Bustos, HPC Line of Business
Manager at BULL
¡Gracias!

More Related Content

What's hot (20)

PPTX
MapReduce: A useful parallel tool that still has room for improvement
Kyong-Ha Lee
 
PPTX
Webinar: Understanding Storage for Performance and Data Safety
MongoDB
 
PDF
Advances in GPU Computing
Frédéric Parienté
 
PPTX
Real-Time Big Data with Storm, Kafka and GigaSpaces
Oleksii Diagiliev
 
PDF
CuPy v4 and v5 roadmap
Preferred Networks
 
PPTX
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Cloudera, Inc.
 
PDF
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Kenta Oono
 
PDF
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
PPTX
Linux MMAP & Ioremap introduction
Gene Chang
 
PPTX
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
mjfrankli
 
PDF
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
Chris Richardson
 
PDF
Chainer Update v1.8.0 -> v1.10.0+
Seiya Tokui
 
PDF
lec4_ref.pdf
vishal choudhary
 
PDF
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 
PPTX
Caffe framework tutorial2
Park Chunduck
 
PDF
クラウド時代の半導体メモリー技術
Ryousei Takano
 
PDF
GIST AI-X Computing Cluster
Jax Jargalsaikhan
 
PDF
Understanding DLmalloc
Haifeng Li
 
PPTX
GPU-Accelerated Parallel Computing
Jun Young Park
 
PDF
On heap cache vs off-heap cache
rgrebski
 
MapReduce: A useful parallel tool that still has room for improvement
Kyong-Ha Lee
 
Webinar: Understanding Storage for Performance and Data Safety
MongoDB
 
Advances in GPU Computing
Frédéric Parienté
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Oleksii Diagiliev
 
CuPy v4 and v5 roadmap
Preferred Networks
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Cloudera, Inc.
 
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Kenta Oono
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
Linux MMAP & Ioremap introduction
Gene Chang
 
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
mjfrankli
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
Chris Richardson
 
Chainer Update v1.8.0 -> v1.10.0+
Seiya Tokui
 
lec4_ref.pdf
vishal choudhary
 
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 
Caffe framework tutorial2
Park Chunduck
 
クラウド時代の半導体メモリー技術
Ryousei Takano
 
GIST AI-X Computing Cluster
Jax Jargalsaikhan
 
Understanding DLmalloc
Haifeng Li
 
GPU-Accelerated Parallel Computing
Jun Young Park
 
On heap cache vs off-heap cache
rgrebski
 

Similar to Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012 (20)

PDF
How shit works: the CPU
Tomer Gabel
 
PDF
STORMPresentation and all about storm_FINAL.pdf
ajajkhan16
 
PDF
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
huguk
 
PPTX
Jvm memory model
Yoav Avrahami
 
PPS
Storm presentation
Shyam Raj
 
PDF
Lecture 25
Berkay TURAN
 
PPTX
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
Tech in Asia ID
 
PDF
Learn How to Master Solr1 4
Lucidworks (Archived)
 
PDF
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
Hackito Ergo Sum
 
PDF
Python高级编程(二)
Qiangning Hong
 
PDF
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
npinto
 
PDF
Migrating from matlab to python
ActiveState
 
PDF
Apache con 2020 use cases and optimizations of iotdb
ZhangZhengming
 
PDF
NAS EP Algorithm
Jongsu "Liam" Kim
 
PDF
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA Taiwan
 
PPTX
Ca บทที่สี่
atit604
 
PPTX
Hanborq optimizations on hadoop map reduce 20120221a
Schubert Zhang
 
PDF
Kaggle tokyo 2018
Cournapeau David
 
PDF
Spark Summit EU talk by Qifan Pu
Spark Summit
 
PPTX
Ops Jumpstart: MongoDB Administration 101
MongoDB
 
How shit works: the CPU
Tomer Gabel
 
STORMPresentation and all about storm_FINAL.pdf
ajajkhan16
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
huguk
 
Jvm memory model
Yoav Avrahami
 
Storm presentation
Shyam Raj
 
Lecture 25
Berkay TURAN
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
Tech in Asia ID
 
Learn How to Master Solr1 4
Lucidworks (Archived)
 
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
Hackito Ergo Sum
 
Python高级编程(二)
Qiangning Hong
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
npinto
 
Migrating from matlab to python
ActiveState
 
Apache con 2020 use cases and optimizations of iotdb
ZhangZhengming
 
NAS EP Algorithm
Jongsu "Liam" Kim
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA Taiwan
 
Ca บทที่สี่
atit604
 
Hanborq optimizations on hadoop map reduce 20120221a
Schubert Zhang
 
Kaggle tokyo 2018
Cournapeau David
 
Spark Summit EU talk by Qifan Pu
Spark Summit
 
Ops Jumpstart: MongoDB Administration 101
MongoDB
 
Ad

More from Big Data Spain (20)

PDF
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data Spain
 
PDF
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Big Data Spain
 
PDF
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
Big Data Spain
 
PDF
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Big Data Spain
 
PDF
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Big Data Spain
 
PDF
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Big Data Spain
 
PDF
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Big Data Spain
 
PDF
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Big Data Spain
 
PDF
State of the art time-series analysis with deep learning by Javier Ordóñez at...
Big Data Spain
 
PDF
Trading at market speed with the latest Kafka features by Iñigo González at B...
Big Data Spain
 
PDF
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Big Data Spain
 
PDF
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
PDF
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Big Data Spain
 
PDF
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Big Data Spain
 
PDF
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Big Data Spain
 
PDF
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Big Data Spain
 
PDF
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Big Data Spain
 
PDF
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Big Data Spain
 
PDF
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
Big Data Spain
 
PDF
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Big Data Spain
 
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Big Data Spain
 
Ad

Recently uploaded (20)

PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Biography of Daniel Podor.pdf
Daniel Podor
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 

Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012

  • 1. It´s The Memory, Stupid! or: How I Learned to Stop Worrying about CPU Speed and Love Memory Access Francesc Alted Software Architect Big Data Spain 2012, Madrid (Spain) November 16, 2012
  • 2. About Continuum Analytics • Develop new ways on how data is stored, computed, and visualized. • Provide open technologies for Data Integration on a massive scale. • Provide software tools, training, and integration/consulting services to corporate, government, and educational clients worldwide.
  • 3. Overview • The Era of ‘Big Data’ • A few words about Python and NumPy • The Starving CPU problem • Choosing optimal containers for Big Data
  • 4. “A wind of streaming data, social data and unstructured data is knocking at the door, and we're starting to let it in. It's a scary place at the moment.” -- Unidentified bank IT executive, as quoted by “The American Banker” The Dawn of ‘Big Data’
  • 5. Challenges • We have to deal with as much data as possible by using limited resources • So, we must use our computational resources optimally to be able to get the most out of Big Data
  • 6. Interactivity and Big Data • Interactivity is crucial for handling data • Interactivity and performance are crucial for handling Big Data
  • 7. Python and ‘Big Data’ • Python is an interpreted language and hence, it offers interactivity • Myth: “Python is slow, so why on the hell are you going to use it for Big Data?” • Answer: Python has access to an incredibly powerful range of libraries that boost its performance far beyond your expectations • ...and during this talk I will prove it!
  • 8. NumPy: A Standard ‘De Facto’ Container
  • 9.     
  • 10. Operating with NumPy • array[2]; array[1,1:5, :]; array[[3,6,10]] • (array1**3 / array2) - sin(array3) • numpy.dot(array1, array2): access to optimized BLAS (*GEMM) functions • and much more...
  • 11. Nothing Is Perfect • NumPy is just great for many use cases • However, it also has its own deficiencies: • Follows the Python evaluation order in complex expressions like : (a * b) + c • Does not have support for multiprocessors (except for BLAS computations)
  • 12. Numexpr: Dealing with Complex Expressions • It comes with a specialized virtual machine for evaluating expressions • It accelerates computations mainly by making a more efficient memory usage • It supports extremely easy to use multithreading (active by default)
  • 13. Exercise (I) Evaluate the next polynomial: 0.25x3 + 0.75x2 + 1.5x - 2 in the range [-1, 1] with a step size of 2*10-7, using both NumPy and numexpr. Note: use a single processor for numexpr numexpr.set_num_threads(1)
  • 14. Exercise (II) Rewrite the polynomial in this notation: ((0.25x + 0.75)x + 1.5)x - 2 and redo the computations. What happens?
  • 15. ((.25*x + .75)*x - 1.5)*x – 2 0,301 0,11 x 0,052 0,045 sin(x)**2+cos(x)**2 0,715 0,559 Time to evaluate polynomial (1 thread) 1,8 1,6 1,4 1,2 NumPy 1 Time (s) Numexpr 0,8 0,6 0,4 0,2 0 .25*x**3 + .75*x**2 - 1.5*x – 2 ((.25*x + .75)*x - 1.5)*x – 2 NumPy vs Numexpr (1 thread) 1,8
  • 16. Power Expansion Numexpr expands expression: 0.25x3 + 0.75x2 + 1.5x - 2 to: 0.25x*x*x + 0.75x*x + 1.5x*x - 2 so, no need to use transcendental pow()
  • 17. Pending question • Why numexpr continues to be 3x faster than NumPy, even when both are executing exactly the *same* number of operations?
  • 18. “Across the industry, today’s chips are largely able to execute code faster than we can feed them with instructions and data.” – Richard Sites, after his article “It’s The Memory, Stupid!”, Microprocessor Report, 10(10),1996 The Starving CPU Problem
  • 19. Memory Access Time vs CPU Cycle Time
  • 21. The Status of CPU Starvation in 2012 • Memory latency is much slower (between 250x and 500x) than processors. • Memory bandwidth is improving at a better rate than memory latency, but it is also slower than processors (between 30x and 100x).
  • 22. CPU Caches to the Rescue • CPU cache latency and throughput are much better than memory • However: the faster they run the smaller they must be
  • 23. CPU Cache Evolution Up to end 80’s 90’s and 2000’s 2010’s Mechanical disk Mechanical disk Mechanical disk Solid state disk Capacity Speed Main memory Main memory Main memory Level 3 cache Level 2 cache Level 2 cache Central processing Level 1 cache Level 1 cache unit (CPU) CPU CPU (a) (b) (c) Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current implementation, which includes additional cache levels; and (c) a sensible guess at what’s coming over the next decade: three levels of cache in the CPU and solid state disks lying between main memory and classical mechanical disks.
  • 24. When CPU Caches Are Effective? Mainly in a couple of scenarios: • Time locality: when the dataset is reused • Spatial locality: when the dataset is accessed sequentially
  • 25. The Blocking Technique When accessing disk or memory, get a contiguous block that fits in CPU cache, operate upon it and reuse it as much as possible.        Use this extensively to leverage spatial and temporal localities
  • 26. Time To Answer NumPy Pending Questions .25*x**3 + .75*x**2 - 1.5*x – 2 ((.25*x + .75)*x - 1.5)*x – 2 x NumPy 1,613 0,301 0,052 Numexpr 0,138 0,11 0,045 sin(x)**2+cos(x)**2 0,715 0,559 Time to evaluate polynomial (1 thread) 1,8 1,6 1,4 1,2 NumPy 1 Time (s) Numexpr 0,8 0,6 0,4 0,2 0 .25*x**3 + .75*x**2 - 1.5*x – 2 ((.25*x + .75)*x - 1.5)*x – 2 NumPy vs Numexpr (1 thread) 1,8
  • 30. Numexpr Limitations • Numexpr only implements element-wise operations, i.e. ‘a*b’ is evaluated as: for i in range(N): c[i] = a[i] * b[i] • In particular, it cannot deal with things like: for i in range(N): c[i] = a[i-1] + a[i] * b[i]
  • 31. Numba: Overcoming numexpr Limitations • Numba is a JIT that can translate a subset of the Python language into machine code • It uses LLVM infrastructure behind the scenes • Can achieve similar or better performance than numexpr, but with more flexibility
  • 32. How Numba Works Python Function Machine Code LLVM-PY LLVM 3.1 ISPC OpenCL OpenMP CUDA CLANG Intel AMD Nvidia Apple
  • 33. Numba Example: Computing the Polynomial import numpy as np import numba as nb N = 10*1000*1000 x = np.linspace(-1, 1, N) y = np.empty(N, dtype=np.float64) @nb.jit(arg_types=[nb.f8[:], nb.f8[:]]) def poly(x, y): for i in range(N): # y[i] = 0.25*x[i]**3 + 0.75*x[i]**2 + 1.5*x[i] - 2 y[i] = ((0.25*x[i] + 0.75)*x[i] + 1.5)*x[i] - 2 poly(x, y) # run through Numba!
  • 34. Times for Computing the Polynomial (In Seconds) Poly version (I) (II) Numpy 1.086 0.505 numexpr 0.108 0.096 Numba 0.055 0.054 Pure C, OpenMP 0.215 0.054 • Compilation time for Numba: 0.019 sec • Run on Mac OSX, Core2 Duo @ 2.13 GHz
  • 35. Numba: LLVM for Python Python code can reach C speed without having to program in C itself (and without losing interactivity!)
  • 36. Numba in SC 2012
  • 37. Numba in SC2012 Awesome Python!
  • 38. If a datastore requires all data to fit in memory, it isn't big data -- Alex Gaynor (in twitter) Optimal Containers for Big Data
  • 39. The Need for a Good Data Container • Too many times we are too focused on computing as fast as possible • But we have seen how important data access is • Hence, having an optimal data structure is critical for getting good performance when processing very large datasets
  • 40. Appending Data in Large NumPy Objects array to be enlarged final array object Copy! New memory new data to append allocation • Normally a realloc() syscall will not succeed • Both memory areas have to exist simultaneously
  • 41. Contiguous vs Chunked NumPy container Blaze container chunk 1 chunk 2 . . . chunk N Contiguous memory Discontiguous memory
  • 42. Appending data in Blaze array to be enlarged final array object X chunk 1 chunk 1 chunk 2 chunk 2 compress new data to append new chunk Only a small amount of data has to be compressed
  • 43. Blosc: (de)compressing faster than memcpy() Transmission + decompression faster than direct transfer?
  • 44. TABLE 1 Test Data Sets Example of How Blosc Accelerates Genomics I/O: # 1 Source 1000 Genomes Identifier ERR000018 Sequencer Illumina GA Read Count 9,280,498 Read Length 36 bp ID Lengths 40–50 FASTQ Size 1,105 MB 2 3 SeqPack (backed by Blosc) 1000 Genomes 1000 Genomes SRR493233 1 SRR497004 1 Illumina HiSeq 2000 AB SOLiD 4 43,225,060 122,924,963 100 bp 51 bp 51–61 78–91 10,916 MB 22,990 MB g. 1. In-memory throughputs for several compression schemes applied to increasing block sizes (where each equence is 256 bytes Howison, M. (in press). High-throughput compression of FASTQ data Source: long). with SeqDB. IEEE Transactions on Computational Biology and Bioinformatics. to a memory buffer, timed the compression of block consistent throughput across both compression and
  • 45. How Blaze Does Out- Of-Core Computations                                                         Virtual Machine : Python, numexpr, Numba
  • 46. Last Message for Today Big data is tricky to manage: Look for the optimal containers for your data Spending some time choosing your appropriate data container can be a big time saver in the long run
  • 47. Summary • Python is a perfect language for Big Data • Nowadays you should be aware of the memory system for getting good performance • Choosing appropriate data containers is of the utmost importance when dealing with Big Data
  • 48. “El éxito del Big Data lo conseguirán aquellos desarrolladores que sean capaces de mirar más allá del standard y sean capaces de entender los recursos hardware subyacentes y la variedad de algoritmos disponibles.” -- Oscar de Bustos, HPC Line of Business Manager at BULL