SciPy 2019: How to Accelerate an Existing Codebase with Numba

© 2019 Anaconda
How to Accelerate an Existing
Codebase with Numba
Stan Seibert
!1

The Four Step Process
!2© 2019 Anaconda

Step 1: Make an Honest Self-Inventory
• Why do you want to speed up your code?
• Tired of waiting for jobs to finish
• Make it practical to scale up to larger workloads
• Entertainment / drag racing (be honest!)
• First express your ultimate goal in absolute terms, not relative:
• "I wish this job finished in 20 minutes."
• "I wish this job ran 50% faster."
• "I want to reach 90% of the theoretical hardware maximum"
!3© 2019 Anaconda

Maslow's Hierarchy of Software Project Needs
!4
Does the code work?
Are there automated tests?
Is there user documentation?
Is it easy to install?
Is it fast
enough?
© 2019 Anaconda

A Benchmarking Test Subject: pymcmcstat
!5
• Note:
• I have no connection with
this project
• Any issues here are my fault
• Wanted unfamiliar, real-world
code base for examples
• Comes with good docs and
examples that can be converted
into performance tests
• Check out their talk @ 3:10 after
lunch!
© 2019 Anaconda
https://github.com/prmiles/pymcmcstat

How Numba works
!6
Python Function
(bytecode)
Bytecode
Analysis
Unbox Function
Arguments
Numba IR
Machine
Code
Execute!
Type
Inference
LLVM/NVVM JIT LLVM IR
Lowering
Rewrite IR
Cache
@jit
def do_math(a, b):
…
>>> do_math(x, y)
© 2019 Anaconda

Numba Internals in a Nutshell
• Translate Python objects of supported types into representations with
no CPython dependencies ("unboxing")
• Compile Python bytecode from your decorated function into machine
code.
• Swap calls to builtins and NumPy functions for implementations
provided by Numba (or 3rd party Numba extensions)
• Allow LLVM to inline functions, autovectorize loops, and do other
optimizations you would expect from a C compiler
• When calling the function, release the GIL if requested
• Convert return values back to Python objects ("boxing")
!7© 2019 Anaconda

What Numba does not do
• Automated translation of CPython or NumPy implementations
• Automatic compilation of 3rd party libraries
• Partial compilation
• Automatic conversion of arbitrary Python types
• Change the layout of data allocated in the interpreter
• Translate entire programs
• Magically make individual NumPy functions faster
!8© 2019 Anaconda

When is Numba unlikely to help?
• Whole program compilation
• Critical functions have already been converted to C or
optimized Cython
• Need to interface directly to C++
• Need to generate C/C++ for separate compilation
• Algorithms are not primarily numerical
• Exception: Numba can do pretty well at bit manipulation
!9© 2019 Anaconda

Step 2: Measurement
!10
Unit Tests Performance Tests
Did I break it? Did I make it faster?
!=
© 2019 Anaconda

Unit testing scientific code
• If you don't have a test suite, start with one test:
• a whole program "smoke test" that runs quickly
• take a run that you trust and make its output your
"expected value" for the test
• Move on to testing individual functions once you have
some smoke test coverage
!11© 2019 Anaconda

Be Realistic About Expected Accuracy
Floating point numbers are not real numbers!
!12
Tolerance is
adjustable
© 2019 Anaconda

Performance testing scientific code
• A unit test suite is not a performance test suite
• Unit tests overemphasize setup/IO/teardown steps
• Perf tests need to have realistic complexity and input
sizes
• If your perf tests are < 0.1 sec, use %timeit in Jupyter
or time module.
!13© 2019 Anaconda

Profiling Tools
!14
• Collecting results:
• Command line: 
python -m cProfile -o step0.prof myscript.py
• Notebook cell: 
%%prun -D step0.prof
• Looking at results:
• Command line: python -m pstats step0.prof
• Web Browser: snakeviz step0.prof
• Also useful: line_profiler!
© 2019 Anaconda

SnakeViz: 
pymcmcstat Algae example
!15
Nearly all the time is
spent in one function
© 2019 Anaconda

SnakeViz: 
pymcmcstat estimating_error_variance_for_mutliple_data_sets
!16
More diffuse spread
of execution time
Focus on the biggest
thing first
© 2019 Anaconda

Step 3: Refactoring the Code
• Options for introducing Numba into a code base:
1. Replace code with a Numba implementation
• Numba is now a required dependency
2. Compile functions only when Numba is present
• Numba is optional dependency
• Sometimes hard to write one function that maximizes performance both
with and without Numba
3. Pick between different implementations of same function at runtime
• Numba is optional dependency
• Can tailor each implementation to maximize performance
• Also good strategy for exploring distributed or GPU-accelerated
computing
!17© 2019 Anaconda

Become Familiar With Numba's Limitations
!18© 2019 Anaconda

© 2019 Anaconda - Confidential & Proprietary
Rule 1: Always use
@jit(nopython=True)
!19
• If you compile this function
with just @jit, it will fall back to
object mode.
• Can you spot why?

© 2019 Anaconda - Confidential & Proprietary !20
• Trick Question!
• You can't tell because
you don't know what
types are going into
this function
• nopython=True will
raise an error and give
you a chance to figure
out what the problem is
Rule 1: Always use
@jit(nopython=True)

Rule 1b: ... and object mode blocks if you must
• Object mode blocks are good for:
• I/O
• Callbacks and progress bars
• Not wasting time implementing Numba-friendly
versions of operations that are not a bottleneck
• Always try to reorg your code first, and use object mode
blocks as a last resort.
!21

Rule 2: Pay attention to data types
• Best for Numba:
• NumPy arrays
• NumPy views on other containers
• OK:
• Tuples, strings, enums, simple scalar types (int, float, bools)
• Globals are fine for constants. Pass the rest of your data as arguments
• Not good:
• General objects, Python lists, Python dicts
!22

Data Types: Algae Example
!23
Original
Fixed
Tuples are like C structs in Numba: Every element can have a different data type
With this change, can compile algaesys. Benchmark: 63 sec → 14.4 sec!
Heterogenous list 😞

Rule 2b: ...and typed containers for nested data
• But what if I need some thing more complex?
• Use Numba typed containers:
• numba.typed.dict (version 0.43)
• numba.typed.list (coming in version 0.45)
• Can nest any types that Numba knows about:
• List[List[int]]
• Dict[int, float32[:,:]]
• Dict[str, int]
!24

Rule 2b: ...and typed containers for nested data
!25
List of ParameterSet classes
Uses slicing and recursion
Can't port this today,
(need typed list +
@jitclass)
but should be able to
after Numba 0.45 
(RC this morning!)

Rule 3: Write it like FORTRAN
• Numba frees you from some of the constraints of Python,
so make sure you take advantage of them:
• Calling small functions is cheap / free (thanks to inlining)
• Break up big chunky functions
• Manual loops perform just as well as array functions.
• Use them when you want to avoid making temporary
arrays and to improve readability
!26

Rule 3a: Prefer functions over classes
!27
No need for self,
except to call sub-
functions

Rule 3b: ...or array exprs and ufuncs
!28
• Numba automatically compiles
array expressions into fused
loops
• Make a new ufunc with
@numba.vectorize when you
need control flow to compute
an element
• Beware of treating 1 element
arrays like scalars
These are
arrays,
not scalars

Rule 4: Target serial execution first
• Threads make everything harder to reason about
• Your algorithms may not be in a parallelizable form
• Even if you want to go parallel, start with working serial
version.
• If serial execution meets your performance goals, stop!
!29

Rule 4b: ...but think about parallel
• Think about what loops in your code could run in parallel.
• parallel=True & numba.prange() make parallel loops easier
• Know your race conditions:
• Read-after-write: One loop iteration reads data that another loop iteration
writes
• Write-after-write: Two loop iterations write data to the same place
• Can sometimes avoid race conditions if you reorg your loop so:
• Input and output arrays are separate
• Each iteration is responsible for one output value, not one input value
!30

Step 4: Share with Others
• Packaging with Numba as a dependency:
• Add it to your requirements.txt / conda recipe
• Wheels for (Python 2.7, 3.5-3.7) * (win-32, win-64, osx, linux-32, linux-64)
available
• Conda packages for same combinations (some repos don't post Python 3.5
packages anymore)
• Numba does not require that end users have a compiler or LLVM present on
their system if installed from binary packages.
• If all of your machine code comes via Numba, you can ship your package as
generic for all platforms ("noarch" in conda, sdist for PyPI).
!31

Looking Forward
• Numba is far from finished. Many things left to do:
• Profiling support for compiled code
• Other tools to introspect compiler pipeline
• Revamp of @jitclass
• Continue to improve error messages
• Expand the subset of Python we can make fast
!32

Conclusion
• Steps for success:
1. Evaluate your project: Do you need optimization?
2. Measure: Have tests, use profilers
3. Refactor the code: Plan, follow the rules, and debug
4. Share with others: Packaging
• Start small, work incrementally, be willing to abandon your
approach if it isn't working.
!33

Resources
• Documentation: 
http://numba.pydata.org/numba-doc/latest/index.html
• Mailing list: 
http://numba.pydata.org/
• Github: 
https://github.com/numba/numba
• Gitter: 
https://gitter.im/numba/numba
• Feel free to ask general questions on mailing list or Gitter, and open Github
issues on specific problems.
!34

Thanks!
!35

Bonus Material
!36

When Things Go Wrong
• Turn off the JIT:
• export NUMBA_DISABLE_JIT=1
• Print debugging:
• print() of constant strings and scalars works in nopython
mode
• Use GDB from Numba functions:
https://numba.pydata.org/numba-doc/dev/user/troubleshoot.html#debugging-jit-compiled-code-with-gdb
• Test functions in isolation
!37

How Numba Is Packaged
• numba source is mostly Python + tiny bit of C/C++
• llvmlite is Python + C wrapper around LLVM
• Requires specific versions of LLVM 
(system LLVM is usually wrong version)
• Statically links LLVM to C wrapper that is part of llvmlite package
• Once built, does not depend on external LLVM
• Building LLVM is challenging, steer users toward our binary wheels /
conda packages if possible.
!38

Packaging Limitations on Different Platforms
• x86, x86_64 wheels + conda packages are in usual places
• Linux-ARMv7 (RaspberryPi) conda packages in numba channel
• Tested with Berryconda environment
• Can ARMv7 wheels go on PyPI? piwheels.net?
• Linux-ARMv8 (64-bit ARM) conda package for only one test
environment in numba channel
• No conda distribution to target yet (conda-forge working on it)
• Can ARMv8 wheels go on PyPI?
• Linux-ppc64le (POWER8, 9) conda packages in numba channel
• Can ppc64le wheels go on PyPI?
!39

Advanced Techniques: SIMD Autovectorization
• All CPUs now have vector instructions:
• Apply math operation to multiple (sometimes up to 16!) values at once.
• LLVM can automatically translate some loops into SIMD versions, but:
• Need fastmath=True (SIMD changes order of ops)
• Need error_model='numpy' (ZeroDivisionError breaks SIMD)
• Make sure you have ICC runtime installed for SIMD special math functions:
• conda install -c numba icc_rt
• See https://github.com/numba/numba-examples/blob/master/notebooks/
simd.ipynb for details.
!40

Advanced Techniques: @generated_jit
• Pick entirely different implementations depending on
input types
• Can specialize based on type, or literal value
• Need to understand how Numba types work
!41

SciPy 2019: How to Accelerate an Existing Codebase with Numba

More Related Content

What's hot (20)

Similar to SciPy 2019: How to Accelerate an Existing Codebase with Numba (20)

Recently uploaded (20)

SciPy 2019: How to Accelerate an Existing Codebase with Numba