SlideShare a Scribd company logo
© 2019 Anaconda
How to Accelerate an Existing
Codebase with Numba
Stan Seibert
!1
The Four Step Process
!2© 2019 Anaconda
Step 1: Make an Honest Self-Inventory
• Why do you want to speed up your code?
• Tired of waiting for jobs to finish
• Make it practical to scale up to larger workloads
• Entertainment / drag racing (be honest!)
• First express your ultimate goal in absolute terms, not relative:
• "I wish this job finished in 20 minutes."
• "I wish this job ran 50% faster."
• "I want to reach 90% of the theoretical hardware maximum"
!3© 2019 Anaconda
Maslow's Hierarchy of Software Project Needs
!4
Does the code work?
Are there automated tests?
Is there user documentation?
Is it easy to install?
Is it fast
enough?
© 2019 Anaconda
A Benchmarking Test Subject: pymcmcstat
!5
• Note:
• I have no connection with
this project
• Any issues here are my fault
• Wanted unfamiliar, real-world
code base for examples
• Comes with good docs and
examples that can be converted
into performance tests
• Check out their talk @ 3:10 after
lunch!
© 2019 Anaconda
https://github.com/prmiles/pymcmcstat
How Numba works
!6
Python Function
(bytecode)
Bytecode
Analysis
Unbox Function
Arguments
Numba IR
Machine
Code
Execute!
Type
Inference
LLVM/NVVM JIT LLVM IR
Lowering
Rewrite IR
Cache
@jit
def do_math(a, b):
…
>>> do_math(x, y)
© 2019 Anaconda
Numba Internals in a Nutshell
• Translate Python objects of supported types into representations with
no CPython dependencies ("unboxing")
• Compile Python bytecode from your decorated function into machine
code.
• Swap calls to builtins and NumPy functions for implementations
provided by Numba (or 3rd party Numba extensions)
• Allow LLVM to inline functions, autovectorize loops, and do other
optimizations you would expect from a C compiler
• When calling the function, release the GIL if requested
• Convert return values back to Python objects ("boxing")
!7© 2019 Anaconda
What Numba does not do
• Automated translation of CPython or NumPy implementations
• Automatic compilation of 3rd party libraries
• Partial compilation
• Automatic conversion of arbitrary Python types
• Change the layout of data allocated in the interpreter
• Translate entire programs
• Magically make individual NumPy functions faster
!8© 2019 Anaconda
When is Numba unlikely to help?
• Whole program compilation
• Critical functions have already been converted to C or
optimized Cython
• Need to interface directly to C++
• Need to generate C/C++ for separate compilation
• Algorithms are not primarily numerical
• Exception: Numba can do pretty well at bit manipulation
!9© 2019 Anaconda
Step 2: Measurement
!10
Unit Tests Performance Tests
Did I break it? Did I make it faster?
!=
© 2019 Anaconda
Unit testing scientific code
• If you don't have a test suite, start with one test:
• a whole program "smoke test" that runs quickly
• take a run that you trust and make its output your
"expected value" for the test
• Move on to testing individual functions once you have
some smoke test coverage
!11© 2019 Anaconda
Be Realistic About Expected Accuracy
Floating point numbers are not real numbers!
!12
Tolerance is
adjustable
© 2019 Anaconda
Performance testing scientific code
• A unit test suite is not a performance test suite
• Unit tests overemphasize setup/IO/teardown steps
• Perf tests need to have realistic complexity and input
sizes
• If your perf tests are < 0.1 sec, use %timeit in Jupyter
or time module.
!13© 2019 Anaconda
Profiling Tools
!14
• Collecting results:
• Command line:

python -m cProfile -o step0.prof myscript.py
• Notebook cell:

%%prun -D step0.prof
• Looking at results:
• Command line: python -m pstats step0.prof
• Web Browser: snakeviz step0.prof
• Also useful: line_profiler!
© 2019 Anaconda
SnakeViz:

pymcmcstat Algae example
!15
Nearly all the time is
spent in one function
© 2019 Anaconda
SnakeViz:

pymcmcstat estimating_error_variance_for_mutliple_data_sets
!16
More diffuse spread
of execution time
Focus on the biggest
thing first
© 2019 Anaconda
Step 3: Refactoring the Code
• Options for introducing Numba into a code base:
1. Replace code with a Numba implementation
• Numba is now a required dependency
2. Compile functions only when Numba is present
• Numba is optional dependency
• Sometimes hard to write one function that maximizes performance both
with and without Numba
3. Pick between different implementations of same function at runtime
• Numba is optional dependency
• Can tailor each implementation to maximize performance
• Also good strategy for exploring distributed or GPU-accelerated
computing
!17© 2019 Anaconda
Become Familiar With Numba's Limitations
!18© 2019 Anaconda
© 2019 Anaconda - Confidential & Proprietary
Rule 1: Always use
@jit(nopython=True)
!19
• If you compile this function
with just @jit, it will fall back to
object mode.
• Can you spot why?
© 2019 Anaconda - Confidential & Proprietary !20
• Trick Question!
• You can't tell because
you don't know what
types are going into
this function
• nopython=True will
raise an error and give
you a chance to figure
out what the problem is
Rule 1: Always use
@jit(nopython=True)
© 2019 Anaconda - Confidential & Proprietary
Rule 1b: ... and object mode blocks if you must
• Object mode blocks are good for:
• I/O
• Callbacks and progress bars
• Not wasting time implementing Numba-friendly
versions of operations that are not a bottleneck
• Always try to reorg your code first, and use object mode
blocks as a last resort.
!21
© 2019 Anaconda - Confidential & Proprietary
Rule 2: Pay attention to data types
• Best for Numba:
• NumPy arrays
• NumPy views on other containers
• OK:
• Tuples, strings, enums, simple scalar types (int, float, bools)
• Globals are fine for constants. Pass the rest of your data as arguments
• Not good:
• General objects, Python lists, Python dicts
!22
© 2019 Anaconda - Confidential & Proprietary
Data Types: Algae Example
!23
Original
Fixed
Tuples are like C structs in Numba: Every element can have a different data type
With this change, can compile algaesys. Benchmark: 63 sec → 14.4 sec!
Heterogenous list 😞
© 2019 Anaconda - Confidential & Proprietary
Rule 2b: ...and typed containers for nested data
• But what if I need some thing more complex?
• Use Numba typed containers:
• numba.typed.dict (version 0.43)
• numba.typed.list (coming in version 0.45)
• Can nest any types that Numba knows about:
• List[List[int]]
• Dict[int, float32[:,:]]
• Dict[str, int]
!24
© 2019 Anaconda - Confidential & Proprietary
Rule 2b: ...and typed containers for nested data
!25
List of ParameterSet classes
Uses slicing and recursion
Can't port this today,
(need typed list +
@jitclass)
but should be able to
after Numba 0.45

(RC this morning!)
© 2019 Anaconda - Confidential & Proprietary
Rule 3: Write it like FORTRAN
• Numba frees you from some of the constraints of Python,
so make sure you take advantage of them:
• Calling small functions is cheap / free (thanks to inlining)
• Break up big chunky functions
• Manual loops perform just as well as array functions.
• Use them when you want to avoid making temporary
arrays and to improve readability
!26
© 2019 Anaconda - Confidential & Proprietary
Rule 3a: Prefer functions over classes
!27
No need for self,
except to call sub-
functions
© 2019 Anaconda - Confidential & Proprietary
Rule 3b: ...or array exprs and ufuncs
!28
• Numba automatically compiles
array expressions into fused
loops
• Make a new ufunc with
@numba.vectorize when you
need control flow to compute
an element
• Beware of treating 1 element
arrays like scalars
These are
arrays,
not scalars
© 2019 Anaconda - Confidential & Proprietary
Rule 4: Target serial execution first
• Threads make everything harder to reason about
• Your algorithms may not be in a parallelizable form
• Even if you want to go parallel, start with working serial
version.
• If serial execution meets your performance goals, stop!
!29
© 2019 Anaconda - Confidential & Proprietary
Rule 4b: ...but think about parallel
• Think about what loops in your code could run in parallel.
• parallel=True & numba.prange() make parallel loops easier
• Know your race conditions:
• Read-after-write: One loop iteration reads data that another loop iteration
writes
• Write-after-write: Two loop iterations write data to the same place
• Can sometimes avoid race conditions if you reorg your loop so:
• Input and output arrays are separate
• Each iteration is responsible for one output value, not one input value
!30
© 2019 Anaconda - Confidential & Proprietary
Step 4: Share with Others
• Packaging with Numba as a dependency:
• Add it to your requirements.txt / conda recipe
• Wheels for (Python 2.7, 3.5-3.7) * (win-32, win-64, osx, linux-32, linux-64)
available
• Conda packages for same combinations (some repos don't post Python 3.5
packages anymore)
• Numba does not require that end users have a compiler or LLVM present on
their system if installed from binary packages.
• If all of your machine code comes via Numba, you can ship your package as
generic for all platforms ("noarch" in conda, sdist for PyPI).
!31
© 2019 Anaconda - Confidential & Proprietary
Looking Forward
• Numba is far from finished. Many things left to do:
• Profiling support for compiled code
• Other tools to introspect compiler pipeline
• Revamp of @jitclass
• Continue to improve error messages
• Expand the subset of Python we can make fast
!32
© 2019 Anaconda - Confidential & Proprietary
Conclusion
• Steps for success:
1. Evaluate your project: Do you need optimization?
2. Measure: Have tests, use profilers
3. Refactor the code: Plan, follow the rules, and debug
4. Share with others: Packaging
• Start small, work incrementally, be willing to abandon your
approach if it isn't working.
!33
© 2019 Anaconda - Confidential & Proprietary
Resources
• Documentation:

http://numba.pydata.org/numba-doc/latest/index.html
• Mailing list:

http://numba.pydata.org/
• Github:

https://github.com/numba/numba
• Gitter:

https://gitter.im/numba/numba
• Feel free to ask general questions on mailing list or Gitter, and open Github
issues on specific problems.
!34
© 2019 Anaconda - Confidential & Proprietary
Thanks!
!35
© 2019 Anaconda - Confidential & Proprietary
Bonus Material
!36
© 2019 Anaconda - Confidential & Proprietary
When Things Go Wrong
• Turn off the JIT:
• export NUMBA_DISABLE_JIT=1
• Print debugging:
• print() of constant strings and scalars works in nopython
mode
• Use GDB from Numba functions:
https://numba.pydata.org/numba-doc/dev/user/troubleshoot.html#debugging-jit-compiled-code-with-gdb
• Test functions in isolation
!37
© 2019 Anaconda - Confidential & Proprietary
How Numba Is Packaged
• numba source is mostly Python + tiny bit of C/C++
• llvmlite is Python + C wrapper around LLVM
• Requires specific versions of LLVM

(system LLVM is usually wrong version)
• Statically links LLVM to C wrapper that is part of llvmlite package
• Once built, does not depend on external LLVM
• Building LLVM is challenging, steer users toward our binary wheels /
conda packages if possible.
!38
© 2019 Anaconda - Confidential & Proprietary
Packaging Limitations on Different Platforms
• x86, x86_64 wheels + conda packages are in usual places
• Linux-ARMv7 (RaspberryPi) conda packages in numba channel
• Tested with Berryconda environment
• Can ARMv7 wheels go on PyPI? piwheels.net?
• Linux-ARMv8 (64-bit ARM) conda package for only one test
environment in numba channel
• No conda distribution to target yet (conda-forge working on it)
• Can ARMv8 wheels go on PyPI?
• Linux-ppc64le (POWER8, 9) conda packages in numba channel
• Can ppc64le wheels go on PyPI?
!39
© 2019 Anaconda - Confidential & Proprietary
Advanced Techniques: SIMD Autovectorization
• All CPUs now have vector instructions:
• Apply math operation to multiple (sometimes up to 16!) values at once.
• LLVM can automatically translate some loops into SIMD versions, but:
• Need fastmath=True (SIMD changes order of ops)
• Need error_model='numpy' (ZeroDivisionError breaks SIMD)
• Make sure you have ICC runtime installed for SIMD special math functions:
• conda install -c numba icc_rt
• See https://github.com/numba/numba-examples/blob/master/notebooks/
simd.ipynb for details.
!40
© 2019 Anaconda - Confidential & Proprietary
Advanced Techniques: @generated_jit
• Pick entirely different implementations depending on
input types
• Can specialize based on type, or literal value
• Need to understand how Numba types work
!41

More Related Content

What's hot (20)

PDF
Numba: Array-oriented Python Compiler for NumPy
Travis Oliphant
 
PDF
Building SciPy kernels with Pythran
Ralf Gommers
 
PDF
SciPy Latin America 2019
Travis Oliphant
 
PDF
Data Science at the Command Line
Héloïse Nonne
 
PDF
Machine learning from software developers point of view
Pierre Paci
 
PDF
Scale up and Scale Out Anaconda and PyData
Travis Oliphant
 
PDF
PyCon Estonia 2019
Travis Oliphant
 
PPTX
Scaling Python to CPUs and GPUs
Travis Oliphant
 
PDF
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
PPTX
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
PPTX
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Ashish Bansal
 
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
PDF
Deep learning with TensorFlow
Ndjido Ardo BAR
 
PPTX
Tensorflow vs MxNet
Ashish Bansal
 
PDF
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
MLconf
 
PPTX
Deep_Learning_Frameworks_CNTK_PyTorch
Subhashis Hazarika
 
PDF
Netflix machine learning
Amer Ather
 
PPTX
State of NuPIC
Numenta
 
PDF
GPU Computing for Data Science
Domino Data Lab
 
PDF
Parallel Programming in Python: Speeding up your analysis
Manojit Nandi
 
Numba: Array-oriented Python Compiler for NumPy
Travis Oliphant
 
Building SciPy kernels with Pythran
Ralf Gommers
 
SciPy Latin America 2019
Travis Oliphant
 
Data Science at the Command Line
Héloïse Nonne
 
Machine learning from software developers point of view
Pierre Paci
 
Scale up and Scale Out Anaconda and PyData
Travis Oliphant
 
PyCon Estonia 2019
Travis Oliphant
 
Scaling Python to CPUs and GPUs
Travis Oliphant
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Ashish Bansal
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
Deep learning with TensorFlow
Ndjido Ardo BAR
 
Tensorflow vs MxNet
Ashish Bansal
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
MLconf
 
Deep_Learning_Frameworks_CNTK_PyTorch
Subhashis Hazarika
 
Netflix machine learning
Amer Ather
 
State of NuPIC
Numenta
 
GPU Computing for Data Science
Domino Data Lab
 
Parallel Programming in Python: Speeding up your analysis
Manojit Nandi
 

Similar to SciPy 2019: How to Accelerate an Existing Codebase with Numba (20)

PDF
Apache Spark Performance Observations
Adam Roberts
 
PPTX
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
PPTX
MPI n OpenMP
Surinder Kaur
 
PDF
IBM Runtimes Performance Observations with Apache Spark
AdamRobertsIBM
 
PPTX
Modern Web-site Development Pipeline
GlobalLogic Ukraine
 
PPTX
Python Applications
Kevin Cedeño, CISM, CISA
 
PDF
Gearman - Northeast PHP 2012
Mike Willbanks
 
PDF
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
Edge AI and Vision Alliance
 
PDF
Preparing Codes for Intel Knights Landing (KNL)
AllineaSoftware
 
PDF
EuroMPI 2013 presentation: McMPI
Dan Holmes
 
PDF
workshop_8_c__.pdf
AtulAvhad2
 
PDF
Composing services with Kubernetes
Bart Spaans
 
PDF
C og c++-jens lund jensen
InfinIT - Innovationsnetværket for it
 
PDF
Vibe Coding_ Develop a web application using AI (1).pdf
Baiju Muthukadan
 
PDF
Introduction to multicore .ppt
Rajagopal Nagarajan
 
PPT
Introduction to the intermediate Python - v1.1
Andrei KUCHARAVY
 
PDF
Working With Concurrency In Java 8
Heartin Jacob
 
PDF
PyData Boston 2013
Travis Oliphant
 
PPTX
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...
Adam Dunkels
 
PDF
Unblocking The Main Thread Solving ANRs and Frozen Frames
Sinan KOZAK
 
Apache Spark Performance Observations
Adam Roberts
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
MPI n OpenMP
Surinder Kaur
 
IBM Runtimes Performance Observations with Apache Spark
AdamRobertsIBM
 
Modern Web-site Development Pipeline
GlobalLogic Ukraine
 
Python Applications
Kevin Cedeño, CISM, CISA
 
Gearman - Northeast PHP 2012
Mike Willbanks
 
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
Edge AI and Vision Alliance
 
Preparing Codes for Intel Knights Landing (KNL)
AllineaSoftware
 
EuroMPI 2013 presentation: McMPI
Dan Holmes
 
workshop_8_c__.pdf
AtulAvhad2
 
Composing services with Kubernetes
Bart Spaans
 
C og c++-jens lund jensen
InfinIT - Innovationsnetværket for it
 
Vibe Coding_ Develop a web application using AI (1).pdf
Baiju Muthukadan
 
Introduction to multicore .ppt
Rajagopal Nagarajan
 
Introduction to the intermediate Python - v1.1
Andrei KUCHARAVY
 
Working With Concurrency In Java 8
Heartin Jacob
 
PyData Boston 2013
Travis Oliphant
 
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...
Adam Dunkels
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Sinan KOZAK
 
Ad

Recently uploaded (20)

PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Ad

SciPy 2019: How to Accelerate an Existing Codebase with Numba

  • 1. © 2019 Anaconda How to Accelerate an Existing Codebase with Numba Stan Seibert !1
  • 2. The Four Step Process !2© 2019 Anaconda
  • 3. Step 1: Make an Honest Self-Inventory • Why do you want to speed up your code? • Tired of waiting for jobs to finish • Make it practical to scale up to larger workloads • Entertainment / drag racing (be honest!) • First express your ultimate goal in absolute terms, not relative: • "I wish this job finished in 20 minutes." • "I wish this job ran 50% faster." • "I want to reach 90% of the theoretical hardware maximum" !3© 2019 Anaconda
  • 4. Maslow's Hierarchy of Software Project Needs !4 Does the code work? Are there automated tests? Is there user documentation? Is it easy to install? Is it fast enough? © 2019 Anaconda
  • 5. A Benchmarking Test Subject: pymcmcstat !5 • Note: • I have no connection with this project • Any issues here are my fault • Wanted unfamiliar, real-world code base for examples • Comes with good docs and examples that can be converted into performance tests • Check out their talk @ 3:10 after lunch! © 2019 Anaconda https://github.com/prmiles/pymcmcstat
  • 6. How Numba works !6 Python Function (bytecode) Bytecode Analysis Unbox Function Arguments Numba IR Machine Code Execute! Type Inference LLVM/NVVM JIT LLVM IR Lowering Rewrite IR Cache @jit def do_math(a, b): … >>> do_math(x, y) © 2019 Anaconda
  • 7. Numba Internals in a Nutshell • Translate Python objects of supported types into representations with no CPython dependencies ("unboxing") • Compile Python bytecode from your decorated function into machine code. • Swap calls to builtins and NumPy functions for implementations provided by Numba (or 3rd party Numba extensions) • Allow LLVM to inline functions, autovectorize loops, and do other optimizations you would expect from a C compiler • When calling the function, release the GIL if requested • Convert return values back to Python objects ("boxing") !7© 2019 Anaconda
  • 8. What Numba does not do • Automated translation of CPython or NumPy implementations • Automatic compilation of 3rd party libraries • Partial compilation • Automatic conversion of arbitrary Python types • Change the layout of data allocated in the interpreter • Translate entire programs • Magically make individual NumPy functions faster !8© 2019 Anaconda
  • 9. When is Numba unlikely to help? • Whole program compilation • Critical functions have already been converted to C or optimized Cython • Need to interface directly to C++ • Need to generate C/C++ for separate compilation • Algorithms are not primarily numerical • Exception: Numba can do pretty well at bit manipulation !9© 2019 Anaconda
  • 10. Step 2: Measurement !10 Unit Tests Performance Tests Did I break it? Did I make it faster? != © 2019 Anaconda
  • 11. Unit testing scientific code • If you don't have a test suite, start with one test: • a whole program "smoke test" that runs quickly • take a run that you trust and make its output your "expected value" for the test • Move on to testing individual functions once you have some smoke test coverage !11© 2019 Anaconda
  • 12. Be Realistic About Expected Accuracy Floating point numbers are not real numbers! !12 Tolerance is adjustable © 2019 Anaconda
  • 13. Performance testing scientific code • A unit test suite is not a performance test suite • Unit tests overemphasize setup/IO/teardown steps • Perf tests need to have realistic complexity and input sizes • If your perf tests are < 0.1 sec, use %timeit in Jupyter or time module. !13© 2019 Anaconda
  • 14. Profiling Tools !14 • Collecting results: • Command line:
 python -m cProfile -o step0.prof myscript.py • Notebook cell:
 %%prun -D step0.prof • Looking at results: • Command line: python -m pstats step0.prof • Web Browser: snakeviz step0.prof • Also useful: line_profiler! © 2019 Anaconda
  • 15. SnakeViz:
 pymcmcstat Algae example !15 Nearly all the time is spent in one function © 2019 Anaconda
  • 16. SnakeViz:
 pymcmcstat estimating_error_variance_for_mutliple_data_sets !16 More diffuse spread of execution time Focus on the biggest thing first © 2019 Anaconda
  • 17. Step 3: Refactoring the Code • Options for introducing Numba into a code base: 1. Replace code with a Numba implementation • Numba is now a required dependency 2. Compile functions only when Numba is present • Numba is optional dependency • Sometimes hard to write one function that maximizes performance both with and without Numba 3. Pick between different implementations of same function at runtime • Numba is optional dependency • Can tailor each implementation to maximize performance • Also good strategy for exploring distributed or GPU-accelerated computing !17© 2019 Anaconda
  • 18. Become Familiar With Numba's Limitations !18© 2019 Anaconda
  • 19. © 2019 Anaconda - Confidential & Proprietary Rule 1: Always use @jit(nopython=True) !19 • If you compile this function with just @jit, it will fall back to object mode. • Can you spot why?
  • 20. © 2019 Anaconda - Confidential & Proprietary !20 • Trick Question! • You can't tell because you don't know what types are going into this function • nopython=True will raise an error and give you a chance to figure out what the problem is Rule 1: Always use @jit(nopython=True)
  • 21. © 2019 Anaconda - Confidential & Proprietary Rule 1b: ... and object mode blocks if you must • Object mode blocks are good for: • I/O • Callbacks and progress bars • Not wasting time implementing Numba-friendly versions of operations that are not a bottleneck • Always try to reorg your code first, and use object mode blocks as a last resort. !21
  • 22. © 2019 Anaconda - Confidential & Proprietary Rule 2: Pay attention to data types • Best for Numba: • NumPy arrays • NumPy views on other containers • OK: • Tuples, strings, enums, simple scalar types (int, float, bools) • Globals are fine for constants. Pass the rest of your data as arguments • Not good: • General objects, Python lists, Python dicts !22
  • 23. © 2019 Anaconda - Confidential & Proprietary Data Types: Algae Example !23 Original Fixed Tuples are like C structs in Numba: Every element can have a different data type With this change, can compile algaesys. Benchmark: 63 sec → 14.4 sec! Heterogenous list 😞
  • 24. © 2019 Anaconda - Confidential & Proprietary Rule 2b: ...and typed containers for nested data • But what if I need some thing more complex? • Use Numba typed containers: • numba.typed.dict (version 0.43) • numba.typed.list (coming in version 0.45) • Can nest any types that Numba knows about: • List[List[int]] • Dict[int, float32[:,:]] • Dict[str, int] !24
  • 25. © 2019 Anaconda - Confidential & Proprietary Rule 2b: ...and typed containers for nested data !25 List of ParameterSet classes Uses slicing and recursion Can't port this today, (need typed list + @jitclass) but should be able to after Numba 0.45
 (RC this morning!)
  • 26. © 2019 Anaconda - Confidential & Proprietary Rule 3: Write it like FORTRAN • Numba frees you from some of the constraints of Python, so make sure you take advantage of them: • Calling small functions is cheap / free (thanks to inlining) • Break up big chunky functions • Manual loops perform just as well as array functions. • Use them when you want to avoid making temporary arrays and to improve readability !26
  • 27. © 2019 Anaconda - Confidential & Proprietary Rule 3a: Prefer functions over classes !27 No need for self, except to call sub- functions
  • 28. © 2019 Anaconda - Confidential & Proprietary Rule 3b: ...or array exprs and ufuncs !28 • Numba automatically compiles array expressions into fused loops • Make a new ufunc with @numba.vectorize when you need control flow to compute an element • Beware of treating 1 element arrays like scalars These are arrays, not scalars
  • 29. © 2019 Anaconda - Confidential & Proprietary Rule 4: Target serial execution first • Threads make everything harder to reason about • Your algorithms may not be in a parallelizable form • Even if you want to go parallel, start with working serial version. • If serial execution meets your performance goals, stop! !29
  • 30. © 2019 Anaconda - Confidential & Proprietary Rule 4b: ...but think about parallel • Think about what loops in your code could run in parallel. • parallel=True & numba.prange() make parallel loops easier • Know your race conditions: • Read-after-write: One loop iteration reads data that another loop iteration writes • Write-after-write: Two loop iterations write data to the same place • Can sometimes avoid race conditions if you reorg your loop so: • Input and output arrays are separate • Each iteration is responsible for one output value, not one input value !30
  • 31. © 2019 Anaconda - Confidential & Proprietary Step 4: Share with Others • Packaging with Numba as a dependency: • Add it to your requirements.txt / conda recipe • Wheels for (Python 2.7, 3.5-3.7) * (win-32, win-64, osx, linux-32, linux-64) available • Conda packages for same combinations (some repos don't post Python 3.5 packages anymore) • Numba does not require that end users have a compiler or LLVM present on their system if installed from binary packages. • If all of your machine code comes via Numba, you can ship your package as generic for all platforms ("noarch" in conda, sdist for PyPI). !31
  • 32. © 2019 Anaconda - Confidential & Proprietary Looking Forward • Numba is far from finished. Many things left to do: • Profiling support for compiled code • Other tools to introspect compiler pipeline • Revamp of @jitclass • Continue to improve error messages • Expand the subset of Python we can make fast !32
  • 33. © 2019 Anaconda - Confidential & Proprietary Conclusion • Steps for success: 1. Evaluate your project: Do you need optimization? 2. Measure: Have tests, use profilers 3. Refactor the code: Plan, follow the rules, and debug 4. Share with others: Packaging • Start small, work incrementally, be willing to abandon your approach if it isn't working. !33
  • 34. © 2019 Anaconda - Confidential & Proprietary Resources • Documentation:
 http://numba.pydata.org/numba-doc/latest/index.html • Mailing list:
 http://numba.pydata.org/ • Github:
 https://github.com/numba/numba • Gitter:
 https://gitter.im/numba/numba • Feel free to ask general questions on mailing list or Gitter, and open Github issues on specific problems. !34
  • 35. © 2019 Anaconda - Confidential & Proprietary Thanks! !35
  • 36. © 2019 Anaconda - Confidential & Proprietary Bonus Material !36
  • 37. © 2019 Anaconda - Confidential & Proprietary When Things Go Wrong • Turn off the JIT: • export NUMBA_DISABLE_JIT=1 • Print debugging: • print() of constant strings and scalars works in nopython mode • Use GDB from Numba functions: https://numba.pydata.org/numba-doc/dev/user/troubleshoot.html#debugging-jit-compiled-code-with-gdb • Test functions in isolation !37
  • 38. © 2019 Anaconda - Confidential & Proprietary How Numba Is Packaged • numba source is mostly Python + tiny bit of C/C++ • llvmlite is Python + C wrapper around LLVM • Requires specific versions of LLVM
 (system LLVM is usually wrong version) • Statically links LLVM to C wrapper that is part of llvmlite package • Once built, does not depend on external LLVM • Building LLVM is challenging, steer users toward our binary wheels / conda packages if possible. !38
  • 39. © 2019 Anaconda - Confidential & Proprietary Packaging Limitations on Different Platforms • x86, x86_64 wheels + conda packages are in usual places • Linux-ARMv7 (RaspberryPi) conda packages in numba channel • Tested with Berryconda environment • Can ARMv7 wheels go on PyPI? piwheels.net? • Linux-ARMv8 (64-bit ARM) conda package for only one test environment in numba channel • No conda distribution to target yet (conda-forge working on it) • Can ARMv8 wheels go on PyPI? • Linux-ppc64le (POWER8, 9) conda packages in numba channel • Can ppc64le wheels go on PyPI? !39
  • 40. © 2019 Anaconda - Confidential & Proprietary Advanced Techniques: SIMD Autovectorization • All CPUs now have vector instructions: • Apply math operation to multiple (sometimes up to 16!) values at once. • LLVM can automatically translate some loops into SIMD versions, but: • Need fastmath=True (SIMD changes order of ops) • Need error_model='numpy' (ZeroDivisionError breaks SIMD) • Make sure you have ICC runtime installed for SIMD special math functions: • conda install -c numba icc_rt • See https://github.com/numba/numba-examples/blob/master/notebooks/ simd.ipynb for details. !40
  • 41. © 2019 Anaconda - Confidential & Proprietary Advanced Techniques: @generated_jit • Pick entirely different implementations depending on input types • Can specialize based on type, or literal value • Need to understand how Numba types work !41