Achieving Parallelism in Intel Distribution for Python with Numba

parallelism

This sponsored post from Intel highlights how today’s enterprises can achieve high levels of parallelism in large scale Python applications using the Intel Distribution for Python with Numba. 

parallelism

Achieving high levels of parallelism in large scale Python applications is still a challenge. (Photo: Shutterstock/By Sashkin)

The rapid growth in popularity of Python as a programming language for mathematics, science, and engineering applications has been amazing. Not only is it easy to learn, but there is a vast treasure of packaged open source libraries out there targeted at just about every computational domain imaginable – from astrophysics and geophysics to data science, statistics, financial analysis, machine learning, and nearly everything in between – making it easy to build applications on top of the work of others.

At the core are the NumPy and SciPy, native runtime libraries that provide basic standard math functions found in most science and engineering applications, including numerical integration, interpolation, optimization, linear algebra, and statistics. As mentioned in previous posts, Intel Distribution for Python includes many library packages that have been highly optimized for the latest Intel Xeon processors.

But achieving high levels of parallelism in large scale Python applications is still a challenge. A recent issue of the Parallel Universe magazine describes one approach that uses just-in-time (JIT) and low-level virtual machine (LLVM) compilation engines to create native-speed code.

This approach uses Numba, an open-source NumPy-aware optimizing compiler for Python. Numba translates a subset of Python and NumPy functions into fast machine code using LLVM through the llvmlite Python package. It provides an easy way for parallelizing Python, often with only minor code changes.

Numba provides a set of options for the @jitdecorator that can you can use to tweak the compiler’s analysis and code generation strategies, based on what you know about the code.

To target code for JIT or LLVM compilation, the code must first be enclosed inside a function, and that function must be identified with an @jit decorator. The decorator triggers the Python interpreter to run the Numba interpreter to generate an intermediate representation (IR) of the function along with a context for the target hardware for later JIT or LLVM compilation, subject to a range of options and parallelism directives.

This works nicely with general Python code. But using scientific or numerical packages like NumPy and SciPy add a few complications. Because some NumPy primitives are already highly optimized, not every application using NumPy or SciPy functions will optimize well with Numba. For example, the Numba interpreter might produce slower performance for codes using certain Basic Linear Algebra Subroutines (BLAS) functions, because the BLAS functions have already been highly optimized, and Numba cannot optimize them any further.

However, the Parallel Universe magazine article does identify situations where Numba optimizations work well, such as situations where multiple NumPy references are stacked together in expressions. Here, Numba can analyze and determine the best vectorization and alignment strategy better than NumPy can.

Numba provides a set of options for the @jitdecorator that can you can use to tweak the compiler’s analysis and code generation strategies, based on what you know about the code.

Admittedly, as outlined in the article, achieving parallelism in Python with Numba takes some practice and an understanding of the fundamentals. Still, Numba is one of the best approaches to exploiting parallelism, and anyone developing scientific computing software in Python should become aware of its capabilities.

Intel Distribution for Python includes an optimized Numba that allows latest SIMD features and multi-core execution to fully utilize latest Intel platform architectures. For a free download, go here.