Parallel Pros and Cons

Python Parallelization Frameworks¶

We'll be exploring four different methods for horizontally scaling python in this workshop:

Python multiprocessing

For when you just need one or two functions to scale. Chances are good that if you try to make anything non-trivial, you're going to invest weeks/months of effort to discover you've made a junk version of Dask or Celery released 10 years ago.

Self-authored multiprocessing

Process #1: Knock, Knock

Process #2: Whose thRACE CONDITION FROM PROCESS #1 BECAUSE YOU DIDN'T USE MULTIPROCESSING CORRECTLY

Pros:

Works "out of the box"

Potential for batch & streaming (good luck...)

Cons:

Race conditions are on you

Memory management is on you

Synchronization is on you

Inter-process dataflow is on you
Dask

Use this if you're already planning on using Pandas. If you aren't using Pandas and all the benefits of optimized C outside of the GIL it brings, it's worth taking a pause to double-check you CAN'T use it before using something else.

Pros:

Nearly 1-for-1 API parity with standard Pandas

Effortless scaling for Dataframe-based workflows

Support for non-Dataframe tasks

Painless infrastructure integration

1st class support for GPUs (via RAPIDS)

Cons:

Centered around Pandas (columnar data, sorry JSON)

Complex workflows aren't a strong suit

Getting custom code & dependencies onto workers is a learning curve

While streaming may be theoretically possible, it's built for batch workflows
Celery

The Python parallelization swiss-army knife. This can do whatever you're trying to do.

Pros:

Complex workflows are a specialty

Integrating your project code & dependencies is the default

Quirky but relatively painless infrastructure

Probably supports where you store your data

Cons:

Canvas (workflow) API has a learning curve

Poor support for arbitrarily long tasks

Inter-process JSON messages can be difficult to predict

Flower doesn't have a dark mode
Apache Beam

This is the endgame If you're truly starting to scale but don't want to ditch Python, then your journey will probably lead here.

Pros:

Forces effective map-shuffle-reduce patterns

Potentially fastest (with Apache Flink) and scales harder than Chuck Norris can kick

1st class support for streaming dataflows and all the complexity that goes along with that (windowing, late arrivals, only once, at least once, etc.)

Leverage existing infrax ( GCP, Spark, etc.)

Create effective Spark/Flink jobs with Python

Cons:

Just an abstraction layer (less the dev-only Direct Runner)

Complex infrax setup for self-hosted prod deployment

Semi-linked to GCP's DataFlow implementation

Chained dependencies cause projects to be stuck with months old libraries

Semi-locked in options for sources and sinks

When are all of these options a bad idea?¶

Spending weeks learning a new language is likely going to be slower than writing something in a language you already know today (CPython) and running it. That said, CPython is very upfront about its inability to use threading. Python 3.12+ is beginning the slow process of overcoming the Global Interpreter Lock (GIL) inability to support multiple threads. Details are in PEP 703.

Until the GIL supports threads and the Python ecosystem (SciPy, Dask, FastAPI, etc.) adapts to the change, the best case scenario with Python is multiprocessing using orders of magnitude more memory, layers of complexity, and slightly more time to accomplish a task compared to what compiled languages with threading can accomplish with basic functions.

Vertically scaling "Python"¶

While this workshop is focused on horizontally scaling Python, it's worth making some honorable mentions for vertically scaling individual Python interpreters to be more performant. The theme here is: speed up Python by minimizing the use of Python.

Wrap C & C++ with Python

This likely isn't new information, but directly extending Python with C or C++ is how Numpy, Pandas, and much of the CPython standard library is made.

Pros:

You're that dev who can optimize Python with C

Cons:

You're the dev trying to optimize Python with C
Compile and Cache Python

py_compile and functools.lru_cache are "out of the box" and relatively painless ways to speed up your critical path.

Chances are pretty good that the Python interpreter is already compiling your code to .pyc files.

Pros:

1,000x performance increase with 1-line of code and no added dependencies

Cons:

If it works and memory holds out
Numba

Step 1. Put @jit above def my_function()

Step 2. Magic

Pros:

Possibility of quick-win 1,000x or more performance increases for your project

Junior devs will think you're an all knowing Python god for greatly speeding up the Python codebase

Cons:

If it works...

About as likely as a used mattress to introduce

Your time is probably better spent learning a threaded language rather than a bolt on solution for Python

Senior devs will probably be annoyed you've increased the complexity/fragility of the codebase and bloated prod images
Taichi

You can install it with pip, write it in your .py files, and it looks like Python. BUT... you're not really using Python anymore. Similar situation as Numba:

Step 1: Hop on a magic carpet with ti.init(arch=ti.cpu)

Step 2: Put @ti.kernel above your function.

Step 3: Magic

Pros/Cons:

Similar tradeoffs as Numba

Taichi advertised benchmark¶

Plain Python vs lru_cache vs Taichi vs Numba¶

Fibonacci Number Benchmarks

The benchmark below is reproducible by running the standalone tests included in the workshop's tests directory:

pytest tests/vertical_scale_test.py --benchmark-histogram

Key observations from setting up a benchmark for a somewhat "normal" Python function:

For idempotent functions called "a lot" with similar input, adding @lru_cache above the function definition is almost certainly the best option.
Just-in-time (JIT) compiled solutions (e.g. Taichi & Numba) implicitly fail when reaching C max/min scalar sizes. Both silently failed when trying to compute numbers larger than the underlying C can support.
Taichi is more "honest" about its limitations. Numba will implicitly fall back to Python without warning (e.g. the fib_numba() test function) when its assumptions (which in general are the same as Taichi's) aren't met.
Added complexity to debug. For example Taichi requires explicitly turning on debugging in its setup: taichi.init(debug=True)
If you're writing custom mathematical computation functions AND those functions are a clear bottleneck for the project goals AND function input isn't expected to be repetitive (so cache hits won't help) AND the math can't be done using native numpy/pandas or machine learning library functions THEN it MAY make sense to look at optimization solutions like Numba or Taichi.