Skip to content

Parallel Pros and Cons

Python Parallelization Frameworks

We'll be exploring four different methods for horizontally scaling python in this workshop:

  • Python multiprocessing


    For when you just need one or two functions to scale. Chances are good that if you try to make anything non-trivial, you're going to invest weeks/months of effort to discover you've made a junk version of Dask or Celery released 10 years ago.

    Self-authored multiprocessing

    Process #1: Knock, Knock

    Process #2: Whose thRACE CONDITION FROM PROCESS #1 BECAUSE YOU DIDN'T USE MULTIPROCESSING CORRECTLY

    Pros:

    Works "out of the box"

    Potential for batch & streaming (good luck...)

    Cons:

    Race conditions are on you

    Memory management is on you

    Synchronization is on you

    Inter-process dataflow is on you

  • Dask


    Use this if you're already planning on using Pandas. If you aren't using Pandas and all the benefits of optimized C outside of the GIL it brings, it's worth taking a pause to double-check you CAN'T use it before using something else.

    Pros:

    Nearly 1-for-1 API parity with standard Pandas

    Effortless scaling for Dataframe-based workflows

    Support for non-Dataframe tasks

    Painless infrastructure integration

    1st class support for GPUs (via RAPIDS)

    Cons:

    Centered around Pandas (columnar data, sorry JSON)

    Complex workflows aren't a strong suit

    Getting custom code & dependencies onto workers is a learning curve

    While streaming may be theoretically possible, it's built for batch workflows

  • Celery


    The Python parallelization swiss-army knife. This can do whatever you're trying to do.

    Pros:

    Complex workflows are a specialty

    Integrating your project code & dependencies is the default

    Quirky but relatively painless infrastructure

    Probably supports where you store your data

    Cons:

    Canvas (workflow) API has a learning curve

    Poor support for arbitrarily long tasks

    Inter-process JSON messages can be difficult to predict

    Flower doesn't have a dark mode

  • Apache Beam


    This is the endgame 💪 If you're truly starting to scale but don't want to ditch Python, then your journey will probably lead here.

    Pros:

    Forces effective map-shuffle-reduce patterns

    Potentially fastest (with Apache Flink) and scales harder than Chuck Norris can kick

    1st class support for streaming dataflows and all the complexity that goes along with that (windowing, late arrivals, only once, at least once, etc.)

    Leverage existing infrax ( GCP, Spark, etc.)

    Create effective Spark/Flink jobs with Python

    Cons:

    Just an abstraction layer (less the dev-only Direct Runner)

    Complex infrax setup for self-hosted prod deployment

    Semi-linked to GCP's DataFlow implementation

    Chained dependencies cause projects to be stuck with months old libraries

    Semi-locked in options for sources and sinks

When are all of these options a bad idea?

Spending weeks learning a new language is likely going to be slower than writing something in a language you already know today (CPython) and running it. That said, CPython is very upfront about its inability to use threading. Python 3.12+ is beginning the slow process of overcoming the Global Interpreter Lock (GIL) inability to support multiple threads. Details are in PEP 703.

Until the GIL supports threads and the Python ecosystem (SciPy, Dask, FastAPI, etc.) adapts to the change, the best case scenario with Python is multiprocessing using orders of magnitude more memory, layers of complexity, and slightly more time to accomplish a task compared to what compiled languages with threading can accomplish with basic functions.

Vertically scaling "Python"

While this workshop is focused on horizontally scaling Python, it's worth making some honorable mentions for vertically scaling individual Python interpreters to be more performant. The theme here is: speed up Python by minimizing the use of Python.

  • Wrap C & C++ with Python


    This likely isn't new information, but directly extending Python with C or C++ is how Numpy, Pandas, and much of the CPython standard library is made.


    Pros:

    You're that dev who can optimize Python with C

    Cons:

    You're the dev trying to optimize Python with C

  • Compile and Cache Python


    py_compile and functools.lru_cache are "out of the box" and relatively painless ways to speed up your critical path.

    Chances are pretty good that the Python interpreter is already compiling your code to .pyc files.


    Pros:

    1,000x performance increase with 1-line of code and no added dependencies

    Cons:

    If it works and memory holds out 🙏🏽

  • Numba


    Step 1. Put @jit above def my_function()

    Step 2. Magic


    Pros:

    Possibility of quick-win 1,000x or more performance increases for your project

    Junior devs will think you're an all knowing Python god for greatly speeding up the Python codebase

    Cons:

    If it works...

    About as likely as a used mattress to introduce

    Your time is probably better spent learning a threaded language rather than a bolt on solution for Python

    Senior devs will probably be annoyed you've increased the complexity/fragility of the codebase and bloated prod images

  • Taichi


    You can install it with pip, write it in your .py files, and it looks like Python. BUT... you're not really using Python anymore. Similar situation as Numba:

    Step 1: Hop on a magic carpet with ti.init(arch=ti.cpu)

    Step 2: Put @ti.kernel above your function.

    Step 3: Magic


    Pros/Cons:

    Similar tradeoffs as Numba

Taichi advertised benchmark

Benchmarks

Plain Python vs lru_cache vs Taichi vs Numba

Fibonacci Number Benchmarks

The benchmark below is reproducible by running the standalone tests included in the workshop's tests directory:

pytest tests/vertical_scale_test.py --benchmark-histogram

Key observations from setting up a benchmark for a somewhat "normal" Python function:

  1. For idempotent functions called "a lot" with similar input, adding @lru_cache above the function definition is almost certainly the best option.

  2. Just-in-time (JIT) compiled solutions (e.g. Taichi & Numba) implicitly fail when reaching C max/min scalar sizes. Both silently failed when trying to compute numbers larger than the underlying C can support.

  3. Taichi is more "honest" about its limitations. Numba will implicitly fall back to Python without warning (e.g. the fib_numba() test function) when its assumptions (which in general are the same as Taichi's) aren't met.

  4. Added complexity to debug. For example Taichi requires explicitly turning on debugging in its setup: taichi.init(debug=True)

  5. If you're writing custom mathematical computation functions AND those functions are a clear bottleneck for the project goals AND function input isn't expected to be repetitive (so cache hits won't help) AND the math can't be done using native numpy/pandas or machine learning library functions THEN it MAY make sense to look at optimization solutions like Numba or Taichi.

Benchmark