Parallel Pros and Cons
Python Parallelization Frameworks¶
We'll be exploring four different methods for horizontally scaling python in this workshop:
-
For when you just need one or two functions to scale. Chances are good that if you try to make anything non-trivial, you're going to invest weeks/months of effort to discover you've made a junk version of Dask or Celery released 10 years ago.
Self-authored multiprocessing
Process #1: Knock, Knock
Process #2: Whose thRACE CONDITION FROM PROCESS #1 BECAUSE YOU DIDN'T USE MULTIPROCESSING CORRECTLY
Pros:
Works "out of the box"
Potential for batch & streaming (good luck...)
Cons:
Race conditions are on you
Memory management is on you
Synchronization is on you
Inter-process dataflow is on you
-
Use this if you're already planning on using Pandas. If you aren't using Pandas and all the benefits of optimized C outside of the GIL it brings, it's worth taking a pause to double-check you CAN'T use it before using something else.
Pros:
Nearly 1-for-1 API parity with standard Pandas
Effortless scaling for Dataframe-based workflows
Support for non-Dataframe tasks
Painless infrastructure integration
1st class support for GPUs (via RAPIDS)
Cons:
Centered around Pandas (columnar data, sorry JSON)
Complex workflows aren't a strong suit
Getting custom code & dependencies onto workers is a learning curve
While streaming may be theoretically possible, it's built for batch workflows
-
The Python parallelization swiss-army knife. This can do whatever you're trying to do.
Pros:
Complex workflows are a specialty
Integrating your project code & dependencies is the default
Quirky but relatively painless infrastructure
Probably supports where you store your data
Cons:
Canvas (workflow) API has a learning curve
Poor support for arbitrarily long tasks
Inter-process JSON messages can be difficult to predict
Flower doesn't have a dark mode
-
This is the endgame If you're truly starting to scale but don't want to ditch Python, then your journey will probably lead here.
Pros:
Forces effective map-shuffle-reduce patterns
Potentially fastest (with Apache Flink) and scales harder than Chuck Norris can kick
1st class support for streaming dataflows and all the complexity that goes along with that (windowing, late arrivals, only once, at least once, etc.)
Leverage existing infrax ( GCP, Spark, etc.)
Create effective Spark/Flink jobs with Python
Cons:
Just an abstraction layer (less the dev-only Direct Runner)
Complex infrax setup for self-hosted prod deployment
Semi-linked to GCP's DataFlow implementation
Chained dependencies cause projects to be stuck with months old libraries
Semi-locked in options for sources and sinks
When are all of these options a bad idea?¶
Spending weeks learning a new language is likely going to be slower than writing something in a language you already know today (CPython) and running it. That said, CPython is very upfront about its inability to use threading. Python 3.12+ is beginning the slow process of overcoming the Global Interpreter Lock (GIL) inability to support multiple threads. Details are in PEP 703.
Until the GIL supports threads and the Python ecosystem (SciPy, Dask, FastAPI, etc.) adapts to the change, the best case scenario with Python is multiprocessing using orders of magnitude more memory, layers of complexity, and slightly more time to accomplish a task compared to what compiled languages with threading can accomplish with basic functions.
Vertically scaling "Python"¶
While this workshop is focused on horizontally scaling Python, it's worth making some honorable mentions for vertically scaling individual Python interpreters to be more performant. The theme here is: speed up Python by minimizing the use of Python.
-
Wrap C & C++ with Python
This likely isn't new information, but directly extending Python with C or C++ is how Numpy, Pandas, and much of the CPython standard library is made.
Pros:
You're that dev who can optimize Python with C
Cons:
You're the dev trying to optimize Python with C
-
Compile and Cache Python
py_compile and functools.lru_cache are "out of the box" and relatively painless ways to speed up your critical path.
Chances are pretty good that the Python interpreter is already compiling your code to
.pyc
files.
Pros:
1,000x performance increase with 1-line of code and no added dependencies
Cons:
If it works and memory holds out
-
Step 1. Put
@jit
abovedef my_function()
Step 2. Magic
Pros:
Possibility of quick-win 1,000x or more performance increases for your project
Junior devs will think you're an all knowing Python god for greatly speeding up the Python codebase
Cons:
If it works...
About as likely as a used mattress to introduce
Your time is probably better spent learning a threaded language rather than a bolt on solution for Python
Senior devs will probably be annoyed you've increased the complexity/fragility of the codebase and bloated prod images
-
You can install it with pip, write it in your
.py
files, and it looks like Python. BUT... you're not really using Python anymore. Similar situation as Numba:Step 1: Hop on a magic carpet with
ti.init(arch=ti.cpu)
Step 2: Put
@ti.kernel
above your function.Step 3: Magic
Pros/Cons:
Similar tradeoffs as Numba
Taichi advertised benchmark¶
Plain Python vs lru_cache vs Taichi vs Numba¶
Fibonacci Number Benchmarks
The benchmark below is reproducible by running the standalone tests included in the
workshop's tests
directory:
pytest tests/vertical_scale_test.py --benchmark-histogram
Key observations from setting up a benchmark for a somewhat "normal" Python function:
-
For idempotent functions called "a lot" with similar input, adding
@lru_cache
above the function definition is almost certainly the best option. -
Just-in-time (JIT) compiled solutions (e.g. Taichi & Numba) implicitly fail when reaching C max/min scalar sizes. Both silently failed when trying to compute numbers larger than the underlying C can support.
-
Taichi is more "honest" about its limitations. Numba will implicitly fall back to Python without warning (e.g. the
fib_numba()
test function) when its assumptions (which in general are the same as Taichi's) aren't met. -
Added complexity to debug. For example Taichi requires explicitly turning on debugging in its setup:
taichi.init(debug=True)
-
If you're writing custom mathematical computation functions AND those functions are a clear bottleneck for the project goals AND function input isn't expected to be repetitive (so cache hits won't help) AND the math can't be done using native numpy/pandas or machine learning library functions THEN it MAY make sense to look at optimization solutions like Numba or Taichi.