Parallel Pros and Cons
Python Parallelization Frameworks¶
We'll be exploring four different methods for horizontally scaling python in this workshop:
- 
 For when you just need one or two functions to scale. Chances are good that if you try to make anything non-trivial, you're going to invest weeks/months of effort to discover you've made a junk version of Dask or Celery released 10 years ago. Self-authored multiprocessing Process #1: Knock, Knock Process #2: Whose thRACE CONDITION FROM PROCESS #1 BECAUSE YOU DIDN'T USE MULTIPROCESSING CORRECTLY Pros: Works "out of the box" Potential for batch & streaming (good luck...) Cons: Race conditions are on you Memory management is on you Synchronization is on you Inter-process dataflow is on you 
- 
 Use this if you're already planning on using Pandas. If you aren't using Pandas and all the benefits of optimized C outside of the GIL it brings, it's worth taking a pause to double-check you CAN'T use it before using something else. Pros: Nearly 1-for-1 API parity with standard Pandas Effortless scaling for Dataframe-based workflows Support for non-Dataframe tasks Painless infrastructure integration 1st class support for GPUs (via RAPIDS) Cons: Centered around Pandas (columnar data, sorry JSON) Complex workflows aren't a strong suit Getting custom code & dependencies onto workers is a learning curve While streaming may be theoretically possible, it's built for batch workflows 
- 
 The Python parallelization swiss-army knife. This can do whatever you're trying to do. Pros: Complex workflows are a specialty Integrating your project code & dependencies is the default Quirky but relatively painless infrastructure Probably supports where you store your data Cons: Canvas (workflow) API has a learning curve Poor support for arbitrarily long tasks Inter-process JSON messages can be difficult to predict Flower doesn't have a dark mode 
- 
 This is the endgame If you're truly starting to scale but don't want to ditch Python, then your journey will probably lead here. Pros: Forces effective map-shuffle-reduce patterns Potentially fastest (with Apache Flink) and scales harder than Chuck Norris can kick 1st class support for streaming dataflows and all the complexity that goes along with that (windowing, late arrivals, only once, at least once, etc.) Leverage existing infrax ( GCP, Spark, etc.) Create effective Spark/Flink jobs with Python Cons: Just an abstraction layer (less the dev-only Direct Runner) Complex infrax setup for self-hosted prod deployment Semi-linked to GCP's DataFlow implementation Chained dependencies cause projects to be stuck with months old libraries Semi-locked in options for sources and sinks 
When are all of these options a bad idea?¶
Spending weeks learning a new language is likely going to be slower than writing something in a language you already know today (CPython) and running it. That said, CPython is very upfront about its inability to use threading. Python 3.12+ is beginning the slow process of overcoming the Global Interpreter Lock (GIL) inability to support multiple threads. Details are in PEP 703.
Until the GIL supports threads and the Python ecosystem (SciPy, Dask, FastAPI, etc.) adapts to the change, the best case scenario with Python is multiprocessing using orders of magnitude more memory, layers of complexity, and slightly more time to accomplish a task compared to what compiled languages with threading can accomplish with basic functions.
Vertically scaling "Python"¶
While this workshop is focused on horizontally scaling Python, it's worth making some honorable mentions for vertically scaling individual Python interpreters to be more performant. The theme here is: speed up Python by minimizing the use of Python.
- 
Wrap C & C++ with Python 
 This likely isn't new information, but directly extending Python with C or C++ is how Numpy, Pandas, and much of the CPython standard library is made. 
 Pros: You're that dev who can optimize Python with C Cons: You're the dev trying to optimize Python with C 
- 
Compile and Cache Python 
 py_compile and functools.lru_cache are "out of the box" and relatively painless ways to speed up your critical path. Chances are pretty good that the Python interpreter is already compiling your code to .pycfiles.
 Pros: 1,000x performance increase with 1-line of code and no added dependencies Cons: If it works and memory holds out 
- 
 Step 1. Put @jitabovedef my_function()Step 2. Magic 
 Pros: Possibility of quick-win 1,000x or more performance increases for your project Junior devs will think you're an all knowing Python god for greatly speeding up the Python codebase Cons: If it works... About as likely as a used mattress to introduce Your time is probably better spent learning a threaded language rather than a bolt on solution for Python Senior devs will probably be annoyed you've increased the complexity/fragility of the codebase and bloated prod images 
- 
 You can install it with pip, write it in your .pyfiles, and it looks like Python. BUT... you're not really using Python anymore. Similar situation as Numba:Step 1: Hop on a magic carpet with ti.init(arch=ti.cpu)Step 2: Put @ti.kernelabove your function.Step 3: Magic 
 Pros/Cons: Similar tradeoffs as Numba 
Taichi advertised benchmark¶
Plain Python vs lru_cache vs Taichi vs Numba¶
Fibonacci Number Benchmarks
The benchmark below is reproducible by running the standalone tests included in the
workshop's tests directory:
pytest tests/vertical_scale_test.py --benchmark-histogram
Key observations from setting up a benchmark for a somewhat "normal" Python function:
- 
For idempotent functions called "a lot" with similar input, adding @lru_cacheabove the function definition is almost certainly the best option.
- 
Just-in-time (JIT) compiled solutions (e.g. Taichi & Numba) implicitly fail when reaching C max/min scalar sizes. Both silently failed when trying to compute numbers larger than the underlying C can support. 
- 
Taichi is more "honest" about its limitations. Numba will implicitly fall back to Python without warning (e.g. the fib_numba()test function) when its assumptions (which in general are the same as Taichi's) aren't met.
- 
Added complexity to debug. For example Taichi requires explicitly turning on debugging in its setup: taichi.init(debug=True)
- 
If you're writing custom mathematical computation functions AND those functions are a clear bottleneck for the project goals AND function input isn't expected to be repetitive (so cache hits won't help) AND the math can't be done using native numpy/pandas or machine learning library functions THEN it MAY make sense to look at optimization solutions like Numba or Taichi. 
