Skip to content

Dask

What is Dask?

The Dask Tutorial and this article by NVIDIA has decent infographics and explanations on what Dask is. The VERY summarized explanation is it's a library that combines Tornado and Pandas so that an arbitrary number of Python interpreters and Pandas DataFrames can be used as if they were a single interpreter and DataFrame.

The Journey of a Task explanation by the Dask authors provides a nice end-to-end primer on how the framework operates.

What is Coiled and Prefect?

Dask fits into a growing segment of the data/tech industry where Free and Open Source Software (FOSS) is provided with fully-managed and extended offerings made available by the primary contributors to make an income.

Two of the more prominent companies aligned with Dask are Coiled.io and Prefect. Coiled is basically a fully-managed Dask cluster while Prefect is an expanded offering more geared towards ETL pipelines.

Dask created hands-on crash course

Jupyter Notebook

Transition to the official crash-course running on your computer to get comfortable with the framework.

Preparing for using Dask in your own projects

Since we've already seen some basics of using Dask in the Jupyter notebooks, let's transition to a couple of tasks using Prefect.

Open the Prefect User Interface

Right-click and "open in new tab"

Prefect

Integrating Prefect with AWS S3/MinIO

In your IDE with your virtual environment activated (as described earlier), try making and running a new python script in the avengercon_2024 directory.

testing.py
from avengercon.prefect.flows import hello_prefect_flow
from avengercon.prefect.storage import create_default_prefect_blocks, create_default_prefect_buckets

print(hello_prefect_flow())
create_default_prefect_buckets()
create_default_prefect_blocks()

Take a look at the "Blocks" portion of the Prefect UI. You should see prefect-artifacts and prefect-flows as registered S3-like buckets. Clicking the link on either will show instructions on how to use these buckets in the future to cache both the files your team is working on and the code you're using to do so. This may be particularly helpful when your operators may want to trigger a pre-defined series of steps for new data by triggering a deployment that uses a flow the dev team stored in a block.