Pandas with Extras
Pandas probably does more than you realize¶
Pandas has many optional dependencies
that greatly extend what the DataFrame API is capable of. A few examples:
excel: Directly read and write.xlsxexcel filesxml: Directly read and write.xmlsemi-structures filespostgresql,mysql,etc.: Directly read and write to a databaseparquet: Directly read and write to a common data lakehouse formataws: Directly read and write to an Amazon Web Services (AWS) S3 bucket (including MinIO)gcp: Read and write to Google Cloud Platform (GCP) Cloud Storage
A caution is using the right tool for the job. Excel, XML, and other semi-structured
data formats can be difficult to wrangle into the DataFrame structured, typed, and
columnar data structure. Pydantic will often be a much better
intermediate tool that creates self documenting code for how business rules were applied
to convert semi-structured JSON or XML data into structured data.
Plain 'ol Pandas Workflow¶
We'll download the Rainfall of Iranian Cities
datasets into /data/IranRainfall to level set on basic use of pandas read_csv() and
read_excel()
functions. The data provides monthly precipitation data for 31 cities in Iran from 1901
to 2022 in Command Separated Value (CSV) and Excel Spreadsheet (XLSX) formats.
Now with an S3 bucket¶
The provided avengercon Python package provides an authenticated MinIO Client as follows:
from avengercon.minio import get_minio_client()
my_client: Optional[Minio] = get_minio_client()
The avengercon package also includes a Pydantic model with all your MinIO credentials and endpoint information
from avengercon.minio.config import minio_config
print(minio_config.endpoint)
print(minio_config.access_key)
print(minio_config.secret_key.get_secret_value())
Likewise, the "Hello, Workshop!" page described how to access our local MinIO server which supports creating buckets and drag-drop data uploads.
The MinIO Python SDK documentation
provides details and example code for the MinIO Client list_buckets(),
list_objects(), and get_object()
functions.
Finally, the Pandas documentation provides an example of how to directly read and write remote files in an S3 bucket. It points to the S3FS Python package it's using for how to authenticate.
Can you successfully read a remote file in S3?
-
Are you able to successfully adjust the
file_uploader.pyexample on the MinIO Python SDK to read a file using the providedget_minio_client()function? -
Are you able to use the information in your
.envfile and the Pandas + S3FS documentation to use a PandasDataFrame'sread_csv()function to directly read from a MinIO S3 bucket?-
Hint #1: The
storage_optionsargument with a"client_kwargs"keyed nested dictionary may be an easy approach for authenticating your credentials. -
Hint #2: Try a search on Google or ChatGPT for "Pandas read_csv MinIO" or "pandas s3fs read from minio"
-