Parquet in Python

Warning

The examples on this page are provided without support

Python supports Parquet through the pyarrow project. For more information, see: https://arrow.apache.org/cookbook/py/.

Examples

Below follow a range of illustrative examples tested with Python 3.11.

All examples use the same data set which can be downloaded here:

Dataset (.zip)

Warning

When loading multiple files, all files must share a common data-schema, see Information on output-file organization

Read single file (in-memory)

import pyarrow.parquet as pq

# Path to local file
file_path = "F:/datalake/AABBCCDD/CAN2_GnssSpeed/2022/04/22/00000975_00000001.parquet"

# Read single file into memory
table = pq.read_table(file_path)

parquet_read.py

Read multiple files (in-memory)

import pyarrow.dataset as ds

# Path to local datalake
datalake_path = "F:/datalake/AABBCCDD/CAN2_GnssSpeed"

# Create dateset containing multiple files
ds_gnss_speed = ds.dataset(datalake_path, format="parquet")

# Read all files into memory
gnss_speed = ds_gnss_speed.to_table()

parquet_dataset_local_readall.py

Read multiple files (in-memory in batches)

import pyarrow.dataset as ds

# Path to local datalake
datalake_path = "F:/datalake/AABBCCDD/CAN2_GnssSpeed"

# Create dateset containing multiple files
ds_gnss_speed = ds.dataset(datalake_path, format="parquet")

# Read into memory in batches
for record_batch in ds_gnss_speed.to_batches():

    # Work on batch data here
    pass

parquet_dataset_local_readbatch.py

Read multiple files from S3 (in-memory)

Below examples demonstrate how to load Parquet data from (Amazon) AWS S3 and Minio S3 respectively. In both examples, the bucket name is css-parquet-test.

AWS S3

import pyarrow.dataset as ds
import pyarrow.fs as fs

# Create s3 filesystem
s3 = fs.S3FileSystem(access_key="<access_key>",
                     secret_key="<secret_key>",
                     region="<region>")

# Path to remote datalake
datalake_path = "css-parquet-test/AABBCCDD/CAN2_GnssSpeed"

# Create dateset containing multiple files using remote filesystem
ds_gnss_speed = ds.dataset(datalake_path, filesystem=s3, format="parquet")

# Read all files into memory
gnss_speed = ds_gnss_speed.to_table()

parquet_dataset_s3_aws_readall.py

Minio S3

import pyarrow.dataset as ds
import pyarrow.fs as fs

# Create s3 filesystem
s3 = fs.S3FileSystem(access_key="<access_key>",
                     secret_key="<secret_key>",
                     endpoint_override="http://192.168.1.146:9000")

# Path to remote datalake
datalake_path = "css-parquet-test/AABBCCDD/CAN2_GnssSpeed"

# Create dateset containing multiple files using remote filesystem
ds_gnss_speed = ds.dataset(datalake_path, filesystem=s3, format="parquet")

# Read all files into memory
gnss_speed = ds_gnss_speed.to_table()

parquet_dataset_s3_minio_readall.py

Read multiple files (out-of-memory)

The duckdb (https://duckdb.org/) project supports the Parquet format and is built for large datasets, as described on the project webpage:

DuckDB will read the Parquet files in a streaming fashion, which means you can perform queries on large Parquet files that do not fit in your main memory.

Documentation of duckdb Python API: https://duckdb.org/docs/api/python/overview
Documentation of duckdb SQL support: https://duckdb.org/docs/sql/introduction

Below example demonstrates how Parquet data can be queried using SQL in Python.

Note

DuckDB generally does not load data into memory until specifically requested to do so using fetch* methods

import duckdb

# Path to local datalake
datalake_path = "F:/datalake/AABBCCDD/"

# Create table from datalake
table_gnss_speed = duckdb.read_parquet(f"{datalake_path}/CAN2_GnssSpeed/**/*.parquet")

# Query table using standard SQL
result_gnss_speed = duckdb.sql("SELECT * FROM table_gnss_speed")

# Display summary
result_gnss_speed.show()

# Aggregate functions
result_gnss_speed_agg = duckdb.sql("SELECT COUNT(Speed), AVG(Speed), MAX(Speed), MIN(Speed) FROM table_gnss_speed")

# Subset by date range
result_gnss_speed_time_range = duckdb.sql("SELECT * FROM table_gnss_speed WHERE t BETWEEN '2022-04-24 08:00:00' AND '2022-04-24 08:50:00'")

# Subset by "sample" (1000 points)
result_gnss_speed_sample = duckdb.sql("SELECT * FROM table_gnss_speed USING SAMPLE 1000 ORDER BY t")

# Fetch sample subset as Python list (loads sample subset into memory)
result_gnss_speed_sample_list = result_gnss_speed_sample.fetchall()

# Fetch sample subset as Numpy array (loads sample subset into memory)
result_gnss_speed_sample_numpy = result_gnss_speed_sample.fetchnumpy()

# Fetch sample subset as Pandas Dataframe (loads sample subset into memory)
result_gnss_speed_sample_df = result_gnss_speed_sample.fetchdf()

parquet_dataset_local_duckdb.py

Arrow Table to Pandas Dataframe

Below demonstrates how an existing Arrow Table can be converted to a Pandas Dataframe.

import pyarrow as pa
import pandas as pd

# Arrow Table to Pandas Dataframe
df = table.to_pandas()

Note

Pandas Dataframes take up a lot of memory and are typically only useful for small datasets