Parquet in Python
Warning
The examples on this page are provided without support
Python supports Parquet through the pyarrow
project. For more information, see: https://arrow.apache.org/cookbook/py/.
Examples
Below follow a range of illustrative examples tested with Python 3.11.
All examples use the same data set which can be downloaded here:
Warning
When loading multiple files, all files must share a common data-schema, see Information on output-file organization
Read single file (in-memory)
import pyarrow.parquet as pq
# Path to local file
file_path = "F:/datalake/AABBCCDD/CAN2_GnssSpeed/2022/04/22/00000975_00000001.parquet"
# Read single file into memory
table = pq.read_table(file_path)
Read multiple files (in-memory)
import pyarrow.dataset as ds
# Path to local datalake
datalake_path = "F:/datalake/AABBCCDD/CAN2_GnssSpeed"
# Create dateset containing multiple files
ds_gnss_speed = ds.dataset(datalake_path, format="parquet")
# Read all files into memory
gnss_speed = ds_gnss_speed.to_table()
Read multiple files (in-memory in batches)
import pyarrow.dataset as ds
# Path to local datalake
datalake_path = "F:/datalake/AABBCCDD/CAN2_GnssSpeed"
# Create dateset containing multiple files
ds_gnss_speed = ds.dataset(datalake_path, format="parquet")
# Read into memory in batches
for record_batch in ds_gnss_speed.to_batches():
# Work on batch data here
pass
Read multiple files from S3 (in-memory)
Below examples demonstrate how to load Parquet data from (Amazon) AWS S3 and Minio S3 respectively. In both examples, the bucket name is css-parquet-test
.
import pyarrow.dataset as ds
import pyarrow.fs as fs
# Create s3 filesystem
s3 = fs.S3FileSystem(access_key="<access_key>",
secret_key="<secret_key>",
region="<region>")
# Path to remote datalake
datalake_path = "css-parquet-test/AABBCCDD/CAN2_GnssSpeed"
# Create dateset containing multiple files using remote filesystem
ds_gnss_speed = ds.dataset(datalake_path, filesystem=s3, format="parquet")
# Read all files into memory
gnss_speed = ds_gnss_speed.to_table()
parquet_dataset_s3_aws_readall.py
import pyarrow.dataset as ds
import pyarrow.fs as fs
# Create s3 filesystem
s3 = fs.S3FileSystem(access_key="<access_key>",
secret_key="<secret_key>",
endpoint_override="http://192.168.1.146:9000")
# Path to remote datalake
datalake_path = "css-parquet-test/AABBCCDD/CAN2_GnssSpeed"
# Create dateset containing multiple files using remote filesystem
ds_gnss_speed = ds.dataset(datalake_path, filesystem=s3, format="parquet")
# Read all files into memory
gnss_speed = ds_gnss_speed.to_table()
Read multiple files (out-of-memory)
The duckdb
(https://duckdb.org/) project supports the Parquet format and is built for large datasets, as described on the project webpage:
DuckDB will read the Parquet files in a streaming fashion, which means you can perform queries on large Parquet files that do not fit in your main memory.
Documentation of
duckdb
Python API: https://duckdb.org/docs/api/python/overviewDocumentation of
duckdb
SQL support: https://duckdb.org/docs/sql/introduction
Below example demonstrates how Parquet data can be queried using SQL in Python.
Note
DuckDB generally does not load data into memory until specifically requested to do so using fetch*
methods
import duckdb
# Path to local datalake
datalake_path = "F:/datalake/AABBCCDD/"
# Create table from datalake
table_gnss_speed = duckdb.read_parquet(f"{datalake_path}/CAN2_GnssSpeed/**/*.parquet")
# Query table using standard SQL
result_gnss_speed = duckdb.sql("SELECT * FROM table_gnss_speed")
# Display summary
result_gnss_speed.show()
# Aggregate functions
result_gnss_speed_agg = duckdb.sql("SELECT COUNT(Speed), AVG(Speed), MAX(Speed), MIN(Speed) FROM table_gnss_speed")
# Subset by date range
result_gnss_speed_time_range = duckdb.sql("SELECT * FROM table_gnss_speed WHERE t BETWEEN '2022-04-24 08:00:00' AND '2022-04-24 08:50:00'")
# Subset by "sample" (1000 points)
result_gnss_speed_sample = duckdb.sql("SELECT * FROM table_gnss_speed USING SAMPLE 1000 ORDER BY t")
# Fetch sample subset as Python list (loads sample subset into memory)
result_gnss_speed_sample_list = result_gnss_speed_sample.fetchall()
# Fetch sample subset as Numpy array (loads sample subset into memory)
result_gnss_speed_sample_numpy = result_gnss_speed_sample.fetchnumpy()
# Fetch sample subset as Pandas Dataframe (loads sample subset into memory)
result_gnss_speed_sample_df = result_gnss_speed_sample.fetchdf()
Arrow Table to Pandas Dataframe
Below demonstrates how an existing Arrow Table can be converted to a Pandas Dataframe.
import pyarrow as pa
import pandas as pd
# Arrow Table to Pandas Dataframe
df = table.to_pandas()
Note
Pandas Dataframes take up a lot of memory and are typically only useful for small datasets