Parquet in Matlab

Warning

The examples on this page are provided without support

Matlab has excellent Parquet support, including:

  • Reading a single files

  • Reading a set of files (in-memory or out-of-memory)

  • Reading directly from local disk, S3, and more

  • Possibility to parallelize calculations


Examples

Below follow a range of illustrative examples tested with Matlab 2023b.

All examples use the same data set which can be downloaded here:

Dataset (.zip)

Warning

When loading multiple files, all files must share a common data-schema, see Information on output-file organization


Read single file (in-memory)

clc, clear

% Path to local file
file_path = "F:\datalake\AABBCCDD\CAN2_GnssSpeed\2022\04\22\00000975_00000001.parquet";

% Read single file into memory
gnss_speed = parquetread(file_path, "RowTimes", "t");

% Plot speed over time
figure(1)
plot(gnss_speed.t, gnss_speed.Speed)

parquet_read.m


Read multiple files (in-memory)

clc, clear

% Path to local datalake
datalake_path = "F:\datalake\AABBCCDD\CAN2_GnssSpeed";

% Create datastore containing multiple files
ds_gnss_speed = parquetDatastore(datalake_path, "RowTimes", "t", "IncludeSubfolders", true );

% Read all files into memory
gnss_speed = ds_gnss_speed.readall();

% Plot speed over time
figure(1)
plot(gnss_speed.t, gnss_speed.Speed)

parquet_datastore_local_readall.m


Read multiple files (out-of-memory)

The Matlab documentation describes Tall-Arrays as:

“Tall arrays are used to work with out-of-memory data that is backed by a datastore.”

Using Tall-Arrays it becomes possible to work efficiently with huge data sets (much larger than available memory). In addition, work on data defined by Tall-Arrays is supported by the Matlab Parallel Computing Toolbox, making it possible to parallelize calculations on data.

clc, clear

% Path to local datalake
datalake_path = "F:\datalake\AABBCCDD\CAN2_GnssSpeed";

% Create datastore containing multiple files
ds_gnss_speed = parquetDatastore(datalake_path, "RowTimes", "t", "IncludeSubfolders", true );

% Create tall array (files not loaded into memory)
gnss_speed = tall(ds_gnss_speed);

% Plot speed over time
figure(1)
plot(gnss_speed.t, gnss_speed.Speed)

parquet_datastore_local_tallarray.m


Read multiple files from S3 (out-of-memory)

This example is the same as Read multiple files (out-of-memory), now just reading directly from an AWS S3 server.

The AWS S3 credentials should first be set using environment variables in the MATLAB command window (refer to the Matlab documentation):

setenv("AWS_ACCESS_KEY_ID","YOUR_AWS_ACCESS_KEY_ID")
setenv("AWS_SECRET_ACCESS_KEY","YOUR_AWS_SECRET_ACCESS_KEY")
setenv("AWS_DEFAULT_REGION","YOUR_AWS_DEFAULT_REGION")

Note

It is important that the region is correct (e.g. eu-north-1 or us-east-1)

clc, clear

% Path to remote S3 datalake
datalake_path = "s3://css-parquet-test/AABBCCDD/CAN2_GnssSpeed";

% Create datastore containing multiple files
ds_gnss_speed = parquetDatastore(datalake_path, "RowTimes", "t", "IncludeSubfolders", true );

% Create tall array (files not loaded into memory)
gnss_speed = tall(ds_gnss_speed);

% Plot speed over time
figure(1)
plot(gnss_speed.t, gnss_speed.Speed)

parquet_datastore_s3_tallarray.m


Time resampling

The Matlab function retime is very powerful when working with time-series (timetable) data in Matlab.

The function can resample or aggregate timetable data with a wide range of configuration options.

Note

Refer to the Matlab documentation of retime

clc, clear

% Path to local datalake
datalake_path = "F:\datalake\AABBCCDD\CAN2_GnssSpeed";

% Create datastore containing multiple files
ds_gnss_speed = parquetDatastore(datalake_path, "RowTimes", "t", "IncludeSubfolders", true );

% Create tall array (files not loaded into memory)
gnss_speed = tall(ds_gnss_speed);

% Extract "lastvalue" from each minute
gnss_speed_resample = retime(gnss_speed , 'minutely', 'lastvalue');

% Plot resampled speed over time
figure(1)
plot(gnss_speed_resample.t, gnss_speed_resample.Speed)

parquet_resample.m


Time synchronization

The Matlab function synchronize is very powerful when working with multiple time-series (timetables) data sets in Matlab.

The function is an extension to retime, supporting multiple time-series data sets.

Note

Refer to the Matlab documentation of synchronize

clc, clear

% Path to local datalake
datalake_path = "F:\datalake\AABBCCDD\";

% Paths to time-series data sets
gnss_speed_path = fullfile(datalake_path, "CAN2_GnssSpeed");
obd_rpm_path = fullfile(datalake_path, "CAN1_OBD2_EngineSpeed");

% Create datastore for time-series data set "GnssSpeed"
ds_gnss_speed = parquetDatastore(gnss_speed_path, "RowTimes", "t", "IncludeSubfolders", true );

% Create datastore for time-series data set "EngineSpeed"
ds_obd_rpm = parquetDatastore(obd_rpm_path, "RowTimes", "t", "IncludeSubfolders", true );

% Create tall arrays (files not loaded into memory)
gnss_speed = tall(ds_gnss_speed);
obd_rpm = tall(ds_obd_rpm);

% Create a common timebase using the timestamps from gnss_speed and linear interpolation for obd_rpm
tt_common = synchronize(gnss_speed, obd_rpm, 'first', 'linear');

% Now with a common timebase, we can do calculations accross the two time series
gear = tt_common.Speed ./ tt_common.S1_PID_0C_EngineRPM;

% Plot gear ratio calculation over time
figure(1)
plot(tt_common.t, gear)

parquet_syncronize.m