Parquet in Matlab
Warning
The examples on this page are provided without support
Matlab has excellent Parquet support, including:
Reading a single files
Reading a set of files (in-memory or out-of-memory)
Reading directly from local disk, S3, and more
Possibility to parallelize calculations
Examples
Below follow a range of illustrative examples tested with Matlab 2023b.
All examples use the same data set which can be downloaded here:
Warning
When loading multiple files, all files must share a common data-schema, see Information on output-file organization
Read single file (in-memory)
clc, clear
% Path to local file
file_path = "F:\datalake\AABBCCDD\CAN2_GnssSpeed\2022\04\22\00000975_00000001.parquet";
% Read single file into memory
gnss_speed = parquetread(file_path, "RowTimes", "t");
% Plot speed over time
figure(1)
plot(gnss_speed.t, gnss_speed.Speed)
Read multiple files (in-memory)
clc, clear
% Path to local datalake
datalake_path = "F:\datalake\AABBCCDD\CAN2_GnssSpeed";
% Create datastore containing multiple files
ds_gnss_speed = parquetDatastore(datalake_path, "RowTimes", "t", "IncludeSubfolders", true );
% Read all files into memory
gnss_speed = ds_gnss_speed.readall();
% Plot speed over time
figure(1)
plot(gnss_speed.t, gnss_speed.Speed)
Read multiple files (out-of-memory)
The Matlab documentation describes Tall-Arrays as:
“Tall arrays are used to work with out-of-memory data that is backed by a datastore.”
Using Tall-Arrays it becomes possible to work efficiently with huge data sets (much larger than available memory). In addition, work on data defined by Tall-Arrays is supported by the Matlab Parallel Computing Toolbox, making it possible to parallelize calculations on data.
clc, clear
% Path to local datalake
datalake_path = "F:\datalake\AABBCCDD\CAN2_GnssSpeed";
% Create datastore containing multiple files
ds_gnss_speed = parquetDatastore(datalake_path, "RowTimes", "t", "IncludeSubfolders", true );
% Create tall array (files not loaded into memory)
gnss_speed = tall(ds_gnss_speed);
% Plot speed over time
figure(1)
plot(gnss_speed.t, gnss_speed.Speed)
Read multiple files from S3 (out-of-memory)
This example is the same as Read multiple files (out-of-memory), now just reading directly from an AWS S3 server.
The AWS S3 credentials should first be set using environment variables in the MATLAB command window (refer to the Matlab documentation):
setenv("AWS_ACCESS_KEY_ID","YOUR_AWS_ACCESS_KEY_ID")
setenv("AWS_SECRET_ACCESS_KEY","YOUR_AWS_SECRET_ACCESS_KEY")
setenv("AWS_DEFAULT_REGION","YOUR_AWS_DEFAULT_REGION")
Note
It is important that the region is correct (e.g. eu-north-1
or us-east-1
)
clc, clear
% Path to remote S3 datalake
datalake_path = "s3://css-parquet-test/AABBCCDD/CAN2_GnssSpeed";
% Create datastore containing multiple files
ds_gnss_speed = parquetDatastore(datalake_path, "RowTimes", "t", "IncludeSubfolders", true );
% Create tall array (files not loaded into memory)
gnss_speed = tall(ds_gnss_speed);
% Plot speed over time
figure(1)
plot(gnss_speed.t, gnss_speed.Speed)
Time resampling
The Matlab function retime
is very powerful when working with time-series (timetable) data in Matlab.
The function can resample or aggregate timetable data with a wide range of configuration options.
Note
Refer to the Matlab documentation of retime
clc, clear
% Path to local datalake
datalake_path = "F:\datalake\AABBCCDD\CAN2_GnssSpeed";
% Create datastore containing multiple files
ds_gnss_speed = parquetDatastore(datalake_path, "RowTimes", "t", "IncludeSubfolders", true );
% Create tall array (files not loaded into memory)
gnss_speed = tall(ds_gnss_speed);
% Extract "lastvalue" from each minute
gnss_speed_resample = retime(gnss_speed , 'minutely', 'lastvalue');
% Plot resampled speed over time
figure(1)
plot(gnss_speed_resample.t, gnss_speed_resample.Speed)
Time synchronization
The Matlab function synchronize
is very powerful when working with multiple time-series (timetables) data sets in Matlab.
The function is an extension to retime
, supporting multiple time-series data sets.
Note
Refer to the Matlab documentation of synchronize
clc, clear
% Path to local datalake
datalake_path = "F:\datalake\AABBCCDD\";
% Paths to time-series data sets
gnss_speed_path = fullfile(datalake_path, "CAN2_GnssSpeed");
obd_rpm_path = fullfile(datalake_path, "CAN1_OBD2_EngineSpeed");
% Create datastore for time-series data set "GnssSpeed"
ds_gnss_speed = parquetDatastore(gnss_speed_path, "RowTimes", "t", "IncludeSubfolders", true );
% Create datastore for time-series data set "EngineSpeed"
ds_obd_rpm = parquetDatastore(obd_rpm_path, "RowTimes", "t", "IncludeSubfolders", true );
% Create tall arrays (files not loaded into memory)
gnss_speed = tall(ds_gnss_speed);
obd_rpm = tall(ds_obd_rpm);
% Create a common timebase using the timestamps from gnss_speed and linear interpolation for obd_rpm
tt_common = synchronize(gnss_speed, obd_rpm, 'first', 'linear');
% Now with a common timebase, we can do calculations accross the two time series
gear = tt_common.Speed ./ tt_common.S1_PID_0C_EngineRPM;
% Plot gear ratio calculation over time
figure(1)
plot(tt_common.t, gear)