Parquet data lake - advanced topics
In this section, we cover some advanced topics for Parquet data lakes.
Table of Contents
Process backlog of MDF files
Below we describe how you can add historical MDF log files (that have already been uploaded) to your Parquet data lake.
Note
Optionally, you can periodically delete your Parquet data lake and re-create as below. This will concatenate smaller files, which can improve processing speed[6]
Warning
Processing your entire input bucket backlog can be time consuming and costly if it is large
Amazon Parquet data lake
- Download below zip to your local disk and unzip it[2]
- Add the
mdf2parquet_decode.exe
(Windows build) and your prefixed DBC files into the folder - Start
1-setup-mc.bat
and enter the outputs from your stack[4] - Start
2-download-mdf.bat
to download some/all data from your input bucket[5] - Start
3-decode-mdf.bat
to decode the data, then review the output ininput_out/
- Start
4-upload-parquet.bat
to upload the data frominput_out/
to your output bucket
Backlog processing zip
| changelog
Google Parquet data lake
- Download below zip to your local disk and unzip it
- Add the
mdf2parquet_decode.exe
(Windows build) and your prefixed DBC files into the folder - Add your
bigquery-storage-admin-account.json
in the folder (from the BigQuery setup guide) - Update the Python script with your input bucket name
- Run the script to download all your MDF log files, decode them and upload the Parquet files[3]
Backlog processing zip
| changelog
Azure Parquet data lake
- Download below zip to your local disk and unzip it
- Add the
mdf2parquet_decode.exe
(Windows build) and your prefixed DBC files into the folder - Update the Python script with your connection strings and container names
- Run the script to download all your MDF log files, decode them and upload the Parquet files[3]
Use device specific DBC files
You may have an bucket/container that contains multiple device groups, each of which requires separate DBC files (e.g. different models/types/…). Here you can use the below method[1]:
- Download below
dbc-groups.json
and update it with your device/DBC lists - Upload the JSON file to your input bucket/container root
- Ensure that your log files are stored in the CANedge folder structure
- Verify that your automation now only applies the device specific DBC files
Optimize performance
- Filter DBC: Filter unused messages/signals via
MessageIgnore
/SignalIgnore
properties - Create light DBC: For large DBC files, consider removing message/signal descriptions[7]
- Use larger MDF splits: Configure your CANedge to split MDF files in e.g. 30-60 min chunks
- Reprocess backlog: Periodically re-process your backlog of MDF files to concatenate them
[1] | If all device groups have globally unique CAN IDs across all DBC files, you can simply add all the DBC files to the S3 input bucket root. In this case, the MF4 decoder will attempt to use all of them across all devices. |
[2] | The zip includes Windows batch scripts and the Windows MinIO client. For Linux, you can download the MinIO client separately and adjust the batch scripts |
[3] | (1, 2) This assumes that you have installed Python 3.11 and the requirements.txt (we recommend using a virtual environment). You can set this up by running python -m venv env & env\Scripts\activate & pip install -r requirements.txt |
[4] | Select the CloudFormation stack to display ‘Outputs’ where you will find the credentials/info for your input/output S3 buckets. The relevant S3 credentials have keys 07 and 08 . |
[5] | If your MDF log files are stored locally (e.g. on an SD), you can of course skip this step and simply copy/paste your files manually as described in the local data lake section |
[6] | MDF files uploaded to the input bucket will be processed one-by-one by the Lambda function, which can result in many small Parquet files if you have frequent MDF splits. In contrast, when you process a backlog of multiple MDF files, the decoder combines files to the extent possible. By deleting the data lake and re-processing your MDF backlog, you can thus optimize performance at the cost of AWS S3 transfer costs. |
[7] | This will not affect the speed of the data lake, but it can reduce the Lambda execution time, thus reducing costs. It is mainly relevant for large DBC files (for example, our J1939 DBC is provided in a ‘light’ version for this purpose) |