Parquet data lake - advanced topics 

In this section, we cover some advanced topics for Parquet data lakes.

Table of Contents

Parquet data lake - advanced topics

Process backlog of MDF files 

Below we describe how you can add historical MDF log files (that have already been uploaded) to your Parquet data lake.

Note

Optionally, you can periodically delete your Parquet data lake and re-create as below. This will concatenate smaller files, which can improve processing speed[6]

Warning

Processing your entire input bucket backlog can be time consuming and costly if it is large

Amazon Parquet data lake 

Download below zip to your local disk and unzip it[2]
Add the mdf2parquet_decode.exe (Windows build) and your prefixed DBC files into the folder
Start 1-setup-mc.bat and enter the outputs from your stack[4]
Start 2-download-mdf.bat to download some/all data from your input bucket[5]
Start 3-decode-mdf.bat to decode the data, then review the output in input_out/
Start 4-upload-parquet.bat to upload the data from input_out/ to your output bucket

Backlog processing zip | changelog

Google Parquet data lake 

Download below zip to your local disk and unzip it
Add the mdf2parquet_decode.exe (Windows build) and your prefixed DBC files into the folder
Add your bigquery-storage-admin-account.json in the folder (from the BigQuery setup guide)
Update the Python script with your input bucket name
Run the script to download all your MDF log files, decode them and upload the Parquet files[3]

Backlog processing zip | changelog

Azure Parquet data lake 

Download below zip to your local disk and unzip it
Add the mdf2parquet_decode.exe (Windows build) and your prefixed DBC files into the folder
Update the Python script with your connection strings and container names
Run the script to download all your MDF log files, decode them and upload the Parquet files[3]

Backlog processing zip | changelog

Use device specific DBC files 

Device specific DBC files in Lambda function

You may have an bucket/container that contains multiple device groups, each of which requires separate DBC files (e.g. different models/types/…). Here you can use the below method[1]:

Download below dbc-groups.json and update it with your device/DBC lists
Upload the JSON file to your input bucket/container root
Ensure that your log files are stored in the CANedge folder structure
Verify that your automation now only applies the device specific DBC files

DBC groups JSON

MDF files uploaded to the input bucket will be processed one-by-one by the Lambda function, which can result in many small Parquet files if you have frequent MDF splits. In contrast, when you process a backlog of multiple MDF files, the decoder combines files to the extent possible. By deleting the data lake and re-processing your MDF backlog, you can thus optimize performance at the cost of AWS S3 transfer costs.

[7]	This will not affect the speed of the data lake, but it can reduce the Lambda execution time, thus reducing costs. It is mainly relevant for large DBC files (for example, our J1939 DBC is provided in a ‘light’ version for this purpose)

Parquet data lake - advanced topics

Process backlog of MDF files

Amazon Parquet data lake

Google Parquet data lake

Azure Parquet data lake

Use device specific DBC files

Optimize performance

Parquet data lake - advanced topics 

Process backlog of MDF files 

Amazon Parquet data lake 

Google Parquet data lake 

Azure Parquet data lake 

Use device specific DBC files 

Optimize performance 