Parquet data lake

In this section we outline how to set up a Parquet data lake.

Parquet data lakes offer an efficient, low cost, scalable and interoperable way of storing DBC decoded CAN/LIN data. The data lake can be analyzed directly via e.g. Python/MATLAB - or through an interface. The data lake can be stored locally or in the cloud.

This is a prerequisite for dashboards and some MATLAB/Python script examples.


Prepare & test DBC files

  1. Download the MF4 decoders and review the documentation here
  2. Rename your DBC files to add a <channel> prefix (can1-<name>.dbc, can9-<name>.dbc, …)[1]
  3. Verify that you can decode your log file by drag & dropping it onto the mdf2csv_decode.exe[2]

Note

If you’re having issues decoding your data, see our MF4 decoder troubleshooting guide


Create a Parquet data lake

Once you have tested your setup locally, you can set up your Parquet data lake and automation.

You can set this up in multiple ways, depending on your existing environment[3]:

  1. Amazon - create a Parquet data lake stored in an AWS S3 bucket (incl. automation)
  2. Azure - create a Parquet data lake stored in an Azure container (incl. automation)
  3. Local - create a Parquet data lake stored locally with manual processing


[1]As per the MF4 decoder docs, the prefix specifies whether a DBC is applied to CAN CH1, LIN CH2 etc. You can have multiple DBC files with the same type and channel prefix. Ensure that your DBC file names use only letters, numbers and dashes
[2]While the end goal is to create a Parquet data lake (rather than CSV), we recommend using the mdf2csv_decode.exe for testing as you can easily review the output in a text editor. However, you can also use the mdf2parquet_decode.exe and then open the resulting Parquet files via the simple Parquet viewer Tad
[3]For example, if you are using a CANedge2/3 to upload data to Amazon or Azure, we recommend setting up a Parquet data lake in Amazon or Azure, respectively