Parquet data lake

In this section we outline how to set up a Parquet data lake.

Parquet data lakes offer an efficient, low cost, scalable and interoperable way of storing DBC decoded CAN/LIN data. The data lake can be analyzed directly via e.g. Python/MATLAB - or through an interface. The data lake can be stored locally or in the cloud.

This is a prerequisite for dashboards and some MATLAB/Python script examples.


Prepare & test DBC files

  1. Download the MF4 decoder mdf2parquet_decode.exe and review the documentation here
  2. Rename your DBC files to add the <channel> prefix (can1-<name>.dbc, can9-<name>.dbc, …)[1]
  3. Verify that you can decode your log file by drag & dropping it onto the mdf2parquet_decode.exe

Note

You can easily open Parquet files on your PC via the free Parquet viewer Tad

Note

If you have issues decoding your data, see our MF4 decoder troubleshooting guide


Create a Parquet data lake

Once you have tested your setup locally, you can set up your Parquet data lake and automation.

You can set this up in multiple ways, depending on your existing environment[2]:

  1. Amazon - create a Parquet data lake stored in an AWS S3 bucket (incl. automation)
  2. Google - create a Parquet data lake stored in a Google bucket (incl. automation)
  3. Azure - create a Parquet data lake stored in an Azure container (incl. automation)
  4. Local - create a Parquet data lake stored locally with manual processing


[1]As per the MF4 decoder docs, the prefix specifies whether a DBC is applied to CAN CH1, LIN CH2 etc. You can have multiple DBC files with the same type and channel prefix. Ensure that your DBC file names use only letters, numbers and dashes
[2]For example, if you are using a CANedge2/3 to upload data to Amazon or Azure, we recommend setting up a Parquet data lake in Amazon or Azure, respectively