Process backlog of MDF files

Parquet Data Lake Backlog Processing

Below we describe how you can add historical MDF log files to your Parquet data lake.

Note

Optionally, you can periodically delete your Parquet data lake and re-create as below. This will concatenate smaller files, which can improve processing speed[5]

Warning

Processing a large input bucket/container backlog may be time consuming and costly


Amazon Parquet data lake

  1. Download below zip to your local disk and unzip it[1]
  2. Add the mdf2parquet_decode.exe (Windows build) and your prefixed DBC files into the folder
  3. Start 1-setup-mc.bat and enter the outputs from your stack[3]
  4. Start 2-download-mdf.bat to download some/all data from your input bucket[4]
  5. Start 3-decode-mdf.bat to decode the data, then review the output in input_out/
  6. Start 4-upload-parquet.bat to upload the data from input_out/ to your output bucket

Backlog processing zip | changelog


Google Parquet data lake

  1. Download below zip to your local disk and unzip it
  2. Add the mdf2parquet_decode.exe (Windows build) and your prefixed DBC files into the folder
  3. Add your bigquery-storage-admin-account.json in the folder (from the BigQuery setup guide)
  4. Update the Python script with your input bucket name
  5. Run the script to download all your MDF log files, decode them and upload the Parquet files[2]

Backlog processing zip | changelog


Azure Parquet data lake

  1. Download below zip to your local disk and unzip it
  2. Add the mdf2parquet_decode.exe (Windows build) and your prefixed DBC files into the folder
  3. Update the Python script with your connection strings and container names
  4. Run the script to download all your MDF log files, decode them and upload the Parquet files[2]

Backlog processing zip | changelog


[1]The zip includes Windows batch scripts and the Windows MinIO client. For Linux, you can download the MinIO client separately and adjust the batch scripts
[2](1, 2) This assumes that you have installed Python 3.11 and the requirements.txt (we recommend using a virtual environment). You can set this up by running python -m venv env & env\Scripts\activate & pip install -r requirements.txt
[3]Select the CloudFormation stack to display ‘Outputs’ where you will find the credentials/info for your input/output S3 buckets
[4]If your MDF log files are stored locally (e.g. on an SD), you can of course skip this step and simply copy/paste your files manually as described in the local data lake section
[5]MDF files uploaded to the input bucket will be processed one-by-one by the Lambda function, which can result in many small Parquet files if you have frequent MDF splits. In contrast, when you process a backlog of multiple MDF files, the decoder combines files to the extent possible. By deleting the data lake and re-processing your MDF backlog, you can thus optimize performance at the cost of AWS S3 transfer costs.