Process backlog of MDF files

Parquet Data Lake Backlog Processing

Below we describe how you can add historical MDF log files to your Parquet data lake.

Note

Optionally, you can periodically delete your Parquet data lake and re-create as below. This will concatenate smaller files, which can improve processing speed

Warning

Processing a large input bucket/container backlog may be time consuming and costly - start with a small backlog during initial testing


Process backlog via JSON file

To process a backlog of MDF files to Parquet in your cloud, follow below steps:

  1. Ensure you are using the latest data lake integration
  2. Download our example backlog.json below and modify it as per your needs
  3. Once ready, upload it to your input bucket root
  4. Trigger the backlog processing as outlined for your specific cloud below

backlog.json

How it works

When you run the backlog processing job, it will do the following:

  1. The script will download and validate your backlog.json file
  2. It will group the log file objects you’ve specified into suitable batches
  3. It then runs the same processing as your cloud automation function on each batch[1]

JSON syntax

To illustrate the JSON syntax, consider the example below:

{
  "config": {
    "batch_size": {
      "min": 10,
      "max": 256
    }
  },
  "files": [
    "2F6913DB/",
    "ABCDEF12/00000088/",
    "2F6913DB/00000086/00000001-62961868.MF4",
    "2F6913DB/00000086/00000003-62977DFB.MF4"
  ]
}

The batch_size parameters let you control the grouping of files. For example, if you have a lot of sessions with 1-2 files, setting a minimum batch size of 10 will allow the script to group up to 10 log files across sessions into a single batch. Files are never grouped across devices. If you have many small files, you may consider increasing the minimum batch size.

The files list contains the list of files you wish to process in a specific run. You can specify these at the device ID level, session level or object level - or a mix of these.


How to run

Amazon

  1. Open AWS Glue Triggers in a new tab
  2. Select the ‘process-backlog-on-demand’ trigger and click ‘Action/Start trigger’
  3. View output logs by clicking the target job and the ‘Runs’ tab

Google

  1. Open Cloud Scheduler in your browser
  2. Select ‘mdf-to-parquet-backlog-scheduler’, click RESUME and FORCE RUN
  3. After it completes, click PAUSE
  4. View output logs via the ‘mdf-to-parquet-backlog’ function Logs/ tab in the console (runs)

Azure

  1. Open your ‘backlog’ Container App Job via the console (jobs)
  2. Click ‘Run now’ (at the top) and then click the execution history
  3. View output logs via the execution console logs (may be delayed up to 15 min)

Note

If the job results in new devices/messages in your data lake, re-run your table mapping job

Note

The max run-time for a job is 60 minutes. For large backlogs, process a subset in each run


Tips & tricks

If you are processing files that have already been processed individually via your cloud automation function, it is recommended to first delete the original Parquet files related to those files from your output bucket. They will not automatically be overwritten when the backlog processing runs because the backlog process will combine multiple Parquet files when possible for efficiency.


[1]The backlog processing essentially re-uses the same code that the single-file automation functions do (e.g. the Amazon Lambda). This means that if you have e.g. enabled event detection, custom signals or device specific DBC files, the backlog processing will also incorporate these. However, for event detection specifically, the backlog processing script will disable notifications (to avoid excessive notifications about historical events).