Process backlog of MDF files
Below we describe how you can add historical MDF log files to your Parquet data lake.
Note
Optionally, you can periodically delete your Parquet data lake and re-create as below. This will concatenate smaller files, which can improve processing speed
Warning
Processing a large input bucket/container backlog may be time consuming and costly - start with a small backlog during initial testing
Table of Contents
Process backlog via JSON file
To process a backlog of MDF files to Parquet in your cloud, follow below steps:
- Ensure you are using the latest data lake integration
- Download our example
backlog.jsonbelow and modify it as per your needs - Once ready, upload it to your input bucket root
- Trigger the backlog processing as outlined for your specific cloud below
How it works
When you run the backlog processing job, it will do the following:
- The script will download and validate your
backlog.jsonfile - It will group the log file objects you’ve specified into suitable batches
- It then runs the same processing as your cloud automation function on each batch[1]
JSON syntax
To illustrate the JSON syntax, consider the example below:
{
"config": {
"batch_size": {
"min": 10,
"max": 256
}
},
"files": [
"2F6913DB/",
"ABCDEF12/00000088/",
"2F6913DB/00000086/00000001-62961868.MF4",
"2F6913DB/00000086/00000003-62977DFB.MF4"
]
}
The batch_size parameters let you control the grouping of files. For example, if you have a lot of sessions with 1-2 files, setting a minimum batch size of 10 will allow the script to group up to 10 log files across sessions into a single batch. Files are never grouped across devices. If you have many small files, you may consider increasing the minimum batch size.
The files list contains the list of files you wish to process in a specific run. You can specify these at the device ID level, session level or object level - or a mix of these.
How to run
Amazon
- Open AWS Glue Triggers in a new tab
- Select the ‘process-backlog-on-demand’ trigger and click ‘Action/Start trigger’
- View output logs by clicking the target job and the ‘Runs’ tab
Google
- Open Cloud Scheduler in your browser
- Select ‘mdf-to-parquet-backlog-scheduler’, click
RESUMEandFORCE RUN - After it completes, click
PAUSE - View output logs via the ‘mdf-to-parquet-backlog’ function Logs/ tab in the console (runs)
Azure
- Open your ‘backlog’ Container App Job via the console (jobs)
- Click ‘Run now’ (at the top) and then click the execution history
- View output logs via the execution console logs (may be delayed up to 15 min)
Note
If the job results in new devices/messages in your data lake, re-run your table mapping job
Note
The max run-time for a job is 60 minutes. For large backlogs, process a subset in each run
Tips & tricks
If you are processing files that have already been processed individually via your cloud automation function, it is recommended to first delete the original Parquet files related to those files from your output bucket. They will not automatically be overwritten when the backlog processing runs because the backlog process will combine multiple Parquet files when possible for efficiency.
| [1] | The backlog processing essentially re-uses the same code that the single-file automation functions do (e.g. the Amazon Lambda). This means that if you have e.g. enabled event detection, custom signals or device specific DBC files, the backlog processing will also incorporate these. However, for event detection specifically, the backlog processing script will disable notifications (to avoid excessive notifications about historical events). |