Azure Parquet data lake

Azure Synapse interface

Here we explain how to deploy an Azure Parquet data lake with automation and an interface.

This can e.g. be used in Grafana-Synapse dashboards, PowerBI-Synapse dashboards or Python.


Overview

This guide lets you set up an automated data pre-processing workflow, including:

  • An ‘input container’ (for MDF/DBC files) and ‘output container’ (for Parquet files)
  • An ‘Azure function’ (DBC decodes new MDF files and outputs them as Parquet files)
  • A ‘Synapse’ SQL interface for querying the data lake (e.g. from Grafana)
  • Three ‘support jobs’ (map data lake, process MDF backlogs, summarize trips)

Note

The below assumes that you have an Azure account and input container deployed via our Terraform script (not manually). If not, see this guide[1].

Note

Ensure you test the MF4 decoders with your log files & DBC files locally before proceeding.


1: Upload to input container

  1. Upload the prefixed DBC files (e.g. can1-xyz.dbc) to your input container root via the console[2]
  2. Upload below zip[3] to your container root

Cloud function zip

Note

If you later need to update the function zip, upload the new version and repeat the steps below


2: Deploy integration

  1. Open the canedge-azure-cloud-terraform repository
  2. Go through the ‘setup instructions’ to open your Cloud Shell and clone the repository
  3. Go through step 2 (MF4-to-Parquet) with --github-token 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
  4. Go through step 3 (Synapse) with the above token
  5. Note down the Synapse connection output (for use in e.g. connecting Grafana/PowerBI)

3: Test your cloud function

  1. Upload a test MDF from your CANedge into your input container via CANcloud or the console
  2. Verify that the decoded Parquet files are created in your output container[4]

Your data lake will now get auto-filled when new MDF files are uploaded to the input container.


4: Map your Parquet data lake to tables

  1. Verify that your output container contains Parquet files[5]
  2. Open your ‘synapse-map-tables’ Container App Job via the console
  3. Click ‘Run now’ (at the top) and then click the execution history
  4. Verify that the job succeeds[7] (the job may fail the 1st time, re-run it)

Note

The mapping script adds ‘meta data’ about your output container. If new devices/messages are added to your Parquet data lake, the script should be run again (manually or by schedule)[6]

Next, you can setup Grafana-Synapse dashboards - or check the advanced topics to process your historical backlog of MDF files, add custom event triggers and more.


[1]The Terraform deployment requires that the storage account containing the input/output container has hierarchical namespace (HNS) enabled. If you manually created your storage account previously, we recommend using our Terraform deployment to create a new storage account and input container and migrate your devices/data to this new setup
[2]If your MDF files are encrypted (MFE, MFM), also upload your passwords.json file
[3]Changelogs: changelog
[4]The output container will be named as <input-container-name>-parquet in the same storage account as your input container
[5]If your output container is empty, you can upload a test MDF file to your input container to create some Parquet data
[6]You only need to re-run the script if new tables are to be created, not if you simply add more data to an existing table
[7]You can review the console output of the script under by clicking the Console Logs under the execution history. Note that these may be delayed by up to 15 min and may not be sorted in time ascending order by default.