Azure Parquet data lake

Azure Parquet data lake

Here we explain how to create an Azure Parquet data lake (with Azure Function automation).

This can e.g. be used in PowerBI-Synapse dashboards or Python.

Note

This guide (and technical support on it) is intended for advanced users


Overview

This guide enables you to set up an automated data pre-processing workflow. This includes an ‘input container’ (for MDF/DBC files) and an ‘output container’ (for Parquet files). It also includes an ‘Azure Function’, which auto-processes new MDF files uploaded to the input container - and outputs them as decoded Parquet files in the output container.

Note

The below assumes that you have an Azure account and input container[1] [2]. If not, see this.

Note

Ensure you test the MF4 decoders with your log files & DBC files locally before proceeding.


1: Upload DBC files to input container

  1. Upload your prepared DBC files to your input container root via the Azure console

2: Create new storage account and output container

  1. In the Azure console, go to Storage Accounts and create new
  2. Specify a name (e.g. parquetstorageaccount)
  3. Select the same resource group and region as your input container
  4. Click ‘Next’ and enable ‘Enable hierarchical namespace’, then ‘Review + create’
  5. Go to the Storage Account / Data storage / Containers and add a new container
  6. Specify a name (e.g. parquet-container-output)

3: Create Azure Function App

  1. In the Azure console, go to Function App and create new
  2. Specify a name (e.g. mdf-to-parquet)
  3. Select the same resource group and region as before
  4. Select Python 3.11 as the runtime stack and choose ‘Consumption (serverless)’
  5. Click ‘Review + create’

4: Set up environment variables

  1. Note down the ‘Connection string’ from your input + output storage accounts[3]
  2. Go to your Azure Function App ‘Settings/Environment variables’
  3. Add the following environment variables (name/value pairs):
StorageAccountConnectionStringInput / <input Storage Account connection string>
StorageAccountConnectionStringOutput / <output Storage Account connection string>
  1. Make sure to click ‘Apply’ after adding the above (this will restart the app)

5: Modify and deploy Azure Function

  1. Install the Azure CLI (to enable authentication)
  2. Install Azure Functions Core Tools (for publishing Azure Functions)
  3. Create a local folder called azure-function-deployment
  4. Download and unzip our ready-to-use Azure Function below in the folder
  5. Open the function_app.py file with a text editor
  6. Update the input-container-name and output-container-name to match your containers
  7. If your log files are compressed/encrypted, change the MF4 suffix[4]
  8. In the azure-function-deployment folder, open your command line
  9. Run az login to authenticate
  10. Run func azure functionapp publish <your-function-app-name> --python[5]
  11. Verify that the MdfToParquet function is in your Azure Function App overview

Function zip | changelog


6: Test Azure Function

  1. Upload a test MDF file from your CANedge into your input container via the Azure console
  2. Verify that the decoded Parquet files are created in your output container

Your data lake will now get auto-filled when new MDF files are uploaded to the input container[6].


Next, you can e.g. set up Azure Synapse to enable PowerBI-Synapse dashboards.


[1]If you have connected a CANedge2/CANedge3 to an Azure Blob Storage Container then this is your input container
[2]If this is your first time deploying this integration, consider creating a ‘playground’ input container that is separate from your ‘production’ input container (where your CANedge units are uploading data to). This allows you to test the full integration with sample MDF files - after which you can deploy the setup on your production input container.
[3]You can find the connection string in your Storage Account under ‘Security + networking/Access keys/Connection string’
[4]If you e.g. use compression, your MDF files will use the suffix MFC - and the trigger function in the function_app.py will need to be adjusted accordingly
[5]Replace <your-function-app-name> with your Function App name, e.g. mdf-to-parquet
[6]If you are not seeing the expected results, double check that you’ve edited and published the Azure Function correctly. In your Azure Function App, you can click the MdfToParquet Function and open the ‘Monitor’ section to see recent invokations incl. console details that may help in troubleshooting