Google Parquet data lake

Google BigQuery interface

Here we explain how to deploy a Google Parquet data lake with automation and an interface.

This can e.g. be used in Grafana-BigQuery dashboards or Python/MATLAB.


Overview

This guide lets you set up an automated data pre-processing workflow, including:

  • An ‘input bucket’ (for MDF/DBC files) and ‘output bucket’ (for Parquet files)
  • A ‘cloud function’ (DBC decodes new MDF files and outputs them as Parquet files)
  • A ‘BigQuery’ SQL interface for querying the data lake (e.g. from Grafana)
  • Three ‘support cloud functions’ (map data lake, process MDF backlogs, summarize trips)

Note

The below assumes that you have a GCP account and input bucket. If not, see this guide.

Note

Ensure you test the MF4 decoders with your log files & DBC files locally before proceeding.


1: Upload files to input bucket

  1. Upload your prefixed DBC files (e.g. can1-xyz.dbc) to your bucket root via the console[1]
  2. Upload below 4 files[2] to your bucket root

Cloud function zip | Mapping function zip | Backlog zip | Aggregation zip

Note

If you later need to update the integration, upload the new files and repeat the steps below


2: Deploy integration

  1. Open the canedge-google-cloud-terraform repository
  2. Go through the ‘setup instructions’ to open your Cloud Shell and clone the repository
  3. Go through step 2 (MF4-to-Parquet) (see above video)
  4. Go through step 3 (BigQuery)

3: Test your cloud function

  1. Upload a test MDF from your CANedge into your input bucket via CANcloud or the console
  2. Verify that the decoded Parquet files are created in your output bucket[3]

Your data lake will now get auto-filled when new MDF files are uploaded to the input bucket.


4: Map your Parquet data lake to tables

  1. Verify that your output bucket contains Parquet files[4]
  2. Open Cloud Scheduler in your browser
  3. Select ‘map-tables-scheduler’, click RESUME and FORCE RUN
  4. After it completes, click PAUSE

Optionally review your <id>-bq-map-tables function Logs/ tab via the console (function overview)

Note

The mapping script adds ‘meta data’ about your output bucket. If new devices/messages are added to your Parquet data lake, the script should be run again (manually or by schedule)[5]

Next, set up Grafana-BigQuery dashboards - or check the advanced topics to process your historical backlog of MDF files, add custom event triggers and more.


[1]If your MDF files are encrypted (MFE, MFM), also upload your passwords.json file
[2]Changelogs: changelog | changelog
[3]The output bucket will be named as <input-bucket-name>-parquet in the same storage account as your input bucket
[4]If your output bucket is empty, you can upload a test MDF file to your input bucket to create some Parquet data
[5]You only need to re-run the script if new tables are to be created, not if you simply add more data to an existing table