Google Parquet data lake
Here we explain how to deploy a Google Parquet data lake with automation and an interface.
This can e.g. be used in Grafana-BigQuery dashboards or Python/MATLAB.
Table of Contents
Overview
This guide lets you set up an automated data pre-processing workflow, including:
- An ‘input bucket’ (for MDF/DBC files) and ‘output bucket’ (for Parquet files)
- A ‘cloud function’ (DBC decodes new MDF files and outputs them as Parquet files)
- A ‘BigQuery’ SQL interface for querying the data lake (e.g. from Grafana)
- Three ‘support cloud functions’ (map data lake, process MDF backlogs, summarize trips)
Note
The below assumes that you have a GCP account and input bucket. If not, see this guide.
Note
Ensure you test the MF4 decoders with your log files & DBC files locally before proceeding.
1: Upload files to input bucket
- Upload your prefixed DBC files (e.g.
can1-xyz.dbc) to your bucket root via the console[1] - Upload below 4 files[2] to your bucket root
Cloud function zip | Mapping function zip | Backlog zip | Aggregation zip
Note
If you later need to update the integration, upload the new files and repeat the steps below
2: Deploy integration
- Open the canedge-google-cloud-terraform repository
- Go through the ‘setup instructions’ to open your Cloud Shell and clone the repository
- Go through step 2 (MF4-to-Parquet) (see above video)
- Go through step 3 (BigQuery)
3: Test your cloud function
- Upload a test MDF from your CANedge into your input bucket via CANcloud or the console
- Verify that the decoded Parquet files are created in your output bucket[3]
Your data lake will now get auto-filled when new MDF files are uploaded to the input bucket.
4: Map your Parquet data lake to tables
- Verify that your output bucket contains Parquet files[4]
- Open Cloud Scheduler in your browser
- Select ‘map-tables-scheduler’, click
RESUMEandFORCE RUN - After it completes, click
PAUSE
Optionally review your <id>-bq-map-tables function Logs/ tab via the console (function overview)
Note
The mapping script adds ‘meta data’ about your output bucket. If new devices/messages are added to your Parquet data lake, the script should be run again (manually or by schedule)[5]
Next, set up Grafana-BigQuery dashboards - or check the advanced topics to process your historical backlog of MDF files, add custom event triggers and more.
| [1] | If your MDF files are encrypted (MFE, MFM), also upload your passwords.json file |
| [2] | Changelogs: changelog | changelog |
| [3] | The output bucket will be named as <input-bucket-name>-parquet in the same storage account as your input bucket |
| [4] | If your output bucket is empty, you can upload a test MDF file to your input bucket to create some Parquet data |
| [5] | You only need to re-run the script if new tables are to be created, not if you simply add more data to an existing table |