Amazon Parquet data lake

AWS S3 Lambda Parquet Data Lake

Here we explain how to deploy an Amazon Parquet data lake with automation and an interface.

This can e.g. be used in Grafana-Athena dashboards or Python/MATLAB scripts.


Overview

This guide lets you set up an automated data pre-processing workflow, including:

  • An ‘input bucket’ (for MDF/DBC files) and ‘output bucket’ (for Parquet files)
  • A ‘Lambda function’ (DBC decodes new MDF files and outputs them as Parquet files)
  • An ‘Athena’ SQL interface for querying the data lake (e.g. from Grafana)
  • Three ‘support Glue jobs’ (map data lake, process MDF backlogs, summarize trips)

Note

Below requires a paid tier AWS account[1] and an S3 bucket[2] - else see our guide.

Note

Ensure you test the MF4 decoders with your log files & DBC files locally before proceeding.


1: Upload files to input bucket

  1. Upload your prefixed DBC files (e.g. can1-xyz.dbc) to your bucket root via the S3 console[3]
  2. Upload below 4 files[4] (zip file and Python scripts) to your bucket root

Lambda zip | Mapping script | Backlog script | Aggregation script


2: Import input bucket

  1. Open AWS CloudFormation and ensure you are in the same region as your input bucket

  2. Verify that there are no existing stacks[5]

  3. Click the upper-right ‘Create stack/With existing resources (import resources)’

  4. Enter below in the ‘Amazon S3 URL’[6]:

    https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/import-s3-bucket-v1.3.0.json

  5. Enter the name of your existing AWS S3 input bucket

  6. Enter a unique ‘stack’ name (e.g. datalake-stack) and your input bucket name again

  7. Click ‘Import resources’ once the ‘Changes’ have loaded, wait ~1 min and hit F5

Note

This only works if you use the ‘import resources’ - not if you create a new stack from scratch


3: Deploy integration

  1. Click ‘Stack actions/Create change set for current stack’
  2. Click ‘Replace current template’ and enter below:
https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/glue-athena-v4.1.0-vG.5.0.json
  1. Enter a ‘UniqueID’ (e.g. datalake05)[7]
  2. Enter a valid email for notifications on Lambda errors/events[8]
  3. Click ‘Acknowledge’, ‘Submit’, wait ~1 min and click the upper-right refresh
  4. Click ‘Execute change set’ (and click it again in the popup), then wait ~1 min

Note

If the deployment fails, double check that you uploaded the zip/scripts to your S3 input bucket

Note

If you later need to update the integration with a new revision, see this guide


4: Test your cloud function

  1. Upload a test MDF file from your CANedge into your input bucket via the S3 console
  2. Verify that the decoded Parquet files are created in your output bucket[9]

Your data lake will now get auto-filled when new MDF files are uploaded to the input bucket.

Note

If you wish to roll back your changes, simply delete the stack (your input bucket is not deleted)


5: Map your Parquet data lake to tables

  1. Verify that your S3 output bucket contains Parquet files
  2. Open AWS Glue Triggers in a new tab
  3. Select the ‘map-tables-on-demand’ trigger and click ‘Action/Start trigger’
  4. Open the trigger target Glue job, click ‘Runs’ and verify that it succeeds

Note

Glue adds ‘meta data’ about your S3 output bucket. If new devices/messages are added to your Parquet data lake, the Glue job should be triggered again (manually or by schedule)[10]


You can now use Athena as a data source in e.g. Grafana-Athena dashboards. You can also check out the advanced topics to learn about event detection, trip summaries and more.


[1]You can upgrade your account from the free tier to paid in the ‘Cost and Usage’ section by clicking ‘Upgrade Plan’
[2]If you have connected a CANedge2/CANedge3 to an AWS S3 bucket then this is your input bucket
[3]If your MDF files are encrypted (MFE, MFM), also upload your passwords.json file
[4]Changelogs: automation scripts | mapping script
[5]If you have created your S3 input bucket via CloudFormation, you should ensure that this stack is deleted before proceeding (the default name is s3-input-bucket-stack). Further, to import your input bucket via CloudFormation, you must select the same region as your bucket via the menu in the upper-right corner in the AWS console.
[6]The CloudFormation JSON stack templates can be downloaded to your local disk via the URL if you wish to review the contents
[7]The unique ID can be useful if you e.g. need to deploy multiple separate Parquet data lakes
[8]We recommend to confirm the subscription to error/event emails as it helps you monitor your workflow, but you can of course disable these emails if preferred. You can edit your SNS subscriptions within AWS.
[9]The output bucket will be named as <input-bucket-name>-parquet - you can find it via the AWS S3 console
[10]New Parquet files added for existing devices/messages will automatically be available for queries by Athena. A new Glue job run is only required if the new Parquet data reflects a previously ‘unmapped’ device or table. For most use cases, the manual trigger will therefore suffice. However, a scheduled trigger is recommended if you expect new devices/messages to be added frequently over time. To activate the scheduled trigger, select it and click ‘Action/Activate trigger’. A Glue job will normally cost ~0.03$/run (depending on data lake size), in which case a scheduled daily trigger would cost cost ~10$/year