Amazon Parquet data lake
Here we explain how to deploy an Amazon Parquet data lake with automation and an interface.
This can e.g. be used in Grafana-Athena dashboards or Python/MATLAB scripts.
Table of Contents
Overview
This guide lets you set up an automated data pre-processing workflow, including:
- An ‘input bucket’ (for MDF/DBC files) and ‘output bucket’ (for Parquet files)
- A ‘Lambda function’ (DBC decodes new MDF files and outputs them as Parquet files)
- An ‘Athena’ SQL interface for querying the data lake (e.g. from Grafana)
- Three ‘support Glue jobs’ (map data lake, process MDF backlogs, summarize trips)
Note
Ensure you test the MF4 decoders with your log files & DBC files locally before proceeding.
1: Upload files to input bucket
- Upload your prefixed DBC files (e.g.
can1-xyz.dbc) to your bucket root via the S3 console[3] - Upload below 4 files[4] (zip file and Python scripts) to your bucket root
Lambda zip | Mapping script | Backlog script | Aggregation script
2: Import input bucket
Open AWS CloudFormation and ensure you are in the same region as your input bucket
Verify that there are no existing stacks[5]
Click the upper-right ‘Create stack/With existing resources (import resources)’
Enter below in the ‘Amazon S3 URL’[6]:
https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/import-s3-bucket-v1.3.0.jsonEnter the name of your existing AWS S3 input bucket
Enter a unique ‘stack’ name (e.g.
datalake-stack) and your input bucket name againClick ‘Import resources’ once the ‘Changes’ have loaded, wait ~1 min and hit F5
Note
This only works if you use the ‘import resources’ - not if you create a new stack from scratch
3: Deploy integration
- Click ‘Stack actions/Create change set for current stack’
- Click ‘Replace current template’ and enter below:
https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/glue-athena-v4.1.0-vG.5.0.json
- Enter a ‘UniqueID’ (e.g.
datalake05)[7] - Enter a valid email for notifications on Lambda errors/events[8]
- Click ‘Acknowledge’, ‘Submit’, wait ~1 min and click the upper-right refresh
- Click ‘Execute change set’ (and click it again in the popup), then wait ~1 min
Note
If the deployment fails, double check that you uploaded the zip/scripts to your S3 input bucket
Note
If you later need to update the integration with a new revision, see this guide
4: Test your cloud function
- Upload a test MDF file from your CANedge into your input bucket via the S3 console
- Verify that the decoded Parquet files are created in your output bucket[9]
Your data lake will now get auto-filled when new MDF files are uploaded to the input bucket.
Note
If you wish to roll back your changes, simply delete the stack (your input bucket is not deleted)
5: Map your Parquet data lake to tables
- Verify that your S3 output bucket contains Parquet files
- Open AWS Glue Triggers in a new tab
- Select the ‘map-tables-on-demand’ trigger and click ‘Action/Start trigger’
- Open the trigger target Glue job, click ‘Runs’ and verify that it succeeds
Note
Glue adds ‘meta data’ about your S3 output bucket. If new devices/messages are added to your Parquet data lake, the Glue job should be triggered again (manually or by schedule)[10]
You can now use Athena as a data source in e.g. Grafana-Athena dashboards. You can also check out the advanced topics to learn about event detection, trip summaries and more.
| [1] | You can upgrade your account from the free tier to paid in the ‘Cost and Usage’ section by clicking ‘Upgrade Plan’ |
| [2] | If you have connected a CANedge2/CANedge3 to an AWS S3 bucket then this is your input bucket |
| [3] | If your MDF files are encrypted (MFE, MFM), also upload your passwords.json file |
| [4] | Changelogs: automation scripts | mapping script |
| [5] | If you have created your S3 input bucket via CloudFormation, you should ensure that this stack is deleted before proceeding (the default name is s3-input-bucket-stack). Further, to import your input bucket via CloudFormation, you must select the same region as your bucket via the menu in the upper-right corner in the AWS console. |
| [6] | The CloudFormation JSON stack templates can be downloaded to your local disk via the URL if you wish to review the contents |
| [7] | The unique ID can be useful if you e.g. need to deploy multiple separate Parquet data lakes |
| [8] | We recommend to confirm the subscription to error/event emails as it helps you monitor your workflow, but you can of course disable these emails if preferred. You can edit your SNS subscriptions within AWS. |
| [9] | The output bucket will be named as <input-bucket-name>-parquet - you can find it via the AWS S3 console |
| [10] | New Parquet files added for existing devices/messages will automatically be available for queries by Athena. A new Glue job run is only required if the new Parquet data reflects a previously ‘unmapped’ device or table. For most use cases, the manual trigger will therefore suffice. However, a scheduled trigger is recommended if you expect new devices/messages to be added frequently over time. To activate the scheduled trigger, select it and click ‘Action/Activate trigger’. A Glue job will normally cost ~0.03$/run (depending on data lake size), in which case a scheduled daily trigger would cost cost ~10$/year |