Amazon Parquet data lake

AWS S3 Lambda Parquet Data Lake

Here we explain how to create an Amazon Parquet data lake (with Lambda function automation).

This can e.g. be used in Grafana-Athena dashboards, Excel or Python/MATLAB scripts.


Overview

This guide enables you to set up an automated data pre-processing workflow. This includes an ‘input bucket’ (for MDF/DBC files) and an ‘output bucket’ (for Parquet files). It also includes a ‘Lambda function’, which auto-processes new MDF files uploaded to the input bucket - and outputs them as decoded Parquet files in the output bucket.

Note

The below assumes that you have an AWS account and S3 bucket[1] [3]. If not, see this guide.

Note

Ensure you test the MF4 decoders with your log files & DBC files locally before proceeding.


1: Upload Lambda zip and DBC files to input bucket

  1. Upload below zip and your prepared DBC files to your input bucket root via the S3 console[2]

Lambda zip | changelog

Note

If you later need to update the Lambda zip, see this guide


2: Import input bucket

  1. Open AWS CloudFormation and ensure you are in the same region as your input bucket

  2. Verify that there are no existing stacks[4]

  3. Click the upper-right ‘Create stack/With existing resources (import resources)’

  4. Enter below in the ‘Amazon S3 URL’[5]:

    https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/import-s3-bucket-v1.3.0.json

  5. Enter the name of your existing AWS S3 input bucket

  6. Enter a ‘stack’ name (e.g. datalake-stack) and your input bucket name again

  7. Click ‘Import resources’ once the ‘Changes’ have loaded, wait ~1 min and hit F5


3: Create output bucket and Lambda

  1. Click ‘Stack actions/Create change set for current stack’
  2. Click ‘Replace current template’ and enter below:
https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/output-lambda-v2.0.5.json
  1. Enter a ‘UniqueID’ (e.g. datalake01)[6]
  2. Click ‘Acknowledge’, ‘Submit’, wait ~1 min and click the upper-right refresh
  3. Click ‘Execute change set’ (and click it again in the popup), then wait ~1 min

Note

If the deployment fails, double check that you uploaded the Lambda zip to your S3 input bucket


4: Test Lambda

  1. Upload a test MDF file from your CANedge into your input bucket via the S3 console
  2. Verify that the decoded Parquet files are created in your output bucket[7]

Your data lake will now get auto-filled when new MDF files are uploaded to the input bucket.

Note

If you wish to roll back your changes, simply delete the stack (your input bucket is not deleted)


Next, you can e.g. set up Amazon Athena to enable Grafana-Athena dashboards - or check the advanced topics to process your historical backlog of MDF files.


[1]If you have connected a CANedge2/CANedge3 to an AWS S3 bucket then this is your input bucket
[2]If your MDF files are encrypted (MFE, MFM), also upload your passwords.json file
[3]If this is your first time deploying this integration, consider creating a ‘playground’ input bucket that is separate from your ‘production’ input bucket (where your CANedge units are uploading data to). This allows you to test the full integration with sample MDF files - after which you can deploy the setup on your production input bucket.
[4]If you have created your S3 input bucket via CloudFormation, you should ensure that this stack is deleted before proceeding (the default name is s3-input-bucket-stack). Further, to import your input bucket via CloudFormation, you must select the same region as your bucket via the menu in the upper-right corner in the AWS console.
[5]The CloudFormation JSON stack templates can be downloaded to your local disk via the URL if you wish to review the contents
[6]The unique ID can be useful if you e.g. need to deploy multiple separate Parquet data lakes
[7]The output bucket will be named as <input-bucket-name>-parquet - you can find it via the AWS S3 console