Parquet data lake aggregation - Amazon 

For many use cases you will need to analyze data across devices, messages and trips. To do this at scale, it can be useful to perform periodic data aggregation and add ‘trip summary tables’ to your data lake. These can e.g. be used in creating dashboard overviews across all your trips.

Note

This guide (and technical support on it) is intended for advanced users

Table of Contents

Parquet data lake aggregation - Amazon

Deploy aggregation script as AWS Glue job 

If you have set up an AWS S3 Parquet data lake and Athena, you can automate this[2]:

Verify that your integration uses the latest Lambda zip and Glue script[1]
Download and unzip the aggregation script below
Update the aggregations.json file with your preferred settings as per the README.md
Upload the Python script to your S3 input bucket root via the AWS S3 console
Upload the JSON file to your S3 output bucket root via the AWS S3 console
Log into your AWS account, go to CloudFormation and select your data lake stack
Click ‘Stack actions/Create change set for current stack’
Click ‘Replace current template’ and enter below:

https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/data-aggregation-v2.0.5-vG.1.0-vA.0.1.json

Click ‘Acknowledge’, ‘Submit’, wait ~1 min and click the upper-right refresh
Click ‘Execute change set’ (and click it again in the popup), then wait ~1 min

The script will now run as a daily scheduled AWS Glue job, automatically adding new entries in S3 as aggregations/tripsummary/yyyy/mm/dd/yyyymmdd.parquet.

aggregation script | changelog

Test script + aggregate backlog of historical data 

We recommend test the script via aggregation of your historical data:

Open the ‘aggregate-data-from-s3’ AWS Glue job ‘Script’ tab
Find the commented out start_date and end_date variables and uncomment them
Specify the date range you wish to process, click ‘Save’ and then ‘Run’
Verify that the run succeeds via the ‘Runs’ tab
Once completed, go to the ‘Script’ tab, comment out the two variables and click ‘Save’

If succesful, you should now see a new folder in your S3 output bucket called aggregations/.

Note

We recommend to test a single date first and verify that everything works as expected.

Run Glue job to map trip summary table 

To query the data via Athena you need to perform a one-time mapping of the table:

Open AWS Glue Triggers in a new tab
Select the original ‘on-demand’ trigger and click ‘Action/Start trigger’
Open your database under AWS Glue Databases
Verify that tbl_aggregations_tripsummary shows up (this may take a few minutes)

You can now e.g. visualize this data via our Grafana-Athena trip summary dashboard template.

[1]	If in doubt, update by following these guides: 1) Update your Lambda zip and 2) update your Glue script

[2]	Instead of deploying the aggregation script via AWS Glue, you can also deploy the script locally and use it on a Parquet data lake stored on S3 or locally. Generally, the script serves as a starting point and can be modified as you see fit - but this is of course beyond the scope of our support.

Parquet data lake aggregation - Amazon

Deploy aggregation script as AWS Glue job

Test script + aggregate backlog of historical data

Run Glue job to map trip summary table

Parquet data lake aggregation - Amazon 

Deploy aggregation script as AWS Glue job 

Test script + aggregate backlog of historical data 

Run Glue job to map trip summary table 