Parquet data lake aggregation - Amazon

For many use cases you will need to analyze data across devices, messages and trips. To do this at scale, it can be useful to perform periodic data aggregation and add ‘trip summary tables’ to your data lake. These can e.g. be used in creating dashboard overviews across all your trips.

Note

This guide (and technical support on it) is intended for advanced users


Deploy aggregation script as AWS Glue job

If you have set up an AWS S3 Parquet data lake and Athena, you can automate this[2]:

  1. Verify that your integration uses the latest Lambda zip and Glue script[1]
  2. Download and unzip the aggregation script below
  3. Update the aggregations.json file with your preferred settings as per the README.md
  4. Upload the Python script to your S3 input bucket root via the AWS S3 console
  5. Upload the JSON file to your S3 output bucket root via the AWS S3 console
  6. Log into your AWS account, go to CloudFormation and select your data lake stack
  7. Click ‘Stack actions/Create change set for current stack’
  8. Click ‘Replace current template’ and enter below:
https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/data-aggregation-v2.0.5-vG.1.0-vA.0.1.json
  1. Click ‘Acknowledge’, ‘Submit’, wait ~1 min and click the upper-right refresh
  2. Click ‘Execute change set’ (and click it again in the popup), then wait ~1 min

The script will now run as a daily scheduled AWS Glue job, automatically adding new entries in S3 as aggregations/tripsummary/yyyy/mm/dd/yyyymmdd.parquet.

aggregation script | changelog


Test script + aggregate backlog of historical data

We recommend test the script via aggregation of your historical data:

  1. Open the ‘aggregate-data-from-s3’ AWS Glue job ‘Script’ tab
  2. Find the commented out start_date and end_date variables and uncomment them
  3. Specify the date range you wish to process, click ‘Save’ and then ‘Run’
  4. Verify that the run succeeds via the ‘Runs’ tab
  5. Once completed, go to the ‘Script’ tab, comment out the two variables and click ‘Save’

If succesful, you should now see a new folder in your S3 output bucket called aggregations/.

Note

We recommend to test a single date first and verify that everything works as expected.


Run Glue job to map trip summary table

To query the data via Athena you need to perform a one-time mapping of the table:

  1. Open AWS Glue Triggers in a new tab
  2. Select the original ‘on-demand’ trigger and click ‘Action/Start trigger’
  3. Open your database under AWS Glue Databases
  4. Verify that tbl_aggregations_tripsummary shows up (this may take a few minutes)

You can now e.g. visualize this data via our Grafana-Athena trip summary dashboard template.


[1]If in doubt, update by following these guides: 1) Update your Lambda zip and 2) update your Glue script
[2]Instead of deploying the aggregation script via AWS Glue, you can also deploy the script locally and use it on a Parquet data lake stored on S3 or locally. Generally, the script serves as a starting point and can be modified as you see fit - but this is of course beyond the scope of our support.