Add trip summaries to your data lake
For many use cases you will need to analyze data across devices, messages and trips. To do this at scale, it can be useful to perform periodic data aggregation and add ‘trip summary tables’ to your data lake. These can e.g. be used in creating dashboard overviews across all your trips.
Table of Contents
Deploy aggregation script as AWS Glue job
If you have set up an AWS S3 Parquet data lake and Athena, you can automate this[2]:
- Verify that your integration uses the latest Lambda zip and Glue script[1]
- Download and unzip the aggregation script below
- Update the
aggregations.json
file with your preferred settings as per theREADME.md
- Upload the Python script to your S3 input bucket root via the AWS S3 console
- Upload the JSON file to your S3 output bucket root via the AWS S3 console
- Log into your AWS account, go to CloudFormation and select your data lake stack
- Click ‘Stack actions/Create change set for current stack’
- Click ‘Replace current template’ and enter below:
https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/data-aggregation-v2.0.7-vG.3.0-vA.0.2.json
- Click ‘Acknowledge’, ‘Submit’, wait ~1 min and click the upper-right refresh
- Click ‘Execute change set’ (and click it again in the popup), then wait ~1 min
The script will now run as a daily scheduled AWS Glue job, automatically adding new entries in S3 as aggregations/tripsummary/yyyy/mm/dd/yyyymmdd.parquet
.
Test script + aggregate backlog of historical data
We recommend test the script via aggregation of your historical data:
- Open the ‘aggregate-data-from-s3’ AWS Glue job ‘Script’ tab
- Find the commented out
start_date
andend_date
variables and uncomment them - Specify the date range you wish to process, click ‘Save’ and then ‘Run’
- Verify that the run succeeds via the ‘Runs’ tab
- Once completed, go to the ‘Script’ tab, comment out the two variables and click ‘Save’
If succesful, you should now see a new folder in your S3 output bucket called aggregations/
.
Note
We recommend to test a single date first and verify that everything works as expected.
Run Glue job to map trip summary table
To query the data via Athena you need to perform a one-time mapping of the table:
- Open AWS Glue Triggers in a new tab
- Select the original ‘on-demand’ trigger and click ‘Action/Start trigger’
- Open your database under AWS Glue Databases
- Verify that
tbl_aggregations_tripsummary
shows up (this may take a few minutes)
You can now e.g. visualize this data via our Grafana-Athena trip summary dashboard template.
[1] | If in doubt, update by following these guides: 1) Update your Lambda zip and 2) update your Glue script |
[2] | Instead of deploying the aggregation script via AWS Glue, you can also deploy the script locally and use it on a Parquet data lake stored on S3 or locally. Generally, the script serves as a starting point and can be modified as you see fit - but this is of course beyond the scope of our support. |