Add trip summaries to your data lake
For many use cases you will need to analyze data across devices, messages and trips. To do this at scale, it can be useful to perform periodic data aggregation and add ‘trip summary tables’ to your Amazon/Google/Azure Parquet data lake. These can e.g. be used in creating dashboard overviews across all your trips.
Table of Contents
Setup aggregation via JSON file
To setup trip summary aggregation of Parquet files in your cloud, follow below steps:
- Ensure you are using the latest data lake integration
- Download our example
aggregations.jsonbelow and modify it as per your needs - Once ready, upload it to your input bucket root
- Trigger the aggregation processing as outlined for your specific cloud below
How it works
When you run the aggregation processing job, it will do the following:
- The script will download and validate your
aggregations.jsonfile - It will go through each day of the period specified for each cluster/device
- It will identify trip windows based on the trip identifier signal
- For each trip window, it will calculate signal aggregations
- The aggregations are written to your data lake in the
aggregations/tripsummaryfolder - The resulting tables can be queried via e.g. our trip summary Grafana dashboards
JSON syntax
To illustrate the JSON syntax, consider the example below:
{
"config": {
"date": {
"mode": "specific_period",
"start_date": "2025-01-01",
"end_date": "2025-01-15"
},
"trip": {
"trip_gap_min": 10,
"trip_min_length_min": 1
}
},
"device_clusters": [
{
"devices": ["2F6913DB", "ABCDEF12"],
"cluster": "cluster1"
}
],
"cluster_details": [
{
"clusters": ["cluster1"],
"details": {
"trip_identifier": {"message": "CAN9_GnssSpeed"},
"aggregations": [
{
"message": "CAN9_GnssSpeed",
"signal": ["Speed"],
"aggregation": ["avg", "max"]
},
{
"message": "CAN9_GnssPos",
"signal": ["Latitude", "Longitude"],
"aggregation": ["first", "last"]
}
]
}
}
]
}
- config: Top-level configuration section
- date: Date range configuration
- mode: Either
specific_period(use explicit dates) orprevious_day(automatic) - start_date/end_date: Required for
specific_periodmode (format: YYYY-MM-DD)
- mode: Either
- trip: Trip detection parameters
- trip_gap_min: Minutes of inactivity to consider a new trip has started
- trip_min_length_min: Minimum trip length in minutes to be considered valid
- date: Date range configuration
- device_clusters: Group devices into logical clusters (e.g. by business logic)
- devices: List of device IDs (serial numbers) to process (if empty, all devices are processed)
- cluster: Name assigned to this group of devices
- cluster_details: Processing configuration for each cluster
- clusters: List of cluster names to apply these settings to
- details: Processing configuration
Note
Using specific_period is ideal for testing/backlog processing, while previous_day is useful for daily scheduled automation (more below)
How to run
Amazon
- Open AWS Glue Triggers in a new tab
- Select the ‘process-aggregation-on-demand’ trigger and click ‘Action/Start trigger’
- View output logs by clicking the target job and the ‘Runs’ tab
- To schedule the job daily, activate the ‘process-aggregation-scheduled’
Google
- Open Cloud Scheduler in your browser
- Select ‘aggregation-scheduler’, click
RESUMEandFORCE RUN - After it completes, click
PAUSE - View output logs via the ‘aggregation’ function Logs/ tab in the console (runs)
- To schedule the job daily, click
RESUMEon the scheduler
Azure
- Open your ‘aggregation’ Container App Job via the console (jobs)
- Click ‘Run now’ (at the top) and then click the execution history
- View output logs via the execution console logs (may be delayed up to 15 min)
Note
When scheduling a daily job, make sure to set the mode to previous_day in your JSON
Note
If the job results in new devices/messages in your data lake, re-run your table mapping job
Note
The max run-time for a job is 60 min. For large backlogs, process a subset in each run
You can now e.g. visualize this data via our Grafana trip summary dashboard templates.
| [1] | Message and signal names are case sensitive and must be entered correctly. We recommend copy/pasting the message names directly from the relevant folders of your data lake as they must be named identically to the folders (do not copy them from the DBC file, for example) - and to copy paste the signal names by copying them from a Parquet file within the relevant message folder (you can use the free Tad Parquet file viewer for this). Alternatively, you can copy the signal names from your DBC file. |
| [2] | The default script supports the following aggregation methods: min, max, avg, median, sum, first, last, delta_sum, delta_sum_pos, delta_sum_neg. The delta sums are calculated by summing the delta signal values, optionally filtering for only positive or negative deltas. You can easily expand the list by modifying the script accordingly. |