Add trip summaries to your data lake


For many use cases you will need to analyze data across devices, messages and trips. To do this at scale, it can be useful to perform periodic data aggregation and add ‘trip summary tables’ to your Amazon/Google/Azure Parquet data lake. These can e.g. be used in creating dashboard overviews across all your trips.


Setup aggregation via JSON file

To setup trip summary aggregation of Parquet files in your cloud, follow below steps:

  1. Ensure you are using the latest data lake integration
  2. Download our example aggregations.json below and modify it as per your needs
  3. Once ready, upload it to your input bucket root
  4. Trigger the aggregation processing as outlined for your specific cloud below

aggregations.json

How it works

When you run the aggregation processing job, it will do the following:

  1. The script will download and validate your aggregations.json file
  2. It will go through each day of the period specified for each cluster/device
  3. It will identify trip windows based on the trip identifier signal
  4. For each trip window, it will calculate signal aggregations
  5. The aggregations are written to your data lake in the aggregations/tripsummary folder
  6. The resulting tables can be queried via e.g. our trip summary Grafana dashboards

JSON syntax

To illustrate the JSON syntax, consider the example below:

{
  "config": {
    "date": {
      "mode": "specific_period",
      "start_date": "2025-01-01",
      "end_date": "2025-01-15"
    },
    "trip": {
      "trip_gap_min": 10,
      "trip_min_length_min": 1
    }
  },
  "device_clusters": [
    {
      "devices": ["2F6913DB", "ABCDEF12"],
      "cluster": "cluster1"
    }
  ],
  "cluster_details": [
    {
      "clusters": ["cluster1"],
      "details": {
        "trip_identifier": {"message": "CAN9_GnssSpeed"},
        "aggregations": [
          {
            "message": "CAN9_GnssSpeed",
            "signal": ["Speed"],
            "aggregation": ["avg", "max"]
          },
          {
            "message": "CAN9_GnssPos",
            "signal": ["Latitude", "Longitude"],
            "aggregation": ["first", "last"]
          }
        ]
      }
    }
  ]
}
  • config: Top-level configuration section
    • date: Date range configuration
      • mode: Either specific_period (use explicit dates) or previous_day (automatic)
      • start_date/end_date: Required for specific_period mode (format: YYYY-MM-DD)
    • trip: Trip detection parameters
      • trip_gap_min: Minutes of inactivity to consider a new trip has started
      • trip_min_length_min: Minimum trip length in minutes to be considered valid
  • device_clusters: Group devices into logical clusters (e.g. by business logic)
    • devices: List of device IDs (serial numbers) to process (if empty, all devices are processed)
    • cluster: Name assigned to this group of devices
  • cluster_details: Processing configuration for each cluster
    • clusters: List of cluster names to apply these settings to
    • details: Processing configuration
      • trip_identifier: Message used to identify trips
      • aggregations: List of signals to aggregate
        • message: Parquet data lake message folder name[1]
        • signal: List of signal names to aggregate
        • aggregation: List of aggregation functions[2]

Note

Using specific_period is ideal for testing/backlog processing, while previous_day is useful for daily scheduled automation (more below)


How to run

Amazon

  1. Open AWS Glue Triggers in a new tab
  2. Select the ‘process-aggregation-on-demand’ trigger and click ‘Action/Start trigger’
  3. View output logs by clicking the target job and the ‘Runs’ tab
  4. To schedule the job daily, activate the ‘process-aggregation-scheduled’

Google

  1. Open Cloud Scheduler in your browser
  2. Select ‘aggregation-scheduler’, click RESUME and FORCE RUN
  3. After it completes, click PAUSE
  4. View output logs via the ‘aggregation’ function Logs/ tab in the console (runs)
  5. To schedule the job daily, click RESUME on the scheduler

Azure

  1. Open your ‘aggregation’ Container App Job via the console (jobs)
  2. Click ‘Run now’ (at the top) and then click the execution history
  3. View output logs via the execution console logs (may be delayed up to 15 min)

Note

When scheduling a daily job, make sure to set the mode to previous_day in your JSON

Note

If the job results in new devices/messages in your data lake, re-run your table mapping job

Note

The max run-time for a job is 60 min. For large backlogs, process a subset in each run


You can now e.g. visualize this data via our Grafana trip summary dashboard templates.


[1]Message and signal names are case sensitive and must be entered correctly. We recommend copy/pasting the message names directly from the relevant folders of your data lake as they must be named identically to the folders (do not copy them from the DBC file, for example) - and to copy paste the signal names by copying them from a Parquet file within the relevant message folder (you can use the free Tad Parquet file viewer for this). Alternatively, you can copy the signal names from your DBC file.
[2]The default script supports the following aggregation methods: min, max, avg, median, sum, first, last, delta_sum, delta_sum_pos, delta_sum_neg. The delta sums are calculated by summing the delta signal values, optionally filtering for only positive or negative deltas. You can easily expand the list by modifying the script accordingly.