Set up Amazon Athena

Athena makes it simple and fast to query data from an AWS S3 Parquet data lake. It can e.g. be used in Grafana-Athena dashboards, Excel or beyond-memory Python scripts.

In this section, we explain how you can set up Athena.


Prerequisites: AWS S3 data lake

  1. Set up AWS S3 Parquet data lake [~10 min]

Note

The above steps are required before proceeding


Set up Athena

  1. Ensure that you have completed the above prerequisites
  2. Download below Glue script and upload it to your input bucket root via the AWS S3 console
  3. Verify that your input bucket contains the latest Lambda zip
  4. Log into your AWS account, go to CloudFormation and select your data lake stack
  5. Click ‘Stack actions/Create change set for current stack’
  6. Click ‘Replace current template’ and enter below:
https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/glue-athena-v2.0.5-vG.1.0.json
  1. Click ‘Acknowledge’, ‘Submit’, wait ~1 min and click the upper-right refresh
  2. Click ‘Execute change set’ (and click it again in the popup), then wait ~1 min
  3. Verify that the deployment succeeds, then go to the stack ‘Outputs’ tab

The ‘Outputs’ tab contains the details required in using Athena as a data source.

Glue script | changelog


Trigger Glue job

  1. Open AWS Glue Triggers in a new tab
  2. Select the ‘on-demand’ trigger and click ‘Action/Start trigger’
  3. Open your database under AWS Glue Databases
  4. Verify that your database tables show up (this may take a few minutes)

Note

Glue adds ‘meta data’ about your S3 output bucket. If new devices/messages are added to your Parquet data lake, the Glue job should be triggered again (manually or by schedule)[1]


You are now ready to use Athena as a data source in e.g. Grafana-Athena dashboards.


[1]New Parquet files added for existing devices/messages will automatically be available for queries by Athena. A new Glue job run is only required if the new Parquet data reflects a previously ‘unmapped’ device or table. For most use cases, the manual trigger will therefore suffice. However, a scheduled trigger is recommended if you expect new devices/messages to be added frequently over time. To activate the scheduled trigger, select it and click ‘Action/Activate trigger’. A Glue job will normally cost ~0.03$/run (depending on data lake size), in which case a scheduled daily trigger would cost cost ~10$/year