Set up Amazon Athena

Amazon Athena interface

Athena makes it simple and fast to query data from your Amazon Parquet data lake via SQL. It can e.g. be used in Grafana-Athena dashboards, Excel or Python scripts.

In this section we explain how you can set up Athena.


Prerequisites

  1. Set up Azure Parquet data lake [~10 min]

Note

The above steps are required before proceeding


Deploy Athena and Glue

  1. Ensure that you have completed the prerequisites
  2. Download below Glue script and upload it to your input bucket root via the AWS S3 console
  3. Verify that your input bucket contains the latest Lambda zip
  4. Log into your AWS account, go to CloudFormation and select your data lake stack
  5. Click ‘Stack actions/Create change set for current stack’
  6. Click ‘Replace current template’ and enter below:
https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/glue-athena-v2.0.5-vG.1.0.json
  1. Click ‘Acknowledge’, ‘Submit’, wait ~1 min and click the upper-right refresh
  2. Click ‘Execute change set’ (and click it again in the popup), then wait ~1 min
  3. Verify that the deployment succeeds, then go to the stack ‘Outputs’ tab

The ‘Outputs’ tab contains the details required in using Athena as a data source.

Glue script | changelog

Note

If you later need to update the Glue script, see this guide


Trigger Glue job

  1. Verify that your S3 output bucket contains Parquet files
  2. Open AWS Glue Triggers in a new tab
  3. Select the ‘on-demand’ trigger and click ‘Action/Start trigger’
  4. Open your database under AWS Glue Databases
  5. Verify that your database tables show up (this may take a few minutes)

Note

Glue adds ‘meta data’ about your S3 output bucket. If new devices/messages are added to your Parquet data lake, the Glue job should be triggered again (manually or by schedule)[1]


You can now use Athena as a data source in e.g. Grafana-Athena dashboards. You can also check out the advanced topics to learn how to set up periodic data aggregation in your data lake.


[1]New Parquet files added for existing devices/messages will automatically be available for queries by Athena. A new Glue job run is only required if the new Parquet data reflects a previously ‘unmapped’ device or table. For most use cases, the manual trigger will therefore suffice. However, a scheduled trigger is recommended if you expect new devices/messages to be added frequently over time. To activate the scheduled trigger, select it and click ‘Action/Activate trigger’. A Glue job will normally cost ~0.03$/run (depending on data lake size), in which case a scheduled daily trigger would cost cost ~10$/year