Set up Amazon Athena
Athena makes it simple and fast to query data from an AWS S3 Parquet data lake. It can e.g. be used in Grafana-Athena dashboards, Excel or beyond-memory Python scripts.
In this section, we explain how you can set up Athena.
Table of Contents
Prerequisites: AWS S3 data lake
- Set up AWS S3 Parquet data lake [~10 min]
Note
The above steps are required before proceeding
Set up Athena
- Ensure that you have completed the above prerequisites
- Download below Glue script and upload it to your input bucket root via the AWS S3 console
- Verify that your input bucket contains the latest Lambda zip
- Log into your AWS account, go to CloudFormation and select your data lake stack
- Click ‘Stack actions/Create change set for current stack’
- Click ‘Replace current template’ and enter below:
https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/glue-athena-v2.0.5-vG.1.0.json
- Click ‘Acknowledge’, ‘Submit’, wait ~1 min and click the upper-right refresh
- Click ‘Execute change set’ (and click it again in the popup), then wait ~1 min
- Verify that the deployment succeeds, then go to the stack ‘Outputs’ tab
The ‘Outputs’ tab contains the details required in using Athena as a data source.
Trigger Glue job
- Open AWS Glue Triggers in a new tab
- Select the ‘on-demand’ trigger and click ‘Action/Start trigger’
- Open your database under AWS Glue Databases
- Verify that your database tables show up (this may take a few minutes)
Note
Glue adds ‘meta data’ about your S3 output bucket. If new devices/messages are added to your Parquet data lake, the Glue job should be triggered again (manually or by schedule)[1]
You are now ready to use Athena as a data source in e.g. Grafana-Athena dashboards.
[1] | New Parquet files added for existing devices/messages will automatically be available for queries by Athena. A new Glue job run is only required if the new Parquet data reflects a previously ‘unmapped’ device or table. For most use cases, the manual trigger will therefore suffice. However, a scheduled trigger is recommended if you expect new devices/messages to be added frequently over time. To activate the scheduled trigger, select it and click ‘Action/Activate trigger’. A Glue job will normally cost ~0.03$/run (depending on data lake size), in which case a scheduled daily trigger would cost cost ~10$/year |