Set up Amazon Athena
Athena makes it simple and fast to query data from your Amazon Parquet data lake via SQL. It can e.g. be used in Grafana-Athena dashboards, Excel or Python scripts.
In this section we explain how you can set up Athena.
Table of Contents
Prerequisites
- Set up Azure Parquet data lake [~10 min]
Note
The above steps are required before proceeding
Deploy Athena and Glue
- Ensure that you have completed the prerequisites
- Download below Glue script and upload it to your input bucket root via the AWS S3 console
- Verify that your input bucket contains the latest Lambda zip
- Log into your AWS account, go to CloudFormation and select your data lake stack
- Click ‘Stack actions/Create change set for current stack’
- Click ‘Replace current template’ and enter below:
https://css-electronics-resources.s3.eu-central-1.amazonaws.com/stacks/glue-athena-v2.0.5-vG.1.0.json
- Click ‘Acknowledge’, ‘Submit’, wait ~1 min and click the upper-right refresh
- Click ‘Execute change set’ (and click it again in the popup), then wait ~1 min
- Verify that the deployment succeeds, then go to the stack ‘Outputs’ tab
The ‘Outputs’ tab contains the details required in using Athena as a data source.
Note
If you later need to update the Glue script, see this guide
Trigger Glue job
- Verify that your S3 output bucket contains Parquet files
- Open AWS Glue Triggers in a new tab
- Select the ‘on-demand’ trigger and click ‘Action/Start trigger’
- Open your database under AWS Glue Databases
- Verify that your database tables show up (this may take a few minutes)
Note
Glue adds ‘meta data’ about your S3 output bucket. If new devices/messages are added to your Parquet data lake, the Glue job should be triggered again (manually or by schedule)[1]
You can now use Athena as a data source in e.g. Grafana-Athena dashboards. You can also check out the advanced topics to learn how to set up periodic data aggregation in your data lake.
[1] | New Parquet files added for existing devices/messages will automatically be available for queries by Athena. A new Glue job run is only required if the new Parquet data reflects a previously ‘unmapped’ device or table. For most use cases, the manual trigger will therefore suffice. However, a scheduled trigger is recommended if you expect new devices/messages to be added frequently over time. To activate the scheduled trigger, select it and click ‘Action/Activate trigger’. A Glue job will normally cost ~0.03$/run (depending on data lake size), in which case a scheduled daily trigger would cost cost ~10$/year |