Set up Google BigQuery
BigQuery makes it simple and fast to query data from your Google Parquet data lake via SQL. It can e.g. be used in e.g. Grafana-BigQuery dashboards or Python scripts.
In this section we explain how you can set up BigQuery.
Table of Contents
Prerequisites
- Set up Google Parquet data lake [~10 min]
Note
The above steps are required before proceeding
Create ‘admin’ service account (for BigQuery + Storage)
- Go to ‘IAM and admin/Service accounts/Create’
- Specify name as
bigquery-storage-admin
- Add the below roles and click ‘Done’[1]
BigQuery Admin
Storage Admin
- Open the service account and go to ‘Keys/Add key/Create new key/JSON’
- Download the key and name it
bigquery-storage-admin-account.json
Create ‘user’ service account (for BigQuery)
- Go to ‘IAM and admin/Service accounts/Create’
- Specify name as
bigquery-user
- Add the below roles and click ‘Done’[1]
BigQuery Data Viewer
BigQuery Job User
Storage Object Viewer
- Open the service account and go to ‘Keys/Add key/Create new key/JSON’
- Download the key and name it
bigquery-user-account.json
Create BigQuery data set
- In the GCP console go to BigQuery
- In the Explorer view, click the ‘…’ next to your project ID and select ‘Create data set’
- Name it
lakedataset1
and use the same region as your buckets (e.g.europe-west3
)
Note
The dataset must be named as above for scripts/templates to work
Map your Parquet data lake to tables
- Verify that your output bucket contains Parquet files[2]
- Download and unzip below script
- Place the
bigquery-storage-admin-account.json
next to the Python script - Open the
bigquery-map-tables.py
via a text editor - Add your details (project ID, data set ID, output bucket name) and save the script
- Run the script to map your current Parquet data lake to tables[3]
BigQuery map tables script
| changelog
Note
The script adds ‘meta data’ about your output bucket. If new devices/messages are added to your Parquet data lake, the script should be run again (manually or by schedule)[4]
You are now ready to use BigQuery as a data source in e.g. Grafana-BigQuery dashboards.
[1] | (1, 2) Once you are done testing, you can optionally restrict the service account further if needed via e.g. IAM conditions |
[2] | If your output bucket is empty, you can upload a test MDF file to your input bucket to create some Parquet data |
[3] | This assumes that you have installed Python 3.11 and the requirements.txt (we recommend using a virtual environment). You can set this up by running python -m venv env & env\Scripts\activate & pip install -r requirements.txt |
[4] | You only need to re-run the script if new tables are to be created, not if you simply add more data to an existing table |