Parquet as database with ClickHouse
Warning
The examples on this page are for advanced users and provided without support. Refer to the ClickHouse documentation for production deployment recommendations.
ClickHouse is a SQL Database Management System (DBMS) with support for a long range of data sources. With ClickHouse, it is possible to query Parquet data-lakes using SQL. The SQL interface opens up for a wide range of additional tools (not directly supporting Parquet).
See the ClickHouse documentation for an list of compatible tools (e.g. Grafana, Superset, Tableau and many more).
Some useful ClickHouse documentation links:
Installation: https://clickhouse.com/docs/en/install
Parquet: https://clickhouse.com/docs/en/integrations/data-formats/parquet
Example
The following provides the most basic example of how to query a Parquet data-lake through a SQL interface using ClickHouse.
Below shows the organization of files and directories used in the example (refer to this throughout the example).
clickhouse
: The ClickHouse executableserver
: Server working directory (manually created)config.d
: Server configuration directory (manually created, optional)user_files
: Directory containing the Parquet data-lake files (empty directory automatically created by the ClickHouse server on first start)
~/clickhouse/
├─ clickhouse
└─ server
├─ config.d
│ └─ config_custom.xml
└─ user_files
└─ AABBCCDD
└─ CAN2_GnssSpeed
└─ 2022
└─ 04
├─ 22
│ ├─ 00000975_00000001.parquet
│ └─ 00000976_00000001.parquet
├─ 23
│ ├─ 00000977_00000001.parquet
│ ├─ 00000978_00000001.parquet
│ ├─ 00000979_00000001.parquet
│ └─ 00000980_00000001.parquet
├─ 24
│ ├─ 00000981_00000001.parquet
│ ├─ 00000982_00000001.parquet
│ ├─ 00000983_00000001.parquet
│ ├─ 00000984_00000001.parquet
│ └─ 00000985_00000001.parquet
└─ 26
├─ 00000986_00000001.parquet
└─ 00000987_00000001.parquet
ClickHouse installation (Linux)
ClickHouse comes as a single binary (containing several tools). In below, the ClickHouse binary is downloaded and the server started.
$ mkdir ~/clickhouse # Create directory for the ClickHouse binary
$ cd ~/clickhouse # Enter directory
$ curl https://clickhouse.com/ | sh # Download the ClickHouse binary
$ mkdir server # Create a server working directory
$ cd server/ # Enter server working directory
$ ~/clickhouse/clickhouse server # Start server in working directory
Note
The ClickHouse server creates a range of files and directories in the working directory when started. It is recommended to start the server in a dedicated directory (~/clickhouse/server/
in above).
Parquet data-lake files
ClickHouse data-lake files can be stored locally or on S3 (locally or remotely, see S3 data-lake).
When started, the ClickHouse server creates a directory user_files
(in working directory). Local files stored in user_files
can be accessed through the SQL interface.
Note
The location of user_files
can be changed, see Custom user files path.
An example Parquet data-lake is copied to user_files
, as shown below.
$ cp AABBCCDD/ ~/clickhouse/server/user_files/ -r # Copy an example Parquet data-lake to user_files
SQL interface
With the ClickHouse server installed and the Parquet data-lake files placed in user_files
it is now possible to query the data-lake using SQL.
The SQL queries can be issued by any compatible tool. In this example, we use the ClickHouse client included in the ClickHouse binary (downloaded above).
The ClickHouse client is started with: ~/clickhouse/clickhouse client
. In the following, it is demonstrated how to create SQL tables referencing the Parquet data-lake. When created, the tables can be queried by any SQL compatible tool.
Show databases
ClickHouse comes with a default empty database named default
. We use this throughout the following examples for simplicity.
:) SHOW DATABASES
┌─name───────────────┐
│ INFORMATION_SCHEMA │
│ default │
│ information_schema │
│ system │
└────────────────────┘
Create table referencing the local Parquet data-lake
It is recommended to use the following table naming (all lower case): tbl_<DEVICE_ID>_<RULE>
(see Information on output-file organization).
Below a table named tbl_aabbccdd_can2_gnssspeed
referencing AABBCCDD/CAN2_GnssSpeed
is created.
:) CREATE TABLE tbl_aabbccdd_can2_gnssspeed ENGINE=File(Parquet, 'AABBCCDD/CAN2_GnssSpeed/**/*')
Ok.
The server replies Ok
, confirming that the reference is valid.
For more information on creating table referencing files see https://clickhouse.com/docs/en/engines/table-engines/special/file
Warning
All files referenced by one table must share the same data schema (as is the case for all files in CAN2_GnssSpeed
)
Query tables in database
Names of all tables in a database (here default
) can be queried.
:) SELECT table_name FROM information_schema.tables WHERE table_catalog='default'
┌─table_name──────────────────┐
│ tbl_aabbccdd_can2_gnssspeed │
└─────────────────────────────┘
Query columns in table
Columns in a table can be queried.
:) SELECT column_name FROM information_schema.columns WHERE table_catalog='default' and table_name='tbl_aabbccdd_can2_gnssspeed'
┌─column_name───┐
│ t │
│ Speed │
│ SpeedAccuracy │
│ SpeedValid │
└───────────────┘
Query data in table
Data in a table can be queried.
:) SELECT * FROM default.tbl_aabbccdd_can2_gnssspeed LIMIT 3
┌──────────────────────────t─┬─Speed─┬─SpeedAccuracy─┬─SpeedValid─┐
│ 2022-04-26 06:54:52.573150 │ 0.799 │ 0.438 │ 1 │
│ 2022-04-26 06:54:53.714250 │ 0.977 │ 0.514 │ 1 │
│ 2022-04-26 06:54:54.279700 │ 0.6 │ 0.434 │ 1 │
└────────────────────────────┴───────┴───────────────┴────────────┘
Drop table referencing data-lake
Table referencing Parquet files can be dropped.
:) DROP TABLE default.tbl_aabbccdd_can2_gnssspeed
Ok.
S3 data-lake
In above example, Parquet data-lake files were stored locally. ClickHouse also supports files stored on a S3 server. If so, the tables referencing the data should be created using the S3
engine, as demonstrated below:
:) CREATE TABLE tbl_aabbccdd_can2_gnssspeed ENGINE=S3('http://localhost:9050/datalake/AABBCCDD/CAN2_GnssSpeed/**/*', '<access_key>', '<secret_key>', 'Parquet')
For more information see https://clickhouse.com/docs/en/engines/table-engines/integrations/s3
Configuration file
ClickHouse has a wide range of configuration possibilities. Custom configuration files placed in the directory config.d
(created manually) are automatically merged with the default configuration.
Below demonstrates how a custom configuration file can be added to ClickHouse (here arbitrarily named config_custom.xml
).
~/clickhouse/
├─ clickhouse
└─ server
└─ config.d
└─ config_custom.xml
For more information see https://clickhouse.com/docs/en/operations/configuration-files
Custom user files path
As default, ClickHouse searches for user_files
(where the Parquet files are stored), in the server working directory. This is often an awkward location to store a data-lake.
A custom user_files
directory can be defined in a custom configuration file with:
<clickhouse>
<!-- Path to user files -->
<user_files_path>/mnt/datalake/</user_files_path>
</clickhouse>
Allow remote access
As default, ClickHouse does not allow external connections (localhost only). To allow external connections, use the listen_host
configuration field. Refer to the ClickHouse documentation for additional details.
<clickhouse>
<!-- Allow remote access -->
<listen_host>::</listen_host>
</clickhouse>