Parquet as database with ClickHouse

Warning

The examples on this page are for advanced users and provided without support. Refer to the ClickHouse documentation for production deployment recommendations.

ClickHouse is a SQL Database Management System (DBMS) with support for a long range of data sources. With ClickHouse, it is possible to query Parquet data-lakes using SQL. The SQL interface opens up for a wide range of additional tools (not directly supporting Parquet).

See the ClickHouse documentation for an list of compatible tools (e.g. Grafana, Superset, Tableau and many more).

Some useful ClickHouse documentation links:


Example

The following provides the most basic example of how to query a Parquet data-lake through a SQL interface using ClickHouse.

Below shows the organization of files and directories used in the example (refer to this throughout the example).

  • clickhouse: The ClickHouse executable

  • server: Server working directory (manually created)

  • config.d: Server configuration directory (manually created, optional)

  • user_files: Directory containing the Parquet data-lake files (empty directory automatically created by the ClickHouse server on first start)

Organization overview
 ~/clickhouse/
 ├─ clickhouse
 └─ server
    ├─ config.d
    │  └─ config_custom.xml
    └─ user_files
       └─ AABBCCDD
          └─ CAN2_GnssSpeed
             └─ 2022
                └─ 04
                   ├─ 22
                   │  ├─ 00000975_00000001.parquet
                   │  └─ 00000976_00000001.parquet
                   ├─ 23
                   │  ├─ 00000977_00000001.parquet
                   │  ├─ 00000978_00000001.parquet
                   │  ├─ 00000979_00000001.parquet
                   │  └─ 00000980_00000001.parquet
                   ├─ 24
                   │  ├─ 00000981_00000001.parquet
                   │  ├─ 00000982_00000001.parquet
                   │  ├─ 00000983_00000001.parquet
                   │  ├─ 00000984_00000001.parquet
                   │  └─ 00000985_00000001.parquet
                   └─ 26
                      ├─ 00000986_00000001.parquet
                      └─ 00000987_00000001.parquet

ClickHouse installation (Linux)

ClickHouse comes as a single binary (containing several tools). In below, the ClickHouse binary is downloaded and the server started.

$ mkdir ~/clickhouse                    # Create directory for the ClickHouse binary
$ cd ~/clickhouse                       # Enter directory
$ curl https://clickhouse.com/ | sh     # Download the ClickHouse binary
$ mkdir server                          # Create a server working directory
$ cd server/                            # Enter server working directory
$ ~/clickhouse/clickhouse server        # Start server in working directory

Note

The ClickHouse server creates a range of files and directories in the working directory when started. It is recommended to start the server in a dedicated directory (~/clickhouse/server/ in above).


Parquet data-lake files

ClickHouse data-lake files can be stored locally or on S3 (locally or remotely, see S3 data-lake).

When started, the ClickHouse server creates a directory user_files (in working directory). Local files stored in user_files can be accessed through the SQL interface.

Note

The location of user_files can be changed, see Custom user files path.

An example Parquet data-lake is copied to user_files, as shown below.

$ cp AABBCCDD/ ~/clickhouse/server/user_files/ -r   # Copy an example Parquet data-lake to user_files

SQL interface

With the ClickHouse server installed and the Parquet data-lake files placed in user_files it is now possible to query the data-lake using SQL.

The SQL queries can be issued by any compatible tool. In this example, we use the ClickHouse client included in the ClickHouse binary (downloaded above).

The ClickHouse client is started with: ~/clickhouse/clickhouse client. In the following, it is demonstrated how to create SQL tables referencing the Parquet data-lake. When created, the tables can be queried by any SQL compatible tool.


Show databases

ClickHouse comes with a default empty database named default. We use this throughout the following examples for simplicity.

:) SHOW DATABASES

┌─name───────────────┐
│ INFORMATION_SCHEMA │
│ default            │
│ information_schema │
│ system             │
└────────────────────┘

Create table referencing the local Parquet data-lake

It is recommended to use the following table naming (all lower case): tbl_<DEVICE_ID>_<RULE> (see Information on output-file organization).

Below a table named tbl_aabbccdd_can2_gnssspeed referencing AABBCCDD/CAN2_GnssSpeed is created.

:) CREATE TABLE tbl_aabbccdd_can2_gnssspeed ENGINE=File(Parquet, 'AABBCCDD/CAN2_GnssSpeed/**/*')

Ok.

The server replies Ok, confirming that the reference is valid.

For more information on creating table referencing files see https://clickhouse.com/docs/en/engines/table-engines/special/file

Warning

All files referenced by one table must share the same data schema (as is the case for all files in CAN2_GnssSpeed)


Query tables in database

Names of all tables in a database (here default) can be queried.

:) SELECT table_name FROM information_schema.tables WHERE table_catalog='default'

┌─table_name──────────────────┐
│ tbl_aabbccdd_can2_gnssspeed │
└─────────────────────────────┘

Query columns in table

Columns in a table can be queried.

:) SELECT column_name FROM information_schema.columns WHERE table_catalog='default' and table_name='tbl_aabbccdd_can2_gnssspeed'

┌─column_name───┐
│ t             │
│ Speed         │
│ SpeedAccuracy │
│ SpeedValid    │
└───────────────┘

Query data in table

Data in a table can be queried.

:) SELECT * FROM default.tbl_aabbccdd_can2_gnssspeed LIMIT 3

┌──────────────────────────t─┬─Speed─┬─SpeedAccuracy─┬─SpeedValid─┐
│ 2022-04-26 06:54:52.573150 │ 0.799 │         0.438 │          1 │
│ 2022-04-26 06:54:53.714250 │ 0.977 │         0.514 │          1 │
│ 2022-04-26 06:54:54.279700 │   0.6 │         0.434 │          1 │
└────────────────────────────┴───────┴───────────────┴────────────┘

Drop table referencing data-lake

Table referencing Parquet files can be dropped.

:) DROP TABLE default.tbl_aabbccdd_can2_gnssspeed

Ok.

S3 data-lake

In above example, Parquet data-lake files were stored locally. ClickHouse also supports files stored on a S3 server. If so, the tables referencing the data should be created using the S3 engine, as demonstrated below:

:) CREATE TABLE tbl_aabbccdd_can2_gnssspeed ENGINE=S3('http://localhost:9050/datalake/AABBCCDD/CAN2_GnssSpeed/**/*', '<access_key>', '<secret_key>', 'Parquet')

For more information see https://clickhouse.com/docs/en/engines/table-engines/integrations/s3


Configuration file

ClickHouse has a wide range of configuration possibilities. Custom configuration files placed in the directory config.d (created manually) are automatically merged with the default configuration.

Below demonstrates how a custom configuration file can be added to ClickHouse (here arbitrarily named config_custom.xml).

Custom configuration
 ~/clickhouse/
 ├─ clickhouse
 └─ server
    └─ config.d
       └─ config_custom.xml

For more information see https://clickhouse.com/docs/en/operations/configuration-files


Custom user files path

As default, ClickHouse searches for user_files (where the Parquet files are stored), in the server working directory. This is often an awkward location to store a data-lake.

A custom user_files directory can be defined in a custom configuration file with:

config_custom.xml
 <clickhouse>
     <!-- Path to user files -->
     <user_files_path>/mnt/datalake/</user_files_path>
 </clickhouse>

Allow remote access

As default, ClickHouse does not allow external connections (localhost only). To allow external connections, use the listen_host configuration field. Refer to the ClickHouse documentation for additional details.

config_custom.xml
 <clickhouse>
     <!-- Allow remote access -->
     <listen_host>::</listen_host>
 </clickhouse>