Clickhouse and Flashblade S3

jboothomas
2 min readAug 22, 2023

--

In this quick how to I will cover the steps to leverage Pure Storage Flashblade S3 storage with a Clickhouse installation.

Disable SSL certificate validation

In the lab environment I do not have SSL certificates in place in order to disable ssl verification the following line must be set within the /etc/clickhouse-server/config.wml file:

<openSSL>
...
<client>
...
<verificationMode>none</verificationMode>
...
</client>
</openSSL>

External datasets

After restarting the clickhouse-server service I connect using the clickhouse-client, and can run direct queries on datasets residing on my S3 storage:

:) SELECT count() AS count
FROM s3('https://192.168.2.2/nyctaxi/*.parquet', 'PSF....EIA', 'A12....OEN');

Query id: f4d3e05f-5bc9-461a-806f-be675d4b60ec

┌────count─┐
│ 39656098 │
└──────────┘

In the above example I provide the endpoint path to the bucket and dataset files, the s3_access_key and then the s3_secret_key.

Clickhouse disks

In order to leverage Flashblade S3 storage as a disk within Clickhouse I create a configuration file under /etc/clickhouse-server/conf.d/fb03.xml with the following contents:

<yandex>
<storage_configuration>
<disks>
<fb03>
<type>s3</type>
<endpoint>https://192.168.2.2/clickhouse//</endpoint>
<access_key_id>PSFB....JEIA</access_key_id>
<secret_access_key>A1211....JOEN</secret_access_key>
<region></region>
<use_path_style_addressing>true</use_path_style_addressing>
</fb03>
<s3_cache>
<type>cache</type>
<disk>fb03</disk>
<path>/var/lib/clickhouse/disks/s3_cache/</path>
<max_size>10Gi</max_size>
</s3_cache>
</disks>
<policies>
<external>
<volumes>
<s3>
<disk>fb03</disk>
</s3>
</volumes>
</external>
</policies>
</storage_configuration>
</yandex>

I restart the clickhouse-server service and connect to the client, I then validate that the S3 storage disk is available within the Clickhouse environment:

:) SELECT * FROM system.disks;

SELECT *
FROM system.disks

Query id: a18918e0-5755-4556-973e-dcceca69b354

┌─name────┬─path────────────────────────────┬───────────free_space─┬──────────total_space─┬─────unreserved_space─┬─keep_free_space─┬─type──┬─is_encrypted─┬─is_read_only─┬─is_write_once─┬─is_remote─┬─is_broken─┬─cache_path─┐
│ default │ /var/lib/clickhouse/ │ 9573216256 │ 48420556800 │ 9573216256 │ 0 │ local │ 0 │ 0 │ 0 │ 0 │ 0 │ │
│ fb03 │ /var/lib/clickhouse/disks/fb03/ │ 18446744073709551615 │ 18446744073709551615 │ 18446744073709551615 │ 0 │ s3 │ 0 │ 0 │ 0 │ 1 │ 0 │ │
└─────────┴─────────────────────────────────┴──────────────────────┴──────────────────────┴──────────────────────┴─────────────────┴───────┴──────────────┴──────────────┴───────────────┴───────────┴───────────┴────────────┘

2 rows in set. Elapsed: 0.001 sec.

With this configured I can create a table on the Flashblade S3 storage as follows:

CREATE TABLE my_first_table
(
`user_id` UInt32,
`message` String,
`timestamp` DateTime,
`metric` Float32
)
ENGINE = MergeTree
PRIMARY KEY (user_id, timestamp)
SETTINGS storage_policy = 'external'

In the above SQL query the last line defines the storage policy to use for the table creation, in my case I named my policy “external” for the Flashblade S3 disk.

After creating this simple table I check the S3 buckets contents:

$ aws s3api list-objects-v2  --profile fbstaines03 --no-verify-ssl --endpoint-url=https://192.168.2.2 --bucket clickhouse | grep Key
"Key": "clickhouse_remove_objects_capability_752e5429-bcba-452c-bcda-b6214d76ddf0",
"Key": "san/szimkjmkmjeidpvemqduhzbjidlst",
"Key": "ujj/gwjmmjasuhodjiliwmxvzkbwzozxn",

I ran a quick tetst and created a second table and a new object was created on the Flashblade bucket “Key”: “gat/ywggwlxvdvwivllvqrljdkdkekeag”.

Conclusion

Clickhouse combined with Flashblade’ speed and simplicity at scale offers a powerful platform to extract value from your growing datasets while keeping things simple. To learn more about Pure Storage Flashblade, the only platform to provide consistent time to result at scale, click here.

--

--

jboothomas

Infrastructure engineering for modern data applications