Trino S3 via hive-metastore integration
In this blog I will go over how to use S3 storage on a Pure Storage Flashblade with Trino the fast distributed SQL query engine for big data.
I deploy Trino using the helm chart and provide a values.yaml file with the following configuration:
additionalCatalogs:
lakehouse: |-
connector.name=hive
hive.metastore.uri=thrift://hivems-hive-metastore.analytics.svc.local:9083
hive.s3.aws-access-key=PSFB....BJEIA
hive.s3.aws-secret-key=A121....eJOEN
hive.s3.endpoint=192.168.2.2
hive.s3.ssl.enabled=false
This is pointing to my hive-metastore server see this blog post for more information on setting that up. I then edit the Trino service to switch from ClusterIP to NodePort so as to facilitate external access.
As usual I use the helm install command:
helm -n analytics install trino trino/trino -f values.yaml
On a linux client with the trino-cli installed, I use the following command to connect to my in Kubernetes running instance, and list the current catalogs available:
$ ./trino-cli-422-executable.jar --server http://10.225.114.50:32140
trino> show catalogs;
Catalog
-----------
hive
system
tpcds
tpch
(4 rows)
Query 20230810_122520_00000_gb9kf, FINISHED, 2 nodes
Splits: 68 total, 68 done (100.00%)
1.15 [0 rows, 0B] [0 rows/s, 0B/s]
trino>
I can then select my hive source and check the available tables:
trino> use hive.default;
USE
trino:default> show tables;
Table
---------------
nyctaxi_table
(1 row)
Query 20230810_140256_00004_uapkp, FINISHED, 3 nodes
Splits: 68 total, 68 done (100.00%)
0.45 [1 rows, 30B] [2 rows/s, 67B/s]
trino:default>
Note that to see the available schemas you can use:
trino> show schemas from hive;
Schema
--------------------
default
information_schema
(2 rows)
Query 20230810_152954_00003_a45jd, FINISHED, 3 nodes
Splits: 68 total, 68 done (100.00%)
0.61 [2 rows, 35B] [3 rows/s, 57B/s]
To see the table schema you can use:
trino:default> describe nyctaxi;
Column | Type | Extra | Comment
-----------------------+--------------+-------+---------
vendorid | bigint | |
tpep_pickup_datetime | timestamp(3) | |
tpep_dropoff_datetime | timestamp(3) | |
passenger_count | double | |
trip_distance | double | |
ratecodeid | double | |
store_and_fwd_flag | varchar | |
pulocationid | bigint | |
dolocationid | bigint | |
payment_type | bigint | |
fare_amount | double | |
extra | double | |
mta_tax | double | |
tip_amount | double | |
tolls_amount | double | |
improvement_surcharge | double | |
total_amount | double | |
(17 rows)
Query 20230810_163505_00104_a45jd, FINISHED, 3 nodes
Splits: 68 total, 68 done (100.00%)
0.28 [17 rows, 1.05KB] [60 rows/s, 3.72KB/s]
I can now run various queries on the dataset, please note this is from a very limited lab k8s cluster with low resources and network connectivity, so performance was not the aim:
trino:default> select count(*) from nyctaxi;
_col0
-----------
899294289
(1 row)
Query 20230815_132013_00029_4suvq, FINISHED, 1 node
Splits: 343 total, 343 done (100.00%)
1.04 [899M rows, 3.56MB] [865M rows/s, 3.43MB/s]
trino:default> select * from nyctaxi WHERE tolls_amount > 100 order by tolls_amount desc limit 10;
vendorid | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | ratecodeid | store_and_fwd_flag | pulocationid | dolocationid | payment_type | fare_amount | extra | mta
----------+-------------------------+-------------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+-------+----
2 | 2017-07-03 17:15:15.000 | 2017-07-03 17:35:57.000 | 1 | 9.74 | 1 | N | 138 | 107 | 2 | 27.5 | 1.0 |
2 | 2015-12-12 00:17:39.000 | 2015-12-12 01:14:18.000 | 1 | 38.48 | 5 | N | 68 | 265 | 2 | 200.0 | 0.0 |
1 | 2016-12-29 07:46:33.000 | 2016-12-29 08:03:36.000 | 1 | 4.7 | 1 | N | 151 | 223 | 2 | 16.5 | 0.0 |
1 | 2014-08-15 10:37:42.000 | 2014-08-15 11:07:21.000 | 1 | 10.0 | 1 | N | 230 | 138 | 3 | 32.0 | 0.0 |
1 | 2015-01-06 18:27:38.000 | 2015-01-06 19:04:04.000 | 2 | 18.5 | 2 | N | 264 | 264 | 2 | 52.0 | 0.0 |
1 | 2016-03-14 15:16:32.000 | 2016-03-14 16:27:11.000 | 1 | 18.5 | 2 | Y | 132 | 230 | 3 | 52.0 | 0.0 |
1 | 2016-05-02 09:10:21.000 | 2016-05-02 09:58:00.000 | 1 | 18.0 | 2 | N | 229 | 132 | 3 | 52.0 | 0.0 |
1 | 2017-11-28 23:24:24.000 | 2017-11-29 00:09:00.000 | 1 | 6.2 | 1 | N | 246 | 112 | 3 | 26.5 | 0.5 |
1 | 2014-09-14 12:37:44.000 | 2014-09-14 12:49:49.000 | 2 | 1.4 | 1 | N | 186 | 170 | 3 | 9.5 | 0.0 |
1 | 2015-11-20 21:35:41.000 | 2015-11-20 22:08:29.000 | 1 | 8.5 | 1 | N | 264 | 264 | 3 | 30.5 | 0.5 |
(10 rows)
Query 20230815_131646_00028_4suvq, FINISHED, 1 node
Splits: 353 total, 353 done (100.00%)
2.77 [487M rows, 6.36GB] [176M rows/s, 2.3GB/s]