Trino S3 via hive-metastore integration

4 min readAug 10, 2023

In this blog I will go over how to use S3 storage on a Pure Storage Flashblade with Trino the fast distributed SQL query engine for big data.

I deploy Trino using the hive chart and provide a values.yaml file with the following configuration:

additionalCatalogs:
  lakehouse: |-
    connector.name=hive
    hive.metastore.uri=thrift://hivems-hive-metastore.analytics.svc.local:9083    
    hive.s3.aws-access-key=PSFB....BJEIA
    hive.s3.aws-secret-key=A121....eJOEN
    hive.s3.endpoint=192.168.2.2
    hive.s3.ssl.enabled=false

This is pointing to my hive-metastore server see this blog post for more information on setting that up. I then edit the Trino service to switch from ClusterIP to NodePort so as to facilitate external access.

As usual I use the helm install command:

helm -n analytics install trino trino/trino  -f values.yaml

On a linux client with the trino-cli installed, I use the following command to connect to my in Kubernetes running instance, and list the current catalogs available:

$ ./trino-cli-422-executable.jar --server http://10.225.114.50:32140
trino> show catalogs;
  Catalog  
-----------
 hive 
 system    
 tpcds     
 tpch      
(4 rows)

Query 20230810_122520_00000_gb9kf, FINISHED, 2 nodes
Splits: 68 total, 68 done (100.00%)
1.15 [0 rows, 0B] [0 rows/s, 0B/s]

trino>

I can then select my hive source and check the available tables:

trino> use hive.default;
USE
trino:default>  show tables;
     Table     
---------------
 nyctaxi_table 
(1 row)

Query 20230810_140256_00004_uapkp, FINISHED, 3 nodes
Splits: 68 total, 68 done (100.00%)
0.45 [1 rows, 30B] [2 rows/s, 67B/s]

trino:default>

Note that to see the available schemas you can use:

trino> show schemas from hive;
       Schema       
--------------------
 default            
 information_schema 
(2 rows)

Query 20230810_152954_00003_a45jd, FINISHED, 3 nodes
Splits: 68 total, 68 done (100.00%)
0.61 [2 rows, 35B] [3 rows/s, 57B/s]

To see the table schema you can use:

trino:default> describe nyctaxi;
        Column         |     Type     | Extra | Comment 
-----------------------+--------------+-------+---------
 vendorid              | bigint       |       |         
 tpep_pickup_datetime  | timestamp(3) |       |         
 tpep_dropoff_datetime | timestamp(3) |       |         
 passenger_count       | double       |       |         
 trip_distance         | double       |       |         
 ratecodeid            | double       |       |         
 store_and_fwd_flag    | varchar      |       |         
 pulocationid          | bigint       |       |         
 dolocationid          | bigint       |       |         
 payment_type          | bigint       |       |         
 fare_amount           | double       |       |         
 extra                 | double       |       |         
 mta_tax               | double       |       |         
 tip_amount            | double       |       |         
 tolls_amount          | double       |       |         
 improvement_surcharge | double       |       |         
 total_amount          | double       |       |         
(17 rows)

Query 20230810_163505_00104_a45jd, FINISHED, 3 nodes
Splits: 68 total, 68 done (100.00%)
0.28 [17 rows, 1.05KB] [60 rows/s, 3.72KB/s]

I can now run various queries on the dataset, please note this is from a very limited lab k8s cluster with low resources and network connectivity, so performance was not the aim:

trino:default> select count(*) from nyctaxi;
   _col0   
-----------
 899294289 
(1 row)

Query 20230815_132013_00029_4suvq, FINISHED, 1 node
Splits: 343 total, 343 done (100.00%)
1.04 [899M rows, 3.56MB] [865M rows/s, 3.43MB/s]

trino:default> select * from nyctaxi WHERE tolls_amount > 100 order by tolls_amount desc limit 10;
 vendorid |  tpep_pickup_datetime   |  tpep_dropoff_datetime  | passenger_count | trip_distance | ratecodeid | store_and_fwd_flag | pulocationid | dolocationid | payment_type | fare_amount | extra | mta
----------+-------------------------+-------------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+-------+----
        2 | 2017-07-03 17:15:15.000 | 2017-07-03 17:35:57.000 |               1 |          9.74 |          1 | N                  |          138 |          107 |            2 |        27.5 |   1.0 |    
        2 | 2015-12-12 00:17:39.000 | 2015-12-12 01:14:18.000 |               1 |         38.48 |          5 | N                  |           68 |          265 |            2 |       200.0 |   0.0 |    
        1 | 2016-12-29 07:46:33.000 | 2016-12-29 08:03:36.000 |               1 |           4.7 |          1 | N                  |          151 |          223 |            2 |        16.5 |   0.0 |    
        1 | 2014-08-15 10:37:42.000 | 2014-08-15 11:07:21.000 |               1 |          10.0 |          1 | N                  |          230 |          138 |            3 |        32.0 |   0.0 |    
        1 | 2015-01-06 18:27:38.000 | 2015-01-06 19:04:04.000 |               2 |          18.5 |          2 | N                  |          264 |          264 |            2 |        52.0 |   0.0 |    
        1 | 2016-03-14 15:16:32.000 | 2016-03-14 16:27:11.000 |               1 |          18.5 |          2 | Y                  |          132 |          230 |            3 |        52.0 |   0.0 |    
        1 | 2016-05-02 09:10:21.000 | 2016-05-02 09:58:00.000 |               1 |          18.0 |          2 | N                  |          229 |          132 |            3 |        52.0 |   0.0 |    
        1 | 2017-11-28 23:24:24.000 | 2017-11-29 00:09:00.000 |               1 |           6.2 |          1 | N                  |          246 |          112 |            3 |        26.5 |   0.5 |    
        1 | 2014-09-14 12:37:44.000 | 2014-09-14 12:49:49.000 |               2 |           1.4 |          1 | N                  |          186 |          170 |            3 |         9.5 |   0.0 |    
        1 | 2015-11-20 21:35:41.000 | 2015-11-20 22:08:29.000 |               1 |           8.5 |          1 | N                  |          264 |          264 |            3 |        30.5 |   0.5 |    
(10 rows)

Query 20230815_131646_00028_4suvq, FINISHED, 1 node
Splits: 353 total, 353 done (100.00%)
2.77 [487M rows, 6.36GB] [176M rows/s, 2.3GB/s]

Trino S3 via hive-metastore integration

Written by jboothomas