Hive-metastore on K8S with S3 external table

jboothomas
3 min readAug 10, 2023

--

In this blog I will cover how to setup Hive metastore on K8S and then leverage external S3 datasets.

Installation

In order to deploy Hive components on the Kubernetes cluster I first add the required helm chart repo:

$ helm repo add bigdata-gradiant https://gradiant.github.io/bigdata-charts/
$ helm repo update

I can then search the new repo for the available helm charts, at the time of writing these are:

$ helm search repo bigdata-gradiant
NAME CHART VERSION APP VERSION DESCRIPTION
bigdata-gradiant/hbase 0.1.6 2.0.1 HBase is an open-source non-relational distribu...
bigdata-gradiant/hdfs 0.1.10 2.7.7 The Apache Hadoop software library is a framewo...
bigdata-gradiant/hive 0.1.6 2.3.6 The Apache Hive ™ data warehouse software facil...
bigdata-gradiant/hive-metastore 0.1.3 2.3.6 The Apache Hive ™ data warehouse software facil...
bigdata-gradiant/jupyter 0.1.11 6.0.3 Helm for jupyter single server with pyspark sup...
bigdata-gradiant/kafka-connect-ui 0.1.0 0.9.7 Helm for Landoop/kafka-connect-ui
bigdata-gradiant/opentsdb 0.1.7 2.4.0 Store and serve massive amounts of time series ...
bigdata-gradiant/spark-standalone 0.1.0 2.4.4 Apache Spark™ is a unified analytics engine for...

So as to link trino to some S3 datasets I will be deploying a Hive metastore:

$ helm install hivems bigdata-gradiant/hive-metastore -n analytics
NAME: hivems
LAST DEPLOYED: Thu Aug 10 10:21:47 2023
NAMESPACE: analytics
STATUS: deployed
REVISION: 1
TEST SUITE: None

I check that my pods have started correctly within Kubernetes:

$ kubectl -n analytics get pods
NAME READY STATUS
... 0 42m
hivems-hive-metastore-0 1/1 Running
hivems-postgresql-0 1/1 Running
...

I then copy out the created configmap and append the following to the data hive-site.xml section (replace with your S3 endpoint values):

$ kubectl -n analytics get configmap hivems-hive-metastore -o yaml  > hivems-hive-metastore.yaml
$ vi hivems-hive-metastore.yaml

### I MAKE THE FOLLOWING CHANGE/ADDITION ###
data:
hive-site.xml: |
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.s3a.endpoint</name>
<value>192.168.2.2</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>PSFB....JEIA</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>A121....eJOEN</value>
</property>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
</property>

Note: I had to create a new container image to include the missing hadoop-aws-2.7.4.jar file this can be obtained from my docker repo jboothomas/hive-metastore-s3:v6. I then simply provided this as the image to use for the hive-metastore deployment.

Table creation

To create an external table I exec into the hive metastore pod and connect to hive:

$ kubectl -n analytics exec -it hivems-hive-metastore-0 - /bin/sh
# hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive>

Now I can create my external table:

create external table if not exists nyctaxi(
VendorID bigint,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp,
passenger_count double,
trip_distance double,
RatecodeID double,
store_and_fwd_flag string,
PULocationID bigint,
DOLocationID bigint,
payment_type bigint,
fare_amount double,
extra double,
mta_tax double,
tip_amount double,
tolls_amount double,
improvement_surcharge double,
total_amount double
)
STORED AS PARQUET
LOCATION 's3a://nyctaxi/';
Time taken: 2.756 seconds

From this point onwards i can leverage various analyics tools and point to my hive-metastore service to run queries against this table (for example: trino).

--

--