SQL query S3 objects with Apache drill

jboothomas
3 min readMay 10, 2024

--

In this article I will cover how to setup and configure apache drill to run SQL queries against parquet files stored in an S3 bucket.

For this setup I will be using an ubuntu 20.04 virtual machine and a Pure Storage FlashBlade S//200 for S3 storage.

I start by updating and installing the requirements and apache base package:

apt update
apt install openjdk-8-jdk
wget https://dlcdn.apache.org/drill/1.21.1/apache-drill-1.21.1.tar.gz
tar -xvf apache-drill-1.21.1.tar.gz
cd apache-drill-1.21.1

~/apache-drill-1.21.1$ ls
KEYS LICENSE NOTICE README.md bin conf git.properties jars sample-data winutils

To access S3, I will configure the S3 storage plugin in Apache Drill, to do so start apache-drill bin/drill-embedded then from a browser navigate to the web service http://your_ip:8047:

Select create and paste in the following json amended to suit your environment:

{
"type": "file",
"connection": "s3a://nyctaxi",
"config": {
"fs.s3a.endpoint": "<your_s3_endpoint>",
"fs.s3a.connection.ssl.enabled": false,
"fs.s3a.access.key": "<your_s3_accesskey>",
"fs.s3a.secret.key": "<your_s3_secretkey>"
},
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
}
},
"formats": {
"parquet": {
"type": "parquet"
}
},
"authMode": "SHARED_USER",
"enabled": true
}

With that applied you should see the new storage plugin listed alongside the defaults:

Listing all databases visible from apache drill, now shows our newly defined s3 storage plugin:

apache drill> show databases;
+--------------------------------+
| SCHEMA_NAME |
+--------------------------------+
| dfs.default |
| dfs.root |
| dfs.tmp |
| information_schema |
| sys |
| cp.default |
| nyctaxis3parquetbucket.default |
| nyctaxis3parquetbucket.root |
+--------------------------------+
8 rows selected (0.184 seconds)

Leveraging this and a known parquet file on the s3 backend I can run some basic queries:

apache drill (nyctaxis3parquetbucket)> select COUNT(*) FROM `s3a://nyctaxi/yellow_tripdata_2022-01.parquet` WHERE trip_distance > 100;
+--------+
| EXPR$0 |
+--------+
| 135 |
+--------+
1 row selected (0.289 seconds)

apache drill (nyctaxis3parquetbucket)> select COUNT(*) FROM `s3a://nyctaxi/yellow_tripdata_2022-01.parquet` WHERE trip_distance < 100;
+---------+
| EXPR$0 |
+---------+
| 2463796 |
+---------+
1 row selected (0.319 seconds)

It is also possible to wildcard the parquet files that make up the dataset to issue the query across all matching files `yellow_tripdata_2022-??.parquet`:

apache drill (nyctaxis3parquetbucket)> select COUNT(*) FROM `s3a://nyctaxi/yellow_tripdata_2022-??.parquet` WHERE trip_distance > 100;
+--------+
| EXPR$0 |
+--------+
| 1529 |
+--------+
1 row selected (0.782 seconds)
apache drill (nyctaxis3parquetbucket)> select COUNT(*) FROM `s3a://nyctaxi/yellow_tripdata_2022-??.parquet` WHERE trip_distance < 100;
+----------+
| EXPR$0 |
+----------+
| 39654569 |
+----------+
1 row selected (1.039 seconds)

On the Pure Storage Flashblade S//200 we can see the S3 IO from the query:

As well as the specific S3 protocol operation latencies:

Apache drill and Pure Storage FlashBlade are yet another means to run SQL queries across S3 parquet objects stored on S3 effortless.

Thanks for reading.

--

--

jboothomas

Infrastructure engineering for modern data applications