SQL query S3 objects with Apache drill

3 min readMay 10, 2024

In this article I will cover how to setup and configure apache drill to run SQL queries against parquet files stored in an S3 bucket.

For this setup I will be using an ubuntu 20.04 virtual machine and a Pure Storage FlashBlade S//200 for S3 storage.

I start by updating and installing the requirements and apache base package:

apt update
apt install openjdk-8-jdk
wget https://dlcdn.apache.org/drill/1.21.1/apache-drill-1.21.1.tar.gz
tar -xvf apache-drill-1.21.1.tar.gz
cd apache-drill-1.21.1

~/apache-drill-1.21.1$ ls
KEYS  LICENSE  NOTICE  README.md  bin  conf  git.properties  jars  sample-data  winutils

To access S3, I will configure the S3 storage plugin in Apache Drill, to do so start apache-drill bin/drill-embedded then from a browser navigate to the web service http://your_ip:8047:

Select create and paste in the following json amended to suit your environment:

{
  "type": "file",
  "connection": "s3a://nyctaxi",
  "config": {
    "fs.s3a.endpoint": "<your_s3_endpoint>",
    "fs.s3a.connection.ssl.enabled": false,
    "fs.s3a.access.key": "<your_s3_accesskey>",
    "fs.s3a.secret.key": "<your_s3_secretkey>"
  },
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    }
  },
  "formats": {
    "parquet": {
      "type": "parquet"
    }
  },
  "authMode": "SHARED_USER",
  "enabled": true
}

With that applied you should see the new storage plugin listed alongside the defaults:

Listing all databases visible from apache drill, now shows our newly defined s3 storage plugin:

apache drill> show databases;
+--------------------------------+
|          SCHEMA_NAME           |
+--------------------------------+
| dfs.default                    |
| dfs.root                       |
| dfs.tmp                        |
| information_schema             |
| sys                            |
| cp.default                     |
| nyctaxis3parquetbucket.default |
| nyctaxis3parquetbucket.root    |
+--------------------------------+
8 rows selected (0.184 seconds)

Leveraging this and a known parquet file on the s3 backend I can run some basic queries:

apache drill (nyctaxis3parquetbucket)> select COUNT(*) FROM `s3a://nyctaxi/yellow_tripdata_2022-01.parquet` WHERE trip_distance > 100;
+--------+
| EXPR$0 |
+--------+
| 135    |
+--------+
1 row selected (0.289 seconds)

apache drill (nyctaxis3parquetbucket)> select COUNT(*) FROM `s3a://nyctaxi/yellow_tripdata_2022-01.parquet` WHERE trip_distance < 100;
+---------+
| EXPR$0  |
+---------+
| 2463796 |
+---------+
1 row selected (0.319 seconds)

It is also possible to wildcard the parquet files that make up the dataset to issue the query across all matching files `yellow_tripdata_2022-??.parquet`:

apache drill (nyctaxis3parquetbucket)> select COUNT(*) FROM `s3a://nyctaxi/yellow_tripdata_2022-??.parquet` WHERE trip_distance > 100;
+--------+
| EXPR$0 |
+--------+
| 1529   |
+--------+
1 row selected (0.782 seconds)
apache drill (nyctaxis3parquetbucket)> select COUNT(*) FROM `s3a://nyctaxi/yellow_tripdata_2022-??.parquet` WHERE trip_distance < 100;
+----------+
|  EXPR$0  |
+----------+
| 39654569 |
+----------+
1 row selected (1.039 seconds)

On the Pure Storage Flashblade S//200 we can see the S3 IO from the query:

As well as the specific S3 protocol operation latencies:

Apache drill and Pure Storage FlashBlade are yet another means to run SQL queries across S3 parquet objects stored on S3 effortless.

Thanks for reading.

SQL query S3 objects with Apache drill

Written by jboothomas