Airbyte S3 connector on k8s

jboothomas
4 min readAug 8, 2023

--

In this blog I will show a simple implementation of Airbyte on kubernetes with S3 integration on PureStorage Flashblade.

From our Kubernetes server with Helm installed I first add the required helm repo for Airbyte:

helm repo add airbyte https://airbytehq.github.io/helm-charts
helm repo update

Then I deploy with, if required, a values.yaml to the desired namespace:

helm install s3airbyte airbyte/airbyte -n airbyte

I edit the service/s3airbyte-airbyte-webapp-svc so change from ClusterIP to NodePort in order to have a quick port forward to the web interface.

airbyte interface

Let’s now create a simple connector to pull data from an S3 bucket, I select create connector, and chose S3 as the type. Then I provide optional fields for the AWS access and secret keys as well as the endpoint. For the endpoint as I have no SSL certificate on my demo environment setup I specify an http path.

Airbyte will test the source and then prompt for a destination to be created.

For the sake of this blog post I will simply pass a second newly create S3 bucket on our Flashblade as the destination and test the transformation of the data from parquet to avro.

Again Airbyte will test the destination and after validation I am presented with the configure connection settings page, change settings to suit, I will leave it all as per default:

After setup I am passed to the connection management pages, where I can see its Status, Job history, Replication, Transformation and settings:

While this is in progress I quickly check the objects in the two source and destination buckets :

$ aws s3api list-objects-v2 --bucket nyctaxi --profile fbstaines03 --no-verify-ssl --endpoint-url=https://192.168.40.165
InsecureRequestWarning,
{
"Contents": [
{
"LastModified": "2023-08-08T11:41:20.000Z",
"ETag": "d2de0ffc4f9112b91c5fe3a407c07435",
"StorageClass": "STANDARD",
"Key": "yellow_tripdata_2023-01.parquet",
"Size": 47673370
}
]
}

$ aws s3api list-objects-v2 --bucket airbytedest --profile fbstaines03 --no-verify-ssl --endpoint-url=https://192.168.40.165
InsecureRequestWarning,

To see the progress of the job select Job History > View Logs

It will show the current count of records processed, for my example S3 connector job:

The job will finish and in the Job history sync history the information for the successful sync is displayed:

Lets check the destination bucket contents, I now have 4 objects from our parquet to avro conversion :

$ aws s3api list-objects-v2 --bucket airbytedest --profile fbstaines03 --no-verify-ssl --endpoint-url=https://192.168.40.165
InsecureRequestWarning,
{
"Contents": [
{
"LastModified": "2023-08-08T11:56:47.000Z",
"ETag": "0a08d7a59360c0ec2f53cd572ca86127-20",
"StorageClass": "STANDARD",
"Key": "/nyctaxi/fbS3nyctaxi/2023_08_08_1691495559589_0.avro",
"Size": 209738278
},
{
"LastModified": "2023-08-08T12:00:51.000Z",
"ETag": "3c81c74aa12620ada225390959677eb7-20",
"StorageClass": "STANDARD",
"Key": "/nyctaxi/fbS3nyctaxi/2023_08_08_1691495559589_1.avro",
"Size": 209747517
},
{
"LastModified": "2023-08-08T12:04:56.000Z",
"ETag": "f62110ec469abac38d1b5b12a1ccf4f4-20",
"StorageClass": "STANDARD",
"Key": "/nyctaxi/fbS3nyctaxi/2023_08_08_1691495559589_2.avro",
"Size": 209748737
},
{
"LastModified": "2023-08-08T12:07:30.000Z",
"ETag": "fc6eae990030e9d6684c7035e65bfdca-13",
"StorageClass": "STANDARD",
"Key": "/nyctaxi/fbS3nyctaxi/2023_08_08_1691495559589_3.avro",
"Size": 131695016
}
]
}

Airbyte provides a simple platform to extract transform and load data from multiple sources and destinations thanks to its 300+ connectors.

Pure Storage Flashblade’ S3 storage is simple to integrate and provides a fast, scalable S3 layer for Airbyte and analytics applications to leverage within the larger data pipeline picture.

--

--

jboothomas

Infrastructure engineering for modern data applications