Lakefs.io on kubernetes with PureStorage FlashBlade

jboothomas
4 min readMar 1, 2024

--

As per their website LakeFS brings software engineering best practices and applies them to data engineering. Let’s go over how to get up and running with lakefs on kubernetes and Pure Storage Flashblade.

All of the following steps are done on a kubernetes cluster in the lakefs namespace.

Deploying the database

As per the lakefs requirements we first need a database to synchronize actions on your repositories.

For our postgres statefulset we need a secret for the database password:

echo -n 'your_password_here' | base64
eW91cl9wYXNzd29yZF9oZXJl

Our postgres is defined as per the following yaml file, change your storageClassName to suit your environment.

apiVersion: v1
kind: Secret
metadata:
name: postgres-secret
type: Opaque
data:
postgres-password: eW91cl9wYXNzd29yZF9oZXJl
---
apiVersion: v1
kind: Service
metadata:
name: postgres
spec:
ports:
- port: 5432
clusterIP: None
selector:
app: postgres
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
selector:
matchLabels:
app: postgres
serviceName: "postgres"
replicas: 1
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:latest
env:
- name: POSTGRES_DB
value: "lakefsdb"
- name: POSTGRES_USER
value: "exampleuser"
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: postgres-password
ports:
- containerPort: 5432
volumeMounts:
- name: postgres-storage
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: postgres-storage
spec:
storageClassName: "pwxdb-storage-class"
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi

and deployed simply via:

kubectl -n lakefs apply -f postrgres.yaml

Upon completion you should have a running postgres server within the desired namespace on the kubernetes cluster:

kubectl -n lakefs get all
NAME READY STATUS RESTARTS AGE
pod/postgres-0 1/1 Running 0 8s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/postgres ClusterIP None <none> 5432/TCP 57s

NAME READY AGE
statefulset.apps/postgres 1/1 8s

Deploying LakeFS

To deploy lakefs we will be using the helm method and for this we need to provide certain values for the chart. These are specific to the storage platform and postgres database used so change to suit. In our case I have provided values for a Pure Storage FlashBlade as it delivers native S3 capabilities as required by LakeFS.

FlashBlade S3 is NOT built on top of a filesystem, as is the case with many other “s3 storage systems” with all the hidden limitations that this incurs (see versioning and object lock issues when running S3 backed by a file system).

The helm chart values:

secrets:
# replace this with the connection string of the database you created in a previous step:
databaseConnectionString: postgres://exampleuser:your_password_here@postgres:5432/lakefsdb
# replace this with a randomly-generated string
authEncryptSecretKey: my_randomly_generated_string
lakefsConfig: |
database:
type: postgres
blockstore:
type: s3
s3:
force_path_style: true
endpoint: http://192.168.4.104
discover_bucket_region: false
credentials:
access_key_id: PSFBSA*****************GCBJEIA
secret_access_key: A1211269*****************91c63eJOEN

Then install using helm cli:

helm install fb-lakefs lakefs/lakefs -n lakefs -f lakefs_helm_values.yml 
NAME: fb-lakefs
LAST DEPLOYED: Fri Mar 1 08:42:43 2024
NAMESPACE: lakefs
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing lakeFS!

1. Run the following to get a url to start setting up lakeFS:
export POD_NAME=$(kubectl get pods --namespace lakefs -l "app.kubernetes.io/instance=fb-lakefs" -o jsonpath="{.items[0].metadata.name}")
kubectl wait --for=condition=ready pod $POD_NAME
echo "Visit http://127.0.0.1:8000/setup to use your application"
kubectl port-forward $POD_NAME 8000:8000 --namespace lakefs

2. See the docs on how to create your first repository: https://docs.lakefs.io/quickstart/repository.html

I edit the services from ClusterIPs to NodePorts, to have a simple means of accessing the GUIs in a lab environment from outside the k8s cluster.

Repositories, data import and visibility

After connecting to the lakefs interface and creating the initial admin user we are presented with our main page where we can create a repository:

Using the import functionality I will point my lakefs at a bucket containing some of the New York City taxi dataset:

You should get a ‘success’ message upon completion of the import task:

Within our repository we can browse and check the imported objects:

We can also run some queries against them:

and of course all this while maintaining the familiar git mindest of commits, branches, tags and repositories:

Conclusion

Lakefs provides data management as code using git-like operations, and when combined with the speed and simplicity at scale of Pure Storage FlashBlade it empowers the most critical aspect of an AI or analytics pipeline: the management of the datasets.

Further reading:

--

--

jboothomas

Infrastructure engineering for modern data applications