Lakefs.io on kubernetes with PureStorage FlashBlade

4 min readMar 1, 2024

As per their website LakeFS brings software engineering best practices and applies them to data engineering. Let’s go over how to get up and running with lakefs on kubernetes and Pure Storage Flashblade.

All of the following steps are done on a kubernetes cluster in the lakefs namespace.

Deploying the database

As per the lakefs requirements we first need a database to synchronize actions on your repositories.

For our postgres statefulset we need a secret for the database password:

echo -n 'your_password_here' | base64
eW91cl9wYXNzd29yZF9oZXJl

Our postgres is defined as per the following yaml file, change your storageClassName to suit your environment.

apiVersion: v1
kind: Secret
metadata:
  name: postgres-secret
type: Opaque
data:
  postgres-password: eW91cl9wYXNzd29yZF9oZXJl
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  ports:
  - port: 5432
  clusterIP: None
  selector:
    app: postgres
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  selector:
    matchLabels:
      app: postgres
  serviceName: "postgres"
  replicas: 1
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:latest
        env:
        - name: POSTGRES_DB
          value: "lakefsdb"
        - name: POSTGRES_USER
          value: "exampleuser"
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: postgres-password
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: postgres-storage
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: postgres-storage
    spec:
      storageClassName: "pwxdb-storage-class"
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

and deployed simply via:

kubectl -n lakefs apply -f postrgres.yaml

Upon completion you should have a running postgres server within the desired namespace on the kubernetes cluster:

kubectl -n lakefs get all
NAME             READY   STATUS              RESTARTS   AGE
pod/postgres-0   1/1     Running             0          8s

NAME               TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/postgres   ClusterIP   None         <none>        5432/TCP   57s

NAME                        READY   AGE
statefulset.apps/postgres   1/1     8s

Deploying LakeFS

To deploy lakefs we will be using the helm method and for this we need to provide certain values for the chart. These are specific to the storage platform and postgres database used so change to suit. In our case I have provided values for a Pure Storage FlashBlade as it delivers native S3 capabilities as required by LakeFS.

FlashBlade S3 is NOT built on top of a filesystem, as is the case with many other “s3 storage systems” with all the hidden limitations that this incurs (see versioning and object lock issues when running S3 backed by a file system).

The helm chart values:

secrets:
    # replace this with the connection string of the database you created in a previous step:
    databaseConnectionString: postgres://exampleuser:your_password_here@postgres:5432/lakefsdb
    # replace this with a randomly-generated string
    authEncryptSecretKey: my_randomly_generated_string
lakefsConfig: |
    database:
      type: postgres
    blockstore:
      type: s3
      s3:
        force_path_style: true
        endpoint: http://192.168.4.104
        discover_bucket_region: false
        credentials:
          access_key_id: PSFBSA*****************GCBJEIA
          secret_access_key: A1211269*****************91c63eJOEN

Then install using helm cli:

helm install fb-lakefs lakefs/lakefs -n lakefs -f lakefs_helm_values.yml 
NAME: fb-lakefs
LAST DEPLOYED: Fri Mar  1 08:42:43 2024
NAMESPACE: lakefs
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing lakeFS!

1. Run the following to get a url to start setting up lakeFS:
  export POD_NAME=$(kubectl get pods --namespace lakefs -l "app.kubernetes.io/instance=fb-lakefs" -o jsonpath="{.items[0].metadata.name}")
  kubectl wait --for=condition=ready pod $POD_NAME
  echo "Visit http://127.0.0.1:8000/setup to use your application"
  kubectl port-forward $POD_NAME 8000:8000 --namespace lakefs

2. See the docs on how to create your first repository: https://docs.lakefs.io/quickstart/repository.html

I edit the services from ClusterIPs to NodePorts, to have a simple means of accessing the GUIs in a lab environment from outside the k8s cluster.

Repositories, data import and visibility

After connecting to the lakefs interface and creating the initial admin user we are presented with our main page where we can create a repository:

Using the import functionality I will point my lakefs at a bucket containing some of the New York City taxi dataset:

You should get a ‘success’ message upon completion of the import task:

Within our repository we can browse and check the imported objects:

We can also run some queries against them:

and of course all this while maintaining the familiar git mindest of commits, branches, tags and repositories:

Conclusion

Lakefs provides data management as code using git-like operations, and when combined with the speed and simplicity at scale of Pure Storage FlashBlade it empowers the most critical aspect of an AI or analytics pipeline: the management of the datasets.

Lakefs.io on kubernetes with PureStorage FlashBlade

Deploying the database

Deploying LakeFS

Repositories, data import and visibility

Conclusion

Written by jboothomas