JupyterHub Portworx and shared storage

jboothomas
5 min readJun 11, 2021

--

In this blog I’ll cover the steps to implement a JupyterHub environment with Portworx shared storage, but also how to move your shared data to a Portworx Proxy volume presented from a Pure Storage FlashBlade.

The benefit is the ability to leverage data locality for iterative work and when needed move to a scale out platform for large dataset workloads needing high throughput and consistent low latency.

The environment used is a Kubernetes cluster running Portworx v2.7 virtualizing storage from a Pure Storage FlashArray.

First up I create the Portworx storage classes for the sharedv4 and proxy volumes. Here is the shared v4 Portworx storage class definition:

kind: StorageClassapiVersion: storage.k8s.io/v1metadata:  name: px-sharedv4-scprovisioner: kubernetes.io/portworx-volumeparameters:  repl: "2"  sharedv4: "true"  allow_all_ips: "true"

Here is the Portworx proxy storage class definition for my FlashBlade share:

kind: StorageClassapiVersion: storage.k8s.io/v1metadata:  name: px-proxy-fbnfsprovisioner: kubernetes.io/portworx-volumeparameters:  proxy_endpoint: "nfs://192.168.5.100" #FlashBlade data IP  proxy_nfs_exportpath: "/z-da-pxproxy" #NFS share name  mount_options: vers=4.1allowVolumeExpansion: true

After applying the above two definitions to my Kubernetes cluster I then create two persistent volumes using the storage classes just defined.

apiVersion: v1kind: PersistentVolumeClaimmetadata:  name: jupyterhub-pxshared-volumespec:  resources:    requests:      storage: 20Gi  accessModes:    - ReadWriteMany  storageClassName: px-sharedv4-sc

and

apiVersion: v1kind: PersistentVolumeClaimmetadata:  name: jupyterhub-pxproxyfb-volumespec:  resources:    requests:      storage: 20Gi  accessModes:    - ReadWriteMany  storageClassName: px-proxy-fbnfs

Ok I now have my two shared storage volumes ready to be leveraged from within my Jupyter as a Service environment.

I create a config.yaml file for additional helm settings for the Jupyterhub helm chart. I first add the following section to add my two shared persistent volume claims as extra volumes mounted to /home/pxshared and /home/pxproxyfb respectively:

storage:  dynamic:    storageClass: px-db-local-snapshot  capacity: 5Gi  extraVolumes:    - name: jupyterhub-pxshared      persistentVolumeClaim:        claimName: jupyterhub-pxshared-volume    - name: jupyterhub-pxproxyfb      persistentVolumeClaim:        claimName: jupyterhub-pxproxyfb-volume  extraVolumeMounts:    - name: jupyterhub-pxshared      mountPath: /home/pxshared    - name: jupyterhub-pxproxyfb      mountPath: /home/pxproxyfb

With that in place and the required config settings I deploy Jupyterhub to the Kubernetes cluster as per the documentation:

helm upgrade --cleanup-on-fail jupyterhub  --install  jupyterhub/jupyterhub --values config.yaml

I can browse to my ingress address, loadbalancer IP or clusterIP:nodeport to access the interface and login. (authentication is not covered — see official documentation). After the login screen the Server Options screen presents a list of notebooks for our users to chose from. I added quite a few to test various things out :

My notebook will launch as a pod that I can see under Kubernetes, it is named jupyter-<username>:

$kubectl get pods 
NAME READY STATUS RESTARTS AGE
continuous-image-puller-4d9qb 1/1 Running 0 9hcontinuous-image-puller-5bzr2 1/1 Running 0 9hcontinuous-image-puller-vrx82 1/1 Running 0 9hhub-794977c58d-7mrjp 1/1 Running 0 9hjupyter-jb 1/1 Running 0 90sproxy-6d88df6d44-rl8c5 1/1 Running 0 9huser-scheduler-7977fbb8ff-2z5nt 1/1 Running 0 41huser-scheduler-7977fbb8ff-6kc4v 1/1 Running 0 41h

Within my notebook I can access the mounted shared volumes from either Portworx using underlying FlashArray storage at /home/pxshared or via the Portworx proxy to the FlashBlade NFS share at /home/pxproxyfb :

I can now develop and test within my personal folder, this leverages the persistent volume claim create for the notebook on the Portworx storage layer (not shared). and when needed I can share my notebooks, code and datasets.

As an example I defined the following code to write and read some data to my Portworx shared volume and an S3 bucket on my FlashBlade:

#! python3
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.mllib.random import RandomRDDs
sparkConf = SparkConf()
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext
rdd = RandomRDDs.uniformVectorRDD(sc, 1000, 8, 1).map(lambda a: a.tolist())
df = rdd.toDF(). \
withColumnRenamed("_1", "c1"). \
withColumnRenamed("_2", "c2"). \
withColumnRenamed("_3", "c3"). \
withColumnRenamed("_4", "c4"). \
withColumnRenamed("_5", "c5"). \
withColumnRenamed("_6", "c6"). \
withColumnRenamed("_7", "c7"). \
withColumnRenamed("_8", "c8")

path = "file:////home/pxshared/hsbcsubmit"
paths3 = "s3a://z-da-datasets/hsbcsubmit"
df.write.format("com.databricks.spark.csv").option("header","true").mode("overwrite").save(path)
df.write.format("com.databricks.spark.csv").option("header","true").mode("overwrite").save(paths3)
print("Reading pxshared data")
csvPX = spark.read.csv(path+"/*.csv", header=True)
print(csvPX.count())
print(csvPX.head())
print("Reading S3 flashblade data")
csvS3 = spark.read.csv(paths3+"/*.csv", header=True)
print(csvS3.count())
print(csvS3.head())

I copied this code to /home/pxshared so as to leverage the shared location via a Spark job submitted to the Kubernetes master as the operator:

/usr/local/bin/spark-submit --master k8s://https://kubernetes.master:6443 \
--deploy-mode cluster \
--name pydata \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=sparkdefault \
--conf spark.kubernetes.namespace=default \
--conf spark.kubernetes.container.image=jboothomas/spark-py:k8smls3v3.1.2a \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=/home/pxshared \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.rwxpvc.options.claimName=jupyterhub-pxshared-volume \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.rwxpvc.mount.path=/home/pxshared \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.rwxpvc.options.claimName=jupyterhub-pxshared-volume \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.rwxpvc.mount.path=/home/pxshared \
--conf spark.hadoop.fs.s3a.endpoint=192.168.5.100 \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
--conf park.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
--conf spark.hadoop.fs.s3a.fast.upload=true \
--conf spark.hadoop.fs.s3a.access.key=PSFBS------BGOHN \
--conf spark.hadoop.fs.s3a.secret.key=87681------0MKOE \
local:///home/pxshared/pydata-submit.py

This will create a driver and an executor pod (as I only requested one):

pydata-112fd979f61fb982-driver    1/1     Running     0          31spydata-331d6379f61fdb8d-exec-1    1/1     Running     0          22s

The job will finish and output the line count for the created datasets on both storage locations, Portworx sharedv4 and FlashBlade S3. As I also defined logging to occur to my shared folder the job logs are present for review if needed:

But perhaps I need to move to a big data fast file and object platform so within JupyterHub config.yaml I can define a notebook to copy the data between my share locations at startup via lifecycle_hooks:

lifecycle_hooks:  postStart:    exec:      command:        - "sh"        - "-c"        - >          cp -a /home/pxshared/ /home/pxproxyfb

I can stop my prior notebook and now select the “copy data” notebook image:

My notebook will start, the copy commands will execute and I will now have the datasets, code, and notebooks on my FlashBlade NFS share. This can be accessed via the Portworx proxy volume but also as a direct NFS share, from say a larger GPU cluster.

The combination of JupyterHub and simple to implement shared storage from either Portworx or the unified fast file and object platform that is FlashBlade, enables individual contributors to develop / iterate workloads and then share with the wider team seamlessly prior to moving into production.

--

--

jboothomas
jboothomas

Written by jboothomas

Infrastructure engineering for modern data applications

No responses yet