Kubeflow installation guide

jboothomas
5 min readApr 15, 2021

A quick step by step on getting Kubeflow up and running with GPU support. In this example I used a single node Kubernetes controller + worker. I will be using docker runtime (nvidia support) and kubernetes 1.19.8 with the calico network CNI and the NFS client for storage claims. Here are the steps:

Disable swap

{sudo swapoff — all ;sudo sed -ri ‘/\sswap\s/s/^#?/#/’ /etc/fstab ;}

Prepare for docker

{cat <<EOF | sudo tee /etc/modules-load.d/containerd.confoverlaybr_netfilterEOF;sudo modprobe overlay ;sudo modprobe br_netfilter ;cat <<EOF | sudo tee /etc/sysctl.d/99-kubernetes-cri.confnet.bridge.bridge-nf-call-iptables = 1net.ipv4.ip_forward = 1net.bridge.bridge-nf-call-ip6tables = 1EOF ;sudo sysctl — system ;}

Install docker

{sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release ;sudo apt-get update ;curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg — dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg ;sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg — dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg ;echo “deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \$(lsb_release -cs) stable” | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null ;sudo apt-get update ;sudo apt-get install docker-ce docker-ce-cli containerd.io ;sudo mkdir -p /etc/docker;cat <<EOF | sudo tee /etc/docker/daemon.json{  “exec-opts”: [“native.cgroupdriver=systemd”],  “log-driver”: “json-file”,  “log-opts”: {    “max-size”: “100m”  },  “storage-driver”: “overlay2”}EOFsudo systemctl enable dockersudo systemctl daemon-reloadsudo systemctl restart docker}

Install kubernetes binaries

{sudo apt-get update && sudo apt-get install -y apt-transport-https ;curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add — ;cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.listdeb https://apt.kubernetes.io/ kubernetes-xenial mainEOF;sudo apt-get update;sudo apt-get install -y kubelet=1.19.8–00 kubeadm=1.19.8–00 kubectl=1.19.8–00;sudo apt-mark hold kubelet kubeadm kubectl;}

Install NVidia drivers

{ubuntu-drivers devices ;sudo apt install nvidia-driver-460 ;### above was recommended driver for home lab rig ###}

Prepare the GPU nodes

{distribution=$(. /etc/os-release;echo $ID$VERSION_ID) ;curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add — ;curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list ;sudo apt-get update && sudo apt-get install -y nvidia-docker2 ;sudo systemctl restart docker ;}

Edit the docker daemon config file which is usually present at /etc/docker/daemon.json:

{  “default-runtime”: “nvidia”,  “runtimes”: {    “nvidia”: {      “path”: “/usr/bin/nvidia-container-runtime”,      “runtimeArgs”: []    }  }}

Docker config file should be as follows:

$ cat /etc/docker/daemon.json{  “default-runtime”: “nvidia”,  “runtimes”: {    “nvidia”: {      “path”: “nvidia-container-runtime”,      “runtimeArgs”: []    }  },  “exec-opts”: [“native.cgroupdriver=systemd”],  “log-driver”: “json-file”,  “log-opts”: {    “max-size”: “100m”  },  “storage-driver”: “overlay2”}

Then reload and restart docker:

{systemctl daemon-reload ;Systemctl restart docker ;}

To check if it is all good run:

sudo docker run — rm nvcr.io/nvidia/cuda nvidia-smi

You should get the same output that you do when running locally nvidia-smi with your GPU details.

Initialise the k8s cluster

{sudo kubeadm init — pod-network-cidr=10.244.0.0/16 — control-plane-endpoint k8s.my.domain.commkdir -p $HOME/.kube ;sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config ;sudo chown $(id -u):$(id -g) $HOME/.kube/config ;kubectl taint nodes — all node-role.kubernetes.io/master- ;}

Deploy CNI

{curl https://docs.projectcalico.org/manifests/calico.yaml -O ;kubectl apply -f calico.yaml;watch kubectl get pods -l=k8s-app=calico-node -A ;}

Get HELM

{wget https://get.helm.sh/helm-v3.5.3-linux-amd64.tar.gz;tar -xzvf helm-v3.5.3-linux-amd64.tar.gz;sudo mv linux-amd64/helm /usr/local/bin/helm;}

Install MetalLB load balancer

{kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.9.5/manifests/namespace.yaml ;kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.9.5/manifests/metallb.yaml ;kubectl create secret generic -n metallb-system memberlist --from-literal=secretkey="$(openssl rand -base64 128)" ;}### create the metallb config file change to your IP pool ###cat <<EOF | sudo tee metallb-cm.yamlapiVersion: v1kind: ConfigMapmetadata:  namespace: metallb-system  name: configdata:  config: |    address-pools:    - name: default      protocol: layer2      addresses:      - 192.168.1.14–192.168.1.19EOFkubectl -n metallb-system apply -f metallb-cm.yaml ;

Install NFS client storage class

{helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner ;helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner — set nfs.server=192.168.1.2 — set nfs.path=/myk8sshare ;kubectl patch storageclass nfs-client -p ‘{“metadata”: {“annotations”:{“storageclass.kubernetes.io/is-default-class”:”true”}}}’ ;}

Install NVIDIA K8S device plugin

{helm repo add nvdp https://nvidia.github.io/k8s-device-plugin ;helm repo update;helm install \ — generate-name \ — version=0.9.0 \nvdp/nvidia-device-plugin;}

Get Kubeflow kfctl binary

{wget https://github.com/kubeflow/kfctl/releases/download/v1.2.0/kfctl_v1.2.0-0-gbc038f9_linux.tar.gz ;tar -xzvf kfctl_v1.2.0–0-gbc038f9_linux.tar.gz ;sudo mv kfctl /usr/local/bin/ ;}

Modify the kube-apiserver manifest

sudo vi /etc/kubernetes/manifests/kube-apiserver.yaml##add following to container command specspec:  containers:  - command:    - kube-apiserver    - --service-account-signing-key-file=/etc/kubernetes/pki/sa.key    - --service-account-issuer=kubernetes.default.svc

Install Kubeflow

{mkdir -p kubeflow;export KF_NAME=kubeflowexport BASE_DIR=/home/jboothomas/export KF_DIR=${BASE_DIR}/${KF_NAME}export CONFIG_URI=”https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml"cd ${KF_DIR}kfctl apply -V -f ${CONFIG_URI}}

I personally remove the telemetry service:

kubectl -n kubeflow delete deploy -l app=spartakus

Once installation completes, as I have deployed metallb, I change the istio-ingressgateway service to a loadbalancer type, so as to access the webinterfaces via a dedicated IP.

{kubectl -n istio-system edit svc istio-ingressgateway### edit the serviceType from NodePort to LoadBalancer ###}

Connect to Kubeflow

If all is up and running you can now browse to the LoadBalancer IP shown by the istio-ingressgateway in your web browser:

and create a namespace for yourself:

I am then greeted with the main Kubeflow dashboard :

You can see that my user namespace has been created in k8s and two containers are now running :

jbt ml-pipeline-ui-artifact-c8fbcb8f9-f6qhv 2/2 Running 0 8hjbt ml-pipeline-visualizationserver-6b78c9646-w699t 2/2 Running 0 8h

Using Kubeflow notebooks (jupyter)

First I will create a notebook, as my lab system has a GPU graphics card I select a TF gpu enabled image to deploy.

Kubeflow proceeds to deploy the image and we can use the provided ‘CONNECT’ button to gain access to the jupyter interface:

Within a new python3 notebook I can validate the resources that are available to me:

List devices available to notebook — CPU and GPU

import tensorflow as tffor devices in tf.config.get_visible_devices():print(devices)
PhysicalDevice(name=’/physical_device:CPU:0', device_type=’CPU’)PhysicalDevice(name=’/physical_device:GPU:0', device_type=’GPU’)

Success ! GPU is listed so I can now run CPU or GPU powered workloads within KubeFlow.

--

--

jboothomas

Infrastructure engineering for modern data applications