JHub Set-up on Azure

Second time setting up a multi-user JupyterHub with Dask enabled

Requirements

See full config.yaml files in the config-template directory in the nmfs-opensci/nmfs-jhub GitHub repo. Name of config file is fish-config.yaml.

  • Domain Name: You will need one. I use Godaddy. They are cheap.

Steps

  • Open the Azure Cloud Shell
  • Create a Kubernetes cluster
  • Create the node pools (what types of VMs)
  • Install JupyterHub with Dask
  • Edit the config file
  • Set up https
  • Set up authentication
  • Once it is working, you can create user shared drive

Open the Azure Cloud Shell

I am using a simpler version of the instructions here: Z2JH Set up Kubernetes on Azure. To see how to do these steps using the Azure Dashboard go to the DaskHub Set-up on Azure. If this is your first time setting up a hub on Azure, it would be good to skim that post to get an overview of the process.

  1. Log into https:\\portal.azure.com

  2. Look for the Cloud Shell icon in the top nav bar. Box with a right pointing arrow in it. Click on that and use bash shell not the Power Shell.

  3. Make sure to set up a storage account. You can just have Azure set one up for you. If you forget and later get warnings that your shell is ephemeral, then go into settings in the cloud shell and reset the user settings.

Create your Kubernetes cluster

We have to create a resource group.

# Create names
RES_GP="SAFSJHub"
CLUSTER_NAME="jhub"

# Create a resource group
az group create \
   --name=$RES_GP \
   --location=westus2 \
   --output table

Then we create a cluster with az aks create. I am going to accept most of the defaults. This will create our cluster and create the first node pool which just runs the hub (not the user VMs). This node (VM) must always be running and there is only ever one (so no autoscaling).

# Create the cluster
az aks create \
   --name $CLUSTER_NAME \
   --resource-group $RES_GP \
   --node-vm-size Standard_D2s_v3 \
   --generate-ssh-keys \
   --node-count 1 \
   --nodepool-name core \
   --nodepool-labels hub.jupyter.org/node-purpose=core \
   --zones 2 \
   --location westus2
  • --name name of our cluster. I made a variable called $HUB_NAME
  • --node-vm-size Standard_D2s_v3 This is a 2 CPU and 4 Gb RAM VM to run the hub itself (so not the user VMs). Costs about $80 a month.
  • --generate-ssh-keys Just use the default generator.
  • --node-count 1 There is only ever one core node which the hub server runs alone on.
  • --nodepool-name core Required. This is the name of the node pool for the hub.
  • --nodepool-labels hub.jupyter.org/node-purpose=core so that when you do queries re the nodes and pods, you can make sense of the output.
  • --zones 2 Probably not needed here. Only needed for user pods.
  • --location westus2 Required. Pick something close. use az account list-locations --output table to see a list of the options.
  • --output table So that output is pretty.

Create user node pools

This will define the types of VMs that will spin up for our users. In the JHub, I am creating, the users can request a minimum amount of RAM from 2-128Gb. For normal uses, the users will select 2 or 4 Gb and 16 to 8 users will be sharing each VM. If a VM runs out of RAM, then another will be spun up. I want to create a separate node pool for 128Gb requests because that is an expensive VM and I only want that if a user requests it (it will go away after the user shuts down their server).

Before you get started, take a look at your quota. Your CPUs*max-count for all node pools should not exceed that.

What sort of machines should I put in my node pool?

  • Lots of memory intensive work? Lower CPU and higher RAM (memory optimized).
  • Lots of parallel processing or batch processing? Lots of CPU and lower RAM (compute optimized).
  • Make sure you have enough OS memory so that you can pull in your docker images. Mostly you need to care if there are many different images that users can choose on your hub. The images are about 7-8 Gb.

Create the 32Gb RAM w 4 CPU node pool. This is a standard base user node for JupyterHubs. We will set

az aks nodepool add \
    --cluster-name $CLUSTER_NAME \
    --resource-group $RES_GP \
    --name usere4sv6 \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 4 \
    --node-vm-size Standard_E4s_v5 \
    --labels hub.jupyter.org/node-purpose=usere4sv6 \
    --zones 2
  • --name usermedium give this node pool a unique name. We will need this in our JHub config file.
  • --enable-cluster-autoscaler Required. We want more VMs to be spun up if more users sign into the hub.
  • --min-count 1 We want the minimum to be 1 so that one node is always running and users don’t have to always start up a new hub. If this hub is rarely used, then we could set this to 0 but then start up would take longer.
  • --max-count 4 This should accommodate roughly 32-64 users if most users are requesting 2-4 Gb of RAM. This is 4*8 = 32 CPU if all nodes are in use.
  • --node-vm-size Standard_E4as_v5 This is a general use VM with 32 Gb RAM and 4 CPU. Similar to AWS r5.xlarge.
  • --labels hub.jupyter.org/node-purpose=usermedium So that we can decipher our node and pod output.
  • --zones 2 Very important. We have to pin our user storage to a specific region and zone. We don’t want our nodes (VMs) to ever spin up in a zone different than our user (cloud) storage.

Create the 128Gb RAM/16 CPU node pool. Everything is the same except Standard_E16s_v5 and the node pool name. This is similar to AWS r5.4xlarge and costs about $750 a month (running continually).

az aks nodepool add \
    --cluster-name $CLUSTER_NAME \
    --name usere16sv5 \
    --resource-group $RES_GP \
    --enable-cluster-autoscaler \
    --min-count 0 \
    --max-count 3 \
    --node-vm-size Standard_E16s_v5 \
    --labels hub.jupyter.org/node-purpose=usere16sv5 \
    --zones 2
  • --min-count 0 We definitely do NOT want this one running continually.

Check out what you created

Click on the Kubernetes Services button and you should see something like this

Click Kubernetes Cluster and then on the jhub cluster. Poke around on the tabs and look at the node pools.

Install DaskHub on your cluster

These next steps are done in the shell after connecting to your cluster. First you need to get to the shell for your cluster. Make sure you don’t have the ephemeral warning when you launch your cloud shell.

Connect to your cluster

Once you have created your Kubernetes cluster, you want to go to its dashboard (by clicking on the name you gave it). You’ll see something like this (in this image the cluster is named daskhub).

Click on the Connect icon to the right of “+ Create”.

You then see this

Click on the link that says “Open Cloud Shell”.

You will get to a terminal and it should automatically run those az commands.

Create config.yaml

This will be the configuration file for your JupyterHub. For now, it can be just comments. Note the name is unimportant but should end in .yaml.

nano config.yaml

This will open the nano editor. Edit your file. You can do # just blank for now. Then Cntl-O to save and Cntl-X to exit.

Install daskhub via helm chart

Instructions: https://artifacthub.io/packages/helm/dask/daskhub .

Check that helm is installed. It should be.

helm version

Tell helm about the dask helm repository

helm repo add dask https://helm.dask.org
helm repo update

Now install. The first dhub is your release name and the second is the namespace name.

helm upgrade --wait --install --render-subchart-notes \
    dhub dask/daskhub \
    --namespace=dhub --create-namespace \
    --values=config.yaml

You will see this on successful installation (it’s long. much has been cut).

Set-up your external IP address

Set the namespace context for the Kubernetes cluster and the the external IP address.

kubectl config set-context $(kubectl config current-context) --namespace dhub
kubectl --namespace=dhub get service proxy-public

These commands will show the the IP address. Save the public IP address. You will need it in step 2. Look for the IP address under EXTERNAL-IP.

Set up https

You can log out of your cluster. The next steps are done elsewhere.

Create a domain name

You will need a domain name for https which you want for security (and JupyterHub won’t stop complaining if you don’t). Find a domain name provider and set one up. It is not expensive. I used GoDaddy. If you already have one, you don’t need another. You create a subdomain under that.

Create a DNS entry

Let’s pretend you set up bluemountain123.live as the domain. Go to the DNS settings for your domain. Add a type A record. This will do 2 things. First this will create the subdomain that you will use to access your JupyterHub. So let’s say you create, dhub as the type A DNS entry. Then dhub.bluemountain123.live will be the url. You can have as many subdomains as you need.

Test if the url is working

http:\\dhub.bluemountain123.live would be the url using the example domain above. Test that it is working (shows a JupyterHub login) before moving on. This is what you should see:

Set-up https on your JupyterHub

Log back into your Kubernetes cluster: go to portal.azure.com, click on your Kubernetes cluster name, and then click on “Connect”. Then click on “Open Cloud Shell”. Read documentation about https

Once you are on the shell, type

nano config.yaml

to edit the config file. Paste this in and save. Note the additional jupyterhub: in the yaml file. This is not in a plain JupyterHub with Kubernetes config file (i.e. in a non-daskhub, the jupyterhub: bit is not there and everything is moved to left by 2 spaces).

jupyterhub:
  proxy:
    https:
      enabled: true
      hosts:
        - dhub.bluemountain123.live
      letsencrypt:
        contactEmail: your@email.com

Update the JupyterHub installation

Anytime you change config.yaml you need to run this code. Find the latest version here.

helm upgrade --cleanup-on-fail --render-subchart-notes dhub dask/daskhub --namespace dhub --version=2024.1.1 --values config.yaml

Test if https is working

COME BACK IN AN HOUR It takes awhile for a certificate to be issued. If it still doesn’t work, try editing the config file and re-upgrading.

Try https:\\dhub.bluemountain123.live and you should see the JupyterHub login without that http warning.

Set up GitHub authentication

I am going to show an example where I use a team on a GitHub organization to manage authentication. There are many other ways to manage users. Google to find that.

Create a new Oauth Application on GitHub

This is going to be associated with your (personal) GitHub account, but you can use a team on a GitHub org that you are owner of.

Log into GitHub and go to GitHub > Settings > Developer Settings > Oauth Apps > Register New Oauth Application

Look carefully at how I filled in the boxes.

Next you will see something like this

You need to copy the ID and then click the create secrets button and save the secret. Save those for later.

Create a team in your GitHub org

You will be added by default and add anyone else who needs access to the hub. Let’s say your org is MyOrg and the team is called DaskHub. So then the allowed organization is MyOrg:DaskHub. You can leave off :DaskHub if you want to allow all members of the organization to log in.

Edit the config.yaml file

nano config.yaml

Add to your config file so it is now this. Replace the id, secret and url with your values.

jupyterhub:
  hub:
    config:
      GitHubOAuthenticator:
        client_id: <replace with your OAuth id>
        client_secret: <replace with your OAuth app secret>
        oauth_callback_url: https://dhub.bluemountain123.live/hub/oauth_callback
        allowed_organizations:
          - MyOrg:DaskHub
        scope:
          - read:org
      JupyterHub:
        authenticator_class: github
      KubeSpawner:
        working_dir: /home/jovyan
  proxy:
    https:
      enabled: true
      hosts:
        - dhub.bluemountain123.live
      letsencrypt:
        contactEmail: your@email.com        

Update the hub

helm upgrade --cleanup-on-fail --render-subchart-notes dhub dask/daskhub --namespace dhub --version=2024.1.1 --values config.yaml

Test

You should now see this and can authenticate with GitHub.

Set up the container image

Now you need to specify the Docker image that will be used. Edit the config.yaml file and add the user image info. Note the spacing matters (a lot).

  singleuser:
    image:
      name: openscapes/python
      tag: f577786
    cmd: null
  singleuser:
    # Defines the default image
    image:
      name: openscapes/python
      tag: f577786
    profileList:
      - display_name: "Python3"
        description: "NASA Openscapes Python image"
        default: true
      - display_name: "R"
        description: "NASA Openscapes RStudio image"
        kubespawner_override:
          image: openscapes/rocker:a7596b5        

Test a full config file

My config.yaml is a bit more customized to allow the user to reserve the amount of RAM that they need. Full config file: https://github.com/nmfs-opensci/nmfs-jhub/blob/main/config-template/fish-config.yaml

Cut out the part regarding the shared drive and paste all the text into config.yaml to try this config file out.

      extraVolumes:
        - name: jupyterhub-shared
          persistentVolumeClaim:
            claimName: daskhub-pvc
      extraVolumeMounts:
        - name: jupyterhub-shared
          mountPath: /home/jovyan/shared

And make sure the hub is working nicely before adding a shared drive. The shared drive part tends to break things.

Create a 2T shared drive

This will cost about $172 a month. Here are some notes on how to do this: https://www.c2labs.com/post/persistent-storage-on-kubernetes-for-azure

Create a storage account in the portal

First we must create a storage account (add the resource). The name must be globally unique, so don’t use the one here.

  • Go to the dashboard and look for “storage account”
  • Create one with a unique name
  • Use the resource group that the Kubernetes cluster is in. For me it is SAFSJHub
  • Choose Standard LRS storage
  • Kind StorageV2
  • key bit you need to set the storage to use the VNET associated with the cluster. Look at the cluster. The VNET you want will be something like aks-vnet-123.
  • key bit you need to create a file share. Make it 2Tb by changing the quota to 2048. Make sure it uses the same resource group as the cluster.

Something like this might work but you need to set the VNET.

az storage account create \
  --name safshubnfs \
  --resource-group SAFSJHub \
  --location westus2 \
  --sku Standard_LRS \
  --kind StorageV2 

Set up the storage class, PV and PVC

mkdir nfs-shared-setup
cd nfs-shared-setup

In here we create 3 files.

sc.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: dask-fileshare
provisioner: file.csi.azure.com
allowVolumeExpansion: true
mountOptions:
  - nconnect=10 # per pod number of connections
parameters:
  resourceGroup: SAFSJHub
  server: safshubnfs.file.core.windows.net
  storageAccount: safshubnfs # storage account created manually
  shareName: daskhub-shared # a file share created manually
  protocol: nfs

pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: file.csi.azure.com
  name: daskhub-pv
spec:
  capacity:
    storage: 2048Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain  # if set as "Delete" file share would be removed in pvc deletion
  storageClassName: dask-fileshare
  csi:
    driver: file.csi.azure.com
    readOnly: false
    volumeHandle: daskhub-shared
    volumeAttributes:
      resourceGroup: SAFSJHub
      storageAccount: daskhubnfs
      shareName: daskhub-shared
      protocol: nfs

pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: daskhub-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: dask-fileshare
  resources:
    requests:
      storage: 2048Gi

Now set everything up. Storage Classes and Persisten Volumes are not scoped to a namespace but Persistent Volume Claims are.

# Apply storage class. View with "kubectl get sc"
kubectl apply -f sc.yaml

# Apply the Persistent Volume
kubectl apply -f pv.yaml

# Apply the Persistent Volume Claim
kubectl apply -f pvc.yaml --namespace=dhub

Add the storage class to the config.yaml

It will look like this.

  singleuser:
    startTimeout: 3600
    defaultUrl: /lab
    storage:
      capacity: 50Gi
      extraVolumes:
        - name: jupyterhub-shared
          persistentVolumeClaim:
            claimName: daskhub-pvc
      extraVolumeMounts:
        - name: jupyterhub-shared
          mountPath: /home/jovyan/shared

Final config.yaml

My config.yaml is a bit more customized to allow the user to reserve the amount of RAM that they need.

Full config file: https://github.com/nmfs-opensci/nmfs-jhub/blob/main/config-template/fish-config.yaml