Troubleshooting
Misc tips
Check your hub
Verify that created Pods enter a Running state:
kubectl --namespace=dhub get pod
If a pod is stuck with a Pending or ContainerCreating status, diagnose with:
kubectl --namespace=dhub describe pod <name of pod>
If a pod keeps restarting, diagnose with:
kubectl --namespace=dhub logs --previous <name of pod>
Verify an external IP is provided for the k8s Service proxy-public.
kubectl --namespace=dhub get service proxy-public
If the external ip remains
kubectl --namespace=dhub describe service proxy-public
Finding core files
These are big and storage is expensive. Within a JHub terminal, run
find / -iname 'core.[0-9]*'
Then delete them.
List kernels
When in the cloud provider, shell for a cluster, e.g. Cloud Shell in Azure from the overview tab for a Kubernetes cluster.
jupyter kernelspec list
Remove
jupyter kernelspec remove <kernel_name>
if the kernel is not in the usual place use something like this to remove
jupyter kernelspec remove -p /home/jovyan/.local/share/jupyter/kernels notebook
Create a kernel
# make sure ipykernel is in your env
conda install ipykernel
python -m ipykernel install --user --name mykernel
Creating a persistent environment
https://nmfs-opensci.github.io/nmfs-jhub/posts/JHub-User-Guide.html#using-your-own-conda-environment
Troubleshooting hanging pods
- Search history
history | grep thingtosearch
- Find info on the nodes and regions/zones
kubectl get nodes --show-labels | grep topology.kubernetes.io
- Verify that created Pods enter a Running state:
kubectl --namespace=jhubk8 get pod
- If a pod is stuck with a Pending or ContainerCreating status, diagnose with:
kubectl --namespace=jhubk8 describe pod <name of pod>
- If a pod keeps restarting, diagnose with:
kubectl --namespace=jhubk8 logs --previous <name of pod>
- Delete a pod
kubectl --namespace=jhubk8 delete pod <name of pod>
- If it says a container is the problem
kubectl --namespace=dhub logs --previous hub-5f5d96968d-z59bx -c git-clone-templates
- Verify an external IP is provided for the k8s Service proxy-public.
kubectl --namespace=jhubk8 get service proxy-public
- If the external ip remains Pending, diagnose with:
kubectl --namespace=jhubk8 describe service proxy-public
- Get info on persistent volumes. Sometimes hang if there is a disconnect between node region/zone and pv region/zone
kubectl get pv -n jhub
kubectl describe pv pvc-25a4c791-d2e7-4aaa-bf5a-459c3de0e60c -n jhub
Look for topology.kubernetes.io
* Get the pod specification (created by jupyterhub helm)
kubectl get pod hub-5f5d96968d-z59bx -n dhub -oyaml > test2.yaml
Note don’t try kubectl apply -f test2.yaml
to change the config on the fly. It breaks things with a jupyterhub. * Open a shell into a container. Container must be running.
kubectl exec -stdin -tty hub-5f5d96968d-z59bx --container git-clone-templates -- /bin/bash
- Debug helm upgrade. Add
--debug
helm upgrade --cleanup-on-fail --render-subchart-notes dhub dask/daskhub --namespace dhub --version=2024.1.1 --values config.yaml --debug
History of problems I have solved
Problem with pod stuck in Init:CrashLoopBackOff
This was due to git-clone-templates
showing user not known. Somehow the repo being cloned was set to private, so the git clone
needed credentials which it didn’t have and that caused the init container to fail.
- Verify that created Pods enter a Running state:
kubectl --namespace=jhubk8 get pod
- Get some info on problem:
kubectl --namespace=jhubk8 describe pod <name of pod>
- If a pod keeps restarting, diagnose with:
kubectl --namespace=jhubk8 logs --previous <name of pod>
- Fix
- tried applying and empty config.yaml but that didn’t replace the old one.
- create a config-test.yaml without the init container part that had the
git clone
. Now the hub would start. - discovered that the repo was private. Fixed.
Problem with pod unable to start do to node affinity mismatch
I had set up my node pools to be one region but multiple zones. When I stopped the cluster and restarted, the system node ended up in another zone than the hub database pv. I tried to stop and restart multiple times to see if the system node would by chance start in the right zone, but it didn’t work. Had to tear down the cluster and start again with region and one zone specified.
- Get list of pvs
kubectl get pv -n jhub
- Find the one associated with the hub database
dhub/hub-db-dir
. - Get info on pv region/zone. Look for
topology.kubernetes.io
.
kubectl get pv -n jhub
kubectl describe pv pvc-25a4c791-d2e7-4aaa-bf5a-459c3de0e60c -n jhub
- Get info on the region/zone for nodes.
kubectl get nodes --show-labels | grep topology.kubernetes.io
- Look for the one that is the system node
kubernetes.azure.com/mode=system
.
Look for a mismatch in zones. Like westus2-1
versus westus2-3
Node affinity mismatch prevents some user pods from starting
Write up in Jupyter Discourse: https://discourse.jupyter.org/t/fixed-node-affinity-mismatch-stopping-some-pods-from-starting/23020
I set up a JupyterHub w Kubernetes on Azure and had been using it with a small team of 3-4 for a year. Then I did a workshop to test it with more people. It worked great during the workshop. After the workshop, I crashed my server (ran out of RAM). No problem. That often happens and I restart. This time, I got a volume / node affinity error and the pod was stuck in pending. Some other people could still launch pods, but I could not.
Turns out it was a mismatch between the zone that my user PVC was on and the zone of the node. As the cluster scaled up during the workshop, new nodes on uswest2-1, uswest2-2, uswest-3 were created because I didn’t specify the zone of my nodes when setting up Kubernetes nodes. I only set the region: uswest2. As the cluster auto-scaled back down, it just so happened that the ‘last node standing’ was on uswest2-2. My user PVC is on uswest2-1 and so there was a pvc / node mismatch.
Google installations need init pause for SSL
https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/2601
proxy:
traefik:
extraInitContainers:
# This startup delay can help the k8s container network find the
# https certificate and allow letsencrypt to work
- name: startup-delay
image: busybox:stable
command: ["sh", "-c", "sleep 10"]
No https after restarting hub
Hub was stopped for awhile and then restarted (Azure dashboard) and then helm updated
. I got connection refused due to improper SSL certificate. I needed to delete the autohttps pod to get https working again.
kubectl get namespace
kubectl get pods --namespace dhub
kubectl delete pods --namespace dhub autohttps-554f6c47f-z86d9
Wait a bit and then it started working again.