Add GPU
Setting up GPU enabled computing
Background
For certain types of model fitting, GPU-enabled computing can speed up the calculations by orders of magnitude. This is especially the case for machine-learning tasks.
Reading
- Zero2JH Instructions
- Pangeo ML Cloud
- WIP on GPU on GKE. From 2019-2020 but lots of good info.
Requirements:
- GPU-enabled node pool
- Drivers specific to the chip, e.g. NVIDIA T4 chip would require NVIDIA drivers. Note that in a Kubernetes cluster, drivers may be installed by default for GPU-enabled node pools. This is part of what causes GPU-enabled nodes to take a long-time to spin up.
- For Kubernetes, a NVIDIA device plugin (for NVIDIA chips). This sets up the pod descriptions to enable access to the GPU in the nodes.
Basic workflow
- Update your quotas on your cloud provider to include GPU compute. Azure
- Look up how to add GPU to a Kubernetes cluster in your cloud provider
- Azure
- GCP
- AWS
- Add a nodepool with GPU. Make sure autoscale has min 0 so that no nodes are spun up yet.
- Install the driver (unless it is installed by default for your machine nodepool) and install a Kubernetes plug-in
- Force the nodepool to scale up by setting autoscale min to 1.
- Check that node with GPU has GPU capacity.
- Edit your config.yaml for JHub with GPU capacity request and an image with the CUDA enabled packages (meaning package recognizes that GPU is available and uses it).
- Restart the JHub with
helm upgrade
- Test
Size of machine
In a teaching or workshop setting, the following are commonly chosen machine sizes.
- 1 GPU; 16 Gig RAM
- AWS: g4dn.xlarge $385/mo
- GCP: n1-standard-4, nvidia-tesla-t4 attached to n1 family
- Azure: Standard_NC4as_T4_v3 (or ) $383/mo
Adding GPU in Azure
Follow instructions on the Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS).
My steps in June 2024.
Test
Open a notebook and run
import torch
torch.cuda.is_available()
Should say True.
Config file
Add this in the profile list:
profileList:
- display_name: NVIDIA Tesla T4, 28 GB, 4 CPUs
description: "Start a container on a dedicated node with a GPU"
slug: "gpu"
profile_options:
image:
display_name: Image
choices:
pytorch:
display_name: Pangeo PyTorch ML Notebook
default: true
slug: "pytorch"
kubespawner_override:
image: "quay.io/pangeo/pytorch-notebook:2023.09.19"
kubespawner_override:
environment:
NVIDIA_DRIVER_CAPABILITIES: compute,utility
mem_limit: null
mem_guarantee: 14G
node_selector:
node.kubernetes.io/instance-type: Standard_NC4as_T4_v3
extra_resource_limits:
nvidia.com/gpu: "1"
Problems
wherever I change image, the hub will not restart. I get time out errors. Does it need to pull the image?