Simple dask example – Coiled

Author

Eli Holmes (NOAA)

📘 Learning Objectives

Set up a Coiled cluster

Learn about how to set up a Coiled account

Overview

In this notebook, I set up a Dask Coiled cluster. This is a cluster set up on new pods on new virtual machines outside of the Jupyter Hub. The coiled cluster is quite similar to the Dask Gateway cluster but is run out of the Jupyter Hub on your own cloud account.

Feature	`LocalCluster`	`Coiled`
Runs in your notebook?	✅ Yes	❌ No (runs in new virtual machines)
Uses multiple pods?	❌ No	✅ Yes (scheduler + workers)
Scales beyond pod?	❌ No	✅ Yes
Can use files in /home	✅ Yes	❌ No

Set up a Coiled account

Go to coiled.io and set-up an account
- You need to associate it with a cloud account on AWS, Azure or GCP (Google). You can get a free 12 month trial on these.
- I find the Google Cloud dashboard more intuitive to use. But the set up for coiled was difficult. I had to do it from my computer using Python and then it asked for me to install Google Cloud sdk CLI. I stopped at that point.
- AWS was easier. I already had a AWS account. I clicked buttons on coiled.io, logged into AWS as instructed, and got it linked.
Now back to the Jupyter notebook.
- Go to a terminal and make sure you are in the same conda environment that you will run your notebook in. In the nmfs-openscapes Jupyter Hub, you will be in that environment by default but on other set-ups you might not be.
- Run coiled setup aws. I assume you have coiled module installed. If not do pip install coiled.
- It is going to send you to the AWS Cloud Shell and will give you a link for that.
- It will give you some code to run in the shell. Note I had to clean up the code to remove some extra spaces. I couldn’t just paste and run.
- When it is good, it will say it is authenticated.
Now run coiled (still in terminal) to make sure it is set up.

Example of the output from coiled.

(coiled) jovyan@jupyter-eeholmes:~$ coiled setup aws
╭────────────────────────────────────────────────────────────────────────────────────────╮
│ Introduction                                                                           │
│                                                                                        │
│ This uses your AWS credentials to set up Coiled.                                       │
│                                                                                        │
│ This will do the following ...                                                         │
│ 1. Create limited IAM roles and grant them to Coiled                                   │
│ 2. Check and expand your AWS quota if needed                                           │
│ 3. Create initial resources to deploy clusters                                         │
│                                                                                        │
│ This will not ...                                                                      │
│ 1. Create resources that cost money                                                    │
│ 2. Grant Coiled access to your data                                                    │
╰────────────────────────────────────────────────────────────────────────────────────────╯
Missing: You don't have local AWS credentials.
That's ok, you can run setup from AWS CloudShell.

Run setup from AWS CloudShell with the following steps:

1. Go to https://console.aws.amazon.com/cloudshell
2. Sign in to your AWS account
   (if you usually switch role or profile, you should do this)
3. Run the following command in CloudShell:

  pip3 install coiled && \
  coiled login \
    --token b2bfb56 \
    --token 42713203-d \
    --token b82f66c && \
  coiled setup aws --region us-east-1

Now that we have coiled set up, we can run a coiled cluster in Python.

Important note on the image

The new pods need to use the same image as the one you started the notebook. Using a small an image as possible will reduce the set-up time since each new pod need to pull down the image into the cluster pod. For this tutorial, I am using

openscapes/python:07980b9

using the “Other” option.

Important note on file access

Since the Coiled pods (scheduler + workers) are separate from your notebook pod, they do not have access to files on local paths like /home.

If your code references a file like ds = xr.open_dataset("file.json"), it will fail when ds["sst"].mean().compute() is called — because the workers can’t see that file (which is in \home). ds is lazy and it needs the information in “file.json” to know where the underlying data files are (that it needs to read).

Set up the functions

# import needed modules
import time, random

# define our functions
def inc(x):
    time.sleep(random.random())
    return x + 1

def dec(x):
    time.sleep(random.random())
    return x - 1

def add(x, y):
    time.sleep(random.random())
    return x + y

Locally the task takes 15-20 seconds.

%%time

# a sequential example with no parallelization
results = []
for x in range(20):
    result = inc(x)
    result = dec(result)
    results.append(result)

print(results)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
CPU times: user 0 ns, sys: 3.51 ms, total: 3.51 ms
Wall time: 22.5 s

import coiled # use a Coiled cluster

Set up cluster in the cloud where we can grab more workers. These 15 workers will cost about 5 cents a minute and this job from start (set-up) to finish is like 5 minutes so 25 cents. Almost all the time is the set-up of the workers. You can go to your dashboard on coiled to see how much compute time you used up.

cluster = coiled.Cluster(n_workers=15) # run on a cluster in the cloud
client = cluster.get_client()

myst-parser 0.18.1 has requirement mdit-py-plugins~=0.3.1, but you have mdit-py-plugins 0.4.0.

Package - myst-parser, Pip check had the following issues that need resolving: 
myst-parser 0.18.1 has requirement mdit-py-plugins~=0.3.1, but you have mdit-py-plugins 0.4.0.

╭──────────────────────────────── Package Info ────────────────────────────────╮
│                            ╷                                                 │
│   Package                  │ Note                                            │
│ ╶──────────────────────────┼───────────────────────────────────────────────╴ │
│   coiled_local_jovyan      │ Source wheel built from /home/jovyan            │
│                            ╵                                                 │
╰──────────────────────────────────────────────────────────────────────────────╯

╭────────────────────────── Not Synced with Cluster ───────────────────────────╮
│               ╷                                                  ╷           │
│   Package     │ Error                                            │ Risk      │
│ ╶─────────────┼──────────────────────────────────────────────────┼─────────╴ │
│   myst-parser │ Pip check had the following issues that need     │ Warning   │
│               │ resolving:                                       │           │
│               │ myst-parser 0.18.1 has requirement               │           │
│               │ mdit-py-plugins~=0.3.1, but you have             │           │
│               │ mdit-py-plugins 0.4.0.                           │           │
│               ╵                                                  ╵           │
╰──────────────────────────────────────────────────────────────────────────────╯

%%time
results = []
for x in range(20): # scale 100x
    result = client.submit(inc, x)
    result = client.submit(dec, result)
    results.append(result)

results = client.gather(results)
print(results)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
CPU times: user 140 ms, sys: 1.54 ms, total: 141 ms
Wall time: 2.18 s

# Close our cluster as soon as we are done because running it uses $
client.close()
cluster.close()

Summary

Coiled is a flexible and inexpensive way to churn through lots of data. Pennies per terrabyte. Get your workflow debugged locally or with Dask Gateway and then you can run the heavy lifting on Coiled. It works in the background so unlike Dask Gateway you don’t need to have your notebook connected the whole time.