# import needed modules
import time, random
# define our functions
def inc(x):
time.sleep(random.random())return x + 1
def dec(x):
time.sleep(random.random())return x - 1
def add(x, y):
time.sleep(random.random())return x + y
Simple dask example โ Coiled
๐ Learning Objectives
- Set up a Coiled cluster
- Learn about how to set up a Coiled account
Overview
In this notebook, I set up a Dask Coiled cluster. This is a cluster set up on new pods on new virtual machines outside of the Jupyter Hub. The coiled cluster is quite similar to the Dask Gateway cluster but is run out of the Jupyter Hub on your own cloud account.
Feature | LocalCluster |
Coiled |
---|---|---|
Runs in your notebook? | โ Yes | โ No (runs in new virtual machines) |
Uses multiple pods? | โ No | โ Yes (scheduler + workers) |
Scales beyond pod? | โ No | โ Yes |
Can use files in /home | โ Yes | โ No |
Set up a Coiled account
- Go to coiled.io and set-up an account
- You need to associate it with a cloud account on AWS, Azure or GCP (Google). You can get a free 12 month trial on these.
- I find the Google Cloud dashboard more intuitive to use. But the set up for coiled was difficult. I had to do it from my computer using Python and then it asked for me to install Google Cloud sdk CLI. I stopped at that point.
- AWS was easier. I already had a AWS account. I clicked buttons on coiled.io, logged into AWS as instructed, and got it linked.
- Now back to the Jupyter notebook.
- Go to a terminal and make sure you are in the same conda environment that you will run your notebook in. In the nmfs-openscapes Jupyter Hub, you will be in that environment by default but on other set-ups you might not be.
- Run
coiled setup aws
. I assume you have coiled module installed. If not dopip install coiled
. - It is going to send you to the AWS Cloud Shell and will give you a link for that.
- It will give you some code to run in the shell. Note I had to clean up the code to remove some extra spaces. I couldnโt just paste and run.
- When it is good, it will say it is authenticated.
- Now run
coiled
(still in terminal) to make sure it is set up.
Example of the output from coiled
.
(coiled) jovyan@jupyter-eeholmes:~$ coiled setup aws
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Introduction โ
โ โ
โ This uses your AWS credentials to set up Coiled. โ
โ โ
โ This will do the following ... โ
โ 1. Create limited IAM roles and grant them to Coiled โ
โ 2. Check and expand your AWS quota if needed โ
โ 3. Create initial resources to deploy clusters โ
โ โ
โ This will not ... โ
โ 1. Create resources that cost money โ
โ 2. Grant Coiled access to your data โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Missing: You don't have local AWS credentials.
That's ok, you can run setup from AWS CloudShell.
Run setup from AWS CloudShell with the following steps:
1. Go to https://console.aws.amazon.com/cloudshell
2. Sign in to your AWS account
(if you usually switch role or profile, you should do this)
3. Run the following command in CloudShell:
pip3 install coiled && \
coiled login \
--token b2bfb56 \
--token 42713203-d \
--token b82f66c && \
coiled setup aws --region us-east-1
Now that we have coiled set up, we can run a coiled cluster in Python.
Important note on the image
The new pods need to use the same image as the one you started the notebook. Using a small an image as possible will reduce the set-up time since each new pod need to pull down the image into the cluster pod. For this tutorial, I am using
openscapes/python:07980b9
using the โOtherโ option.
Important note on file access
Since the Coiled pods (scheduler + workers) are separate from your notebook pod, they do not have access to files on local paths like /home
.
If your code references a file like ds = xr.open_dataset("file.json")
, it will fail when ds["sst"].mean().compute()
is called โ because the workers canโt see that file (which is in \home
). ds
is lazy and it needs the information in โfile.jsonโ to know where the underlying data files are (that it needs to read).
Set up the functions
Locally the task takes 15-20 seconds.
%%time
# a sequential example with no parallelization
= []
results for x in range(20):
= inc(x)
result = dec(result)
result
results.append(result)
print(results)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
CPU times: user 0 ns, sys: 3.51 ms, total: 3.51 ms
Wall time: 22.5 s
import coiled # use a Coiled cluster
Set up cluster in the cloud where we can grab more workers. These 15 workers will cost about 5 cents a minute and this job from start (set-up) to finish is like 5 minutes so 25 cents. Almost all the time is the set-up of the workers. You can go to your dashboard on coiled to see how much compute time you used up.
= coiled.Cluster(n_workers=15) # run on a cluster in the cloud
cluster = cluster.get_client() client
myst-parser 0.18.1 has requirement mdit-py-plugins~=0.3.1, but you have mdit-py-plugins 0.4.0.
Package - myst-parser, Pip check had the following issues that need resolving: myst-parser 0.18.1 has requirement mdit-py-plugins~=0.3.1, but you have mdit-py-plugins 0.4.0.
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Package Info โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ โท โ โ Package โ Note โ โ โถโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโด โ โ coiled_local_jovyan โ Source wheel built from /home/jovyan โ โ โต โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโ Not Synced with Cluster โโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ โท โท โ โ Package โ Error โ Risk โ โ โถโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโด โ โ myst-parser โ Pip check had the following issues that need โ Warning โ โ โ resolving: โ โ โ โ myst-parser 0.18.1 has requirement โ โ โ โ mdit-py-plugins~=0.3.1, but you have โ โ โ โ mdit-py-plugins 0.4.0. โ โ โ โต โต โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
%%time
= []
results for x in range(20): # scale 100x
= client.submit(inc, x)
result = client.submit(dec, result)
result
results.append(result)
= client.gather(results)
results print(results)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
CPU times: user 140 ms, sys: 1.54 ms, total: 141 ms
Wall time: 2.18 s
# Close our cluster as soon as we are done because running it uses $
client.close() cluster.close()
Summary
Coiled is a flexible and inexpensive way to churn through lots of data. Pennies per terrabyte. Get your workflow debugged locally or with Dask Gateway and then you can run the heavy lifting on Coiled. It works in the background so unlike Dask Gateway you donโt need to have your notebook connected the whole time.