Using the S3 Scratch Bucket

The JupyterHub has a preconfigured S3 “Scratch Bucket” that automatically deletes files after 7 days. This is a great resource for experimenting with large datasets and working collaboratively on a shared dataset with other users.

Access the scratch bucket

The scratch bucket is hosted at s3://nmfs-openscapes-scratch. The JupyterHub automatically sets an environment variable SCRATCH_BUCKET that appends a suffix to the s3 url with your GitHub username. This is intended to keep track of file ownership, stay organized, and prevent users from overwriting data!

Everyone has full access to the scratch bucket, so be careful not to overwrite data from other users when uploading files. Also, any data you put there will be deleted 7 days after it is uploaded
If you need more permanent S3 bucket storage refer to AWS_S3_bucket documentation (left) to configure your own S3 Bucket.

We’ll use the S3FS Python package, which provides a nice interface for interacting with S3 buckets.

import os
import s3fs
import fsspec
import boto3
import xarray as xr
import geopandas as gpd
# My GitHub username is `eeholmes`
scratch = os.environ['SCRATCH_BUCKET']
scratch 
's3://nmfs-openscapes-scratch/eeholmes'
# But you can set a different S3 object prefix to use:
scratch = 's3://nmfs-openscapes-scratch/hackhours'

Uploading data

It’s great to store data in S3 buckets because this storage features very high network throughput. If many users are simultaneously accessing the same file on a spinning networked harddrive (/home/jovyan/shared) performance can be quite slow. S3 has much higher performance for such cases.

Upload single file

local_file = '~/NOAAHackDays/topics-2025/2025-02-14-earthdata/littlecube.nc'

remote_object = f"{scratch}/littlecube.nc"

s3.upload(local_file, remote_object)
[None]

Once a bucket has files, I can list them. If the bucket is empty, you will get errors instead of [].

s3 = s3fs.S3FileSystem()
s3.ls(scratch)
['nmfs-openscapes-scratch/hackhours/littlecube.nc']
s3.stat(remote_object)
{'Key': 'nmfs-openscapes-scratch/hackhours/littlecube.nc',
 'LastModified': datetime.datetime(2025, 2, 13, 21, 41, 5, tzinfo=tzlocal()),
 'ETag': '"d73616d9e3ad84cf58a4a676b1e3d454"',
 'ChecksumAlgorithm': ['CRC32'],
 'ChecksumType': 'FULL_OBJECT',
 'Size': 50224,
 'StorageClass': 'STANDARD',
 'type': 'file',
 'size': 50224,
 'name': 'nmfs-openscapes-scratch/hackhours/littlecube.nc'}

Upload a directory

local_dir = '~/NOAAHackDays/topics-2025/resources'

!ls -lh {local_dir}
total 5.9M
-rw-r--r-- 1 jovyan jovyan 5.9M Feb 12 21:05 e_sst.nc
drwxr-xr-x 3 jovyan jovyan  281 Feb 12 21:18 longhurst_v4_2010
s3.upload(local_dir, scratch, recursive=True)
[None, None, None, None, None, None, None, None, None]

The directory name is the directory name (only) of the local directory.

s3.ls(f'{scratch}/resources')
['nmfs-openscapes-scratch/hackhours/resources/e_sst.nc',
 'nmfs-openscapes-scratch/hackhours/resources/longhurst_v4_2010']

Accessing Data

Some software packages allow you to stream data directly from S3 Buckets. But you can always pull objects from S3 and work with local file paths.

This download-first, then analyze workflow typically works well for older file formats like HDF and netCDF that were designed to perform well on local hard drives rather than Cloud storage systems like S3.

For best performance do not work with data in your home directory. Instead use a local scratch space like `/tmp`
remote_object
's3://nmfs-openscapes-scratch/hackhours/littlecube.nc'
local_object = '/tmp/test.nc'
s3.download(remote_object, local_object)
[None]
ds = xr.open_dataset(local_object)
ds
<xarray.Dataset> Size: 97kB
Dimensions:       (time: 366, lat: 8, lon: 8)
Coordinates:
  * time          (time) datetime64[ns] 3kB 2020-01-01 2020-01-02 ... 2020-12-31
  * lat           (lat) float32 32B 33.62 33.88 34.12 ... 34.88 35.12 35.38
  * lon           (lon) float32 32B -75.38 -75.12 -74.88 ... -73.88 -73.62
Data variables:
    analysed_sst  (time, lat, lon) float32 94kB ...
If you don't want to think about downloading files you can let `fsspec` handle this behind the scenes for you! This way you only need to think about remote paths
fs = fsspec.filesystem("simplecache", 
                       cache_storage='/tmp/files/',
                       same_names=True,  
                       target_protocol='s3',
                       )
# The `simplecache` setting above will download the full file to /tmp/files
print(remote_object)
with fs.open(remote_object) as f:
    ds = xr.open_dataset(f.name) # NOTE: pass f.name for local cached path
s3://nmfs-openscapes-scratch/hackhours/littlecube.nc
ds
<xarray.Dataset> Size: 97kB
Dimensions:       (time: 366, lat: 8, lon: 8)
Coordinates:
  * time          (time) datetime64[ns] 3kB 2020-01-01 2020-01-02 ... 2020-12-31
  * lat           (lat) float32 32B 33.62 33.88 34.12 ... 34.88 35.12 35.38
  * lon           (lon) float32 32B -75.38 -75.12 -74.88 ... -73.88 -73.62
Data variables:
    analysed_sst  (time, lat, lon) float32 94kB ...

Cloud-optimized formats

Other formats like COG, ZARR, Parquet are ‘Cloud-optimized’ and allow for very efficient streaming directly from S3. In other words, you do not need to download entire files and instead can easily read subsets of the data.

The example below reads a Parquet file directly into memory (RAM) from S3 without using a local disk:

# first upload the file
local_file = '~/NOAAHackDays/topics-2025/resources/example.parquet'

remote_object = f"{scratch}/example.parquet"

s3.upload(local_file, remote_object)
[None]
gf = gpd.read_parquet(remote_object)
gf.head(2)
pop_est continent name iso_a3 gdp_md_est geometry
0 889953.0 Oceania Fiji FJI 5496 MULTIPOLYGON (((180 -16.06713, 180 -16.55522, ...
1 58005463.0 Africa Tanzania TZA 63177 POLYGON ((33.90371 -0.95, 34.07262 -1.05982, 3...

Advanced: Access Scratch bucket outside of JupyterHub

Let’s say you have a lot of files on your laptop you want to work with. The S3 Bucket is a convient way to upload large datasets for collaborative analysis. To do this, you need to copy AWS Credentials from the JupyterHub to use on other machines. More extensive documentation on this workflow can be found in this repository https://github.com/scottyhq/jupyter-cloud-scoped-creds.

The following code must be run on the JupyterHub to get temporary credentials:

client = boto3.client('sts')

with open(os.environ['AWS_WEB_IDENTITY_TOKEN_FILE']) as f:
    TOKEN = f.read()

response = client.assume_role_with_web_identity(
    RoleArn=os.environ['AWS_ROLE_ARN'],
    RoleSessionName=os.environ['JUPYTERHUB_CLIENT_ID'],
    WebIdentityToken=TOKEN,
    DurationSeconds=3600
)

reponse will be a python dictionary that looks like this:

{'Credentials': {'AccessKeyId': 'ASIAYLNAJMXY2KXXXXX',
  'SecretAccessKey': 'J06p5IOHcxq1Rgv8XE4BYCYl8TG1XXXXXXX',
  'SessionToken': 'IQoJb3JpZ2luX2VjEDsaCXVzLXdlc////0dsD4zHfjdGi/0+s3XKOUKkLrhdXgZ8nrch2KtzKyYyb...',
  'Expiration': datetime.datetime(2023, 7, 21, 19, 51, 56, tzinfo=tzlocal())},
  ...

You can copy and paste the values to another computer, and use them to configure your access to S3:

s3 = s3fs.S3FileSystem(key=response['Credentials']['AccessKeyId'],
                       secret=response['Credentials']['SecretAccessKey'],
                       token=response['Credentials']['SessionToken'] )
# Confirm your credentials give you access
s3.ls('nmfs-openscapes-scratch', refresh=True)