import os
import s3fs
import fsspec
import boto3
import xarray as xr
import geopandas as gpd
Using the S3 Scratch Bucket
The JupyterHub has a preconfigured S3 “Scratch Bucket” that automatically deletes files after 7 days. This is a great resource for experimenting with large datasets and working collaboratively on a shared dataset with other users.
Access the scratch bucket
The scratch bucket is hosted at s3://nmfs-openscapes-scratch
. The JupyterHub automatically sets an environment variable SCRATCH_BUCKET
that appends a suffix to the s3 url with your GitHub username. This is intended to keep track of file ownership, stay organized, and prevent users from overwriting data!
Everyone has full access to the scratch bucket, so be careful not to overwrite data from other users when uploading files. Also, any data you put there will be deleted 7 days after it is uploaded
If you need more permanent S3 bucket storage refer to AWS_S3_bucket documentation (left) to configure your own S3 Bucket.
We’ll use the S3FS Python package, which provides a nice interface for interacting with S3 buckets.
# My GitHub username is `eeholmes`
= os.environ['SCRATCH_BUCKET']
scratch scratch
's3://nmfs-openscapes-scratch/eeholmes'
# But you can set a different S3 object prefix to use:
= 's3://nmfs-openscapes-scratch/hackhours' scratch
Uploading data
It’s great to store data in S3 buckets because this storage features very high network throughput. If many users are simultaneously accessing the same file on a spinning networked harddrive (/home/jovyan/shared
) performance can be quite slow. S3 has much higher performance for such cases.
Upload single file
= '~/NOAAHackDays/topics-2025/2025-02-14-earthdata/littlecube.nc'
local_file
= f"{scratch}/littlecube.nc"
remote_object
s3.upload(local_file, remote_object)
[None]
Once a bucket has files, I can list them. If the bucket is empty, you will get errors instead of []
.
= s3fs.S3FileSystem()
s3 s3.ls(scratch)
['nmfs-openscapes-scratch/hackhours/littlecube.nc']
s3.stat(remote_object)
{'Key': 'nmfs-openscapes-scratch/hackhours/littlecube.nc',
'LastModified': datetime.datetime(2025, 2, 13, 21, 41, 5, tzinfo=tzlocal()),
'ETag': '"d73616d9e3ad84cf58a4a676b1e3d454"',
'ChecksumAlgorithm': ['CRC32'],
'ChecksumType': 'FULL_OBJECT',
'Size': 50224,
'StorageClass': 'STANDARD',
'type': 'file',
'size': 50224,
'name': 'nmfs-openscapes-scratch/hackhours/littlecube.nc'}
Upload a directory
= '~/NOAAHackDays/topics-2025/resources'
local_dir
!ls -lh {local_dir}
total 5.9M
-rw-r--r-- 1 jovyan jovyan 5.9M Feb 12 21:05 e_sst.nc
drwxr-xr-x 3 jovyan jovyan 281 Feb 12 21:18 longhurst_v4_2010
=True) s3.upload(local_dir, scratch, recursive
[None, None, None, None, None, None, None, None, None]
The directory name is the directory name (only) of the local directory.
f'{scratch}/resources') s3.ls(
['nmfs-openscapes-scratch/hackhours/resources/e_sst.nc',
'nmfs-openscapes-scratch/hackhours/resources/longhurst_v4_2010']
Accessing Data
Some software packages allow you to stream data directly from S3 Buckets. But you can always pull objects from S3 and work with local file paths.
This download-first, then analyze workflow typically works well for older file formats like HDF and netCDF that were designed to perform well on local hard drives rather than Cloud storage systems like S3.
For best performance do not work with data in your home directory. Instead use a local scratch space like `/tmp`
remote_object
's3://nmfs-openscapes-scratch/hackhours/littlecube.nc'
= '/tmp/test.nc'
local_object s3.download(remote_object, local_object)
[None]
= xr.open_dataset(local_object)
ds ds
<xarray.Dataset> Size: 97kB Dimensions: (time: 366, lat: 8, lon: 8) Coordinates: * time (time) datetime64[ns] 3kB 2020-01-01 2020-01-02 ... 2020-12-31 * lat (lat) float32 32B 33.62 33.88 34.12 ... 34.88 35.12 35.38 * lon (lon) float32 32B -75.38 -75.12 -74.88 ... -73.88 -73.62 Data variables: analysed_sst (time, lat, lon) float32 94kB ...
If you don't want to think about downloading files you can let `fsspec` handle this behind the scenes for you! This way you only need to think about remote paths
= fsspec.filesystem("simplecache",
fs ='/tmp/files/',
cache_storage=True,
same_names='s3',
target_protocol )
# The `simplecache` setting above will download the full file to /tmp/files
print(remote_object)
with fs.open(remote_object) as f:
= xr.open_dataset(f.name) # NOTE: pass f.name for local cached path ds
s3://nmfs-openscapes-scratch/hackhours/littlecube.nc
ds
<xarray.Dataset> Size: 97kB Dimensions: (time: 366, lat: 8, lon: 8) Coordinates: * time (time) datetime64[ns] 3kB 2020-01-01 2020-01-02 ... 2020-12-31 * lat (lat) float32 32B 33.62 33.88 34.12 ... 34.88 35.12 35.38 * lon (lon) float32 32B -75.38 -75.12 -74.88 ... -73.88 -73.62 Data variables: analysed_sst (time, lat, lon) float32 94kB ...
Cloud-optimized formats
Other formats like COG, ZARR, Parquet are ‘Cloud-optimized’ and allow for very efficient streaming directly from S3. In other words, you do not need to download entire files and instead can easily read subsets of the data.
The example below reads a Parquet file directly into memory (RAM) from S3 without using a local disk:
# first upload the file
= '~/NOAAHackDays/topics-2025/resources/example.parquet'
local_file
= f"{scratch}/example.parquet"
remote_object
s3.upload(local_file, remote_object)
[None]
= gpd.read_parquet(remote_object)
gf 2) gf.head(
pop_est | continent | name | iso_a3 | gdp_md_est | geometry | |
---|---|---|---|---|---|---|
0 | 889953.0 | Oceania | Fiji | FJI | 5496 | MULTIPOLYGON (((180 -16.06713, 180 -16.55522, ... |
1 | 58005463.0 | Africa | Tanzania | TZA | 63177 | POLYGON ((33.90371 -0.95, 34.07262 -1.05982, 3... |
Advanced: Access Scratch bucket outside of JupyterHub
Let’s say you have a lot of files on your laptop you want to work with. The S3 Bucket is a convient way to upload large datasets for collaborative analysis. To do this, you need to copy AWS Credentials from the JupyterHub to use on other machines. More extensive documentation on this workflow can be found in this repository https://github.com/scottyhq/jupyter-cloud-scoped-creds.
The following code must be run on the JupyterHub to get temporary credentials:
= boto3.client('sts')
client
with open(os.environ['AWS_WEB_IDENTITY_TOKEN_FILE']) as f:
= f.read()
TOKEN
= client.assume_role_with_web_identity(
response =os.environ['AWS_ROLE_ARN'],
RoleArn=os.environ['JUPYTERHUB_CLIENT_ID'],
RoleSessionName=TOKEN,
WebIdentityToken=3600
DurationSeconds )
reponse
will be a python dictionary that looks like this:
{'Credentials': {'AccessKeyId': 'ASIAYLNAJMXY2KXXXXX',
'SecretAccessKey': 'J06p5IOHcxq1Rgv8XE4BYCYl8TG1XXXXXXX',
'SessionToken': 'IQoJb3JpZ2luX2VjEDsaCXVzLXdlc////0dsD4zHfjdGi/0+s3XKOUKkLrhdXgZ8nrch2KtzKyYyb...',
'Expiration': datetime.datetime(2023, 7, 21, 19, 51, 56, tzinfo=tzlocal())},
...
You can copy and paste the values to another computer, and use them to configure your access to S3:
= s3fs.S3FileSystem(key=response['Credentials']['AccessKeyId'],
s3 =response['Credentials']['SecretAccessKey'],
secret=response['Credentials']['SessionToken'] ) token
# Confirm your credentials give you access
'nmfs-openscapes-scratch', refresh=True) s3.ls(