Loading ERDDAP Data with xarray

Author

Eli Holmes with help from Roy Mendelssohn, Sunny Hospital

📘 Learning Objectives

Load ERDDAP data with xarray

Use some xarray functions on data cubes

Overview

We can open ERDDAP gridded dataset collections with xarray. ERDDAP automatically creates an aggregated metadata layer that combines all granules in a collection. We can then call xarray.open_dataset(collection_url) to get a data cube without having to point to all the individual files. Make sure that the collection url points to the dataset without any subsetting specification. Under the hood, xarray will load the collection metadata (lazily). After loading, you can subset the full collection data as needed.

We will use the Multi-scale Ultra-high Resolution (MUR) SST Analysis fv04.1, Global, 0.01°, 2002-present, Daily data from NASA JPL. This is a high-resolution dataset of over 150 terabytes.

For those not working in the JupyterHub

Create a code cell and run pip install xarray.

Load the data set (lazily)

import xarray as xr
url = "https://coastwatch.pfeg.noaa.gov/erddap/griddap/jplMURSST41"
ds = xr.open_dataset(url)
print(f"Size in TB: {ds.nbytes / 1e12:.2f} TB")
ds.sizes

Size in TB: 151.58 TB

Frozen({'time': 8355, 'latitude': 17999, 'longitude': 36000})

ds

Subset this data set

First we will select the SST variable, subset to a 2 degree box and 1 year of data. This will create a much smaller data set or 120 Mb.

dc = ds['analysed_sst'].sel(
                        latitude=slice(33.5, 35.5), 
                        longitude=slice(-75.5, -73.5), 
                        time=slice('2003-01-01', '2003-12-31'))
print(f"Size in GB: {dc.nbytes / 1e9:.2f} GB")
dc.sizes

Size in GB: 0.12 GB

Frozen({'time': 365, 'latitude': 201, 'longitude': 201})

%%time
# Load the data into memory so the next steps are fast
dc.load();

CPU times: user 1.19 s, sys: 355 ms, total: 1.55 s
Wall time: 1min 47s

Plot one day

dc.sel(time="2003-10-01").plot();

Compute the mean SST over the whole data set

Computing the mean for an data array returns a data array and here .item() is used to just return the value.

dc.mean().item()

24.8006164888092

We can plot the spatial mean raster

We do this by taking the mean over the time dimension.

dc.mean(dim=['time']).plot();

We can plot the daily means

We do this by taking the mean over the space dimensions.

dc.mean(dim=['latitude', 'longitude']).plot();

Resample to create monthly means

The resample function takes a long time if you don’t have the data loaded in memory already. So you will want to run something like dc.load() (or dc.mean() or dc.compute()) before running the resample code.

dc_monthly = dc.resample(time='ME').mean()
dc_monthly.mean(dim=['latitude', 'longitude']).plot();

Chunking to do larger than memory computations

If we load in all the days for our 2 degree square, it is 2.7 Gb and larger than the memory on the small virtual machines (2 Gb). If we run the daily mean code, it would load all that data into memory and crash the kernel. In addition, ERDDAP has a 2Gb per request limit and we will run into that limit also. We can use dask and chunking to do the calculation in smaller chunks so we don’t run out of memory or run into download per request limits.

dc_all = ds['analysed_sst'].sel(
                        latitude=slice(33.5, 35.5), 
                        longitude=slice(-75.5, -73.5))

print(f"Size in GB: {dc_all.nbytes / 1e9:.2f} GB")
dc_all.sizes

Size in GB: 2.69 GB

Frozen({'time': 8311, 'latitude': 201, 'longitude': 201})

Use dask to chunk the data into 500 day chunks. Now when we process the data, it will load only one chunk at a time. We can (slowly) work through all the chunks without maxing out the data.

import dask
dc_chunk = dc_all.chunk({'time': 500, 'latitude': -1, 'longitude': -1})
dc_chunk

<xarray.DataArray 'analysed_sst' (time: 8311, latitude: 201, longitude: 201)> Size: 3GB
dask.array<xarray-<this-array>, shape=(8311, 201, 201), dtype=float64, chunksize=(500, 201, 201), chunktype=numpy.ndarray>
Coordinates:
  * time       (time) datetime64[ns] 66kB 2002-06-01T09:00:00 ... 2025-03-05T...
  * latitude   (latitude) float32 804B 33.5 33.51 33.52 ... 35.48 35.49 35.5
  * longitude  (longitude) float32 804B -75.5 -75.49 -75.48 ... -73.51 -73.5
Attributes:
    colorBarMaximum:  32.0
    colorBarMinimum:  0.0
    comment:          Interim near-real-time (nrt) version using Multi-Resolu...
    ioos_category:    Temperature
    long_name:        Analysed Sea Surface Temperature
    source:           MODIS_T-JPL, MODIS_A-JPL, AMSR2-REMSS, AVHRRMTB_G-NAVO,...
    standard_name:    sea_surface_foundation_temperature
    units:            degree_C
    valid_max:        57.767
    valid_min:        -7.767000000000003

Here I will use just 3 years of data to illustrate the concept since chugging through all years would take awhile. but you can watch the memory (at the bottom nav bar in the Jupyter Hub) and see that it stays low. Note, 150 Mb chunks are a bit smaller than needed and we could speed things up by optimizing chunk size.

# This is takes about 6 minutes
from dask.diagnostics import ProgressBar

with ProgressBar():
    mean_all_years = dc_chunk.sel(time=slice("2002", "2006")).mean(dim=['latitude', 'longitude']).compute()

[########################################] | 100% Completed | 400.90 s

mean_all_years.plot();

We can resample to monthly.

mean_all_years.resample(time='ME').mean().plot();

Note when we loaded all the data in, versus chunking with dask, running the resample code was fast because the data was already loaded. But with the chunked data, we don’t keep all of it in memory so running resample would mean chugging through all the chunks again, i.e. would not be fast. Instead, above, the daily means were resampled to monthly instead of doing this:

dc_monthly = dc.resample(time='ME').mean()

Compare to Zarr

This dataset is also available on AWS in Zarr format. This is a cloud-optimized format and computations with it are considerably faster.

import xarray
ds2 = xarray.open_zarr("https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1")

dc2 = ds2['analysed_sst'].sel(
                        lat=slice(33.5, 35.5), 
                        lon=slice(-75.5, -73.5),
                        time=slice('2003-01-01', '2006-12-31'))

Processing the Zarr dataset on AWS takes only 1.5 minutes versus 6 minutes with the data on ERDDAP.

import dask
dc2_chunk = dc2.chunk({'time': 500, 'lat': -1, 'lon': -1})
from dask.diagnostics import ProgressBar

with ProgressBar():
    mean_all_years = dc2_chunk.mean(dim=['lat', 'lon']).compute()

[########################################] | 100% Completed | 73.60 s

mean_all_years.plot();

Summary

This tutorial shows you how to create xarray data cubes for gridded data from ERDDAP. The key is to use the url in the right format <erddap server>/erddap/griddap/<collection id> with no extra information like .html or time or space subsetting, or file type.

Loading data available in Zarr format (a cloud-optimized format) works similarly and results in faster processing, but cloud-optimized data formats are still relatively uncommon.