Creating data cubes on THREDDS servers via OPeNDAP protocol

Author

Eli Holmes (NOAA)

📘 Learning Objectives

Practice working with data on THREDDS and OPeNDAP servers

Open a single file

Create a data cube using urls to multiple files

Do some basic data aggregation.

Overview

THREDDS is a common type of data server that usually includes multiple ways to access the data. One of those ways is via the OPeNDAP protocol which allows you to subset the data, instead of downloading the whole data file. Here you will learn to use xarray’s open_mfdataset to create data cubes on THREDDS servers using the OPeNDAP protocol. This tutorial uses an example where the server doesn’t require authentication (username and password).

Acknowledgments: This example was adapted from Ryan Abernathy https://rabernat.github.io/research_computing_2018/xarray-tips-and-tricks.html

NCEP-NCAR Reanalysis 1

For our first example, we will use 4xDaily Air temperature at sigma level 995 data (air.sig995) from the NCEP-NCAR Reanalysis 1. First look at the THREDDS catalog to orient yourself to the file naming convention. We can click on one of the files an see our [access options]. We are looking for the OPeNDAP information as we need to get the url for that. Clicking the OPeNDAP link reveals the url format of the files:

https://psl.noaa.gov/thredds/dodsC/Datasets/ncep.reanalysis/surface/air.sig995.1948.nc

Now we can create the file urls.

Set-up if on Colab

# uncomment and run if on Colab
#!pip install -q netCDF4 pydap

Open files

import xarray as xr

base_url = 'http://psl.noaa.gov/thredds/dodsC/Datasets/ncep.reanalysis/surface/air.sig995'

files = [f'{base_url}.{year}.nc' for year in range(1948, 2025)]
len(files)

Open a single file.

files[1]

'http://psl.noaa.gov/thredds/dodsC/Datasets/ncep.reanalysis/surface/air.sig995.1949.nc'

%%time
ds = xr.open_dataset(files[1]);
ds

CPU times: user 1.55 s, sys: 183 ms, total: 1.73 s
Wall time: 2.07 s

ds.air.isel(time=1).plot()

We will load the file metadata with open_mfdataset and create our virtual data cube. It will take 10 seconds or so but it doesn’t use much memory as we are only reading and loading the file metadata.

%%time
import xarray as xr
ds = xr.open_mfdataset(files)

CPU times: user 2.91 s, sys: 413 ms, total: 3.32 s
Wall time: 25.8 s

ds

<xarray.Dataset> Size: 5GB
Dimensions:  (time: 112500, lat: 73, lon: 144)
Coordinates:
  * time     (time) datetime64[ns] 900kB 1948-01-01 ... 2024-12-31T18:00:00
  * lat      (lat) float32 292B 90.0 87.5 85.0 82.5 ... -82.5 -85.0 -87.5 -90.0
  * lon      (lon) float32 576B 0.0 2.5 5.0 7.5 10.0 ... 350.0 352.5 355.0 357.5
Data variables:
    air      (time, lat, lon) float32 5GB dask.array<chunksize=(1464, 73, 144), meta=np.ndarray>
Attributes:
    Conventions:                     COARDS
    title:                           4x daily NMC reanalysis (1948)
    description:                     Data is from NMC initialized reanalysis\...
    platform:                        Model
    history:                         created 99/05/11 by Hoop (netCDF2.3)\nCo...
    dataset_title:                   NCEP-NCAR Reanalysis 1
    References:                      http://www.psl.noaa.gov/data/gridded/dat...
    _NCProperties:                   version=2,netcdf=4.6.3,hdf5=1.10.5
    DODS_EXTRA.Unlimited_Dimension:  time

ds['air'].isel(time=2).plot();

The data set is not that huge, but it is bigger than the 2Gb RAM in our minimal Jupyter Hub.

print(f"{ds.nbytes / 1e9} Gb")

0.61452314 Gb

But even a small dataset would crash our 2Gb RAM.

ds_sub = ds.sel(time=slice("1948","1958"))
print(f"{ds_sub.nbytes / 1e9} Gb")
# ds_sub.load() # this would crash a 2Gb RAM

0.6759249 Gb

Creating daily means

For one year, we can create a daily mean since 1 year is not that much data and we can fit that into memory.

ds_mean = (
    ds["air"]
        .sel(time="1950")
        .mean(dim=['lat', 'lon'])
)
ds_mean.plot();

Error:curl error: Stream error in the HTTP/2 framing layer
Error:DAP DATADDS packet is apparently too short

# resample to daily
ds_mean.resample(time='D').mean().plot();

But if we try to do all years at once, we will run out of memory.

Chunking the data

Fortunately, xarray will process our data in chunks rather than loading the whole data into memory. The data were chunked automatically into yearly chunks by xarray since we created using open_mfdataset. The chunks are 1460 = 4 x 365 days since the files are one year.

print(ds["air"].chunks)

((1464, 1460, 1460, 1460, 1464), (73,), (144,))

I am going to rechunk smaller to get to about 1 Mb sized chunks. This should make it go a little faster and use a little less memory.

import dask
ds_chunk = ds["air"].sel(time=slice("1948", "1958")).chunk({'time': 24, 'lat': -1, 'lon': -1})
ds_chunk

<xarray.DataArray 'air' (time: 7308, lat: 73, lon: 144)> Size: 307MB
dask.array<rechunk-merge, shape=(7308, 73, 144), dtype=float32, chunksize=(24, 73, 144), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 58kB 1948-01-01 ... 1952-12-31T18:00:00
  * lat      (lat) float32 292B 90.0 87.5 85.0 82.5 ... -82.5 -85.0 -87.5 -90.0
  * lon      (lon) float32 576B 0.0 2.5 5.0 7.5 10.0 ... 350.0 352.5 355.0 357.5
Attributes: (12/13)
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    ...            ...
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]
    valid_range:   [185.16 331.16]
    dataset:       NCEP Reanalysis
    level_desc:    0.995 sigma
    _ChunkSizes:   [  1  73 144]

xarray.DataArray

'air'

time: 7308
lat: 73
lon: 144

dask.array<chunksize=(24, 73, 144), meta=np.ndarray>

	Array	Chunk
Bytes	293.05 MiB	0.96 MiB
Shape	(7308, 73, 144)	(24, 73, 144)
Dask graph	305 chunks in 12 graph layers
Data type	float32 numpy.ndarray

Coordinates: (3)

time

(time)

datetime64[ns]

1948-01-01 ... 1952-12-31T18:00:00

long_name :: Time
delta_t :: 0000-00-00 06:00:00
standard_name :: time
axis :: T
actual_range :: [1297320. 1306098.]
_ChunkSizes :: 1

array(['1948-01-01T00:00:00.000000000', '1948-01-01T06:00:00.000000000',
       '1948-01-01T12:00:00.000000000', ..., '1952-12-31T06:00:00.000000000',
       '1952-12-31T12:00:00.000000000', '1952-12-31T18:00:00.000000000'],
      shape=(7308,), dtype='datetime64[ns]')

lat

(lat)

float32

90.0 87.5 85.0 ... -87.5 -90.0

units :: degrees_north
actual_range :: [ 90. -90.]
long_name :: Latitude
standard_name :: latitude
axis :: Y

array([ 90. ,  87.5,  85. ,  82.5,  80. ,  77.5,  75. ,  72.5,  70. ,  67.5,
        65. ,  62.5,  60. ,  57.5,  55. ,  52.5,  50. ,  47.5,  45. ,  42.5,
        40. ,  37.5,  35. ,  32.5,  30. ,  27.5,  25. ,  22.5,  20. ,  17.5,
        15. ,  12.5,  10. ,   7.5,   5. ,   2.5,   0. ,  -2.5,  -5. ,  -7.5,
       -10. , -12.5, -15. , -17.5, -20. , -22.5, -25. , -27.5, -30. , -32.5,
       -35. , -37.5, -40. , -42.5, -45. , -47.5, -50. , -52.5, -55. , -57.5,
       -60. , -62.5, -65. , -67.5, -70. , -72.5, -75. , -77.5, -80. , -82.5,
       -85. , -87.5, -90. ], dtype=float32)

lon

(lon)

float32

0.0 2.5 5.0 ... 352.5 355.0 357.5

units :: degrees_east
long_name :: Longitude
actual_range :: [ 0. 357.5]
standard_name :: longitude
axis :: X

array([  0. ,   2.5,   5. ,   7.5,  10. ,  12.5,  15. ,  17.5,  20. ,  22.5,
        25. ,  27.5,  30. ,  32.5,  35. ,  37.5,  40. ,  42.5,  45. ,  47.5,
        50. ,  52.5,  55. ,  57.5,  60. ,  62.5,  65. ,  67.5,  70. ,  72.5,
        75. ,  77.5,  80. ,  82.5,  85. ,  87.5,  90. ,  92.5,  95. ,  97.5,
       100. , 102.5, 105. , 107.5, 110. , 112.5, 115. , 117.5, 120. , 122.5,
       125. , 127.5, 130. , 132.5, 135. , 137.5, 140. , 142.5, 145. , 147.5,
       150. , 152.5, 155. , 157.5, 160. , 162.5, 165. , 167.5, 170. , 172.5,
       175. , 177.5, 180. , 182.5, 185. , 187.5, 190. , 192.5, 195. , 197.5,
       200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
       225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
       250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
       275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
       300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
       325. , 327.5, 330. , 332.5, 335. , 337.5, 340. , 342.5, 345. , 347.5,
       350. , 352.5, 355. , 357.5], dtype=float32)

Attributes: (13)
long_name :
4xDaily Air temperature at sigma level 995
units :
degK
precision :
2
GRIB_id :
11
GRIB_name :
TMP
var_desc :
Air temperature
statistic :
Individual Obs
parent_stat :
Other
actual_range :
[185.16 322.1 ]
valid_range :
[185.16 331.16]
dataset :
NCEP Reanalysis
level_desc :
0.995 sigma
_ChunkSizes :
[ 1 73 144]

# This is takes about 4 minutes; 1.4 Gb
from dask.diagnostics import ProgressBar

with ProgressBar():
    mean_all_years = ds_chunk.mean(dim=['lat', 'lon']).compute()

[########                                ] | 21% Completed | 211.11 ms

Error:DAP DATADDS packet is apparently too short

[########################################] | 100% Completed | 615.40 ms

mean_all_years.plot();

# resample to monthly
mean_all_years.resample(time='ME').mean().plot();

Conclusion

We practiced creating data cubes with urls to files on a THREDDS server via the OPeNDAP protocol.

References

https://www.jamstec.go.jp/ridinfo/xarray-and-opendap/
More opendap + xarray debugging https://github.com/stuckyb/gcdl/issues/24
Another opendap example https://github.com/ornldaac/daymet-python-opendap-xarray/blob/master/1_daymetv4_discovery_access_subsetting.ipynb
Nice example https://rabernat.github.io/research_computing_2018/xarray-tips-and-tricks.html