Comparing DAP2 and DAP4

Author

Eli Holmes (NOAA)

Colab Badge JupyterHub Badge Download Badge

Key Differences

DAP2 is the older protocol and DAP4 is the newer. In this notebook, I will run some comparison code. Per ChapGPT here are some of the differences.

Feature DAP2 DAP4
Data Model Supports simple types (e.g., integers, floats, strings, arrays) and some complex structures. Supports a richer set of data types, including better handling of nested structures and new metadata constructs.
Data Encoding Uses ASCII and binary (older, less efficient encoding). Uses more modern binary encoding, including NetCDF-4/HDF5-like structures.
Metadata Handling Limited support for additional metadata. Supports richer metadata, allowing self-describing datasets.
Chunked Data Access Limited ability to access specific parts of large datasets. Improved ability to request and return chunks of data efficiently.
Support for Modern Formats Less support for modern formats like HDF5 and NetCDF-4. Native support for HDF5 and NetCDF-4, allowing better integration with existing scientific workflows.
Efficient Transfers Less efficient for large datasets. More efficient for large datasets due to better compression and structured data requests.
Constraint Expressions Limited filtering and subsetting capabilities. More expressive constraints, enabling more sophisticated data selection.

Setup

import earthaccess
import pydap
import xarray as xr
earthaccess.login()
session = earthaccess.get_requests_https_session()

Comparison

This dataset has both DAP4 and DAP2 access: https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/hyrax/MERRA/MAI1NXINT.5.2.0/2016/01/contents.html

If we use https:// in the url, pydap will automatically use DAP2. To use DAP4, we use dap4:// instead. pydap issues a warning if we are accessing with DAP2, so let’s turn that off.

import warnings

# Suppress only the specific warning from PyDAP
warnings.filterwarnings("ignore", message="PyDAP was unable to determine the DAP protocol*", category=UserWarning)

DAP4 is faster for data lazy loading

%%time
url = "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf"
ds2 = xr.open_dataset(url, engine="pydap", session=session)
CPU times: user 111 ms, sys: 8.42 ms, total: 119 ms
Wall time: 6.84 s
%%time
url = "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf"
ds4 = xr.open_dataset(url, engine="pydap", session=session)
CPU times: user 56.9 ms, sys: 9.32 ms, total: 66.2 ms
Wall time: 3.32 s

DAP4 adds slashes to the dims

This is not a pydap thing. You can see it on the OPeNDAP access page too.

ds4.dims
FrozenMappingWarningOnValuesAccess({'/TIME': 24, '/YDim': 361, '/XDim': 540})
ds2.dims
FrozenMappingWarningOnValuesAccess({'TIME': 24, 'YDim': 361, 'XDim': 540})

Data loading time is faster with DAP2

At least in this example.

%%time
ds4["TQL"].isel({"/TIME": 2, "/YDim": 10}).load();
CPU times: user 56.2 ms, sys: 3.21 ms, total: 59.4 ms
Wall time: 3.55 s
%%time
ds2["TQL"].isel({"TIME": 2, "YDim": 10}).load();
CPU times: user 16.9 ms, sys: 0 ns, total: 16.9 ms
Wall time: 977 ms

Mean is not too different

Need to reset up data since I loaded the data in the previous example.

%%time
url = "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf"
ds2 = xr.open_dataset(url, engine="pydap", session=session)
CPU times: user 674 ms, sys: 57.7 ms, total: 732 ms
Wall time: 7.5 s
%%time
ds2["TQL"].mean().compute();
CPU times: user 132 ms, sys: 76.9 ms, total: 208 ms
Wall time: 2.61 s
%%time
url = "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf"
ds4 = xr.open_dataset(url, engine="pydap", session=session)
CPU times: user 136 ms, sys: 8.96 ms, total: 145 ms
Wall time: 3.59 s
%%time
ds4["TQL"].mean().compute();
CPU times: user 144 ms, sys: 66.7 ms, total: 210 ms
Wall time: 2.44 s

Adding constraints decreases lazy loading time but DAP4 still faster

For DAP4, add ?dap4.ce=/TQL to select just TQL. For DAP2, add ?TQL.

%%time
url = "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf?TQL"
#pydap_ds2 = pydap.client.open_url(url, protocol="dap2", session=session)
ds2 = xr.open_dataset(url, engine="pydap", session=session)
CPU times: user 36 ms, sys: 0 ns, total: 36 ms
Wall time: 1.74 s
%%time
url = "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf?dap4.ce=/TQL"
ds4 = xr.open_dataset(url, engine="pydap", session=session)
CPU times: user 15.1 ms, sys: 7.53 ms, total: 22.6 ms
Wall time: 947 ms

xarray.open_mfdataset is faster with DAP4

Though speed difference has decreased from using xarray.open_dataset.

%%time
urls = [
    "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf?dap4.ce=/TQL",
    "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160102.hdf?dap4.ce=/TQL",
    "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160103.hdf?dap4.ce=/TQL"
]
ds4 = xr.open_mfdataset(urls, engine="pydap",
                        parallel=True, 
                        combine='nested', 
                        concat_dim='/TIME',
                        session=session)
CPU times: user 68.8 ms, sys: 5 ms, total: 73.8 ms
Wall time: 968 ms
%%time
urls = [
    "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf?TQL",
    "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160102.hdf?TQL",
    "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160103.hdf?TQL"
]
ds2 = xr.open_mfdataset(urls, engine="pydap",
                        parallel=True, 
                        combine='nested', 
                        concat_dim='TIME',
                        session=session)
CPU times: user 114 ms, sys: 11.1 ms, total: 125 ms
Wall time: 1.59 s

Mean is now faster with DAP4

For this example.

%%time
ds4.mean().compute();
CPU times: user 475 ms, sys: 143 ms, total: 619 ms
Wall time: 2.95 s
%%time
ds2.mean().compute();
CPU times: user 393 ms, sys: 108 ms, total: 501 ms
Wall time: 3.12 s

Effect of rechunking doesn’t seem to be different

all the ds’s are like this to start. Rechunking doesn’t seem to slow one down more than another.

print(ds4["TQL"].chunks)
((24, 24, 24), (361,), (540,))
ds2_rechunked = ds2.chunk({"TIME": 2, "XDim": 25, "YDim": 25})
%%time
ds2_rechunked.mean().compute();
CPU times: user 9.39 s, sys: 1.28 s, total: 10.7 s
Wall time: 11.9 s
ds4_rechunked = ds4.chunk({"/TIME": 2, "/XDim": 25, "/YDim": 25})
%%time
ds4_rechunked.mean().compute();
CPU times: user 9.06 s, sys: 1.29 s, total: 10.4 s
Wall time: 11.4 s
ds2_rechunked = ds2.chunk({"TIME": 50, "XDim": -1, "YDim": -1})
%%time
ds2_rechunked.mean().compute();
CPU times: user 411 ms, sys: 98 ms, total: 509 ms
Wall time: 2.96 s
ds4_rechunked = ds4.chunk({"/TIME": 50, "/XDim": -1, "/YDim": -1})
%%time
ds4_rechunked.mean().compute();
CPU times: user 465 ms, sys: 182 ms, total: 647 ms
Wall time: 2.85 s

Saving to netcdf

Not too different. I have to do some clean up on the DAP4 dim names with slashes before saving.

%%time
ds2.attrs = {}
ds2.to_netcdf("output.nc")
CPU times: user 389 ms, sys: 178 ms, total: 567 ms
Wall time: 4.17 s
ds4_clean = ds4.rename({"/TIME": "TIME", "/YDim": "YDim", "/XDim": "XDim"})
%%time
ds4_clean.attrs = {}
ds4_clean.to_netcdf("output.nc")
CPU times: user 554 ms, sys: 190 ms, total: 744 ms
Wall time: 3.66 s

Conclusion

The main difference I was able to see with these tests is that lazy loading was faster with DAP4, but this was evidenced mostly when using open_dataset. One the xarray Dataset was lazy loading, loading the data or doing computations (mean) was similar or sometimes faster when DAP2 was used.

ds4.nbytes / 1e6
299.44264