import earthaccess
import pydap
import xarray as xr
earthaccess.login()
session = earthaccess.get_requests_https_session()Comparing DAP2 and DAP4
Key Differences
DAP2 is the older protocol and DAP4 is the newer. In this notebook, I will run some comparison code. Per ChapGPT here are some of the differences.
| Feature | DAP2 | DAP4 |
|---|---|---|
| Data Model | Supports simple types (e.g., integers, floats, strings, arrays) and some complex structures. | Supports a richer set of data types, including better handling of nested structures and new metadata constructs. |
| Data Encoding | Uses ASCII and binary (older, less efficient encoding). | Uses more modern binary encoding, including NetCDF-4/HDF5-like structures. |
| Metadata Handling | Limited support for additional metadata. | Supports richer metadata, allowing self-describing datasets. |
| Chunked Data Access | Limited ability to access specific parts of large datasets. | Improved ability to request and return chunks of data efficiently. |
| Support for Modern Formats | Less support for modern formats like HDF5 and NetCDF-4. | Native support for HDF5 and NetCDF-4, allowing better integration with existing scientific workflows. |
| Efficient Transfers | Less efficient for large datasets. | More efficient for large datasets due to better compression and structured data requests. |
| Constraint Expressions | Limited filtering and subsetting capabilities. | More expressive constraints, enabling more sophisticated data selection. |
Setup
Comparison
This dataset has both DAP4 and DAP2 access: https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/hyrax/MERRA/MAI1NXINT.5.2.0/2016/01/contents.html
If we use https:// in the url, pydap will automatically use DAP2. To use DAP4, we use dap4:// instead. pydap issues a warning if we are accessing with DAP2, so let’s turn that off.
import warnings
# Suppress only the specific warning from PyDAP
warnings.filterwarnings("ignore", message="PyDAP was unable to determine the DAP protocol*", category=UserWarning)DAP4 is faster for data lazy loading
%%time
url = "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf"
ds2 = xr.open_dataset(url, engine="pydap", session=session)CPU times: user 111 ms, sys: 8.42 ms, total: 119 ms
Wall time: 6.84 s
%%time
url = "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf"
ds4 = xr.open_dataset(url, engine="pydap", session=session)CPU times: user 56.9 ms, sys: 9.32 ms, total: 66.2 ms
Wall time: 3.32 s
DAP4 adds slashes to the dims
This is not a pydap thing. You can see it on the OPeNDAP access page too.
ds4.dimsFrozenMappingWarningOnValuesAccess({'/TIME': 24, '/YDim': 361, '/XDim': 540})
ds2.dimsFrozenMappingWarningOnValuesAccess({'TIME': 24, 'YDim': 361, 'XDim': 540})
Data loading time is faster with DAP2
At least in this example.
%%time
ds4["TQL"].isel({"/TIME": 2, "/YDim": 10}).load();CPU times: user 56.2 ms, sys: 3.21 ms, total: 59.4 ms
Wall time: 3.55 s
%%time
ds2["TQL"].isel({"TIME": 2, "YDim": 10}).load();CPU times: user 16.9 ms, sys: 0 ns, total: 16.9 ms
Wall time: 977 ms
Mean is not too different
Need to reset up data since I loaded the data in the previous example.
%%time
url = "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf"
ds2 = xr.open_dataset(url, engine="pydap", session=session)CPU times: user 674 ms, sys: 57.7 ms, total: 732 ms
Wall time: 7.5 s
%%time
ds2["TQL"].mean().compute();CPU times: user 132 ms, sys: 76.9 ms, total: 208 ms
Wall time: 2.61 s
%%time
url = "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf"
ds4 = xr.open_dataset(url, engine="pydap", session=session)CPU times: user 136 ms, sys: 8.96 ms, total: 145 ms
Wall time: 3.59 s
%%time
ds4["TQL"].mean().compute();CPU times: user 144 ms, sys: 66.7 ms, total: 210 ms
Wall time: 2.44 s
Adding constraints decreases lazy loading time but DAP4 still faster
For DAP4, add ?dap4.ce=/TQL to select just TQL. For DAP2, add ?TQL.
%%time
url = "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf?TQL"
#pydap_ds2 = pydap.client.open_url(url, protocol="dap2", session=session)
ds2 = xr.open_dataset(url, engine="pydap", session=session)CPU times: user 36 ms, sys: 0 ns, total: 36 ms
Wall time: 1.74 s
%%time
url = "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf?dap4.ce=/TQL"
ds4 = xr.open_dataset(url, engine="pydap", session=session)CPU times: user 15.1 ms, sys: 7.53 ms, total: 22.6 ms
Wall time: 947 ms
xarray.open_mfdataset is faster with DAP4
Though speed difference has decreased from using xarray.open_dataset.
%%time
urls = [
"dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf?dap4.ce=/TQL",
"dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160102.hdf?dap4.ce=/TQL",
"dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160103.hdf?dap4.ce=/TQL"
]
ds4 = xr.open_mfdataset(urls, engine="pydap",
parallel=True,
combine='nested',
concat_dim='/TIME',
session=session)CPU times: user 68.8 ms, sys: 5 ms, total: 73.8 ms
Wall time: 968 ms
%%time
urls = [
"https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf?TQL",
"https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160102.hdf?TQL",
"https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160103.hdf?TQL"
]
ds2 = xr.open_mfdataset(urls, engine="pydap",
parallel=True,
combine='nested',
concat_dim='TIME',
session=session)CPU times: user 114 ms, sys: 11.1 ms, total: 125 ms
Wall time: 1.59 s
Mean is now faster with DAP4
For this example.
%%time
ds4.mean().compute();CPU times: user 475 ms, sys: 143 ms, total: 619 ms
Wall time: 2.95 s
%%time
ds2.mean().compute();CPU times: user 393 ms, sys: 108 ms, total: 501 ms
Wall time: 3.12 s
Effect of rechunking doesn’t seem to be different
all the ds’s are like this to start. Rechunking doesn’t seem to slow one down more than another.
print(ds4["TQL"].chunks)((24, 24, 24), (361,), (540,))
ds2_rechunked = ds2.chunk({"TIME": 2, "XDim": 25, "YDim": 25})%%time
ds2_rechunked.mean().compute();CPU times: user 9.39 s, sys: 1.28 s, total: 10.7 s
Wall time: 11.9 s
ds4_rechunked = ds4.chunk({"/TIME": 2, "/XDim": 25, "/YDim": 25})%%time
ds4_rechunked.mean().compute();CPU times: user 9.06 s, sys: 1.29 s, total: 10.4 s
Wall time: 11.4 s
ds2_rechunked = ds2.chunk({"TIME": 50, "XDim": -1, "YDim": -1})%%time
ds2_rechunked.mean().compute();CPU times: user 411 ms, sys: 98 ms, total: 509 ms
Wall time: 2.96 s
ds4_rechunked = ds4.chunk({"/TIME": 50, "/XDim": -1, "/YDim": -1})%%time
ds4_rechunked.mean().compute();CPU times: user 465 ms, sys: 182 ms, total: 647 ms
Wall time: 2.85 s
Saving to netcdf
Not too different. I have to do some clean up on the DAP4 dim names with slashes before saving.
%%time
ds2.attrs = {}
ds2.to_netcdf("output.nc")CPU times: user 389 ms, sys: 178 ms, total: 567 ms
Wall time: 4.17 s
ds4_clean = ds4.rename({"/TIME": "TIME", "/YDim": "YDim", "/XDim": "XDim"})%%time
ds4_clean.attrs = {}
ds4_clean.to_netcdf("output.nc")CPU times: user 554 ms, sys: 190 ms, total: 744 ms
Wall time: 3.66 s
Conclusion
The main difference I was able to see with these tests is that lazy loading was faster with DAP4, but this was evidenced mostly when using open_dataset. One the xarray Dataset was lazy loading, loading the data or doing computations (mean) was similar or sometimes faster when DAP2 was used.
ds4.nbytes / 1e6299.44264