import earthaccess
import pydap
import xarray as xr
earthaccess.login()= earthaccess.get_requests_https_session() session
Comparing DAP2 and DAP4
Key Differences
DAP2 is the older protocol and DAP4 is the newer. In this notebook, I will run some comparison code. Per ChapGPT here are some of the differences.
Feature | DAP2 | DAP4 |
---|---|---|
Data Model | Supports simple types (e.g., integers, floats, strings, arrays) and some complex structures. | Supports a richer set of data types, including better handling of nested structures and new metadata constructs. |
Data Encoding | Uses ASCII and binary (older, less efficient encoding). | Uses more modern binary encoding, including NetCDF-4/HDF5-like structures. |
Metadata Handling | Limited support for additional metadata. | Supports richer metadata, allowing self-describing datasets. |
Chunked Data Access | Limited ability to access specific parts of large datasets. | Improved ability to request and return chunks of data efficiently. |
Support for Modern Formats | Less support for modern formats like HDF5 and NetCDF-4. | Native support for HDF5 and NetCDF-4, allowing better integration with existing scientific workflows. |
Efficient Transfers | Less efficient for large datasets. | More efficient for large datasets due to better compression and structured data requests. |
Constraint Expressions | Limited filtering and subsetting capabilities. | More expressive constraints, enabling more sophisticated data selection. |
Setup
Comparison
This dataset has both DAP4 and DAP2 access: https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/hyrax/MERRA/MAI1NXINT.5.2.0/2016/01/contents.html
If we use https://
in the url, pydap will automatically use DAP2. To use DAP4, we use dap4://
instead. pydap issues a warning if we are accessing with DAP2, so let’s turn that off.
import warnings
# Suppress only the specific warning from PyDAP
"ignore", message="PyDAP was unable to determine the DAP protocol*", category=UserWarning) warnings.filterwarnings(
DAP4 is faster for data lazy loading
%%time
= "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf"
url = xr.open_dataset(url, engine="pydap", session=session) ds2
CPU times: user 111 ms, sys: 8.42 ms, total: 119 ms
Wall time: 6.84 s
%%time
= "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf"
url = xr.open_dataset(url, engine="pydap", session=session) ds4
CPU times: user 56.9 ms, sys: 9.32 ms, total: 66.2 ms
Wall time: 3.32 s
DAP4 adds slashes to the dims
This is not a pydap thing. You can see it on the OPeNDAP access page too.
ds4.dims
FrozenMappingWarningOnValuesAccess({'/TIME': 24, '/YDim': 361, '/XDim': 540})
ds2.dims
FrozenMappingWarningOnValuesAccess({'TIME': 24, 'YDim': 361, 'XDim': 540})
Data loading time is faster with DAP2
At least in this example.
%%time
"TQL"].isel({"/TIME": 2, "/YDim": 10}).load(); ds4[
CPU times: user 56.2 ms, sys: 3.21 ms, total: 59.4 ms
Wall time: 3.55 s
%%time
"TQL"].isel({"TIME": 2, "YDim": 10}).load(); ds2[
CPU times: user 16.9 ms, sys: 0 ns, total: 16.9 ms
Wall time: 977 ms
Mean is not too different
Need to reset up data since I loaded the data in the previous example.
%%time
= "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf"
url = xr.open_dataset(url, engine="pydap", session=session) ds2
CPU times: user 674 ms, sys: 57.7 ms, total: 732 ms
Wall time: 7.5 s
%%time
"TQL"].mean().compute(); ds2[
CPU times: user 132 ms, sys: 76.9 ms, total: 208 ms
Wall time: 2.61 s
%%time
= "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf"
url = xr.open_dataset(url, engine="pydap", session=session) ds4
CPU times: user 136 ms, sys: 8.96 ms, total: 145 ms
Wall time: 3.59 s
%%time
"TQL"].mean().compute(); ds4[
CPU times: user 144 ms, sys: 66.7 ms, total: 210 ms
Wall time: 2.44 s
Adding constraints decreases lazy loading time but DAP4 still faster
For DAP4, add ?dap4.ce=/TQL
to select just TQL. For DAP2, add ?TQL
.
%%time
= "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf?TQL"
url #pydap_ds2 = pydap.client.open_url(url, protocol="dap2", session=session)
= xr.open_dataset(url, engine="pydap", session=session) ds2
CPU times: user 36 ms, sys: 0 ns, total: 36 ms
Wall time: 1.74 s
%%time
= "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf?dap4.ce=/TQL"
url = xr.open_dataset(url, engine="pydap", session=session) ds4
CPU times: user 15.1 ms, sys: 7.53 ms, total: 22.6 ms
Wall time: 947 ms
xarray.open_mfdataset is faster with DAP4
Though speed difference has decreased from using xarray.open_dataset
.
%%time
= [
urls "dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf?dap4.ce=/TQL",
"dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160102.hdf?dap4.ce=/TQL",
"dap4://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160103.hdf?dap4.ce=/TQL"
]= xr.open_mfdataset(urls, engine="pydap",
ds4 =True,
parallel='nested',
combine='/TIME',
concat_dim=session) session
CPU times: user 68.8 ms, sys: 5 ms, total: 73.8 ms
Wall time: 968 ms
%%time
= [
urls "https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160101.hdf?TQL",
"https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160102.hdf?TQL",
"https://goldsmr1.gesdisc.eosdis.nasa.gov/opendap/MERRA/MAI1NXINT.5.2.0/2016/01/MERRA300.prod.assim.inst1_2d_int_Nx.20160103.hdf?TQL"
]= xr.open_mfdataset(urls, engine="pydap",
ds2 =True,
parallel='nested',
combine='TIME',
concat_dim=session) session
CPU times: user 114 ms, sys: 11.1 ms, total: 125 ms
Wall time: 1.59 s
Mean is now faster with DAP4
For this example.
%%time
; ds4.mean().compute()
CPU times: user 475 ms, sys: 143 ms, total: 619 ms
Wall time: 2.95 s
%%time
; ds2.mean().compute()
CPU times: user 393 ms, sys: 108 ms, total: 501 ms
Wall time: 3.12 s
Effect of rechunking doesn’t seem to be different
all the ds’s are like this to start. Rechunking doesn’t seem to slow one down more than another.
print(ds4["TQL"].chunks)
((24, 24, 24), (361,), (540,))
= ds2.chunk({"TIME": 2, "XDim": 25, "YDim": 25}) ds2_rechunked
%%time
; ds2_rechunked.mean().compute()
CPU times: user 9.39 s, sys: 1.28 s, total: 10.7 s
Wall time: 11.9 s
= ds4.chunk({"/TIME": 2, "/XDim": 25, "/YDim": 25}) ds4_rechunked
%%time
; ds4_rechunked.mean().compute()
CPU times: user 9.06 s, sys: 1.29 s, total: 10.4 s
Wall time: 11.4 s
= ds2.chunk({"TIME": 50, "XDim": -1, "YDim": -1}) ds2_rechunked
%%time
; ds2_rechunked.mean().compute()
CPU times: user 411 ms, sys: 98 ms, total: 509 ms
Wall time: 2.96 s
= ds4.chunk({"/TIME": 50, "/XDim": -1, "/YDim": -1}) ds4_rechunked
%%time
; ds4_rechunked.mean().compute()
CPU times: user 465 ms, sys: 182 ms, total: 647 ms
Wall time: 2.85 s
Saving to netcdf
Not too different. I have to do some clean up on the DAP4 dim names with slashes before saving.
%%time
= {}
ds2.attrs "output.nc") ds2.to_netcdf(
CPU times: user 389 ms, sys: 178 ms, total: 567 ms
Wall time: 4.17 s
= ds4.rename({"/TIME": "TIME", "/YDim": "YDim", "/XDim": "XDim"}) ds4_clean
%%time
= {}
ds4_clean.attrs "output.nc") ds4_clean.to_netcdf(
CPU times: user 554 ms, sys: 190 ms, total: 744 ms
Wall time: 3.66 s
Conclusion
The main difference I was able to see with these tests is that lazy loading was faster with DAP4, but this was evidenced mostly when using open_dataset
. One the xarray Dataset was lazy loading, loading the data or doing computations (mean) was similar or sometimes faster when DAP2 was used.
/ 1e6 ds4.nbytes
299.44264