Adding new data

Written by Minh Phan

We can also add new data to our ZARR file as long as the additional dataset shape fits into our original dataset shape but one dimension. By this, we can add data along one dimension at a time, but other dimensions and all the variables (including metadata) must be identical in size. For example, if our dataset has size 100 lat x 100 lon x 200 time with five variables, the new dataset that we can append to must also have the exact five variables, and two of the dimensions be the same size (in the most logical case, we append along the time dimension, so our new data must have 100 lat x 100 lon).

Sometimes, you are also recommended to rechunk the data after appending as unequal chunk sizes may cost computational operation time.

import xarray as xr
import pandas as pd
import numpy as np

For demonstration purposes, I will not go through again process of creating another dataset, and instead provide an already cleaned dataset for us to practice on. Start by loading this cleaned dataset into our file, as well as the original dataset that we already exported (to compare and double check metadata before we export).

To keep our original dataset intact, I made a copy of our original Zarr file. Please load it instead.

Load in data

og_ds = xr.open_zarr('demonstrated data/final-sample-appending.zarr/')
new_ds = xr.open_zarr('demonstrated data/new-data-sample.zarr/')

Note that our new dataset does not have any metadata. As shown in the previous notebooks, metadata is added at the last step, so now we are going to copy all metadata from the original dataset to our new one.

Add metadata

# copy dataset metadata
new_ds.attrs = og_ds.attrs

# copy variables/dimensions metadata
# make sure that all vars in new_ds exist in og_ds
for var in new_ds.variables:
    new_ds[var].attrs = og_ds[var].attrs
# double-check
new_ds
<xarray.Dataset>
Dimensions:          (time: 2556, lat: 81, lon: 81)
Coordinates:
  * lat              (lat) float32 25.0 24.75 24.5 24.25 ... 5.75 5.5 5.25 5.0
  * lon              (lon) float32 60.0 60.25 60.5 60.75 ... 79.5 79.75 80.0
  * time             (time) datetime64[ns] 1993-01-01 1993-01-02 ... 1999-12-31
Data variables: (12/14)
    CHL              (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    CHL_uncertainty  (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    adt              (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    air_temp         (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    direction        (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    sla              (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    ...               ...
    u_curr           (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    u_wind           (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    ug_curr          (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    v_curr           (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    v_wind           (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    vg_curr          (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
Attributes: (12/17)
    creator_email:              minhphan@uw.edu
    creator_name:               Minh Phan
    creator_type:               person
    date_created:               2023-11-11
    geospatial_lat_max:         25.0
    geospatial_lat_min:         5.0
    ...                         ...
    geospatial_lon_units:       degrees_east
    source:                     OSCAR, ERA5 Reanalysis, Copernicus Climate Ch...
    summary:                    Daily mean of 0.25 x 0.25 degrees gridded dat...
    time_coverage_end:          2002-12-31T23:59:59
    time_coverage_start:        2000-01-01T00:00:00
    title:                      Sample of Climate Data for Coastal Upwelling ...

Appending data

new_ds.to_zarr('demonstrated data/final-sample-appending.zarr/', append_dim='time', mode='a')
<xarray.backends.zarr.ZarrStore at 0x7f1455c51dd0>

Final result

xr.open_zarr('demonstrated data/final-sample-appending.zarr/')
<xarray.Dataset>
Dimensions:          (time: 3287, lat: 81, lon: 81)
Coordinates:
  * lat              (lat) float32 25.0 24.75 24.5 24.25 ... 5.75 5.5 5.25 5.0
  * lon              (lon) float32 60.0 60.25 60.5 60.75 ... 79.5 79.75 80.0
  * time             (time) datetime64[ns] 2000-01-01 2000-01-02 ... 1999-12-31
Data variables: (12/14)
    CHL              (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    CHL_uncertainty  (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    adt              (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    air_temp         (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    direction        (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    sla              (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    ...               ...
    u_curr           (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    u_wind           (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    ug_curr          (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    v_curr           (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    v_wind           (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
    vg_curr          (time, lat, lon) float32 dask.array<chunksize=(100, 81, 81), meta=np.ndarray>
Attributes: (12/17)
    creator_email:              minhphan@uw.edu
    creator_name:               Minh Phan
    creator_type:               person
    date_created:               2023-11-11
    geospatial_lat_max:         25.0
    geospatial_lat_min:         5.0
    ...                         ...
    geospatial_lon_units:       degrees_east
    source:                     OSCAR, ERA5 Reanalysis, Copernicus Climate Ch...
    summary:                    Daily mean of 0.25 x 0.25 degrees gridded dat...
    time_coverage_end:          2002-12-31T23:59:59
    time_coverage_start:        2000-01-01T00:00:00
    title:                      Sample of Climate Data for Coastal Upwelling ...