VirtualiZarr → Concat files → Icechunk

Author: Rich Signell

Colab Badge Download Badge JupyterHub

  1. Open a single virtual dataset from S3
  2. Combine 5 daily files with open_virtual_mfdataset
  3. Write virtual references to an Icechunk store
  4. Read back and plot

Applies the VirtualiZarr workflow to the NOAA Climate Data Record (CDR) of AVHRR NDVI — a 0.05° global daily vegetation index dataset publicly available on AWS S3. Data: NOAA CDR NDVI v5 (AVHRR)
Bucket: s3://noaa-cdr-ndvi-pds (anonymous access, us-east-1)

# Colab users, uncomment and run this
#!pip install -q icechunk  virtualizarr xarray obspec_utils obstore hvplot
import warnings
import shutil
from pathlib import Path

import xarray as xr
import icechunk
from obstore.store import from_url
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from obspec_utils.registry import ObjectStoreRegistry

warnings.filterwarnings(
    "ignore",
    message="Numcodecs codecs are not in the Zarr version 3 specification*",
    category=UserWarning,
)

Setup: S3 store and registry

Point obstore at the public NDVI bucket and register it so VirtualiZarr can resolve chunk references.

bucket = "s3://noaa-cdr-ndvi-pds"
base = "data/2000"

# 5 consecutive daily files — January 2000
filenames = [
    "AVHRR-Land_v005_AVH13C1_NOAA-14_20000101_c20170623095628.nc",
    "AVHRR-Land_v005_AVH13C1_NOAA-14_20000102_c20170623101557.nc",
    "AVHRR-Land_v005_AVH13C1_NOAA-14_20000103_c20170623103338.nc",
    "AVHRR-Land_v005_AVH13C1_NOAA-14_20000104_c20170623105028.nc",
    "AVHRR-Land_v005_AVH13C1_NOAA-14_20000105_c20170623110559.nc",
]

urls = [f"{bucket}/{base}/{f}" for f in filenames]

store = from_url(bucket, region="us-east-1", skip_signature=True)
registry = ObjectStoreRegistry({bucket: store})
parser = HDFParser()

urls[0]
's3://noaa-cdr-ndvi-pds/data/2000/AVHRR-Land_v005_AVH13C1_NOAA-14_20000101_c20170623095628.nc'

1. Open a single virtual dataset

open_virtual_dataset reads only the file metadata — no chunk data is downloaded. The result looks like an xarray dataset but chunk references point back to the original S3 file.

We load time, lat, and lon as concrete arrays (they are small and needed for coordinates).

vds = open_virtual_dataset(
    url=urls[0],
    parser=parser,
    registry=registry,
    loadable_variables=["time", "latitude", "longitude"],
    decode_times=True,
)
vds
<xarray.Dataset> Size: 156MB
Dimensions:    (latitude: 3600, longitude: 7200, time: 1, ncrs: 1, nv: 2)
Coordinates:
  * latitude   (latitude) float32 14kB 89.97 89.93 89.88 ... -89.93 -89.97
  * longitude  (longitude) float32 29kB -180.0 -179.9 -179.9 ... 179.9 180.0
  * time       (time) datetime64[ns] 8B 2000-01-01
    ncrs       (ncrs) float32 4B ManifestArray<shape=(1,), dtype=float32, chu...
    nv         (nv) float32 8B ManifestArray<shape=(2,), dtype=float32, chunk...
Data variables:
    crs        (ncrs) int16 2B ManifestArray<shape=(1,), dtype=int16, chunks=...
    lat_bnds   (latitude, nv) float32 29kB ManifestArray<shape=(3600, 2), dty...
    lon_bnds   (longitude, nv) float32 58kB ManifestArray<shape=(7200, 2), dt...
    NDVI       (time, latitude, longitude) int16 52MB ManifestArray<shape=(1,...
    TIMEOFDAY  (time, latitude, longitude) int16 52MB ManifestArray<shape=(1,...
    QA         (time, latitude, longitude) int16 52MB ManifestArray<shape=(1,...
Attributes: (12/48)
    title:                                  Normalized Difference Vegetation ...
    institution:                            NASA/GSFC/SED/ESD/HBSL/TIS/MODIS-...
    Conventions:                            CF-1.6, ACDD-1.3
    standard_name_vocabulary:               CF Standard Name Table (v25, 05 J...
    naming_authority:                       gov.noaa.ncei
    license:                                See the Use Agreement for this CD...
    ...                                     ...
    PercentValidDaytimeData:                32.03
    PercentValidDaytimeLand:                32.03
    PercentValidClearDaytimeLand:           3.94
    PercentValidDaytimeLandInCloudShadow:   1.19
    PercentValidClearDaytimeWater:          0.00
    PercentValidDaytimeWaterInCloudShadow:  0.00
# ManifestArray shows the chunk layout — no data downloaded yet
vds["NDVI"].data
ManifestArray<shape=(1, 3600, 7200), dtype=int16, chunks=(1, 1200, 2400)>

2. Combine 5 daily files

These files have a dimensionless ncrs coordinate used for the CRS definition. Because ncrs has no index, coordinate-based combining can fail. Instead of using combine="by_coords", we open each file individually and concatenate explicitly along time with xr.concat(...). We then drop nv, ncrs, and crs because they are file-level metadata variables that interfere with concatenation and are not needed for this virtualized NDVI data product.

vds_list = [
    open_virtual_dataset(
        url=url,
        parser=parser,
        registry=registry,
        loadable_variables=["time", "latitude", "longitude"],
        decode_times=True,
    )
    for url in urls
]

combined_vds = xr.concat(
    vds_list,
    dim="time",
    data_vars="minimal",
    coords="minimal",
    combine_attrs="drop_conflicts",
    compat="override"
).drop_vars(["nv", "ncrs", "crs"], errors="ignore")
combined_vds
<xarray.Dataset> Size: 778MB
Dimensions:    (latitude: 3600, nv: 2, longitude: 7200, time: 5)
Coordinates:
  * latitude   (latitude) float32 14kB 89.97 89.93 89.88 ... -89.93 -89.97
  * longitude  (longitude) float32 29kB -180.0 -179.9 -179.9 ... 179.9 180.0
  * time       (time) datetime64[ns] 40B 2000-01-01 2000-01-02 ... 2000-01-05
Dimensions without coordinates: nv
Data variables:
    lat_bnds   (latitude, nv) float32 29kB ManifestArray<shape=(3600, 2), dty...
    lon_bnds   (longitude, nv) float32 58kB ManifestArray<shape=(7200, 2), dt...
    NDVI       (time, latitude, longitude) int16 259MB ManifestArray<shape=(5...
    TIMEOFDAY  (time, latitude, longitude) int16 259MB ManifestArray<shape=(5...
    QA         (time, latitude, longitude) int16 259MB ManifestArray<shape=(5...
Attributes: (12/32)
    institution:                            NASA/GSFC/SED/ESD/HBSL/TIS/MODIS-...
    Conventions:                            CF-1.6, ACDD-1.3
    standard_name_vocabulary:               CF Standard Name Table (v25, 05 J...
    naming_authority:                       gov.noaa.ncei
    license:                                See the Use Agreement for this CD...
    cdm_data_type:                          Grid
    ...                                     ...
    InputDataType:                          GAC
    ESDT:                                   AVH13C1
    RangeBeginningTime:                     00:00:00.0000
    RangeEndingTime:                        23:59:59.9999
    PercentValidClearDaytimeWater:          0.00
    PercentValidDaytimeWaterInCloudShadow:  0.00

3a. Export to kerchunk JSON (optional)

Write the combined virtual references to a single kerchunk-compatible JSON file that any fsspec-aware reader can consume.

out = "combined_refs.json"

combined_vds.vz.to_kerchunk(
    filepath=out,
    format="json",
)

3b. Write to Icechunk

Create a local Icechunk repository and write the virtual references into it. The VirtualChunkContainer tells Icechunk where to fetch the actual chunk bytes at read time — the original S3 files are never copied.

repo_path = Path("ndvi-icechunk-concat")
if repo_path.exists():
    shutil.rmtree(repo_path)
    print(f"Cleared existing repo at {repo_path}/")

config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(
    icechunk.VirtualChunkContainer(
        url_prefix="s3://noaa-cdr-ndvi-pds/",
        store=icechunk.s3_store(region="us-east-1", anonymous=True),
    ),
)

storage = icechunk.local_filesystem_storage(str(repo_path))
repo = icechunk.Repository.create(storage, config)

session = repo.writable_session("main")
combined_vds.vz.to_icechunk(session.store)
snapshot_id = session.commit("Add 5 days NDVI CDR Jan 2000")
print("Committed:", snapshot_id)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 1
----> 1 repo_path = Path("ndvi-icechunk-concat")
      2 if repo_path.exists():
      3     shutil.rmtree(repo_path)

NameError: name 'Path' is not defined

4. Read back and plot

Open the Icechunk store with xarray — all 5 days appear as a single continuous dataset. Chunk data is fetched lazily from S3 on demand.

credentials = icechunk.containers_credentials({
    "s3://noaa-cdr-ndvi-pds/": icechunk.s3_credentials(anonymous=True)
})

repo2 = icechunk.Repository.open(
    storage, config, authorize_virtual_chunk_access=credentials
)
session2 = repo2.readonly_session("main")

ds = xr.open_zarr(session2.store, consolidated=False, chunks=None)
ds
<xarray.Dataset> Size: 2GB
Dimensions:    (time: 5, latitude: 3600, longitude: 7200, nv: 2)
Coordinates:
  * time       (time) datetime64[ns] 40B 2000-01-01 2000-01-02 ... 2000-01-05
  * latitude   (latitude) float32 14kB 89.97 89.93 89.88 ... -89.93 -89.97
  * longitude  (longitude) float32 29kB -180.0 -179.9 -179.9 ... 179.9 180.0
Dimensions without coordinates: nv
Data variables:
    NDVI       (time, latitude, longitude) float64 1GB ...
    QA         (time, latitude, longitude) int16 259MB ...
    lon_bnds   (longitude, nv) float32 58kB ...
    lat_bnds   (latitude, nv) float32 29kB ...
    TIMEOFDAY  (time, latitude, longitude) float64 1GB ...
Attributes: (12/32)
    institution:                            NASA/GSFC/SED/ESD/HBSL/TIS/MODIS-...
    Conventions:                            CF-1.6, ACDD-1.3
    standard_name_vocabulary:               CF Standard Name Table (v25, 05 J...
    naming_authority:                       gov.noaa.ncei
    license:                                See the Use Agreement for this CD...
    cdm_data_type:                          Grid
    ...                                     ...
    InputDataType:                          GAC
    ESDT:                                   AVH13C1
    RangeBeginningTime:                     00:00:00.0000
    RangeEndingTime:                        23:59:59.9999
    PercentValidClearDaytimeWater:          0.00
    PercentValidDaytimeWaterInCloudShadow:  0.00
import hvplot.xarray  # noqa

# Global NDVI map for the first day
ds["NDVI"].isel(time=0).hvplot(rasterize=True, geo=True, global_extent=True,
    x="longitude", y="latitude", tiles='OSM',
    cmap="YlGn", clim=(-0.1, 1.0),
    title="AVHRR NDVI — 2000-01-01",
    width=800, height=400,
)
ds['crs'].values
# Global mean NDVI over the 5-day period
ds["NDVI"].mean(["latitude", "longitude"]).hvplot(
    title="Global mean NDVI — Jan 1–5, 2000",
    ylabel="NDVI",
)