Data subsetting and plotting with earthaccess and xarray

Author

Luis Lopez (NASA) and adapted by Eli Holmes (NOAA)

📘 Learning Objectives

How to crop a single data file

How to create a data cube (DataSet) with xarray

Extract variables, temporal slices, and spatial slices from an xarray dataset

Summary

In this examples we will use the xarray and earthaccess to subset data and make figures.

For this tutorial we will use the GHRSST Level 4 MUR Global Foundation Sea Surface Temperature Analysis (v4.1) data. This is much higher resolution data than the AVHRR data and we will do spatially subsetting to a small area of interest.

For those not working in the JupyterHub

and then create a code cell and run pip install earthaccess

Import Required Packages

# Suppress warnings
import warnings
warnings.simplefilter('ignore')
warnings.filterwarnings('ignore')
from pprint import pprint

import earthaccess
import xarray as xr

Authenticate to NASA Earthdata

We will authenticate our Earthaccess session, and then open the results like we did in the Search & Discovery section.

auth = earthaccess.login()
# are we authenticated?
if not auth.authenticated:
    # ask for credentials and persist them in a .netrc file
    auth.login(strategy="interactive", persist=True)

Get a vector of urls to our nc files

short_name = 'MUR-JPL-L4-GLOB-v4.1'
version = "4.1"
date_start = "2020-01-01"
date_end = "2020-04-01"
date_range = (date_start, date_end)
# min lon, min lat, max lon, max lat
bbox = (-75.5, 33.5, -73.5, 35.5)  

results = earthaccess.search_data(
    short_name = short_name,
    version = version,
    cloud_hosted = True,
    temporal = date_range,
    bounding_box = bbox,
)

Granules found: 92

Crop and plot one netCDF file

Each MUR SST netCDF file is large so I do not want to download. Instead we will subset the data on the server side. We will start with one file.

fileset = earthaccess.open(results[0:1])
ds = xr.open_dataset(fileset[0])

Opening 1 granules, approx size: 0.66 GB

Note that xarray works with “lazy” computation whenever possible. In this case, the metadata are loaded into JupyterHub memory, but the data arrays and their values are not — until there is a need for them.

Let’s print out all the variable names.

for v in ds.variables:
    print(v)

time
lat
lon
analysed_sst
analysis_error
mask
sea_ice_fraction
dt_1km_data
sst_anomaly

Of the variables listed above, we are interested in analysed_sst.

ds.variables['analysed_sst'].attrs

{'long_name': 'analysed sea surface temperature',
 'standard_name': 'sea_surface_foundation_temperature',
 'units': 'kelvin',
 'valid_min': -32767,
 'valid_max': 32767,
 'comment': '"Final" version using Multi-Resolution Variational Analysis (MRVA) method for interpolation',
 'source': 'MODIS_T-JPL, MODIS_A-JPL, AMSR2-REMSS, AVHRRMTA_G-NAVO, AVHRRMTB_G-NAVO, iQUAM-NOAA/NESDIS, Ice_Conc-OSISAF'}

Subsetting

In addition to directly accessing the files archived and distributed by each of the NASA DAACs, many datasets also support services that allow us to customize the data via subsetting, reformatting, reprojection/regridding, and file aggregation. What does subsetting mean? To subset means to extract only the portions of a dataset that are needed for a given purpose.

There are three primary types of subsetting that we will walk through: 1. Temporal 2. Spatial 3. Variable

In each case, we will be excluding parts of the dataset that are not wanted using xarray. Note that “subsetting” is also called a data “transformation”.

# Display the full dataset's metadata
ds

Now we will prepare a subset. We’re using essentially the same spatial bounds as above; however, as opposed to the earthaccess inputs above, here we must provide inputs in the formats expected by xarray. Instead of a single, four-element, bounding box, we use Python slice objects, which are defined by starting and ending numbers.

ds_subset = ds.sel(time=date_start, lat=slice(33.5, 35.5), lon=slice(-75.5, -73.5)) 
ds_subset

Plotting

We will first plot using the methods built-in to the xarray package.

Note that, as opposed to the “lazy” loading of metadata previously, this will now perform “eager” computation, pulling the required data chunks.

ds_subset['analysed_sst'].plot(figsize=(10,6), x='lon', y='lat');

Create a data cube by combining multiple netCDF files

When we open multiple files, we use open_mfdataset(). Once again, we are doing lazy loading. Note this method works best if you are in the same Amazon Web Services (AWS) region as the data (us-west-2) and can use S3 connection. For the EDM workshop, we are on an Azure JupyterHub and are using https connection so this is much much slower. If we had spun up this JupyterHub on AWS us-west-2 where the NASA data are hosted, we could load a whole year of data instantly. We will load just a few days so it doesn’t take so long.

fileset = earthaccess.open(results[0:6])
ds = xr.open_mfdataset(fileset[0:5])

Opening 6 granules, approx size: 4.01 GB

ds['analysed_sst']

<xarray.DataArray 'analysed_sst' (time: 5, lat: 17999, lon: 36000)> Size: 26GB
dask.array<concatenate, shape=(5, 17999, 36000), dtype=float64, chunksize=(1, 1023, 2047), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 40B 2020-01-01T09:00:00 ... 2020-01-05T09:...
  * lat      (lat) float32 72kB -89.99 -89.98 -89.97 ... 89.97 89.98 89.99
  * lon      (lon) float32 144kB -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
Attributes:
    long_name:      analysed sea surface temperature
    standard_name:  sea_surface_foundation_temperature
    units:          kelvin
    valid_min:      -32767
    valid_max:      32767
    comment:        "Final" version using Multi-Resolution Variational Analys...
    source:         MODIS_T-JPL, MODIS_A-JPL, AMSR2-REMSS, AVHRRMTA_G-NAVO, A...

We can subset a spatial box as we did with a single file.

ds_subset = ds.sel(lat=slice(33.5, 35.5), lon=slice(-75.5, -73.5)) 
ds_subset

<xarray.Dataset> Size: 9MB
Dimensions:           (time: 5, lat: 201, lon: 201)
Coordinates:
  * time              (time) datetime64[ns] 40B 2020-01-01T09:00:00 ... 2020-...
  * lat               (lat) float32 804B 33.5 33.51 33.52 ... 35.48 35.49 35.5
  * lon               (lon) float32 804B -75.5 -75.49 -75.48 ... -73.51 -73.5
Data variables:
    analysed_sst      (time, lat, lon) float64 2MB dask.array<chunksize=(1, 201, 201), meta=np.ndarray>
    analysis_error    (time, lat, lon) float64 2MB dask.array<chunksize=(1, 201, 201), meta=np.ndarray>
    mask              (time, lat, lon) float32 808kB dask.array<chunksize=(1, 201, 201), meta=np.ndarray>
    sea_ice_fraction  (time, lat, lon) float64 2MB dask.array<chunksize=(1, 201, 201), meta=np.ndarray>
    dt_1km_data       (time, lat, lon) timedelta64[ns] 2MB dask.array<chunksize=(1, 201, 201), meta=np.ndarray>
    sst_anomaly       (time, lat, lon) float64 2MB dask.array<chunksize=(1, 201, 201), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                CF-1.7
    title:                      Daily MUR SST, Final product
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...
    institution:                Jet Propulsion Laboratory
    history:                    created at nominal 4-day latency; replaced nr...
    ...                         ...
    project:                    NASA Making Earth Science Data Records for Us...
    publisher_name:             GHRSST Project Office
    publisher_url:              http://www.ghrsst.org
    publisher_email:            ghrsst-po@nceo.ac.uk
    processing_level:           L4
    cdm_data_type:              grid

Indexes: (3)

time

PandasIndex

PandasIndex(DatetimeIndex(['2020-01-01 09:00:00', '2020-01-02 09:00:00',
               '2020-01-03 09:00:00', '2020-01-04 09:00:00',
               '2020-01-05 09:00:00'],
              dtype='datetime64[ns]', name='time', freq=None))

lat

PandasIndex

PandasIndex(Index([              33.5,   33.5099983215332,  33.52000045776367,
       33.529998779296875, 33.540000915527344,  33.54999923706055,
       33.560001373291016,  33.56999969482422,  33.58000183105469,
        33.59000015258789,
       ...
        35.40999984741211,  35.41999816894531,  35.43000030517578,
       35.439998626708984,  35.45000076293945, 35.459999084472656,
       35.470001220703125,  35.47999954223633,   35.4900016784668,
                     35.5],
      dtype='float32', name='lat', length=201))

lon

PandasIndex

PandasIndex(Index([             -75.5, -75.48999786376953,  -75.4800033569336,
       -75.47000122070312, -75.45999908447266, -75.44999694824219,
       -75.44000244140625, -75.43000030517578, -75.41999816894531,
       -75.41000366210938,
       ...
       -73.58999633789062, -73.58000183105469, -73.56999969482422,
       -73.55999755859375, -73.55000305175781, -73.54000091552734,
       -73.52999877929688,  -73.5199966430664, -73.51000213623047,
                    -73.5],
      dtype='float32', name='lon', length=201))

Attributes: (47)

Conventions :: CF-1.7
title :: Daily MUR SST, Final product
summary :: A merged, multi-sensor L4 Foundation SST analysis product from JPL.
references :: http://podaac.jpl.nasa.gov/Multi-scale_Ultra-high_Resolution_MUR-SST
institution :: Jet Propulsion Laboratory
history :: created at nominal 4-day latency; replaced nrt (1-day latency) version.
comment :: MUR = "Multi-scale Ultra-high Resolution"
license :: These data are available free of charge under data policy of JPL PO.DAAC.
id :: MUR-JPL-L4-GLOB-v04.1
naming_authority :: org.ghrsst
product_version :: 04.1
uuid :: 27665bc0-d5fc-11e1-9b23-0800200c9a66
gds_version_id :: 2.0
netcdf_version_id :: 4.1
date_created :: 20200124T180027Z
start_time :: 20200101T090000Z
stop_time :: 20200101T090000Z
time_coverage_start :: 20191231T210000Z
time_coverage_end :: 20200101T210000Z
file_quality_level :: 3
source :: MODIS_T-JPL, MODIS_A-JPL, AMSR2-REMSS, AVHRRMTA_G-NAVO, AVHRRMTB_G-NAVO, iQUAM-NOAA/NESDIS, Ice_Conc-OSISAF
platform :: Terra, Aqua, GCOM-W, MetOp-A, MetOp-B, Buoys/Ships
sensor :: MODIS, AMSR2, AVHRR, in-situ
Metadata_Conventions :: Unidata Observation Dataset v1.0
metadata_link :: http://podaac.jpl.nasa.gov/ws/metadata/dataset/?format=iso&shortName=MUR-JPL-L4-GLOB-v04.1
keywords :: Oceans > Ocean Temperature > Sea Surface Temperature
keywords_vocabulary :: NASA Global Change Master Directory (GCMD) Science Keywords
standard_name_vocabulary :: NetCDF Climate and Forecast (CF) Metadata Convention
southernmost_latitude :: -90.0
northernmost_latitude :: 90.0
westernmost_longitude :: -180.0
easternmost_longitude :: 180.0
spatial_resolution :: 0.01 degrees
geospatial_lat_units :: degrees north
geospatial_lat_resolution :: 0.01
geospatial_lon_units :: degrees east
geospatial_lon_resolution :: 0.01
acknowledgment :: Please acknowledge the use of these data with the following statement: These data were provided by JPL under support by NASA MEaSUREs program.
creator_name :: JPL MUR SST project
creator_email :: ghrsst@podaac.jpl.nasa.gov
creator_url :: http://mur.jpl.nasa.gov
project :: NASA Making Earth Science Data Records for Use in Research Environments (MEaSUREs) Program
publisher_name :: GHRSST Project Office
publisher_url :: http://www.ghrsst.org
publisher_email :: ghrsst-po@nceo.ac.uk
processing_level :: L4
cdm_data_type :: grid