NASA Earth Data

Written by Minh Phan

This tutorial serves to provide one of many ways a user can download data from NASA’s EarthData database, mostly with datasets (collections in EarthData’s terminology) hosted in the cloud.

Authorize credentials

To download data from NASA’s Earth Data database, it’s recommended that you set up a .netrc credential file so that you don’t have to manually log in every time you run a downloading script. To do this, consult 2021 Cloud Hackathon’s tutorial here. Make sure to register an account with Earth Data first before following the tutorial.

When you finished implementing the .netrc file, continue with the tutorial below

Import necessary libraries

import xarray as xr
import earthaccess
import numpy as np
import pandas as pd
import os, glob

Download data to your local machine using earthaccess library

Earthaccess library streamlines your downloading, slicing, and searching for granules easier than ever. For cloud-hosted datasets (which is what this tutorial best works with), we choose to download granules to the local machine instead of streaming them to the working Python scripts as some users may not be physically available in the us-west region for streaming to be effective. Local downloading may result in heavy file sizes, but is consistent, and we am also providing some tweaks to save as much as you can, especially if your research interest area requires a long temporal range and does not cover globally.

# Log in using .netrc file
auth = earthaccess.login(strategy="netrc")
We are already authenticated with NASA EDL

Streaming granules

# EarthAccess's approach to collecting granules
results = earthaccess.search_data(
    short_name='MUR-JPL-L4-GLOB-v4.1',
    cloud_hosted=True,
    bounding_box = (60, 5, 80, 25),
    temporal=("2003-01-01", "2003-02-28")
)
Granules found: 59
files = earthaccess.open(results) # s3
 Opening 59 granules, approx size: 0.0 GB
ds = xr.open_mfdataset(files)
ds
<xarray.Dataset>
Dimensions:           (time: 59, lat: 17999, lon: 36000)
Coordinates:
  * time              (time) datetime64[ns] 2003-01-01T09:00:00 ... 2003-02-2...
  * lat               (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99
  * lon               (lon) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
Data variables:
    analysed_sst      (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray>
    analysis_error    (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray>
    mask              (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray>
    sea_ice_fraction  (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                CF-1.5
    title:                      Daily MUR SST, Final product
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...
    institution:                Jet Propulsion Laboratory
    history:                    created at nominal 4-day latency; replaced nr...
    ...                         ...
    project:                    NASA Making Earth Science Data Records for Us...
    publisher_name:             GHRSST Project Office
    publisher_url:              http://www.ghrsst.org
    publisher_email:            ghrsst-po@nceo.ac.uk
    processing_level:           L4
    cdm_data_type:              grid
ds.sel(lat=slice(5, 25), lon=slice(60, 80))
<xarray.Dataset>
Dimensions:           (time: 59, lat: 2001, lon: 2001)
Coordinates:
  * time              (time) datetime64[ns] 2003-01-01T09:00:00 ... 2003-02-2...
  * lat               (lat) float32 5.0 5.01 5.02 5.03 ... 24.98 24.99 25.0
  * lon               (lon) float32 60.0 60.01 60.02 60.03 ... 79.98 79.99 80.0
Data variables:
    analysed_sst      (time, lat, lon) float32 dask.array<chunksize=(1, 2001, 2001), meta=np.ndarray>
    analysis_error    (time, lat, lon) float32 dask.array<chunksize=(1, 2001, 2001), meta=np.ndarray>
    mask              (time, lat, lon) float32 dask.array<chunksize=(1, 2001, 2001), meta=np.ndarray>
    sea_ice_fraction  (time, lat, lon) float32 dask.array<chunksize=(1, 2001, 2001), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                CF-1.5
    title:                      Daily MUR SST, Final product
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...
    institution:                Jet Propulsion Laboratory
    history:                    created at nominal 4-day latency; replaced nr...
    ...                         ...
    project:                    NASA Making Earth Science Data Records for Us...
    publisher_name:             GHRSST Project Office
    publisher_url:              http://www.ghrsst.org
    publisher_email:            ghrsst-po@nceo.ac.uk
    processing_level:           L4
    cdm_data_type:              grid

Download granules for an extended period of time

Since you cannot slice data spatially, downloading granules is going to take up a lot of disk space if you only need data within a small bounding box. For our task, we wrote a simple function to slice data and wrote it into a new file.

You can consult the earthacess library website or their notebooks for code snippets on how to browse and look up collections. For this notebook, we mainly focus on the downloading aspect. First, we need to get the list of granules to download.

Since earthacess does not support spatial slicing, we developed a method to download, slice, combine, and export data yearly, then finally delete temporary downloaded files to save disk space. Assumed that you already knew the temporal, spatial range of the dataset of your chosen, we first download the data by year into a temporary folder, then slice the data and then export the combined data to another folder.

# Our approach
# Month end not included

def download_and_combine_granules(short_name, month_start, month_end, lat1=5, lat2=25, lon1=60, lon2=80):
    for month in pd.date_range(month_start, month_end, freq='M').strftime('%Y-%m-%d'):      
        print('Collecting granules')
        granules = earthaccess.granule_query().short_name(short_name).temporal(f'{month[:7]}-01',month).get(366)
        
        MAIN_FOLDER = 'demonstrated data/earth_data'
        TEMP_FOLDER = 'temp'
        path_temp_folder = os.path.join(MAIN_FOLDER, TEMP_FOLDER)
        path_processed_folder = os.path.join(MAIN_FOLDER, short_name)
        # create folder to store data
        if not os.path.exists(path_temp_folder):
            os.makedirs(path_temp_folder)
        if not os.path.exists(path_processed_folder):
            os.makedirs(path_processed_folder)
        files = earthaccess.download(granules, path_temp_folder)
       
        
        ## if dataset coordinates are slice-able, use:
        print('Slicing...')
        data = xr.open_mfdataset(f'{path_temp_folder}/*.nc').sel(lat=slice(lat1, lat2), lon=slice(lon1, lon2))
    
        
        # combine files together 
        ## for some collections, coordinate names are 'lat' and 'lon' while their underlying indices are 'latitude' and 'longitude', respectively
        ## may or may not be applicable for other datasets on the site.
        ## get bounding box if not sliceable
        ### lat1_idx, lat2_idx, lon1_idx, lon2_idx = get_bounding_box(os.path.join(path_temp_folder, first_file), lat1, lat2, lon1, lon2)
        ### data = xr.open_mfdataset(f'{path_temp_folder}/*.nc').isel(latitude=slice(lat1_idx, lat2_idx+1), longitude=slice(lon1_idx, lon2_idx+1))
        
        data.to_netcdf(f'{path_processed_folder}/{month}.nc')
        
        # delete files in the temporary folder
        print('Deleting temporary files...')
        files = glob.glob(f'{path_temp_folder}/*.*')
        for f in files:
            os.remove(f)

def get_bounding_box(file_path, lat1=0, lat2=30, lon1=60, lon2=80):
    """
    The dataset we experimented did not have indexed coordinates, 
    so we resorted to slicing using index positions
    """
    ds = xr.open_dataset(file_path)
    
    lat_vals = ds.lat.values
    lon_vals = ds.lon.values
    
    lat1_idx = np.abs(lat_vals-lat1).argmin()
    lat2_idx = np.abs(lat_vals-lat2).argmin()
    lon1_idx = np.abs(lon_vals-lon1).argmin()
    lon2_idx = np.abs(lon_vals-lon1).argmin()
    
    return lat1_idx, lat2_idx, lon1_idx, lon2_idx
# download and combine every month's worth of data
download_and_combine_granules(short_name='MUR25-JPL-L4-GLOB-v04.2',
                          month_start='2003-01', month_end='2003-03', 
                          lat1=5, lat2=25, lon1=60, lon2=80)
Collecting granules
 Getting 31 granules, approx download size: 0.0 GB
Slicing...
Deleting temporary files...
Collecting granules
 Getting 28 granules, approx download size: 0.0 GB
Slicing...
Deleting temporary files...
list_of_missing_dates = ["2003-01-04", "2003-01-18", "2003-02-18"]

for date in list_of_missing_dates:
    result = earthaccess.search_data(
        short_name='MUR-JPL-L4-GLOB-v4.1',
        cloud_hosted=True,
        bounding_box = (60, 5, 80, 25),
        temporal=(date, date)
    )
    
    MAIN_FOLDER = 'demonstrated data/earth_data'
    TEMP_FOLDER = 'temp'
    path_temp_folder = os.path.join(MAIN_FOLDER, TEMP_FOLDER)
    file = earthaccess.download(result, path_temp_folder)
Granules found: 1
 Getting 1 granules, approx download size: 0.0 GB
Error while downloading the file 20030218090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/earthaccess/store.py", line 488, in _download_file
    r.raise_for_status()
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20030218090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc

Combine files together

Now that we have all netcdf4 files in one place, all spatially sliced, combining the rest of the data is a piece of cake! Note that some of the data will be overlap in the process of combing data every year earlier, so it’s best practice to remove duplicates (if any)

ds
<xarray.Dataset>
Dimensions:           (time: 59, lat: 17999, lon: 36000)
Coordinates:
  * time              (time) datetime64[ns] 2003-01-01T09:00:00 ... 2003-02-2...
  * lat               (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99
  * lon               (lon) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
Data variables:
    analysed_sst      (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray>
    analysis_error    (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray>
    mask              (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray>
    sea_ice_fraction  (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                CF-1.5
    title:                      Daily MUR SST, Final product
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...
    institution:                Jet Propulsion Laboratory
    history:                    created at nominal 4-day latency; replaced nr...
    ...                         ...
    project:                    NASA Making Earth Science Data Records for Us...
    publisher_name:             GHRSST Project Office
    publisher_url:              http://www.ghrsst.org
    publisher_email:            ghrsst-po@nceo.ac.uk
    processing_level:           L4
    cdm_data_type:              grid
combined = xr.open_mfdataset('demonstrated data/earth_data/MUR25-JPL-L4-GLOB-v04.2/*.nc')

combined
<xarray.Dataset>
Dimensions:           (time: 56, lat: 80, lon: 80)
Coordinates:
  * time              (time) datetime64[ns] 2003-01-01T09:00:00 ... 2003-02-2...
  * lat               (lat) float32 5.125 5.375 5.625 ... 24.38 24.62 24.88
  * lon               (lon) float32 60.12 60.38 60.62 ... 79.38 79.62 79.88
Data variables:
    analysed_sst      (time, lat, lon) float32 dask.array<chunksize=(29, 80, 80), meta=np.ndarray>
    analysis_error    (time, lat, lon) float32 dask.array<chunksize=(29, 80, 80), meta=np.ndarray>
    mask              (time, lat, lon) float32 dask.array<chunksize=(29, 80, 80), meta=np.ndarray>
    sea_ice_fraction  (time, lat, lon) float32 dask.array<chunksize=(29, 80, 80), meta=np.ndarray>
    sst_anomaly       (time, lat, lon) float32 dask.array<chunksize=(29, 80, 80), meta=np.ndarray>
Attributes: (12/54)
    Conventions:                CF-1.7, ACDD-1.3
    title:                      Daily 0.25-degree MUR SST, Final product
    summary:                    A low-resolution version of the MUR SST analy...
    keywords:                   Oceans > Ocean Temperature > Sea Surface Temp...
    keywords_vocabulary:        NASA Global Change Master Directory (GCMD) Sc...
    standard_name_vocabulary:   NetCDF Climate and Forecast (CF) Metadata Con...
    ...                         ...
    publisher_name:             GHRSST Project Office
    publisher_url:              https://www.ghrsst.org
    publisher_email:            gpc@ghrsst.org
    file_quality_level:         3
    metadata_link:              http://podaac.jpl.nasa.gov/ws/metadata/datase...
    acknowledgment:             Please acknowledge the use of these data with...

You can see that we do not need all variables: let’s take a look into each and see what we need.

The data is “chunked” (broke down into smaller, more mangaeable pieces to eliminate duplicated copies of repeating data on storage). Therefore you can’t see the actual data if you clicking on the interactive panel above, as it is lazy-loaded to only show the general information of the whole object.

Simply load the dataset to see the actual underlying data values

# load sst_anomaly to see underlying data
combined.sst_anomaly.compute()
<xarray.DataArray 'sst_anomaly' (time: 56, lat: 80, lon: 80)>
array([[[-0.086     , -0.02      , -0.026     , ..., -0.252     ,
         -0.49400002, -0.499     ],
        [-0.03      , -0.24300002, -0.163     , ..., -0.344     ,
         -0.416     , -0.38200003],
        [-0.41700003, -0.37100002, -0.082     , ...,  0.1       ,
         -0.066     , -0.1       ],
        ...,
        [-0.628     , -1.077     , -1.2720001 , ...,         nan,
                 nan,         nan],
        [-0.39000002, -0.564     , -0.80500007, ...,         nan,
                 nan,         nan],
        [-0.75600004, -0.675     , -0.53900003, ...,         nan,
                 nan,         nan]],

       [[-0.046     , -0.06900001, -0.089     , ..., -0.333     ,
         -0.51000005, -0.60400003],
        [-0.177     , -0.379     , -0.277     , ..., -0.34300002,
         -0.418     , -0.43100002],
        [-0.358     , -0.38700002, -0.16600001, ..., -0.026     ,
         -0.158     , -0.12900001],
...
        [ 0.43300003,  0.48000002,  0.633     , ...,         nan,
                 nan,         nan],
        [ 0.439     ,  0.54200006,  0.51900005, ...,         nan,
                 nan,         nan],
        [ 0.642     ,  0.637     ,  0.609     , ...,         nan,
                 nan,         nan]],

       [[ 0.194     ,  0.432     ,  0.31      , ..., -0.081     ,
          0.162     ,  0.14      ],
        [ 0.147     ,  0.20400001,  0.164     , ...,  0.035     ,
          0.32500002,  0.094     ],
        [ 0.194     ,  0.216     ,  0.11300001, ...,  0.24700001,
          0.33900002,  0.231     ],
        ...,
        [-0.085     ,  0.098     ,  0.23200001, ...,         nan,
                 nan,         nan],
        [-0.156     , -0.20700002, -0.014     , ...,         nan,
                 nan,         nan],
        [-0.037     , -0.07      , -0.012     , ...,         nan,
                 nan,         nan]]], dtype=float32)
Coordinates:
  * time     (time) datetime64[ns] 2003-01-01T09:00:00 ... 2003-02-28T09:00:00
  * lat      (lat) float32 5.125 5.375 5.625 5.875 ... 24.12 24.38 24.62 24.88
  * lon      (lon) float32 60.12 60.38 60.62 60.88 ... 79.12 79.38 79.62 79.88
Attributes:
    long_name:              SST anomaly from a seasonal SST climatology based...
    coverage_content_type:  auxiliaryInformation
    units:                  kelvin
    valid_min:              -32767
    valid_max:              32767
    comment:                anomaly reference to the day-of-year average betw...