import xarray as xr
import earthaccess
import numpy as np
import pandas as pd
import os, glob
NASA Earth Data
Written by Minh Phan
This tutorial serves to provide one of many ways a user can download data from NASA’s EarthData database, mostly with datasets (collections in EarthData’s terminology) hosted in the cloud.
Import necessary libraries
Download data to your local machine using earthaccess library
Earthaccess library streamlines your downloading, slicing, and searching for granules easier than ever. For cloud-hosted datasets (which is what this tutorial best works with), we choose to download granules to the local machine instead of streaming them to the working Python scripts as some users may not be physically available in the us-west region for streaming to be effective. Local downloading may result in heavy file sizes, but is consistent, and we am also providing some tweaks to save as much as you can, especially if your research interest area requires a long temporal range and does not cover globally.
# Log in using .netrc file
= earthaccess.login(strategy="netrc") auth
We are already authenticated with NASA EDL
Streaming granules
# EarthAccess's approach to collecting granules
= earthaccess.search_data(
results ='MUR-JPL-L4-GLOB-v4.1',
short_name=True,
cloud_hosted= (60, 5, 80, 25),
bounding_box =("2003-01-01", "2003-02-28")
temporal )
Granules found: 59
= earthaccess.open(results) # s3 files
Opening 59 granules, approx size: 0.0 GB
= xr.open_mfdataset(files) ds
ds
<xarray.Dataset> Dimensions: (time: 59, lat: 17999, lon: 36000) Coordinates: * time (time) datetime64[ns] 2003-01-01T09:00:00 ... 2003-02-2... * lat (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99 * lon (lon) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 Data variables: analysed_sst (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray> analysis_error (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray> mask (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray> sea_ice_fraction (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray> Attributes: (12/47) Conventions: CF-1.5 title: Daily MUR SST, Final product summary: A merged, multi-sensor L4 Foundation SST anal... references: http://podaac.jpl.nasa.gov/Multi-scale_Ultra-... institution: Jet Propulsion Laboratory history: created at nominal 4-day latency; replaced nr... ... ... project: NASA Making Earth Science Data Records for Us... publisher_name: GHRSST Project Office publisher_url: http://www.ghrsst.org publisher_email: ghrsst-po@nceo.ac.uk processing_level: L4 cdm_data_type: grid
=slice(5, 25), lon=slice(60, 80)) ds.sel(lat
<xarray.Dataset> Dimensions: (time: 59, lat: 2001, lon: 2001) Coordinates: * time (time) datetime64[ns] 2003-01-01T09:00:00 ... 2003-02-2... * lat (lat) float32 5.0 5.01 5.02 5.03 ... 24.98 24.99 25.0 * lon (lon) float32 60.0 60.01 60.02 60.03 ... 79.98 79.99 80.0 Data variables: analysed_sst (time, lat, lon) float32 dask.array<chunksize=(1, 2001, 2001), meta=np.ndarray> analysis_error (time, lat, lon) float32 dask.array<chunksize=(1, 2001, 2001), meta=np.ndarray> mask (time, lat, lon) float32 dask.array<chunksize=(1, 2001, 2001), meta=np.ndarray> sea_ice_fraction (time, lat, lon) float32 dask.array<chunksize=(1, 2001, 2001), meta=np.ndarray> Attributes: (12/47) Conventions: CF-1.5 title: Daily MUR SST, Final product summary: A merged, multi-sensor L4 Foundation SST anal... references: http://podaac.jpl.nasa.gov/Multi-scale_Ultra-... institution: Jet Propulsion Laboratory history: created at nominal 4-day latency; replaced nr... ... ... project: NASA Making Earth Science Data Records for Us... publisher_name: GHRSST Project Office publisher_url: http://www.ghrsst.org publisher_email: ghrsst-po@nceo.ac.uk processing_level: L4 cdm_data_type: grid
Download granules for an extended period of time
Since you cannot slice data spatially, downloading granules is going to take up a lot of disk space if you only need data within a small bounding box. For our task, we wrote a simple function to slice data and wrote it into a new file.
You can consult the earthacess library website or their notebooks for code snippets on how to browse and look up collections. For this notebook, we mainly focus on the downloading aspect. First, we need to get the list of granules to download.
Since earthacess does not support spatial slicing, we developed a method to download, slice, combine, and export data yearly, then finally delete temporary downloaded files to save disk space. Assumed that you already knew the temporal, spatial range of the dataset of your chosen, we first download the data by year into a temporary folder, then slice the data and then export the combined data to another folder.
# Our approach
# Month end not included
def download_and_combine_granules(short_name, month_start, month_end, lat1=5, lat2=25, lon1=60, lon2=80):
for month in pd.date_range(month_start, month_end, freq='M').strftime('%Y-%m-%d'):
print('Collecting granules')
= earthaccess.granule_query().short_name(short_name).temporal(f'{month[:7]}-01',month).get(366)
granules
= 'demonstrated data/earth_data'
MAIN_FOLDER = 'temp'
TEMP_FOLDER = os.path.join(MAIN_FOLDER, TEMP_FOLDER)
path_temp_folder = os.path.join(MAIN_FOLDER, short_name)
path_processed_folder # create folder to store data
if not os.path.exists(path_temp_folder):
os.makedirs(path_temp_folder)if not os.path.exists(path_processed_folder):
os.makedirs(path_processed_folder)= earthaccess.download(granules, path_temp_folder)
files
## if dataset coordinates are slice-able, use:
print('Slicing...')
= xr.open_mfdataset(f'{path_temp_folder}/*.nc').sel(lat=slice(lat1, lat2), lon=slice(lon1, lon2))
data
# combine files together
## for some collections, coordinate names are 'lat' and 'lon' while their underlying indices are 'latitude' and 'longitude', respectively
## may or may not be applicable for other datasets on the site.
## get bounding box if not sliceable
### lat1_idx, lat2_idx, lon1_idx, lon2_idx = get_bounding_box(os.path.join(path_temp_folder, first_file), lat1, lat2, lon1, lon2)
### data = xr.open_mfdataset(f'{path_temp_folder}/*.nc').isel(latitude=slice(lat1_idx, lat2_idx+1), longitude=slice(lon1_idx, lon2_idx+1))
f'{path_processed_folder}/{month}.nc')
data.to_netcdf(
# delete files in the temporary folder
print('Deleting temporary files...')
= glob.glob(f'{path_temp_folder}/*.*')
files for f in files:
os.remove(f)
def get_bounding_box(file_path, lat1=0, lat2=30, lon1=60, lon2=80):
"""
The dataset we experimented did not have indexed coordinates,
so we resorted to slicing using index positions
"""
= xr.open_dataset(file_path)
ds
= ds.lat.values
lat_vals = ds.lon.values
lon_vals
= np.abs(lat_vals-lat1).argmin()
lat1_idx = np.abs(lat_vals-lat2).argmin()
lat2_idx = np.abs(lon_vals-lon1).argmin()
lon1_idx = np.abs(lon_vals-lon1).argmin()
lon2_idx
return lat1_idx, lat2_idx, lon1_idx, lon2_idx
# download and combine every month's worth of data
='MUR25-JPL-L4-GLOB-v04.2',
download_and_combine_granules(short_name='2003-01', month_end='2003-03',
month_start=5, lat2=25, lon1=60, lon2=80) lat1
Collecting granules
Getting 31 granules, approx download size: 0.0 GB
Slicing...
Deleting temporary files...
Collecting granules
Getting 28 granules, approx download size: 0.0 GB
Slicing...
Deleting temporary files...
= ["2003-01-04", "2003-01-18", "2003-02-18"]
list_of_missing_dates
for date in list_of_missing_dates:
= earthaccess.search_data(
result ='MUR-JPL-L4-GLOB-v4.1',
short_name=True,
cloud_hosted= (60, 5, 80, 25),
bounding_box =(date, date)
temporal
)
= 'demonstrated data/earth_data'
MAIN_FOLDER = 'temp'
TEMP_FOLDER = os.path.join(MAIN_FOLDER, TEMP_FOLDER)
path_temp_folder file = earthaccess.download(result, path_temp_folder)
Granules found: 1
Getting 1 granules, approx download size: 0.0 GB
Error while downloading the file 20030218090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/earthaccess/store.py", line 488, in _download_file
r.raise_for_status()
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/requests/models.py", line 960, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20030218090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
Combine files together
Now that we have all netcdf4 files in one place, all spatially sliced, combining the rest of the data is a piece of cake! Note that some of the data will be overlap in the process of combing data every year earlier, so it’s best practice to remove duplicates (if any)
ds
<xarray.Dataset> Dimensions: (time: 59, lat: 17999, lon: 36000) Coordinates: * time (time) datetime64[ns] 2003-01-01T09:00:00 ... 2003-02-2... * lat (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99 * lon (lon) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 Data variables: analysed_sst (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray> analysis_error (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray> mask (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray> sea_ice_fraction (time, lat, lon) float32 dask.array<chunksize=(1, 17999, 36000), meta=np.ndarray> Attributes: (12/47) Conventions: CF-1.5 title: Daily MUR SST, Final product summary: A merged, multi-sensor L4 Foundation SST anal... references: http://podaac.jpl.nasa.gov/Multi-scale_Ultra-... institution: Jet Propulsion Laboratory history: created at nominal 4-day latency; replaced nr... ... ... project: NASA Making Earth Science Data Records for Us... publisher_name: GHRSST Project Office publisher_url: http://www.ghrsst.org publisher_email: ghrsst-po@nceo.ac.uk processing_level: L4 cdm_data_type: grid
= xr.open_mfdataset('demonstrated data/earth_data/MUR25-JPL-L4-GLOB-v04.2/*.nc')
combined
combined
<xarray.Dataset> Dimensions: (time: 56, lat: 80, lon: 80) Coordinates: * time (time) datetime64[ns] 2003-01-01T09:00:00 ... 2003-02-2... * lat (lat) float32 5.125 5.375 5.625 ... 24.38 24.62 24.88 * lon (lon) float32 60.12 60.38 60.62 ... 79.38 79.62 79.88 Data variables: analysed_sst (time, lat, lon) float32 dask.array<chunksize=(29, 80, 80), meta=np.ndarray> analysis_error (time, lat, lon) float32 dask.array<chunksize=(29, 80, 80), meta=np.ndarray> mask (time, lat, lon) float32 dask.array<chunksize=(29, 80, 80), meta=np.ndarray> sea_ice_fraction (time, lat, lon) float32 dask.array<chunksize=(29, 80, 80), meta=np.ndarray> sst_anomaly (time, lat, lon) float32 dask.array<chunksize=(29, 80, 80), meta=np.ndarray> Attributes: (12/54) Conventions: CF-1.7, ACDD-1.3 title: Daily 0.25-degree MUR SST, Final product summary: A low-resolution version of the MUR SST analy... keywords: Oceans > Ocean Temperature > Sea Surface Temp... keywords_vocabulary: NASA Global Change Master Directory (GCMD) Sc... standard_name_vocabulary: NetCDF Climate and Forecast (CF) Metadata Con... ... ... publisher_name: GHRSST Project Office publisher_url: https://www.ghrsst.org publisher_email: gpc@ghrsst.org file_quality_level: 3 metadata_link: http://podaac.jpl.nasa.gov/ws/metadata/datase... acknowledgment: Please acknowledge the use of these data with...
You can see that we do not need all variables: let’s take a look into each and see what we need.
The data is “chunked” (broke down into smaller, more mangaeable pieces to eliminate duplicated copies of repeating data on storage). Therefore you can’t see the actual data if you clicking on the interactive panel above, as it is lazy-loaded to only show the general information of the whole object.
Simply load the dataset to see the actual underlying data values
# load sst_anomaly to see underlying data
combined.sst_anomaly.compute()
<xarray.DataArray 'sst_anomaly' (time: 56, lat: 80, lon: 80)> array([[[-0.086 , -0.02 , -0.026 , ..., -0.252 , -0.49400002, -0.499 ], [-0.03 , -0.24300002, -0.163 , ..., -0.344 , -0.416 , -0.38200003], [-0.41700003, -0.37100002, -0.082 , ..., 0.1 , -0.066 , -0.1 ], ..., [-0.628 , -1.077 , -1.2720001 , ..., nan, nan, nan], [-0.39000002, -0.564 , -0.80500007, ..., nan, nan, nan], [-0.75600004, -0.675 , -0.53900003, ..., nan, nan, nan]], [[-0.046 , -0.06900001, -0.089 , ..., -0.333 , -0.51000005, -0.60400003], [-0.177 , -0.379 , -0.277 , ..., -0.34300002, -0.418 , -0.43100002], [-0.358 , -0.38700002, -0.16600001, ..., -0.026 , -0.158 , -0.12900001], ... [ 0.43300003, 0.48000002, 0.633 , ..., nan, nan, nan], [ 0.439 , 0.54200006, 0.51900005, ..., nan, nan, nan], [ 0.642 , 0.637 , 0.609 , ..., nan, nan, nan]], [[ 0.194 , 0.432 , 0.31 , ..., -0.081 , 0.162 , 0.14 ], [ 0.147 , 0.20400001, 0.164 , ..., 0.035 , 0.32500002, 0.094 ], [ 0.194 , 0.216 , 0.11300001, ..., 0.24700001, 0.33900002, 0.231 ], ..., [-0.085 , 0.098 , 0.23200001, ..., nan, nan, nan], [-0.156 , -0.20700002, -0.014 , ..., nan, nan, nan], [-0.037 , -0.07 , -0.012 , ..., nan, nan, nan]]], dtype=float32) Coordinates: * time (time) datetime64[ns] 2003-01-01T09:00:00 ... 2003-02-28T09:00:00 * lat (lat) float32 5.125 5.375 5.625 5.875 ... 24.12 24.38 24.62 24.88 * lon (lon) float32 60.12 60.38 60.62 60.88 ... 79.12 79.38 79.62 79.88 Attributes: long_name: SST anomaly from a seasonal SST climatology based... coverage_content_type: auxiliaryInformation units: kelvin valid_min: -32767 valid_max: 32767 comment: anomaly reference to the day-of-year average betw...