Exploring ship echosounder data from the Pacific Hake survey

Jupyter notebook accompanying the manuscript:

Echopype: A Python library for interoperable and scalable processing of ocean sonar data for biological information
Authors: Wu-Jung Lee, Emilio Mayorga, Landung Setiawan, Kavin Nguyen, Imran Majeed, Valentina Staneva

Introduction

Goals

Illustrate a common workflow for echosounder data conversion, calibration and use. This workflow leverages the standardization applied by echopype and the power, ease of use and familiarity of libraries in the scientific Python ecosystem.
Extract and visualize data with relative ease using geospatial and temporal filters.

Description

This notebook uses EK60 echosounder data collected during the 2017 Joint U.S.-Canada Integrated Ecosystem and Pacific Hake Acoustic Trawl Survey (‘Pacific Hake Survey’) to illustrate a common workflow for data conversion, calibration and analysis using echopype and core scientific Python software packages, particularly xarray, GeoPandas, pandas and NumPy.

Two days of cloud-hosted .raw data files are accessed by echopype directly from an Amazon Web Services (AWS) S3 “bucket” maintained by the NOAA NCEI Water-Column Sonar Data Archive. The total data used are 170 .raw files at approximately 25 MB each (1 Hz pinging rate from first light to dusk), corresponding to approximately 4.2 GB. With echopype, each file is converted to a standardized representation based on the SONAR-netCDF4 v1.0 convention and saved to the cloud-optimized Zarr format.

Data stored in the netCDF-based SONAR-netCDF4 convention can be conveniently and intuitively manipulated with xarray in combination with related scientific Python packages. Mean Volume Backscattering Strength (MVBS) is computed with echopype from each raw data file and exported to a netCDF file. Here, we define two geographical bounding boxes encompassing two ship tracks and use these to extract corresponding timestamp intervals from the GPS data, and then the corresponding MVBS data based on those intervals. Finally, these extracted MVBS subsets are plotted as track echograms.

Outline

Establish AWS S3 file system connection and generate list of target EK60 .raw files
Process S3-hosted raw files with echopype: convert, calibrate and export to standardized files
Extract and process GPS locations from the Platform group of converted raw files
Read MVBS and plot track echograms for time periods corresponding to two ship tracks

Running the notebook

This notebook can be run with a conda environment created using the conda environment file https://github.com/OSOceanAcoustics/echopype-examples/blob/main/binder/environment.yml. The notebook creates two directories, if not already present: ./exports/hakesurvey_convertedzarr and ./exports/hakesurvey_calibratednc. netCDF and Zarr files will be exported there.

Note

We encourage importing echopype as ep for consistency.

from pathlib import Path

import fsspec
import numpy as np
import geopandas as gpd
import xarray as xr

import matplotlib.pyplot as plt
from shapely.geometry import box
import cartopy.crs as ccrs
import cartopy.io.img_tiles as cimgt
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER

import echopype as ep
from echopype.qc import exist_reversed_time

import warnings
warnings.simplefilter("ignore", category=DeprecationWarning)

Establish AWS S3 file system connection and generate list of target EK60 `.raw` files

Access and inspect the publicly accessible NCEI WCSD S3 bucket on the AWS cloud as if it were a local file system. This will be done through the Python fsspec file system and bytes storage interface. We will use fsspec.filesystem.glob (fs.glob) to generate a list of all EK60 .raw data files in the bucket, then filter on file names for target dates of interest.

The directory path on the ncei-wcsd-archive S3 bucket is s3://ncei-wcsd-archive/data/raw/Bell_M._Shimada/SH1707/EK60/. All .raw files from the 2017 Hake survey cruise are found here.

fs = fsspec.filesystem('s3', anon=True)

bucket = "ncei-wcsd-archive"
rawdirpath = "data/raw/Bell_M._Shimada/SH1707/EK60"

s3rawfiles = fs.glob(f"{bucket}/{rawdirpath}/*.raw")

# print out the last two S3 raw file paths in the list
s3rawfiles[-2:]

['ncei-wcsd-archive/data/raw/Bell_M._Shimada/SH1707/EK60/Summer2017-D20170913-T180733.raw',
 'ncei-wcsd-archive/data/raw/Bell_M._Shimada/SH1707/EK60/Winter2017-D20170615-T002629.raw']

Generate list of target EK60 .raw files from AWS S3 bucket based on dates. The dates are found in the middle string token (e.g., “D20170913”). Select files from 2 days, 2017-07-28 and 2017-07-29.

s3rawfiles = [
    s3path for s3path in s3rawfiles 
    if any([f"D2017{datestr}" in s3path for datestr in ['0728', '0729']])
]

print(f"There are {len(s3rawfiles)} target raw files available")

There are 170 target raw files available

Process S3-hosted raw files with `echopype`: convert, calibrate and export to standardized files

Loop through all the selected raw files on S3 and convert, calibrate and generate Mean Volume Backscattering Strength (MVBS). Save the raw converted and MVBS data to local files, as zarr and netCDF, respectively.

def populate_metadata(ed, raw_fname):
    """
    Manually populate into the "ed" EchoData object 
    additional metadata about the dataset and the platform
    """
    
    # -- SONAR-netCDF4 Top-level Group attributes
    survey_name = (
        "2017 Joint U.S.-Canada Integrated Ecosystem and "
        "Pacific Hake Acoustic Trawl Survey ('Pacific Hake Survey')"
    )
    ed['Top-level'].attrs['title'] = f"{survey_name}, file {raw_fname}"
    ed['Top-level'].attrs['summary'] = (
        f"EK60 raw file {raw_fname} from the {survey_name}, converted to a SONAR-netCDF4 file using echopype."
        "Information about the survey program is available at "
        "https://www.fisheries.noaa.gov/west-coast/science-data/"
        "joint-us-canada-integrated-ecosystem-and-pacific-hake-acoustic-trawl-survey"
    )

    # -- SONAR-netCDF4 Platform Group attributes
    # Per SONAR-netCDF4, for platform_type see https://vocab.ices.dk/?ref=311
    ed['Platform'].attrs['platform_type'] = "Research vessel"
    ed['Platform'].attrs['platform_name'] = "Bell M. Shimada"  # A NOAA ship
    ed['Platform'].attrs['platform_code_ICES'] = "315"

Create the directories where the exported files will be saved, if these directories don’t already exist.

base_dpath = Path('./exports/notebook2')
base_dpath.mkdir(exist_ok=True, parents=True)

converted_dpath = Path(base_dpath / 'hakesurvey_convertedzarr')
converted_dpath.mkdir(exist_ok=True)
calibrated_dpath = (base_dpath / 'hakesurvey_calibratednc')
calibrated_dpath.mkdir(exist_ok=True)

echopype processing

EchoData is an echopype object for conveniently handling raw converted data from either raw instrument files or previously converted and standardized raw netCDF4 and Zarr files. It is essentially a container for multiple xarray.Dataset objects, each corresponds to one of the netCDF4 groups specified in the SONAR-netCDF4 convention – the convention followed by echopype. The EchoData object can be used to conveniently accesse and explore the echosounder raw data and for calibration and other processing.

The cell below contains the main echopype workflow steps. For each raw file: - Access file directly from S3 via ep.open_raw to create a converted EchoData object in memory - Add global and platform attributes to EchoData object - Export to a local Zarr dataset (a collection of files encapsulated in a directory) - Generate calibrated Sv and then MVBS from the raw data in the EchoData object - Export MVBS to a local netcdf file

Note: Depending on your internet speed, this cell may take some time to run (potentially 20-30 mins).

%%time

for s3rawfpath in s3rawfiles:
    raw_fpath = Path(s3rawfpath)
    try:
        # Access file directly from S3 to create a converted EchoData object in memory
        ed = ep.open_raw(
            f's3://{s3rawfpath}',
            sonar_model='EK60',
            storage_options={'anon': True}
        )
        # Manually populate additional metadata about the dataset and the platform
        populate_metadata(ed, raw_fpath.name)

        # Save to converted Zarr format
        ed.to_zarr(save_path=converted_dpath, overwrite=True)

        # Use the EchoData object "ed" to generate calibrated and
        # computed MVBS files that will be saved to netcdf
        ds_Sv = ep.calibrate.compute_Sv(ed)
        ds_MVBS = ep.commongrid.compute_MVBS(
            ds_Sv,
            range_bin='5m',  # in meters
            ping_time_bin='20s',  # in seconds
            
        )
        ds_MVBS.to_netcdf(calibrated_dpath / f'MVBS_{raw_fpath.stem}.nc')
    except Exception as e:
        print(f'Failed to process raw file {raw_fpath.name}: {e}')

/Users/wujung/miniconda3/envs/echopype_examples_v084/lib/python3.10/site-packages/xarray/core/duck_array_ops.py:215: RuntimeWarning: invalid value encountered in cast
  return data.astype(dtype, **kwargs)
/Users/wujung/miniconda3/envs/echopype_examples_v084/lib/python3.10/site-packages/xarray/core/duck_array_ops.py:215: RuntimeWarning: invalid value encountered in cast
  return data.astype(dtype, **kwargs)
/Users/wujung/miniconda3/envs/echopype_examples_v084/lib/python3.10/site-packages/xarray/core/duck_array_ops.py:215: RuntimeWarning: invalid value encountered in cast
  return data.astype(dtype, **kwargs)

CPU times: user 3min 18s, sys: 53.2 s, total: 4min 11s
Wall time: 9min

Test for time reversals

Small time reversals are found in EK60 datasets, including the 2017 Pacific Hake survey, where the ping_time (or GPS time1) value may be lower (older) than the preceding ping_time by a second. Such discontinuities can interfere with concatenating individual raw files to produce an aggregated dataset. The capability to identify and address these reversals is in the echopype.qc subpackage.

for datapath in converted_dpath.glob('*'):
    ed = ep.open_converted(datapath)
    # Test for a negative ping_time increment in sequential timestamps, in the Sonar/Beam_group1 group
    if exist_reversed_time(ds=ed['Sonar/Beam_group1'], time_name='ping_time'):
        print(f'Reversed time in {datapath}')

There are no time reversals in this two-day dataset, fortunately.

Examine the EchoData object for one of the data files

echopype provides a user-friendly, convenient representation of an EchoData object that leverages the user-friendly xarray Dataset HTML representation. Since an EchoData object is effectively a container for multiple xarray.Dataset objects corresponding to netCDF4 groups, the notebook “print out” provides a summary view of all the groups and interactive access to summaries of each group.

Here, ed is the last object opened in the time reversal test, in the preceding cell.

ed

Extract and process GPS locations from the `Platform` group of converted raw files

Use xarray.open_mfdataset to open the Platform group from all the converted raw netcdf files as a single concatenated (combined) xarray dataset. Then extract GPS time1 (time stamp), latitude and longitude from this group and transform that data into a GeoPandas GeoDataFrame containing point-geometry objects that are readily manipulated via geospatial operations. A GeoDataFrame adds geospatial capabilities to a Pandas DataFrame.

Due to the presence of multiple time coordinates in this group, care must be taken in defining how the concatenation (combine) operation is to be performed. This is captured in the arguments passed to open_mfdataset.

%%time
platform_ds = xr.open_mfdataset(
    str(converted_dpath / '*.zarr'), group='Platform', 
    engine='zarr',
    data_vars='minimal', coords='minimal',
    combine='nested'
)

CPU times: user 32.5 s, sys: 1.2 s, total: 33.7 s
Wall time: 34.6 s

platform_ds

<xarray.Dataset> Size: 11MB
Dimensions:              (channel: 3, time1: 244846, time2: 88959)
Coordinates:
  * channel              (channel) <U37 444B 'GPT  18 kHz 009072058c8d 1-1 ES...
  * time1                (time1) datetime64[ns] 2MB 2017-07-28T00:05:36.10331...
  * time2                (time2) datetime64[ns] 712kB 2017-07-28T00:05:34.897...
Data variables: (12/20)
    MRU_offset_x         float64 8B nan
    MRU_offset_y         float64 8B nan
    MRU_offset_z         float64 8B nan
    MRU_rotation_x       float64 8B nan
    MRU_rotation_y       float64 8B nan
    MRU_rotation_z       float64 8B nan
    ...                   ...
    sentence_type        (time1) object 2MB dask.array<chunksize=(244846,), meta=np.ndarray>
    transducer_offset_x  (channel) float64 24B dask.array<chunksize=(3,), meta=np.ndarray>
    transducer_offset_y  (channel) float64 24B dask.array<chunksize=(3,), meta=np.ndarray>
    transducer_offset_z  (channel) float64 24B dask.array<chunksize=(3,), meta=np.ndarray>
    vertical_offset      (time2) float64 712kB dask.array<chunksize=(88959,), meta=np.ndarray>
    water_level          float64 8B 9.15
Attributes:
    platform_code_ICES:  315
    platform_name:       Bell M. Shimada
    platform_type:       Research vessel

We can use time1 (the timestamps for NMEA datagrams, or GPS timestamps) to examine the exact timestamp interval spanned by the combined dataset.

print(f"{platform_ds.time1.values.min()}, {platform_ds.time1.values.max()}")

2017-07-28T00:05:36.103315000, 2017-07-30T00:17:59.381813000

To create the GeoPandas GeoDataFrame, first transform the latitude and longitude arrays to a single Pandas DataFrame that retains the time1 coordinate as a common index. This is done by using the DataFrame to_dataframe method together with a Pandas join operation. Then, a point GeoDataFrame is created from this DataFrame.

gps_df = platform_ds.latitude.to_dataframe().join(platform_ds.longitude.to_dataframe())

gps_df.head(3)

	latitude	longitude
time1
2017-07-28 00:05:36.103315	43.533072	-124.683998
2017-07-28 00:05:37.511097	43.533080	-124.684005
2017-07-28 00:05:37.669428	43.533167	-124.684000

gps_gdf = gpd.GeoDataFrame(
    gps_df,
    geometry=gpd.points_from_xy(gps_df['longitude'], gps_df['latitude']), 
    crs="epsg:4326"
)

A simple, easily generated map plot of the point GeoDataFrame

gps_gdf.plot(markersize=2)

Read MVBS and plot track echograms for time periods corresponding to two ship tracks

Read MVBS as a concatenated dataset

Use xarray.open_mfdataset again to read and concatenate (combine) data files into a single xarray Dataset. This time, we’re reading the MVBS netCDF files.

%%time
MVBS_ds = xr.open_mfdataset(
    str(calibrated_dpath / 'MVBS_*.nc'), 
    data_vars='minimal', coords='minimal',
    combine='by_coords'
)

CPU times: user 2.14 s, sys: 180 ms, total: 2.32 s
Wall time: 3.16 s

MVBS_ds

<xarray.Dataset> Size: 32MB
Dimensions:            (channel: 3, ping_time: 8819, echo_range: 150)
Coordinates:
  * ping_time          (ping_time) datetime64[ns] 71kB 2017-07-28T00:05:20 .....
  * channel            (channel) <U37 444B 'GPT  18 kHz 009072058c8d 1-1 ES18...
  * echo_range         (echo_range) float64 1kB 0.0 5.0 10.0 ... 740.0 745.0
Data variables:
    Sv                 (channel, ping_time, echo_range) float64 32MB dask.array<chunksize=(3, 60, 150), meta=np.ndarray>
    water_level        float64 8B 9.15
    frequency_nominal  (channel) float64 24B dask.array<chunksize=(3,), meta=np.ndarray>
Attributes:
    processing_software_name:     echopype
    processing_software_version:  0.8.4
    processing_time:              2024-04-25T17:58:23Z
    processing_function:          commongrid.compute_MVBS

Replace the channel dimension and coordinate with the frequency_nominal variable containing actual frequency values. Note that this step is possible only because there are no duplicated frequencies present.

MVBS_ds = ep.consolidate.swap_dims_channel_frequency(MVBS_ds)

MVBS_ds

<xarray.Dataset> Size: 32MB
Dimensions:            (frequency_nominal: 3, ping_time: 8819, echo_range: 150)
Coordinates:
  * ping_time          (ping_time) datetime64[ns] 71kB 2017-07-28T00:05:20 .....
  * echo_range         (echo_range) float64 1kB 0.0 5.0 10.0 ... 740.0 745.0
  * frequency_nominal  (frequency_nominal) float64 24B 1.8e+04 3.8e+04 1.2e+05
Data variables:
    Sv                 (frequency_nominal, ping_time, echo_range) float64 32MB dask.array<chunksize=(3, 60, 150), meta=np.ndarray>
    channel            (frequency_nominal) <U37 444B 'GPT  18 kHz 009072058c8...
    water_level        float64 8B 9.15
Attributes:
    processing_software_name:     echopype
    processing_software_version:  0.8.4
    processing_time:              2024-04-25T17:58:23Z
    processing_function:          commongrid.compute_MVBS

Extract MVBS along two N-S tracks selected via geographical bounding boxes

Define rectangular bounding boxes around two tracks oriented North-South, then plot a reference map showing all the GPS points (in red), the two bounding boxes (black), and the OOI mooring location (CE04 Oregon Offshore, yellow star) examined in the accompanying Jupyter notebook.

tracksouth_bbox = gpd.GeoSeries(box(-125.17, 43.65, -125.14, 43.84), crs=gps_gdf.crs)
tracknorth_bbox = gpd.GeoSeries(box(-125.17, 43.98, -125.14, 44.17), crs=gps_gdf.crs)

basemap = cimgt.OSM()

_, ax = plt.subplots(
    figsize=(7, 7), subplot_kw={"projection": basemap.crs}
)
bnd = gps_gdf.geometry.bounds
ax.set_extent([bnd.minx.min() - 1, bnd.maxx.max() + 2, 
               bnd.miny.min() - 0.8, bnd.maxy.max() + 2.2])
ax.add_image(basemap, 7)
ax.gridlines(draw_labels=True, xformatter=LONGITUDE_FORMATTER, yformatter=LATITUDE_FORMATTER)

# GPS points
gps_gdf.plot(ax=ax, markersize=0.1, color='red', 
             transform=ccrs.PlateCarree())

# Bounding box for selected tracks
tracksouth_bbox.plot(ax=ax, edgecolor="black", linewidth=1.2, facecolor='none', 
                     transform=ccrs.PlateCarree())
tracknorth_bbox.plot(ax=ax, edgecolor="black", linewidth=1.2, facecolor='none', 
                     transform=ccrs.PlateCarree())

# OOI CE04 Oregon Offshore mooring location
plt.plot(-124.95, 44.37, marker='*', color='yellow', markersize=13, 
         transform=ccrs.PlateCarree());

Clip the GPS locations GeoPandas GeoDataFrame gps_gdf generated from the Platform group with the two rectangular regions, tracksouth_bbox and tracknorth_bbox. Extract from those clipped GeoDataFrames (tracksouth_gps_gdf and tracknorth_gps_gdf) the minimum and maximum time1 timestamps for each track and use them to select the corresponding time span in the MVBS data.

Finally, create a MVBS Dataset subset for each track (tracksouth_MVBS_ds and tracknorth_MVBS_ds) by taking advantage of xarray’s “label”-based selection capability via the sel method. We select MVBS_ds data with ping_time values within the timestamp interval “slice” extracted from the geographical track. As we can see above, ping_time is a coordinate variable in the MVBS_ds Dataset.

tracksouth_gps_gdf = gpd.clip(gps_gdf, tracksouth_bbox)
tracksouth_MVBS_ds = MVBS_ds.sel(
    ping_time=slice(tracksouth_gps_gdf.index.min(), tracksouth_gps_gdf.index.max())
)

tracknorth_gps_gdf = gpd.clip(gps_gdf, tracknorth_bbox)
tracknorth_MVBS_ds = MVBS_ds.sel(
    ping_time=slice(tracknorth_gps_gdf.index.min(), tracknorth_gps_gdf.index.max())
)

Plot MVBS echograms for the two N-S tracks, for all 3 frequencies

The final step is to create echogram plots (range vs ping_time) for each echosounder frequency and each of the two selected ship tracks. That’s 6 subplots. We first define two functions to simplify the task. plot_echograms plots the echograms of the 3 frequencies as a column of subplots, extracting the data for each frequency by using xarray’s isel index selector method, which uses index counts rather than values. As we can see above, frequency_nominal is a coordinate of the Sv (MVBS) DataArray.

def track_interval_str(trackdt):
    """ Create the timestamp interval title string for a plot column
    """
    track_interval_title_str = (
        f"{trackdt.index.min().strftime('%b-%d %H:%MZ')}"
        f" to {trackdt.index.max().strftime('%b-%d %H:%MZ')}"
    )
    return track_interval_title_str

def plot_echograms(ds, freq_len, column_idx):
    """Plot echograms of the 3 frequencies for xarray dataset ds,
       as a column of subplots"""
    for f in range(freq_len):
        ax = axes[f][column_idx]
        # Select Sv data by frequency using the frequency_nominal coordinate index "f",
        # then plot the echogram of the selected data
        ds.Sv.isel(frequency_nominal=f).plot(
            ax=ax, 
            x='ping_time',
            y='echo_range',
            yincrease=False,
            vmin=-80,
            vmax=-50,
        )
        if f < 2:
            ax.set_xlabel(None);

freq_len = len(MVBS_ds.frequency_nominal)

fig, axes = plt.subplots(nrows=freq_len, ncols=2, constrained_layout=True, figsize=(16, 16))

fig.suptitle(
    (f"    South Track, {track_interval_str(tracksouth_gps_gdf)}"
     "                                         "
     f"North Track, {track_interval_str(tracknorth_gps_gdf)}"),
    fontsize=16)

plot_echograms(tracksouth_MVBS_ds, freq_len, column_idx=0) # left column
plot_echograms(tracknorth_MVBS_ds, freq_len, column_idx=1) # right column

Package versions

import datetime, s3fs
print(f"echopype: {ep.__version__}, xarray: {xr.__version__}, geopandas: {gpd.__version__}, "
      f"fsspec: {fsspec.__version__}, s3fs: {s3fs.__version__}")

print(f"\n{datetime.datetime.utcnow()} +00:00")

echopype: 0.8.4, xarray: 2024.3.0, geopandas: 0.14.3, fsspec: 2024.3.1, s3fs: 2024.3.1

2024-04-25 18:08:15.744108 +00:00