9 Developer notes

9.1 Design

py-rocket-base is inspired by repo2docker and the Pangeo Docker stack design. py-rocker-base is built the Pangeo base Docker file (with ONBUILD commands stripped out) and that Docker file make the choices regarding the environment design—things like how the conda environment is set-up and the base directory structure and permissions.

The Pangeo Docker stack does not use repo2docker, but mimics repo2docker’s environment design. The Pangeo base-image behaves similar to repo2docker in that using the base-image in the FROM line of a Dockerfile causes the build to look for files with the same names as repo2docker’s configuration files and then do the proper action with those files. This means that routine users do not need to know how to write Dockerfile code in order to extend the image with new packages or applications. py-rocker-base Docker image uses this Pangeo base-image design. It is based on ONBUILD commands in the Dockerfile that trigger actions only when the image is used in the FROM line of another Dockerfile.

py-rocket-base does not include this ONBUILD behavior. Instead it follows the rocker docker stack design and provides helper scripts for building on the base image. py-rocket-base a directory called \pyrocket_scriptsthat will help you do common tasks for scientific docker images.These scripts are not required. If users are familiar with writing Docker files, they can write their own code. The use of helper scripts was used after feedback that the Pangeo ONBUILD behavior makes it harder to customize images that need very specific structure or order of operations.

There are many ways to install R and RStudio into an image designed for JupyterHubs The objective of py-rocker-base is not to install R and RStudio, per se, and there are other leaner and faster ways to install R/RStudio if that is your goal¹. The objective of py-rocket-base is to create an JupyterHub image such when you use R in the Jupyter Hub (whether in RStudio, Jupyter Lab, or VSCode), you enter an environment that is the same as if you had used a Rocker image. If you are in the JupyterLab UI, the environment is the same as it you had used repo2docker (or Pangeo base-image) to create the environment.

9.2 Documentation

To build the documentation book, clone repo and then

cd book
quarto render .

Set GitHub Pages to docs folder.

9.3 Building the images

The .github/workflows/build.yaml is a GitHub Action to build the image. The GitHub Action builds the image and the URL will look like one of these

ghcr.io/nmfs-opensci/repo-name/image-name:latest
ghcr.io/nmfs-opensci/image-name:latest

For example, for this repo the image is ghcr.io/nmfs-opensci/py-rocket-base:latest.

9.4 base-image

In the directory, base-image is the Pangeo base-image Dockerfile minus the ONBUILD statements. Thus the base-image for py-rocket-base is the same as Pangeo base-image but doesn’t have the behavior of automatically processing files like environment.yml in child images (that use the base image in the FROM line).

py-rocket-base uses base-image and adds on the pangeo-notebook metapackage which add the basic JupyterHub and JupyterLab packages. py-rocket-base then adds on R/RStudio, more conda packages and Desktop via install scripts.

9.5 py-rocket-base

The Dockerfile does the following in order:

Move files into /srv/repo
Move py-rocket scripts into /pyrocket_scripts and rocker scripts into /rocker_scripts
Install conda packages with the pangeo-notebook metapackage as the main set of packages plus the extra server packages
Install R and RStudio plus the verse set of packages with the rocker scripts via pyrocket_scripts/install_rocker.sh
Set up the R kernel to point to the rocker installed R and fix a few things re reticulate and R in VSCode (happens in pyrocket_scripts/install_rocker.sh).
Set up the Desktop environment and ensure that applications go into /etc/xdg/userconfig instead of $HOME.
Move the start script to /srv/start.

The pieces of the Dockerfile are explained below. Click on the number next to code to read about what that code block does.

FROM ghcr.io/nmfs-opensci/py-rocket-base/base-image:latest

USER root

# Define environment variables
# DISPLAY Tell applications where to open desktop apps - this allows notebooks to pop open GUIs
ENV REPO_DIR="/srv/repo" \
    DISPLAY=":1.0" \
    R_VERSION="4.4.1"

# Add NB_USER to staff group (required for rocker script)
# Ensure the staff group exists first
RUN groupadd -f staff && usermod -a -G staff "${NB_USER}"

# Copy files into REPO_DIR and make sure staff group can edit (use staff for rocker)
COPY --chown=${NB_USER}:${NB_USER} . ${REPO_DIR}
RUN chgrp -R staff ${REPO_DIR} && \
    chmod -R g+rwx ${REPO_DIR} && \
    rm -rf ${REPO_DIR}/book ${REPO_DIR}/docs

# Copy scripts to /pyrocket_scripts and set permissions
RUN mkdir -p /pyrocket_scripts && \
    cp -r ${REPO_DIR}/scripts/* /pyrocket_scripts/ && \
    chown -R root:staff /pyrocket_scripts && \
    chmod -R 775 /pyrocket_scripts

# Install conda packages (will switch to NB_USER in script)
RUN /pyrocket_scripts/install-conda-packages.sh ${REPO_DIR}/environment.yml

# Install R, RStudio via Rocker scripts. Requires the prefix for a rocker Dockerfile
RUN /pyrocket_scripts/install-rocker.sh "verse_${R_VERSION}"

# Install extra apt packages
# Install linux packages after R installation since the R install scripts get rid of packages
RUN /pyrocket_scripts/install-apt-packages.sh ${REPO_DIR}/apt.txt

# Install some basic VS Code extensions
RUN /pyrocket_scripts/install-vscode-extensions.sh ${REPO_DIR}/vscode-extensions.txt

# Re-enable man pages disabled in Ubuntu 18 minimal image
# https://wiki.ubuntu.com/Minimal
RUN yes | unminimize
ENV MANPATH="${NB_PYTHON_PREFIX}/share/man:${MANPATH}"
RUN mandb

# Add custom Jupyter server configurations
RUN mkdir -p ${NB_PYTHON_PREFIX}/etc/jupyter/jupyter_server_config.d/ && \
    mkdir -p ${NB_PYTHON_PREFIX}/etc/jupyter/jupyter_notebook_config.d/ && \
    cp ${REPO_DIR}/custom_jupyter_server_config.json  ${NB_PYTHON_PREFIX}/etc/jupyter/jupyter_server_config.d/ && \
    cp ${REPO_DIR}/custom_jupyter_server_config.json ${NB_PYTHON_PREFIX}/etc/jupyter/jupyter_notebook_config.d/

# Set up the defaults for Desktop. 
ENV XDG_CONFIG_HOME=/etc/xdg/userconfig
RUN mkdir -p ${XDG_CONFIG_HOME} && \
    chown -R ${NB_USER}:${NB_USER} ${XDG_CONFIG_HOME} && \
    chmod -R u+rwx,g+rwX,o+rX ${XDG_CONFIG_HOME} && \
    mv ${REPO_DIR}/user-dirs.dirs ${XDG_CONFIG_HOME} && \
    chmod +x ${REPO_DIR}/scripts/setup-desktop.sh && \
    ${REPO_DIR}/scripts/setup-desktop.sh

# Fix home permissions. Not needed in JupyterHub with persistent memory but needed if not used in that context
RUN /pyrocket_scripts/fix-home-permissions.sh

# Set up the start command 
USER ${NB_USER}
RUN chmod +x ${REPO_DIR}/start \
    && cp ${REPO_DIR}/start /srv/start
    
# Revert to default user and home as pwd
USER ${NB_USER}
WORKDIR ${HOME}

1: Some commands need to be run as root, such as installing linux packages with apt-get
2: Set variables. CONDA_ENV is useful for child builds
3: Copy the py-rocket-base files into /srv/repo directory. book and docs are the documentation files and are not needed in the image.
4: Copy the pyrocket scripts into the image and set the permissions so they can be executed by the staff group (which includes jovyan). The pyrocket scripts are used to do most of the installation tasks and these can also be used to extend py-rocket-base.
5: Use the pyrocket script to install the conda packages in environment.yml. The script does clean-up. The core package is the pangeo-notebook metapackage to this are added some JupyterLab extensions and packages needed for RStudio and Desktop. Scientific packages are not added here. They will be added via child images that use py-rocket-base as the base image (in the FROM line).
6: This section runs the script install-rocker.sh which installs R and RStudio using rocker scripts.
7: The linux packages are installed with the install-apt-packages script which takes care of clean-up. These packages need to be installed after R is installed because the R scripts uninstall packages as part of cleanup.
8: The VSCode extensions are installed into the conda environment directory since instead of the home directory since the home directory is replaced by the user persistent home directory in a JupyterHub.
9: Ubuntu does not have man pages installed by default. These lines activate man so users have the common help files.
10: This is some custom jupyter config to allow hidden files to be listed in the folder browser.
11: Setting up Desktop. Keep config in the /etc so doesn’t trash user environment (that they might want for other environments). Setting up Desktop configuration very poorly documented. The key is setting the environmental variable XDG_CONFIG_HOME and then putting the file user-dirs.dirs within that directory. In that file, one can specify XDG_DESKTOP_DIR="/usr/share/Desktop" which says where application files are kept.
12: Ensure that none of the directories in /home are owned by root. When the image is used in a JupyterHub, this won’t matter if home is replaced by the user persistent directory but in other applications having any directories in home owned by root will cause problems.
13: The start file mainly includes a subshell to run any start files used in extenstions from the py-rocket-base image.
14: The parent docker build completes by setting the user to jovyan and the working directory to ${HOME}. Within a JupyterHub deployment, ${HOME} will often be re-mapped to the user persistent memory so it is important not to write anything that needs to be persistent to ${HOME}, for example configuration. You can do this in the start script since that runs after the user directory is mapped or you can put configuration files in some place other than ${HOME}.

9.6 install-rocker.sh

This script will copy in the rocker scripts from rocker-versioned2 into ${REPO_DIR} to install things. It will read in one of the rocker docker files using R_DOCKERFILE defined in the appendix file (which is inserted into the main docker file). Variables defined here will only be available in this script. Click on the numbers in the script to learn what each section is doing. This shows the core of the script, but the script also has code that is setting up the R kernel for Jupyter Lab and VSCode and does some more setup to make sure R works well in the Hub.

#!/bin/bash
set -e

# Copy in the rocker files. Work in ${REPO_DIR} to make sure I don't clobber anything
cd ${REPO_DIR}
wget https://github.com/rocker-org/rocker-versioned2/archive/refs/tags/R${R_VERSION}.tar.gz
tar zxvf R${R_VERSION}.tar.gz && \
mv rocker-versioned2-R${R_VERSION}/scripts /rocker_scripts && \
ROCKER_DOCKERFILE_NAME="${R_DOCKERFILE}.Dockerfile"
mv rocker-versioned2-R${R_VERSION}/dockerfiles/${ROCKER_DOCKERFILE_NAME}  /rocker_scripts/original.Dockerfile && \
rm R${R_VERSION}.tar.gz && \
rm -rf rocker-versioned2-R${R_VERSION}

cd /
# Read the Dockerfile and process each line
while IFS= read -r line; do
    # Check if the line starts with ENV or RUN
    if [[ "$line" == ENV* ]]; then
        # Assign variable
        var_assignment=$(echo "$line" | sed 's/^ENV //g')
        # Replace ENV DEFAULT_USER="jovyan"
        if [[ "$var_assignment" == DEFAULT_USER* ]]; then
            var_assignment="DEFAULT_USER=${NB_USER}"
        fi
        # Run this way eval "export ..." otherwise the " will get turned to %22
        eval "export $var_assignment"
        # Write the exported variable to env.txt
        echo "export $var_assignment" >> ${REPO_DIR}/env.txt
    elif [[ "$line" == RUN* ]]; then
        # Run the command from the RUN line
        cmd=$(echo "$line" | sed 's/^RUN //g')
        echo "Executing: $cmd"
        eval "$cmd" # || echo ${cmd}" encountered an error, but continuing..."
    fi
done < /rocker_scripts/original.Dockerfile

# Install extra tex packages that are not installed by default
if command -v tlmgr &> /dev/null; then
    echo "Installing texlive collection-latexrecommended..."
    tlmgr install collection-latexrecommended
    tlmgr install pdfcol tcolorbox eurosym upquote adjustbox titling enumitem ulem soul rsfs
fi

1: The rocker-versioned2 repository for a particular R version is copied into {REPO_DIR} and unzipped. R_VERSION is defined in appendix.
2: The unzipped directory will be named rocker-versioned2-R${R_VERSION}. We move the scripts directory to /rocker_scripts (base level) because the rocker scripts expect the scripts to be there.
3: R_DOCKERFILE is defined as verse_${R_VERSION}. The docker file we will process (find ENV and RUN lines) is called ROCKER_DOCKERFILE_NAME in the rocker files. We move this to /rocker_scripts/original.Dockerfile so we can refer to it later.
4: Clean up the rocker directories that we no longer need.
5: cd to the base level where /rocker_scripts is.
6: The big while loop is processing /rocker_scripts/original.Dockerfile. The code is using piping > and the input file and pipe is specified at the end of the while loop code.
7: This looks if the line starts with ENV and if it does, it strips off ENV and stores the variable assigment statement to $var_assignment.
8: The rocker docker files do not use the NB_USER environmental variable (defined in appendix). If the ENV line is defining the default user, we need to change that assignment to the variable NB_USER. This part is specific to the rocker docker files.
9: We need to export any variables (ENV) found in the docker file so it is available to the scripts that will run in the RUN statements. We need to export the variables as done here (with eval and export) otherwise they don’t make it to the child scripts about to be run. Getting variables to be exported to child scripts being called by a parent script is tricky and this line required a lot of testing and debugging to get variables exported properly.
10: The export line will only make the variable available to the child scripts. We also want them available in the final image. To do that, we write them to a file that we will source from the docker file. Scripts are run in an ephemeral subshell during docker builds so we cannot define the variable here.
11: If the docker file line starts with RUN then run the command. This command should be a rocker script because that is how rocker docker files are organized. See an example rocker docker file.
12: Here the input file for the while loop is specified.
13: The rocker install_texlive.sh script (which is part of verse) will provide a basic texlive installation. Here a few more packages are added so that the user is able to run vanilla Quarto to PDF and Myst to PDF. See the chapter on texlive.

9.7 start

Within a JupyterHub, the user home directory $HOME is typically re-mapped to the user persistent home directory. That means that the image build process cannot put things into $HOME, they would just be lost when $HOME is re-mapped. If a process needs to have something in the home directory, e.g. in some local user configuration, this must be done in the start script. The repo2docker docker image specifies that the start script is ${REPO_DIR}/start. In py-rocket-base, the start scripts in a child docker file is souces in a subshell from the py-rocket-base start script.

#!/bin/bash
set -euo pipefail

# Start - Set any environment variables here
# These are inherited by all processes, *except* RStudio
# USE export <parname>=value
# source this file to get the variables defined in the rocker Dockerfile
source ${REPO_DIR}/env.txt
# End - Set any environment variables here

# Run child start scripts in a subshell to contain its environment
# ${REPO_DIR}/childstart/ is created by setup-start.sh
if [ -d "${REPO_DIR}/childstart/" ]; then
    for script in ${REPO_DIR}/childstart/*; do
        if [ -f "$script" ]; then
            echo "Sourcing script: $script"
            source "$script" || {
                echo "Error: Failed to source $script. Moving on to the next script."
            }
        fi
    done
fi
exec "$@"

1: In a Docker file so no way to dynamically set environmental variables, so the env.txt file with the export <var>=<value> are source at start up.
2: Run any child start script in a subshell. Run in a subshell to contain any set statements or similar. start scripts are moved into childstarts by the setup-start.sh pyrocket script.

9.8 setup-desktop.sh

The default for XDG and xfce4 is for Desktop files to be in ~/Desktop but this leads to a variety of problems. First we are altering the user directiory which seems rude, second orphan desktop files might be in ~/Desktop so who knows what the user Desktop experience with be, here the Desktop dir is set to /usr/share/Desktop so is part of the image. Users that really want to customize Desktop can change ~/.config/user-dirs.dirs. Though py-rocket-base might not respect that. Not sure why you’d do that instead of just using a different image that doesn’t have the py-rocket-base behavior.

#!/bin/bash
set -e

# Copy in the Desktop files
APPLICATIONS_DIR=/usr/share/applications
DESKTOP_DIR=/usr/share/Desktop
mkdir -p "${DESKTOP_DIR}"
chown :staff /usr/share/Desktop
chmod 775 /usr/share/Desktop
# set the Desktop dir default for XDG
echo 'XDG_DESKTOP_DIR="${DESKTOP_DIR}"' > /etc/xdg/user-dirs.defaults

# The for loops will fail if they return null (no files). Set shell option nullglob
shopt -s nullglob

for desktop_file_path in ${REPO_DIR}/Desktop/*.desktop; do
    cp "${desktop_file_path}" "${APPLICATIONS_DIR}/."
    # Symlink application to desktop and set execute permission so xfce (desktop) doesn't complain
    desktop_file_name="$(basename ${desktop_file_path})"
    # Set execute permissions on the copied .desktop file
    chmod +x "${APPLICATIONS_DIR}/${desktop_file_name}"
    ln -sf "${APPLICATIONS_DIR}/${desktop_file_name}" "${DESKTOP_DIR}/${desktop_file_name}"
done
update-desktop-database "${APPLICATIONS_DIR}"

# Add MIME Type data from XML files  to the MIME database.
MIME_DIR="/usr/share/mime"
MIME_PACKAGES_DIR="${MIME_DIR}/packages"
mkdir -p "${MIME_PACKAGES_DIR}"
for mime_file_path in ${REPO_DIR}/Desktop/*.xml; do
    cp "${mime_file_path}" "${MIME_PACKAGES_DIR}/."
done
update-mime-database "${MIME_DIR}"

# Add icons
ICON_DIR="/usr/share/icons"
ICON_PACKAGES_DIR="${ICON_DIR}/packages"
mkdir -p "${ICON_PACKAGES_DIR}"
for icon_file_path in "${REPO_DIR}"/Desktop/*.png; do
    cp "${icon_file_path}" "${ICON_PACKAGES_DIR}/" || echo "Failed to copy ${icon_file_path}"
done
for icon_file_path in "${REPO_DIR}"/Desktop/*.svg; do
    cp "${icon_file_path}" "${ICON_PACKAGES_DIR}/" || echo "Failed to copy ${icon_file_path}"
done
gtk-update-icon-cache "${ICON_DIR}"

1: This is the default local for system applications.
2: Create the Desktop directory and make sure jovyan can put files there. This is mainly for debugging.
3: This is not needed. It is the user-dirs.dirs file that is used.
4: Copy the .desktop file in the Desktop directory into the applications directory and make a symlink to the Desktop directory. The former means that the applications will appear in the menu in xfce4 desktop and the latter means there will be a desktop icon.
5: Add any mime xml files to the mime folder and update the mime database.
6: Add any png or svg icon files to the icon folder and update the icon database.

9.9 Notes on the Jupyter Lab environment

9.9.1 Terminal in Jupyter Lab

When a terminal is launched from the Launcher, it starts a login bash shell (bash -l). When login bash shells are started, /etc/profile script is run. For this image, this script will execute all the scripts in the directory /etc/profile.d. There is the script init_conda.sh which ensures that the conda notebook environment is activated. The user might override this if they create ~/.bashrc_profile in which case that is used instead of /etc/profile.

For non-login bash shells (interactive), /etc/bash.bashrc determines the shell environment unless the user has created ~/.bashrc, in which case that file determines the shell environment. In Jupyter Lab, you can start and interactive shell by running bash (from a terminal). Be aware that if you run bash it might look like the conda environment is deactivated but it is really not since the PATH still includes conda in it. If you are trying to get remove conda from the path (and get rid of all the conda environment variables) you need to run conda deactivate (2x).

9.10 Notes on R in the RStudio or Jupyter Lab environment

jupyter-rsession-proxy in environment.yml allows us to launch RStudio from Jupyter Lab and IRkernel run in our R installation via this in the Docker file

RUN Rscript -e "install.packages('IRkernel')" && \
    PATH=/srv/conda/envs/notebook/bin:$PATH Rscript -e "IRkernel::installspec(name = 'ir', displayname = 'R ${R_VERSION}')"

creates a Jupyter Lab kernel called R X.X.X using our R installation (in /usr/local/bin/R) with all our libraries.

However we enter the R environment (either in Jupyter Lab or RStudio), the environment is different than the default environment if you use the default Python environment (the conda ‘notebook’ environment).

9.10.1 Environmental variables

PATH is different. conda is not on the path. Try Sys.getenv("PATH"). This is on purpose because the R geospatial packages get confused if one uses the GDAL associated with the conda environment. This is critical to know if you are using reticulate and Python inside of R. conda will not be on the path and all the Python libraries will not be accessible. If you want to use the conda environment, you have to run this. The first line tells R where the conda binary is because it has no way to find it since conda is not on its system PATH.

options(reticulate.conda_binary = "/srv/conda/condabin/conda")
reticulate::use_condaenv("/srv/conda/envs/notebook", required = TRUE)

For RStudio only: None of the environmental variables in the docker file will be in the /rstudio environment. The start command affects \lab and \notebook but not \rstudio. I have made some attempt to add back in a few required ones in /usr/local/lib/R/etc/Rprofile.site but it is very minimal.

If your users need some environmental variable set, they will need to set those in $R_HOME/etc/Rprofile.site which is run when R starts.

9.10.2 Terminal in RStudio

The default way that a terminal is started in RStudio is bash -l which means it is a login terminal. When a login terminal launches, /etc/profile script is run. For this image, this script will execute all the scripts in the directory /etc/profile.d. You can add scripts there that you want to run when a login terminal is started. In particular, there is the script init_conda.sh. This ensures that when a terminal is opened from the Launcher in JupyterLab, the conda notebook environment is activated. However, we do not want this to happen in RStudio so the script checks if RSTUDIO==1 and R_HOME is set, if that is true then we are in the RStudio UI and conda should not be initialized (and is not).