Best Coding Practices in R

Best RACE-GAP Coding Practices Prepared by: Caitlin Allen Akselrud - NOAA Federal

Aims:

  • To create and share reproducible, transparent, and transferable code. Specifically:
  • To maintain consistent file structure and coding syntax for ease of sharing and reading code.
  • To have thoroughly documented code, that is understandable by others and reproducible.
  • To implement version control measures and track coding changes.
  • To have others view code for quality control, implementing issue trackers, and correcting issues in a timely manner.

Let go of your entrenched habits in favor of group consistency!

Standard file structure:

This file structure example is based on R code, and other softwares may have slightly different configurations. At a minimum, your file structure should generally be in keeping with Wilson et al. (2017, see Box 3). For R code and R projects, your file structure should minimally include:

|- project_name

|- - project_name.Rproj

|- - code

|- - data

|- - documentation

|- - figures

|- - functions

|- - output

You may include subfolders within your folders, such as:

|- project_name

|- - data

|- - - data_raw

Or add a year folder for annual projects, such as:

|- project_name

|- - [year]

|- - - code

|- - - [other standard subfolders]

You may also wish to add additional folders, such as:

|- project_name

|- - presentations

|- - images

|- - reports

|- - paper

|- - license

|- - citation

|- - README

An example function for automatically building these folders:

# here are the names of the file folders you want to create:
dirs = c("code", "data", "documentation", "figures", "functions", "output")
for(i in 1:length(dirs)){
  if(dir.exists(dirs[i])==FALSE){
    dir.create(dirs[i])
  }
}

Structure and syntax of code:

Code header metadata

Every .R file should start with a standard header. This header can be a comment in base format (#), in {roxygen} format (#’), or as an RMarkdown comment (— your comments —). You header should include, at a minimum:

# A short description of what this file is or does
# Created by: creator name
# Modified by: modifier name (if applicable)
# Contact: your@noaa.gov email address
# Created: date in yyyy-mm-dd format
# Modified: date in yyyy-mm-dd format

Dependencies

It is important to track all packages and package versions used in your code. Current recommendations are to use the {renv} package: https://rstudio.github.io/renv/articles/renv.html

  • At the start of a project, run: renv::init()
  • During a project, run: renv::snapshot()
  • When reopening the project or opening on a different computer, run: renv::restore()
  • You will need to make sure you have a copy of the .RProfile file and the renv.lock files, and that you share those files when sharing your code with others or between computers.

You should load all packages and libraries required in your script at the top of your script, not throughout. Naming conventions and coding style All names should be descriptive. Not descriptive: var1, file1.R Better: cpue, cpue_calc.R Best: cpue_kg, cpue_calc_from_raw_data.R

Use snake_case, all lowercase and spaces replaced with underscores (not periods, and definitely not spaces). * File names * Use snake_case (underscores, not periods, and definitely never spaces): * lowercase_and_underscore.R

Use YYYY-MM-DD for any dates in file names. It can be helpful to use dates for 1) original data download, 2) files that change (for example, based on code runs), 3) to order items systematically in a folder, or 4) versions of files:

  • orig_data_2020-12-01.csv
  • formatted_data_2020-12-02.csv
  • 2020-12-20_FirstLastName_XYZConf.pptx

Use descriptive names. For example, it is helpful to include datasource_briefdescription_firstyear_lastyear_version_date.csv: * racebt_pollock_2010_2020_v_2020-12-01.csv * Origdata_racebt_pollock_2010_2020_v_2020-12-01.csv

Use numbers to identify the order items should be in for your fellow human readers. For example, this can be helpful if you and your colleagues expect to find files saved in a predictable order: * 1_intro_map_of_study_area.jpg * 2_intro_map_of_stations_surveyed.jpg * 3_method_net_diagram.jpg

It is better to have a longer filename than something nonsensical, nondescript, or only makes sense to you (e.g., pollock20.csv). However, be mindful that it is possible to have file names that are too long and you must balance descriptiveness with succinctness. Some functions in R will require paths or file names to be below a certain character limit (often around 256 characters).

When in doubt, defer to Jenny Bryan: https://speakerdeck.com/jennybc/how-to-name-files * Object names * Short, but descriptive is ideal here. Again, use snake_case (lowercase with underscores between words, and not periods because they can cause problems in certain packages and functions).

Do not use reserved object and base R function names. Here are some common names that can cause problems: * c: c() is a function for concatenate, so do not use ‘c’ as a parameter or variable name. * T and F: these are boolean TRUE and FALSE, respectively. Note: Use TRUE and FALSE rather than T or F because some functions and packages can be picky and the former are much clearer. * t and f: t() and f() are the functions for t-tests and f-tests(). * plot: plot() is a base R function for creating a plot. A common mistake is to name a plot object ‘plot’. Instead, use something more descriptive (e.g., study_map_plot or figure_*).

Descriptive: for multiple datasets, use something like: squid_dat or env_dat rather than: dat1 or dat2

When reading in data files, there is a {tidyverse} package to correct names: rename_all(tolower). The {janitor} package is also an excellent and expedient way to convert column names to follow Tidy data conventions.

Note: for tracking derived versus original data column names, it is best to call in all data being used, keeping all original and raw data in a raw data sub-folder. Then, processed data happens in a separate function and goes into a separate processed data sub-folder.

Some people use all caps for a parameter that is fixed in the global environment. This is ok, but please define the parameter at the top of the script, and note that it is a fixed value. For example: * SELEX <- 0.5 # selectivity is fixed at 0.5

Some people prefer to capitalize the first letter of an object when turning something into a factor. This is fine. E.g., age as a continuous variable and Age as a grouped factor for plotting. Function names Use snake_case for function names (lowercase with underscores between words, and not periods because they can cause problems with package functions). descriptive_function_name is great. * Naming the function with an f_ at the start is even better for clarity to other users and can make finding the function easier: f_calc_catchability If using a specific package for a function that may have the same name among packages, the best practice is to use package::function().

Dates

Always use the international convention for dates: YYYY-MM-DD. This allows files to be stored and sorted easily, as well as limit confusion on which number is for the year, the month, or the day. Note that if this fill is being written for a single digit month or day, we use a “0” to fill the space (e.g., “2021-01-02”).

Assignments

  • Use <- instead of = when assigning a data type. Reserve = for use within function designations.
  • Good: pop_dy <- f_get_pop(growth_rate = 0.3, carrying_cap = 10000)
  • Works, but is bad practice: pop_dy = f_get_pop(growth_rate = 0.3, carrying_cap = 10000)
  • Someone will cry: x=foo.fxn(r=0.3, K=10000)

Spaces

  • Place spaces around all binary operators (=, +, -, <-, etc.).
  • Do not place a space before a comma, but always place one after a comma.
  • Detailed examples and additional recommendations are available in the tidyverse style guide.

R Sections

  • Use R sections to create outline sections with comments:
  • Use Ctrl+Shift+R to create outline sections (e.g., # — your section —).
  • Use Ctrl+Shift+O to see the outline of the script. This will help organize the script and make it easy for others to follow the script.
  • Please make a general description of the section such as: “packages”, “load data”, “data processing”, “model”, “results”, “graphics” are all good, short, descriptive names.
  • The developer can always create additional subsections within each section by adding a * before the section name (e.g., # — * your subsection—).

For example:

# --- Load Libraries ---

# --- Source Scripts ---
# --- * Load Data ---
# --- * Load Functions ---

# --- Preliminary Analysis ---

# --- Model 1 ---
# --- * Save Results ---
# --- * Graphs ---

Starting a new project with best practices:

  • Use R Projects. R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. This is such a wise and common practice that RStudio has built-in support for projects.
  • Use standard file structure. See the Standard file structures section.
  • Do NOT use fixed paths. Refer to the {here} package and use here::here("mydir", "myfile") types of file specifications.
  • Do NOT save the R environment (reproducible code should run start to end without needing to load the environment).
  • Include a header with developer information at the top of each script. See the Code header section.
  • Annotate code generously for clarity. Add comments to lines for anything that might be tricky, confusing, or where the calculation comes from a specific source or paper.
  • Include all of dependencies
  • Please limit dependencies to packages/libraries actually being use.
  • Use R sections. See the R Sections section.
  • Use modular coding practices (functions!).
  • Build in regular checks to ensure the script is doing what it is supposed to do {testthat} package: https://testthat.r-lib.org/; https://cran.r-project.org/web/packages/testthat/testthat.pdf
  • Use version control.
  • Use the tidyverse style guide (or {styler} package), so the code has consistent formatting.
  • Strongly preferred practices
  • Use Git, GitHub, and/or GitLab for version control; see GitHub use section below.
  • When prepping a package for distribution, it is up to the author as to what dependencies they wish to include, or if they prefer to work exclusively in base R coding language.
  • Converting old code to best practices:
  • If you are converting old code to a new project, please follow the below guidance in addition to the guidance in the Starting a new project with best practices section.
  • Create an R Project (if one does not exist).
  • Set up standard file structure (to the extent possible) and organize files into appropriate folders. See the Standard file structures section.
  • Correct any paths that were written to local fixed directories (or “fixed” paths). Refer to the {here} package or getwd().
  • Include your header and dependencies.
  • Only include libraries you actually use.
  • Check the code and note any dependencies that may no longer work, result in errors, or are deprecated (no longer maintained or updated in favor of a newer package). Where possible, replace the use of these packages.
  • Add sections to the code for clarity (to the extent possible). See R sections.
  • Add notes within the code for clarity (to the extent possible).
  • For consistent formatting, select all your code and reformat using Code > Reformat code (or ctrl + shift + a). Or you can run the {styler} package, so your code has consistent formatting with the R Style Guide.
  • Old non-R code to new R code
  • When creating new code based on code written in another language, it’s important to compare results of the new R code with the outputs of the previous code. It is important to test whether the two pieces of code (non-R and R) are producing the same outputs. If not, it is important to understand and document the differences.
  • It is best to create new R code that exactly replicates the original code, and then do updates and corrections to the new code, as needed, documenting where changes have been made.
  • Maintaining and sharing code:
  • Writing code for other people/computers
  • Create packages for code that are going to consistently be used or may be needed by others.
  • When writing a code or package for a wide audience/machines, check the code works using a virtual machine, with programs like Docker: https://www.docker.com/.
  • All packages can include the developer’s ORCID ID (https://orcid.org/).
  • Provide visual progress updates, so another user knows the code is running properly, even if it takes a long time.
  • Progress bar: https://ryouready.wordpress.com/2009/03/16/r-monitor-function-progress-with-a-progress-bar/. print(paste(“some note about code progress”) is also acceptable.
  • Options that rely on the code running on an open computer to execute are not encouraged (e.g., {beepr} package), as they may cause code to error out before completion.
  • If code takes a long time to run, it is helpful to annotate that in the code notes, in the script header, or to print a message to the user during the run.
  • Test the functions in the script (http://r-pkgs.had.co.nz/tests.html)

R profile options

  • Common example: machine precision settings are changed.
  • If changes are made to the options in profile, it must be specified in the code and documentation.
  • Build in a check to verify the user settings against the needed settings.
  • Save a record of what the profile options are.
  • It is recommended to not manually change options and to have the code make changes in the script.

Data sharing:

  • We often do not keep copies of raw data on sites like GitHub (too much storage). Instead, please have files in a space that is widely accessible to the group on a shared drive. Specify in the code notes where those files are.
  • If outside collaborators need access for authorized uses, another option is Google Drive (to automate downloads or otherwise access such data from R; see https://googledrive.tidyverse.org/](https://googledrive.tidyverse.org/)). To officially archive data, please see the Archiving section.

Attribution

Adapted from the Best AFSC RACE-GAP Coding Practices Guidelines compiled by Caitlin Allen Akselrud, Emily Markowitz, and others.