All experienced statisticians or data analysts have heard about those one or two instances where a coding error led to an entire conclusion changing, even leading to a retraction. It’s the sort of stuff that keeps people up at night. Unfortunately, not many of us think about this stuff until we realize it’s highly possible that it could happen to us (p < 0.05).
So it really is worth spending the time setting up principled statistical workflows that will reduce the likelihood of making errors. I’d like to quickly go over a few things that I’ve found helpful over the years, and I’ll first start with data management/entry and then move onto analysis workflows. I largely consider this to be a living document, and I’m sure many people who will read this will have far better suggestions, so please leave them down below in the comments!
Data Entry & Management
A classic paper that I’d like to touch on is the one by Broman & Woo, 2018 on how to manage your data when working with spreadsheets. And there really are many risks to working with spreadsheets, just ask any statistician who works in genetics or a bioinformatician. I’d like to touch on some of the most important points of their paper before moving onto some other “principles” I’d like to share:
Summary: Be Consistent With Everything
- When labeling missing values (NA, N/A, 0, etc.)
Good: choose one prespecified method and be consistent
Bad: leaving cells empty
- When labeling variables everywhere
Good: if you label your one response variable
Bad: if you label your one response variable
fat-free-massin one script/sheet, and
- When formatting your variable names
- When formatting your dates
Bad: everything else
- When filling out cells
Good: one cell = one thing
Bad: one cell = multiple entries
- When keeping track of your data
Good: create a comprehensive data dictionary so anyone else can look at it and understand the spreadsheets
Bad: expecting yourself and others to figure it out based on the variable names
- When using Excel or Google Sheet’s shiny features
Good: avoid using any of them, any formulas, highlighting, italicizing, bolding, etc.
Bad: you make charts in Excel
- Planning for disasters
Good: backup your files, save them in .txt files
Bad: you don’t plan for disasters
Setting Up a Principled Statistical Workflow
Here’s what I’ve been doing for a while and what seems to work for me (none of these ideas are originally mine, and I actually picked them up over the years from others’ advice, some of which will be linked below).
First, I will create a folder with the title of my project, and usually have an RStudio project set up in there (while this advice is for people who use R, I think the general approaches may be useful). This folder will contain many other folders (more on that below), so the structure will end up being a bit complex.
If this ends up confusing you, you can just scroll all the way down to see an image of what the folder structure looks like.
Instead of constantly changing the working directory to each subfolder when I need to do something inside that folder, I set the project folder as the working directory only once, and then run the following:
library(here) set_here() here()
This not only sets the working directory, but gives you much more control over how you can save your files from any place within the hierarchical structure. It will create a file inside the main project folder called
.here, which will indicate that this is the top level of the hierarchy.
Then, I will have create a
README.rmd file in the main project folder with updates on what I’ve done and what I still need to work on, so I will remember and so others will too.
Next, I’ll set up a
Data folder inside the main project folder. This is where all my data/spreadsheets/.txt files and data dictionaries will go. The original data files will stay inside this folder, while I create two other folders inside this
Data folder called
Transformed, I will typically save .rds files that were a result of cleaning and transforming the data, including imputing missing data. I will touch more on exactly how I do that later below.
Models will obviously store fitted models and any validation of those models.
Now, back to the main project folder, I’ll set up another folder called
R. This will be where all of my .R scripts live. I will number them consistently, like so
All potential R packages and custom functions that I need will belong in
01-functions.R. This .R file will only serve this purpose and nothing else. As you’ve guessed by now, the other .R files will be doing the same, strictly for very specific purposes.
However, I will not be running each of these .R scripts individually, line by line.
Instead, after I’ve created all these files (figuring out what I need to do carefully and writing it down and annotating it), every single .R file except for
main.R will have the following script (or some iteration to match the name) at the beginning
library(here) set_here() source(here("R", "01-functions.R")) cleaning_02 <- file(here("Errors", "02-cleaning.txt"), open = "wt") sink(cleaning_02, type = "message")
and this script at the end
sink(type = "message") close(cleaning_02) readLines(here("Errors", "02-cleaning.txt"))
You’ll notice several things, first I’m once again calling the
here package and telling it that I’m working within this folder, and then I’m calling the
01-functions.R file by using the
source() function, but also notice how the
source() function is followed by a
here() function, this
here() function allows you to fully control files in a specific folder from anywhere else, without having to actually be in that folder. So, suppose I was in the
Main Project ->
Models folder and I had my working directory set there, which is pretty far from the
R folder, I can still using functions like
save(), etc. and call or manipulate files from a totally different folder by specifying the hierarchy using
This is how I will always be calling the necessary packages and functions I need from every R script. the next line
cleaning_02 <- file(here("Errors", "02-cleaning.txt"), open = "wt")
is designed to catch any errors that occur inside the .R script and save them to a .txt file in that same
R folder. So here’s an example:
# set up error control script library(here) set_here() source(here("R", "01-functions.R")) cleaning_02 <- file(here("Errors", "02-cleaning.txt"), open = "wt") sink(cleaning_02, type = "message") # start relevant analysis/cleaning df <- data.frame((x <- rnorm(500)), (y <- rnorm(500))) lm(y ~ x, data = df) # end of analysis/cleaning # close error catching script and save file sink(type = "message") close(cleaning_02) readLines(here("Errors", "02-cleaning.txt"))
I will admit that it is not perfect at capturing errors but usually if something wrong occurred when I fit my simple regression there, the file would catch the warning so I would know that something went wrong.
Why might I want to catch these errors and save them if I ran the scripts?
- suppose the analyses take a very long time, and are very computationally heavy, and R/RStudio crashes, you may not be able to figure out what went wrong
- suppose the analysis completed but because there was so much occurring in the console you missed the warnings/errors and got results that you don’t know have issues.
This also helps us automate the entire workflow once we’ve carefully set it up. After each .R script has one of these error-catching scripts, we can run everything from
main.R with one command
library(here) set_here() here() source(here("R", "01-functions.R")) source(here("R", "02-cleaning.R")) source(here("R", "03-inspection.R")) source(here("R", "03.5-load_data.R")) source(here("R", "04-analysis.R")) source(here("R", "04.5-validation.R")) source(here("R", "05-sensitivity.R")) source(here("R", "06-tables.R")) source(here("R", "07-export.R"))
So, now the
Error folder will contain .txt files with any errors or warning messages.
Typically, when I’m setting up a new project, I will also add a few other folders such as an
Outputs folder, where I will save tables, and graphs and a
Report folder, where I might be working on a manuscript.
Now that I’ve mentioned my hierarchy and how I catch errors, I can mention some other things that I believe are absolutely heinous practices. Many people often will run an entire analysis and have several objects/vectors stored in their Global Environment. Whether they’re taking a break or finished with the analysis, so what they’ll do is click the save button on top of the pane and “Save the workspace image”.
This is a horrific practice and you should never do this because it can lead to several problems such as:
- having several saved objects and packages interfere with one another once they’re all fully loaded together at once
- having giant workspace images, that will probably cause R to constantly crash
- not allowing you to load very specific objects that you need at a time while leaving everything else
I would say that it is a good idea to never save the workspace image ever, and instead, always use
saveRDS(), where the first argument is the object in your environment that you want to save, and the second argument is the path where you want to save it.
So, suppose I just used multiple imputation because I had a dataset with many missing values, and I wanted to save the imputed dataset, and only load the imputed dataset next time. This is how I would do it
source(here("R", "01-functions.R")) library(beepr) library(mice) z <- mice::parlmice( data = temp, method = "midastouch", m = 15, maxit = 60, cluster.seed = 1031, n.core = 4, n.imp.core = 15, cl.type = "FORK" ) beep(3) saveRDS(z, here("Main Project", "Data", "z.rds"))
Now I have saved my imputed dataset only as an object called
z.rds and it is saved in the
Data folder even though I would hypothetically be working from the
Also notice that I use the
beep() function from the beepr package, which can be quite handy to let you know a particular script or analysis is done.
Next time, if I cleared my entire global environment and wanted to ONLY load the imputed dataset, I would simply have to just click on
z.rds and it would load into the environment or I could use the following command
z <- readRDS("z.rds", here("Main Project", "Data", "z.rds"))
This gives me full control over my environment, and it’s also why I encourage people to turn off the automatic prompt to save the workspace that RStudio gives you by default (I think even Hadley Wickham has said the same).
So if I now wanted to include the imputed dataset into one of my .R scripts that I am trying to automate with one command, whether for reproducibility, or speed, etc., I could do it with absolute control.
This is also how I save my models and how I validate them, which is especially necessary because they can take a long time.
Here’s an example of a script that would be in the
04-analysis.R script, I’m leaving out the error script now to avoid making this too long. I don’t need to load the
brms package because I’m already calling it from the first .R script,
01-functions.R (but I’m just showing it for now to avoid confusion) and then I am saving it in the
Models folder which is found in the
Data folder. And again, the
beep() function would let me know when the script is over, and scripts like this can take very long.
source(here("R", "01-functions.R")) library(brms) n_1 <- nrow(z) # of observations k_1 <- (ncol(z) - 1) # of predictors p0_1 <- 5 # prior guess for the number of relevant variables tau0_1 <- p0_1 / (k_1 - p0_1) * 1 / sqrt(n_1) # scale for tau # regularized horseshoe prior hs_prior <- brms::set_prior("horseshoe(scale_global = tau0_1, scale_slab = 1)", class = "b") model1_full_penalized <- brm_multiple(brmsformula(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21), data = z, prior = hs_prior, family = student(), save_all_pars = TRUE, iter = 20000, warmup = 5000, chains = 4, cores = 4, thin = 1, combine = TRUE, seed = 1031, control = list(max_treedepth = 15, adapt_delta = 0.9999)) saveRDS(model1_full_penalized, here("Main Project", "Data", "Models", "model1_full_penalized.rds")) beep(3)
So I’ve saved the full contents of the model, which are usually several gigabytes large in the
Once I’ve written all my analyses scripts, I’ll typically create a subfolder in the
Outputs folder called
Figures and then
Tables and save them appropriately with the help of the
Then, I’ll go back to the
README.rmd file, make notes of the hierarchy of the entire folder and subfolders, and what has been done and what still needs to be done.
Now, suppose I deleted all of my saved models, imputed data, graphs, and tables, and only left the original dataset. Anyone could go into my project and simply highlight the entire
main.R file and run it, and they would get all the same outputs and models. Although, it might take them a long time…. depending on what they’re working with.
But anyway, here’s what my project folders tend to look like.
Also, it should go without saying that sharing your data and code will help catch errors before it’s too late!