Quality Control in Statistical Analyses

All experienced statisticians or data analysts have heard about those one or two instances where a coding error led to an entire conclusion changing, even leading to a retraction. It’s the sort of stuff that keeps people up at night. Unfortunately, not many of us think about this stuff until we realize it’s highly possible that it could happen to us (\(p<{0.05}\) or \(s>{4.32}\) bits).

So it is worth spending time to set up a principled statistical workflow that will reduce the likelihood of making errors. I’d like to quickly go over a few things that I’ve found helpful over the years, and I’ll first start with data management/entry and then move onto analysis workflows. I largely consider this to be a living document, and I’m sure many people who will read this will have far better suggestions, so please leave them down below in the comments!

But before I get into this, I think it’s worth stressing that backing up your data, scripts, etc. and using version control is extremely important. I really don’t think there’s much getting around this, unfortunately. It’s necessary so that other collaborators/colleagues can inspect your work and catch potential mistakes or see overall progress, but more importantly, it will prevent you from losing your data in a disaster, and it’ll help you catch your own mistakes, since you’ll be the most familiar with the data and scripts.


Data Entry & Management


A classic paper that I’d like to touch on is the one by Broman & Woo, 2018 on how to manage your data when working with spreadsheets. The sad reality is that even though spreadsheets like Microsoft Excel or Google Sheets are very available almost everywhere, and are relatively easy to use, there really are many risks to working with spreadsheets, just ask any statistician who works in genetics or bioinformatician. I’d like to touch on some of the most important points of their (Broman & Woo, 2018) paper before moving onto some other “principles” I’d like to share:


General Principle: Be Consistent With Everything


  • When labeling missing values (NA, N/A, 0, etc.):
    • Good: Choose one prespecified method and be consistent.

    • Bad: Leaving cells empty.

  • When labeling variables:
    • Good: If you label your one response variable fat-free-mass consistently.

    • Bad: If you label your one response variable fat-free-mass in one script/sheet, and ffm/fat_free_mass in another.

  • When formatting variable names:
    • Good: predictor_1_week12, predictor_2_week12, response_variable_week12. This is consistent, with the order of the names, and dates/weeks, so easier to organize, inspect, and clean.

    • Bad: week12_predictor_1, predictor_2_week12, 12_week_response Just no.

  • When formatting dates:
    • Good: Use one thing, YYYY-MM-DD, consistently, though I believe this is superior to all other formats.

    • Bad: Everything else.

  • When filling out cells:
    • Good: One cell = one thing.

    • Bad: One cell = multiple entries.

  • When keeping track of the data:
    • Good: Create a comprehensive data dictionary so anyone can look at it and understand the spreadsheets/dataframes/variables.

    • Bad: Expecting yourself and others to figure it out based on the variable names that you thought were brilliant.

  • When using Excel or Google Sheet’s shiny features:
    • Good: Avoid using any of them, any formulas, highlighting, italicizing, bolding, etc.

    • Bad: You’re the type of person to make charts in Excel.

  • Planning for disasters:
    • Good: Always, always, backup your files, save them in .txt files. Keep backups of those.

    • Bad: You don’t plan for disasters.


Setting Up a Principled Statistical Workflow


Here’s what I’ve been doing for a while and what seems to work for me (none of these ideas are originally mine, and I actually picked them up over the years from others’ advice, some of which will be linked below).

Now, onto setting up this workflow… when setting up a folder specific for a project (if you’re not doing this, you seriously need to), first, I will create a folder with the title of my project, and usually have an RStudio project set up in there (Disclaimer: while this advice is for people who use R, I think the general approaches will be useful, especially for those who use scripts). This folder will contain many other folders (more on that below), so the structure will end up being a bit complex.

If this ends up confusing you, you can just scroll all the way down to see an image of what the folder structure looks like.

Instead of constantly changing the working directory to each subfolder when I need to do something inside that subfolder, I set the project folder as the working directory only once, and then run the following:


# Very seriously consider installing the `here` package

# https://cran.r-project.org/package=here

library(here) 

# this and the scripts below take the directory used and set it as the top

set_here() 

here()

This not only sets the working directory, but gives you much more control over how you can save your files from any place within the hierarchical structure. It will create a file called .here inside the Main Project folder, which will indicate that this is the top level of the hierarchy.

Then, I will have created a README.Rmd file in the Main Project folder with updates on what I’ve done and what I still need to work on, so I will remember and so my collaborators can see it too.

Next, I’ll set up a Data folder inside the Main Project folder. This is where all my data/spreadsheets/.txt files and data dictionaries will go. The original data files will stay inside this folder, while I create two other subsubfolders inside this Data folder called Transformed and Models. In Transformed, I will typically save .rds files that were a result of cleaning and transforming the data, including imputing missing data. I will touch more on exactly how I do that later below. A Models subfolder will obviously store fitted models and any validation/sensitivity analyses of those models.

Now, back to the main project folder, I’ll set up another folder called R. This will be where all of my .R scripts live. I will number them consistently, like so

  • main.R
  • 01-functions.R
  • 02-cleaning.R
  • 03-inspection.R
  • 04-analysis.R
  • 04.5-validation.R
  • 05-sensitivity.R
  • 06-tables.R
  • 07-export.R

All potential R packages and custom functions that I need will belong in 01-functions.R. This .R file will only serve this purpose and nothing else. As you’ve guessed by now, the other .R files will be doing the same, strictly for very specific purposes.

However, I will not be running each of these .R scripts individually, line by line.

Instead, after I’ve created all these files (figuring out what I need to do carefully and writing it down and annotating it), every single .R file except for main.R will have the following script (or some iteration to match the name) at the beginning


library(here)

set_here()

source(here("R", "01-functions.R"))

cleaning_02 <- file(here("Errors", "02-cleaning.txt"), open = "wt")

sink(cleaning_02, type = "message")

and this script at the end.


sink(type = "message")

close(cleaning_02)

readLines(here("Errors", "02-cleaning.txt"))

You’ll notice several things, first, I’m once again calling the here package and telling it that I’m working within this folder (R), and then I’m calling the 01-functions.R file by using the source() function, but also notice how the source() function is followed by a here() function, this here() function allows you to fully control files in a specific folder from anywhere else, without having to actually be in that folder. So, suppose I was in the Main Project -> Data -> Models folder and I was saving my work there (in the Models folder), which is pretty far from the R folder, I can still, using functions like source(), save(), etc., call or manipulate files from a totally different folder by specifying the hierarchy using here().

This is how I will always be calling the necessary packages and functions I need from every R script. Now lets look at the next line


cleaning_02 <- file(here("Errors", "02-cleaning.txt"), open = "wt")

This script is designed to catch any warnings that occur inside the .R script and save them to a .txt file in another subfolder within the R folder called Errors. So here’s an example:


# set up warning catching script

library(here)

set_here()

source(here("R", "01-functions.R")) 

# Above is calling the .R script that has all the package/functions

cleaning_02 <- file(here("Errors", "02-cleaning.txt"), open = "wt") 

# Above script `file` creates a .txt file within the errors folder

sink(cleaning_02, type = "message") 

# Above function starts the process of compiling messages into the .txt file to catch messages

# This is where my data inspection and cleaning might start

set.seed(1031) # My birthday

# Time to simulate fake data

df <- data.frame((x <- rnorm(500)), 
                 (y <- rnorm(500)))  

str(df) # inspect data frame

summary(df)

sum(is.na(df)) # Look at missing values etc.

colSums(is.na(df))

# I think you get the point

view(dfSummary(df)) 

# end of cleaning/inspection

# Below script closes the message catching script and saves the file

sink(type = "message") 

close(cleaning_02)

# Below opens the .txt file for inspection

readLines(here("Errors", "02-cleaning.txt")) 

I will admit that it is not perfect at capturing warnings but usually, if something wrong occurred when I ran a script, the function (sink()) and its file usually catches the warning so I would know that something went wrong during the analysis.

Why might I want to catch these errors and save them if I ran the scripts?

Many reasons:

  • Suppose the analyses take a very long time, and are very computationally heavy, and R/RStudio crashes, you may not be able to figure out what went wrong.
  • Suppose the analysis completed but because there was so much occurring in the console you missed the warnings/errors and got results that you don’t know have issues.

This also helps us automate the entire workflow once we’ve carefully set it up. After each .R script has one of these message-catching scripts, we can run everything from main.R with one command


library(here) 

set_here()

# Load and set here

here() 

# One line script below to load all necessary packages/functions.

source(here("R", "01-functions.R")) 

# One line script below to clean the data.

source(here("R", "02-cleaning.R")) 

# One line script below to inspect the data, usually where I also impute

source(here("R", "03-inspection.R"))   

# Load things like imputed data/other necessary objects that were saved in previous script

source(here("R", "03.5-load_data.R")) 

# Below might be main analysis script 

source(here("R", "04-analysis.R")) 

# Below might be validation of main analysis results 

source(here("R", "04.5-validation.R")) 

# Test the sensitivity of the results by varying assumptions/multiple parameters

source(here("R", "05-sensitivity.R")) 

# Generate tables and figures of all your results 

source(here("R", "06-tables.R")) 

# Export these into actual files that can be inserted into papers, reports, etc.

source(here("R", "07-export.R")) 

So, now the Error folder will contain .txt files with any warning messages or errors.

Typically, when I’m setting up a new project, I will also add a few other folders such as an Outputs folder, where I will save tables, and graphs and a Report folder, where I might be working on a manuscript.

Now that I’ve mentioned my hierarchy and how I catch errors, I can mention some other things that I believe are absolutely heinous practices. Many people often will run an entire analysis and have several objects/vectors stored in their Global Environment. Whether they’re taking a break or finished with the analysis, so what they’ll do is click the save button on top of the pane and “Save the workspace image”.

This is a horrific practice and you should never do this because it can lead to several problems such as:

  • having several saved objects and packages interfere with one another once they’re all fully loaded together at once

  • having giant workspace images, that will probably cause R to constantly crash

  • not allowing you to load very specific objects that you need at a time while leaving everything else

I would say that it is a good idea to never save the workspace image, ever, and instead, always use the saveRDS() base R function, where the first argument is the object in your environment that you want to save, and the second argument is the path where you want to save it, which of course should be using the here() function.

So, suppose I just used multiple imputation because I had a dataset with many missing values, I had a grasp of the missing data mechanism, and so I use something like weighted predictive mean matching, and I wanted to save the imputed dataset, and only load the imputed dataset next time. This is how I would do it (in the most vaguest way possible),


source(here("R", "01-functions.R"))

library(beepr)

# multiple imputation package that uses multiple imputation by chained equations

library(mice) 

# temporary dataset to be used in the imputation workflow 

temp <- df 

z <- parlmice(data = temp, method = "midastouch", m = 100, maxit = 100,
              cluster.seed = 1031, n.core = 8, n.imp.core = 25, cl.type = "FORK") 

# I'm not really going to comment on the above function because it'll take an extraordinary amount of time. 

# I recommend checking out Stef van Buuren's papers and books for R as he is an expert on MICE and multiple imputation

# Also, check out the works of some other missing data researchers (Tim Morris, Ian White, Donald Rubin, Jonathan Bartlett, Paul Gustafson, etc.)

beep(3) # I will touch on this function below.

saveRDS(z, here("Main Project", "Data", "z.rds"))

Here is the workflow truly showing, saveRDS() is saving a very specific object (z, the imputed dataset) into the Data folder by using the hierarchical structure of here(), notice how Main Project is the first, and followed by Data, and then comes the object name (“z.rds”).

Now I have saved my imputed dataset only as an object called z.rds and it is saved in the Data folder even though I would hypothetically be working from the R folder.

Also notice that I use the beep() function from the beepr package, which can be quite handy to let you know if a particular script or analysis is done.

Next time, if I cleared my entire global environment and wanted to ONLY load the imputed dataset, I would simply have to just click on z.rds file and it would load into the environment or I could use the following command,


z <- readRDS("z.rds", here("Main Project", "Data", "z.rds"))

This gives me full control over my environment, and it’s also why I encourage people to turn off the automatic prompt to save the workspace that RStudio gives you by default (I think even Hadley Wickham has said the same).

So if I now wanted to include the imputed dataset into one of my .R scripts that I am trying to automate with one command, whether for reproducibility, or speed, etc., I could do it with absolute control.

This is also how I save my models and how I validate them, which is especially necessary because they can take a long time.

Here’s an example of a script that would be in the 04-analysis.R script, I’m leaving out the warning-catching script now to avoid making this too long. I don’t need to load the brms package because I’m already calling it from the first .R script, 01-functions.R (but I’m just showing it for now to avoid confusion) and then I am saving it in the Models folder which is found in the Data folder. And again, the beep() function would let me know when the script is over, and scripts like this can take very long.


# calling all packages, please report for duty

source(here("R", "01-functions.R")) 

library(brms) 

# This is a script I used previously where I used Bayes for penalization  

# Although these are all generic options 

# of observations

n_1 <- nrow(z)

# of predictors

k_1 <- (ncol(z) - 1)

# prior guess for the number of relevant variables

p0_1 <- 5 

# scale for tau

tau0_1 <- p0_1 / (k_1 - p0_1) * 1 / sqrt(n_1)

# regularized horseshoe prior

hs_prior <- set_prior("horseshoe(scale_global = tau0_1, scale_slab = 1)", 
                            class = "b")

full_mod1_shrunk <- brm_multiple(
              bf(y ~ x1 + x2 + x3 + x4 +
              x5 + x6 + x7 + x8 +
              x9 + x10 + x11 + x12 +
              x13 + x14 + x15 + x16 +
              x17 + x18 + x19 + x20 + x21, quantile = 0.50),
              data = z, prior = hs_prior, 
              family = asym_laplace(), save_all_pars = TRUE,
              iter = 20000, warmup = 5000, chains = 4, 
              cores = 4, thin = 1, combine = TRUE, seed = 1031, 
              control = list(max_treedepth = 15, adapt_delta = 0.9999))

# principled workflow

saveRDS(model1_full_penalized, 
        here("Main Project", "Data", "Models", "model1_full_penalized.rds")) 

beep(3)

So I’ve saved the full contents of the model (model1_full_penalized), which are usually several gigabytes large in the Models folder. Once again, notice how the here() function allows you to work from a subfolder like R, which is inside the Main Project folder, and allow you to state the hierarchy and save to whichever folder you wish, in this case Models, which is inside of Data.

Once I’ve written all my analyses scripts, I’ll typically create a subfolder in the Outputs folder called Figures and then Tables and save them appropriately with the help of the here package.

Then, I’ll go back to the README.Rmd file, make notes of the hierarchy of the entire folder and subfolders, and what’s been completed and what still needs to be done.

Now, suppose I deleted all of my saved models, imputed data, graphs, and tables, and only left the original dataset. Anyone could go into my project and simply highlight the entire main.R file and run it, and they would get all the same outputs and models, assuming I’ve set a seed, which I sometimes do. Although, it might take them a long time… depending on whether the scripts are computer intensive.

Also, please, please annotate your scripts, like I’ve done above to explain some of the functions. This is not only helpful for others who are trying to go through your code, but also for yourself, if you end up forgetting what does what, and what steps you were taking.

But anyway, here’s what my project folders tend to look like. And here is a dummy GitHub repo to show what the structure typically looks like as template.



There are some other quality checks that I’d like to incorporate very soon into my own workflows, and that I plan on adding to this particular blog post. Most of them involve practices/principles from R package development, so they have to do with creating tests that examine whether your scripts are doing what they should (unit testing, parameter recovery tests, etc.)

Also, it should go without saying that sharing your data and code will help catch errors before it’s too late, but sharing them with others/the public after a project is done will also help catch possible errors that you/your collaborators/reviewers didn’t catch.


  • Cite this blog post


  • See also:

    comments powered by Disqus