What Makes a Sensitivity Analysis?

Cargo-Cult Statistical Rituals


Sensitivity analyses (SA) are common in trials and observational studies, but often little thought is given to what they should entail. None of this is surprising given that they are not usually taught in traditional settings, although they have been discussed extensively within the statistical and epidemiological literature.17 Indeed, the Handbook of Missing Data Methodology,8 which is the most authoritative book to date on statistical methods to handle missing data, has nearly six chapters devoted to the topic of sensitivity analyses and different approaches to use. Clearly, they (SA) are an important component of many statistical analyses, and are therefore worth carefully understanding.

This article will focus on what a sensitivity analyses is often assumed to be based on practices in the literature, some differing points of view regarding sensitivity analyses, and a practical example of conducting principled sensitivity analyses in the context of missing data in a randomized clinical trial.

The first word of the phrase is clearly an obvious tell, it suggests an analysis that examines how sensitive or robust a result is. But a natural question is, sensitive specifically to what? Can I vary anything in the analysis to test the sensitivity of the result? For example, if I obtained an odds ratio of 1.7 from the primary analysis in my experiment and I conducted an extra analysis to see whether the odds ratio would change by resampling, is that a sensitivity analysis?

What if I did a completely different type of analysis, for example, if the primary analysis was a classical statistical test with an adjustment for multiple comparisons and I decided to run a hierarchical Bayesian regression with a concentrated prior on the null, would that be a sensitivity analysis? The possibilities of what can be varied are endless when a loose definition of sensitivity analysis is assumed. This is unique to every discipline and their culture.


Frequent Misconceptions


Take for example in in clinical trials and observational studies, in particular, it is common to see primary analyses often being intent-to-treat (ITT) analyses and sensitivity analyses being per-protocol analyses (PP). The following is a similar example from a high-profile trial published in the The Lancet in 2002,9


The Multicentre Aneurysm Screening Study group randomised 67,800 men to receive an invitation to an abdominal ultrasound scan or not [6]. Of those invited to receive an abdominal scan, 20% did not accept. The primary analysis was by intention to treat, thus estimating the effect of being randomised to abdominal ultrasound. Another analysis investigated the complier average causal effect, which considers what the (average) effect of treatment was in patients who would have adhered to protocol however they were randomised [7].


To many, this may seem perfectly fine, even great. The authors used the question they were primarily interested in as the main analysis, and then conducted an additional analysis to see if these results are consistent. Unfortunately, this is highly problematic as Morris et al. (2014)10 describes


These questions are different, and observing different results should not shake our confidence in either. The CACE analysis was a secondary analysis, not a sensitivity analysis.

It is common for authors to compare the results of intention-to-treat with per-protocol analysis; see for example [8, 9]. While it is hard to pin down the precise question of per-protocol analysis [10], this is clearly different to the question intention-to-treat addresses. Per-protocol analysis should not therefore be considered as a sensitivity analysis for intention-to-treat but as a secondary analysis, if at all.


As mentioned above, there are many possible ways to conduct a sensitivity analysis, especially if the phrase is used in a loose/vague way. Whether or not these are valid and principled approaches to conducting sensitivity analyses is another question. Luckily, many of us do not have to ponder day and night about the semantics about this because both The Handbook of Missing Data Methodology and the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) have given this topic much thought, and for the latter it is reflected by the fact that they recently created an entire new addition to their E9 guidance document (Statistical Principles for Clinical Trials) which has served as a reference for clinical trial statisticians for decades. In the addendum, titled Addendum on Estimands and Sensitivity Analysis In Clinical Trials, which has now been legally adopted by regulatory agencies around the world including the FDA and EMA, they elaborate on the concept of estimands and the role sensitivity analyses play.

To be clear, estimands are not new and have been discussed in the statistical literature since Tukey11 but the ICH working group’s addendum formalized its adoption in the clinical trial world and the importance of sensitivity analyses.1214


Estimands & Sensitivity


A.5.2.1. Role of Sensitivity Analysis

Inferences based on a particular estimand should be robust to limitations in the data and deviations from the assumptions used in the statistical model for the main estimator. This robustness is evaluated through a sensitivity analysis. Sensitivity analysis should be planned for the main estimators of all estimands that will be important for regulatory decision making and labelling in the product information. This can be a topic for discussion and agreement between sponsor and regulator.

The statistical assumptions that underpin the main estimator should be documented. One or more analyses, focused on the same estimand, should then be pre-specified to investigate these assumptions with the objective of verifying whether or not the estimate derived from the main estimator is robust to departures from its assumptions. This might be characterised as the extent of departures from assumptions that change the interpretation of the results in terms of their statistical or clinical significance (e.g. tipping point analysis).

Distinct from sensitivity analysis, where investigations are conducted with the intent of exploring robustness of departures from assumptions, other analyses that are conducted in order to more fully investigate and understand the trial data can be termed “supplementary analysis” (see Glossary; A.5.3.). Where the primary estimand(s) of interest is agreed between sponsor and regulator, the main estimator is pre-specified unambiguously, and the sensitivity analysis verifies that the estimate derived is reliable for interpretation, supplementary analyses should generally be given lower priority in assessment.

A.5.2.2. Choice of Sensitivity Analysis

When planning and conducting a sensitivity analysis, altering multiple aspects of the main analysis simultaneously can make it challenging to identify which assumptions, if any, are responsible for any potential differences seen. It is therefore desirable to adopt a structured approach, specifying the changes in assumptions that underlie the alternative analyses, rather than simply comparing the results of different analyses based on different sets of assumptions. The need for analyses varying multiple assumptions simultaneously should then be considered on a case by case basis. A distinction between testable and untestable assumptions may be useful when assessing the interpretation and relevance of different analyses.

ICH E9 (R1) Guideline

The need for sensitivity analysis in respect of missing data is established and retains its importance in this framework. Missing data should be defined and considered in respect of a particular estimand (see A.4.). The distinction between data that are missing in respect of a specific estimand and data that are not directly relevant to a specific estimand gives rise to separate sets of assumptions to be examined in sensitivity analysis.


they explicitly define a sensitivity analysis as being an analysis which realistically varies the assumptions from the primary analysis, still targets the same estimand, examines the robustness of the results to assumption violations, and can possibly change the results/conclusions drawn. They contrast this with more extensive analyses that investigate these violations/departures of assumptions, and characterize them as being supplementary rather than sensitivity.

Similar views are echoed by the National Research Council’s advice on clinical trials (National Research Council 2010)15


Recommendation 15: Sensitivity analysis should be part of the primary reporting of findings from clinical trials. Examining sensitivity to the assumptions about the missing data mechanism should be a mandatory component of reporting.



Now, back to the NEJM paper. The reason why an ITT analysis and a PP analysis cannot serve as primary and sensitivity analyses, respectively, is because they are targeting entirely different estimands.10 Thus, because they are answering completely different questions, they’re just two different analyses. And the additional PP analysis, although important to certain stakeholders, can be classified as a supplementary analysis or a non-sensitivity analysis.

Indeed, the following flowchart from Morris et al. (2014)10 is particularly useful as a guide to differentiate sensitivity analyses from non-sensitivity analyses.


Adapted flowchart from Morris et al. (2014)


So what does a sensitivity analysis actually look like? Below, I walk through the analysis of a trial to give some examples.


An Example From a Trial


I’ll use a sample clinical trial dataset, which is described here:


"This dataset is typical of diastolic blood pressure data measured in small clinical trials in hypertension from the mid-to-late 1960s and for approximately a decade thereafter.

During this time, hypertension was more severe, the number of effective treatments was relatively small, and the definition (DBP > 95 mmHg) of essential hypertension was not as stringent as it is now (DBP > 80 mmHg) - as seen in the 1967 report from the Veterans Administration Cooperative Study Group on Antihypertensive Agents VA Study Group (1967).

In Table 3.1, diastolic blood pressure (DBP) was measured (mmHg) in the supine position at baseline (i.e., DBP) before randomization… Patients’ age and sex were recorded at baseline and represent potential covariates. The primary objective in this chapter in the analysis of this dataset is to test whether treatment A (new drug) may be effective in lowering DBP as compared to B (placebo) and to describe changes in DBP across the times at which it was measured."


Although we will not be using some of the scripts and project setup that I discussed in my statistical workflow post for the sake of efficiency and ease, I would urge and recommend others do so that it remains easy to catch errors. In particular, I would especially recommend using the following packages:





# Simulate Clinical Trial Data  -------------------------------------

set.seed(1031, kind = "Mersenne-Twister", 
         normal.kind = "Inversion", 
         sample.kind = "Rejection")
BaselineDBP <- rnorm(500, 117, 2)
Group <- rbinom(500, 1, 0.5)
Age <- rnorm(500, 50, 7)
R1 <- rnorm(500, 0, 1)
R2 <- rnorm(500, 1, 0.5)
errors <- rmvnorm(500, mean = c(0, 0),
                  sigma = matrix(c(1, 0.5, 0.5, 1),
                                 nrow = 2, byrow = TRUE))
PostDBP <- BaselineDBP + (Age * 0.33) + 
           (Group * 9.7) + errors[, 1]

I have made some minor adjustments to this dataset and also generated a new variable (Z) based on the existing ones in the dataset that is strongly correlated with the outcome of interest (PostDBP).


# Constructing Covariates  -------------------------------------

n     <- length(BaselineDBP)   # length of vector
rho   <- 0.7                   # desired correlation
theta <- acos(rho)             # corresponding angle
x1    <- BaselineDBP           # fixed given data
x2    <- rnorm(n, 60, 10)      # new random data
X     <- cbind(x1, x2)         # matrix
Xctr  <- scale(X, center = TRUE,
               scale = FALSE)                 
Id   <- diag(n)                               
Q    <- qr.Q(qr(Xctr[ , 1, drop = FALSE]))    # QR-decomposition
P    <- tcrossprod(Q)          
x2o  <- (Id-P) %*% Xctr[ , 2]                 
Xc2  <- cbind(Xctr[ , 1], x2o)                # bind to matrix
Y1    <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2))) 
Z <- Y1[ , 2] + (1 / tan(theta)) * Y1[ , 1]   # final new vector
BP_full <- data.frame(PostDBP, BaselineDBP,
                      Group, Age, Z, R1, R2)


# Explore Data  -------------------------------------
str(BP_full)
#> 'data.frame':    500 obs. of  7 variables:
#>  $ PostDBP    : num  148 134 130 139 133 ...
#>  $ BaselineDBP: num  119 117 117 116 118 ...
#>  $ Group      : int  1 0 0 1 0 0 0 1 1 0 ...
#>  $ Age        : num  59.3 54.3 44.3 46.1 51.7 ...
#>  $ Z          : num  0.00547 0.06653 0.07261 -0.02679 0.05205 ...
#>  $ R1         : num  -0.921 -0.359 -0.774 -0.195 0.864 ...
#>  $ R2         : num  0.203 1.385 0.999 0.637 1.165 ...
summary(BP_full)
#>     PostDBP       BaselineDBP        Group            Age              Z                   R1                 R2         
#>  Min.   :124.8   Min.   :110.7   Min.   :0.000   Min.   :30.79   Min.   :-0.188210   Min.   :-3.13501   Min.   :-0.6206  
#>  1st Qu.:133.6   1st Qu.:115.7   1st Qu.:0.000   1st Qu.:46.17   1st Qu.:-0.040797   1st Qu.:-0.68427   1st Qu.: 0.6348  
#>  Median :138.2   Median :117.0   Median :1.000   Median :50.11   Median :-0.001858   Median : 0.02296   Median : 0.9880  
#>  Mean   :138.5   Mean   :117.0   Mean   :0.506   Mean   :50.47   Mean   : 0.000000   Mean   :-0.02118   Mean   : 0.9916  
#>  3rd Qu.:143.3   3rd Qu.:118.3   3rd Qu.:1.000   3rd Qu.:55.56   3rd Qu.: 0.040447   3rd Qu.: 0.72599   3rd Qu.: 1.3382  
#>  Max.   :154.7   Max.   :122.1   Max.   :1.000   Max.   :75.82   Max.   : 0.258358   Max.   : 2.67130   Max.   : 2.4550
ztable(head(BP_full, 10))
PostDBP BaselineDBP Group Age Z R1 R2
148.18 118.72 1 59.28 0.01 -0.92 0.20
134.34 117.38 0 54.32 0.07 -0.36 1.38
129.92 116.74 0 44.30 0.07 -0.77 1.00
138.80 116.02 1 46.13 -0.03 -0.19 0.64
133.35 117.85 0 51.72 0.05 0.86 1.16
139.83 119.93 0 55.73 0.09 -0.73 1.46
133.89 117.30 0 44.52 -0.03 1.57 0.46
138.24 113.28 1 45.70 -0.09 -0.31 1.18
139.87 115.34 1 48.75 -0.03 -1.43 1.24
135.31 118.42 0 53.10 0.03 0.25 -0.24


Now, we can quickly explore the other characteristics of the dataset, and first we look at the rank correlations between the predictors and the response variable.


# Inspect Data Attributes -------------------------------------

study_formula <- as.formula(PostDBP ~ .)

plot(spearman2(study_formula, BP_full), pch = 19,
     cex = 1,  col = alpha("#d46c5b", 1.95))
abline(v = 0, col = alpha("#d46c5b", 0.75), lty = 3)


Along with a correlation matrix between the predictors of interest.



We will not be looking at the estimates from the fitted model yet. We are mainly looking at a few characteristics of this dataset before we move on to the next step.


Now that we have familiarized ourselves with this dataset, we suddenly notice that there are missing data (which I actually generated below using a custom function) and suppose we didn’t actually know how the missing data were generated, we might suspect or start with the assumption that the data are missing at random (MAR).


I do not plan on touching what the different missing data mechanisms are (MCAR, MAR, MNAR) as described by Rubin (1978). Although it is necessary to be familiar with them to better understand the approaches used throughout this post. For a simple introduction to the topic, please see


Of course, here, we do have an idea of what the missing data mechanism is, but we shall assume the role of a data analyst who has just been given a dataset with missing values in the outcome from their clinical colleague.


# Generate Missing Data  -------------------------------------

set.seed(1031)
Ry <- ifelse(lesslikely() > 0, 1, 0)
PostDBP[Ry == 0] <- NA
BP_miss <- data.frame(PostDBP, BaselineDBP,
                      Group, Age, Z, R1, R2)
str(BP_miss)
#> 'data.frame':    500 obs. of  7 variables:
#>  $ PostDBP    : num  NA 134 130 139 133 ...
#>  $ BaselineDBP: num  119 117 117 116 118 ...
#>  $ Group      : int  1 0 0 1 0 0 0 1 1 0 ...
#>  $ Age        : num  59.3 54.3 44.3 46.1 51.7 ...
#>  $ Z          : num  0.00547 0.06653 0.07261 -0.02679 0.05205 ...
#>  $ R1         : num  -0.921 -0.359 -0.774 -0.195 0.864 ...
#>  $ R2         : num  0.203 1.385 0.999 0.637 1.165 ...

sum(is.na(BP_miss))
#> [1] 258

missing_col <- c("#3f8f9b", "#d46c5b")
aggr(BP_miss, col = alpha(missing_col, 0.60),
     plot = TRUE, prop = F, numbers = T)

par(mfrow = c(2, 2))
histMiss(BP_miss[,c("Age", "PostDBP")], 
         only.miss = TRUE,  
          col = alpha(c("#d46c5b", "gray"), 0.65))

histMiss(BP_miss[,c("Z", "PostDBP")], 
         only.miss = TRUE,  
          col = alpha(c("#d46c5b", "gray"), 0.65))

histMiss(BP_miss[,c("R1", "PostDBP")], 
         only.miss = TRUE,  
         col = alpha(c("#d46c5b", "gray"), 0.65))

histMiss(BP_miss[,c("R2", "PostDBP")], 
         only.miss = TRUE,  
        col = alpha(c("#d46c5b", "gray"), 0.65))


So now, we must examine this missing dataset and its characteristics and look for any systematic differences. We also examine the proportion of missingness and the influx and outflux patterns.


# Examine Flux Patterns  -------------------------------------

round(fico(BP_miss), 1)[2:6]
#> BaselineDBP       Group         Age           Z          R1 
#>         0.5         0.5         0.5         0.5         0.5
ztable(flux_temp)
POBS Influx Outflux AINB AOUT FICO
PostDBP 0.48 0.48 0 1 0.00 0.00
BaselineDBP 1.00 0.00 1 0 0.09 0.52
Group 1.00 0.00 1 0 0.09 0.52
Age 1.00 0.00 1 0 0.09 0.52
Z 1.00 0.00 1 0 0.09 0.52
R1 1.00 0.00 1 0 0.09 0.52
R2 1.00 0.00 1 0 0.09 0.52
fluxplot(BP_miss, labels = FALSE, pch = 19,
         col = "#d46c5b", cex = .95,
         main = "Influx-Outflux Pattern")


Exploratory Analyses


We start off doing some preliminary analyses to eyeball differences between the observed data and the missing amputed data.


#> [1] 258
pobs influx outflux ainb aout fico
PostDBP 0.48 0.48 0 1 0.00 0.00
BaselineDBP 1.00 0.00 1 0 0.09 0.52
Group 1.00 0.00 1 0 0.09 0.52
Age 1.00 0.00 1 0 0.09 0.52
Z 1.00 0.00 1 0 0.09 0.52
R1 1.00 0.00 1 0 0.09 0.52
R2 1.00 0.00 1 0 0.09 0.52

We fit a model quickly to these missing data using the rlm() function from the MASS package and glance at what our estimates look like.


# Exploratory Model Estimates  -------------------------------------

eda_1 <- rlm(PostDBP ~ Group + BaselineDBP +
             Age + Z + R1 + R2, data = BP_miss)

ztable(summary(eda_1)[["coefficients"]])
Value Std. Error t value
(Intercept) -7.48 5.47 -1.37
Group 9.72 0.13 74.39
BaselineDBP 1.05 0.05 22.56
Age 0.35 0.01 38.61
Z -1.76 1.49 -1.18
R1 -0.18 0.07 -2.63
R2 -0.20 0.13 -1.59
ztable(performance(eda_1))
AIC BIC RMSE Sigma
669.45 697.36 0.93 0.95


Because the MASS::rlm() function will automatically delete missing observations and do a complete-case analysis, we can gauge what our estimates would look like if we assume that the data are missing completely at random (MCAR), which is almost always unrealistic.

It is almost always more realistic to assume that the data are missing at random (MAR), \(\operatorname{Pr}\left(R=0 \mid Y_{\mathrm{obs}}, Y_{\mathrm{mis}}, \psi\right)=\operatorname{Pr}\left(R=0 \mid Y_{\mathrm{obs}}, \psi\right)\). This can often be a reasonable assumption to start off with for the primary analysis.16


The Analysis Workflow


For our main analysis, we will build an imputation model to impute our missing data. We include all possible information in our imputation model, meaning all the variables in our dataset, and impute the missing data using the predictive mean matching method, and then analyze each dataset and pool the results. Our workflow will look like this


Adapted flowchart from van Buuren (2018)16


We do this with a fully-adjusted model, with all relevant covariates and a reduced model with just the treatment group. This is because we want both the adjusted and unadjusted estimates. We will then compare these models using the likelihood-ratio test17 to see whether there are any substantial differences between the fully-adjusted and reduced model.


MI Power Analysis


However, before we are able to move onto this step, we must determine how many imputations we will construct. To many, this may seem like a trivial and even nonsensical task when any individual can impute a dataset with the click of a button and enter in any number of datasets they wish to impute.

The issue with this is that without giving careful thought to how many imputations may be needed for a particular study, one may run the risk of imputing too few datasets and losing information, or they may waste resources imputing a nonsensical amount that may not be necessary. Indeed, the latter seems especially trivial, however, it may often yield no advantages whatsoever, and only incur costs as many authors have argued.


The argument is that “the additional resources that would be required to create and store more than a few imputations would not be well spent” (Schafer 1997, 107), and “in most situations there is simply little advantage to producing and analyzing more than a few imputed datasets” (Schafer and Olsen 1998, 549).


Therefore, we will examine certain characteristics of our dataset with missing values and use them in an analysis to determine how many datasets we should impute. We will base this off a number of characteristics such as the fraction of missing information, the proportion of missing observations, and the losses that we are willing to incur.

We first start by running a “dry” or naive imputation of our data to quickly examine some of these characteristics. This imputed dataset will not be used for our primary, analysis, it is simply serving to guide us in our main imputation model. This is also why we have set the number of iterations to 0.


# Dry Imputation -------------------------------------

form <- list(PostDBP ~ Group + BaselineDBP + Age + Z + R1 + R2)
pred <- make.predictorMatrix(BP_miss)

mice(BP_miss, method = "pmm", m = 20, 
     maxit = 0, cores = 4, donors = 5,
     seed = 1031, formulas = form, ridge = 0, 
     predictorMatrix = pred, remove.collinear = FALSE, 
     remove.constant = FALSE, allow.na = FALSE,
     printFlag = FALSE, visitSequence = "monotone") -> init_imp

head(init_imp$loggedEvents, 10) 
#> NULL

Now that we have imputed our dataset, we may examine it for any issues before moving onto the next step of analyzing it.


str(complete(init_imp, 1))
#> 'data.frame':    500 obs. of  7 variables:
#>  $ PostDBP    : num  147 134 130 139 133 ...
#>  $ BaselineDBP: num  119 117 117 116 118 ...
#>  $ Group      : int  1 0 0 1 0 0 0 1 1 0 ...
#>  $ Age        : num  59.3 54.3 44.3 46.1 51.7 ...
#>  $ Z          : num  0.00547 0.06653 0.07261 -0.02679 0.05205 ...
#>  $ R1         : num  -0.921 -0.359 -0.774 -0.195 0.864 ...
#>  $ R2         : num  0.203 1.385 0.999 0.637 1.165 ...

densityplot(init_imp, plot.points = TRUE,
            theme = aesthetics, col = col)

bwplot(init_imp, PostDBP, col = col)


Although it is only a dry run, there seems to be few issues with it.


init_res <- pool(with(init_imp,
                      rlm(PostDBP ~ Group + 
                          BaselineDBP + Age + 
                          Z + R1 + R2)))

init_res_sum <- summary(init_res)

colnames(init_res_sum) <- c("Term", "Estimate", "SE",
                            "Statistic", "df", "P-val")

ztable(init_res_sum)
Term Estimate SE Statistic df P-val
(Intercept) 35.07 23.38 1.50 61.57 0.14
Group 6.46 0.64 10.18 43.87 0.00
BaselineDBP 0.74 0.20 3.72 61.17 0.00
Age 0.26 0.04 7.19 122.38 0.00
Z -4.12 6.40 -0.64 60.56 0.52
R1 -0.01 0.26 -0.05 94.39 0.96
R2 0.08 0.54 0.15 72.50 0.88

Now that we have our vector, we can quickly examine certain characteristics such as the fraction of missing information (FIMO), the between-imputation variance and the within-imputation variance. The reason that we focus on these particular numbers is because they are crucial for several tasks in any statistical analysis plan, such as having enough power/precision, proper long-coverage, etc.

We use the FIMO from the dry imputation and a number of other criteria to determine how many imputations we need:


  • Some have suggested taking the FIMO and multiplying by 100 to obtain the number of imputed datasets needed

\[m \approx 100 \lambda\]


  • we wish to ensure that the monte-carlo standard error from the imputations are less than 10% of the between-imputation standard error

\[mcse < B^{\hat{-}}_{SE} * 0.10\]


  • we wish to minimize the monte-carlo error that is derived from dividing the chosen number of datasets to impute from the fraction of missing information so that the monte-carlo standard error is less than 0.01

\[FMI / m ≈ 0.01\]


  • we also wish to minimize the total variance so that the square-root of it is no greater than the ideal variance and the corresponding confidence interval width.

\[T_{m}=\left(1+\frac{\gamma_{0}}{m}\right) T_{\infty}\]


  • Furthermore, it is commonly advocated by missing data experts to impute the same or similar number of datasets as fractions of missing observations \(\gamma_{0} * 100\)

We start by extracting the FMI’s from our dry imputation and calculating all these numbers.


grid(init_res$pooled$fmi)

fmi <- max(summary(init_res, type = c("all"), # Minimum Imputations
                   conf.int = TRUE, conf.level = 0.95)["fmi"])  

m <- fmi / 0.01  # FMI Imputations

mcse <- sqrt(max(init_res$pooled$b) / m) # MCSE
se <- max((init_res$pooled$ubar) / sqrt(500)) # SE

What we find is that the fraction of missing information is nearly 50% which is very high and that in order to get a monte-carlo error of less than 0.01, we would need a minimum of at least 50 imputed datasets. We also calculate both the monte-carlo standard error and the standard error from the dry run and find that they are practically equivalent.

However, we must also explore other possibilities in terms of number of imputed datasets to reduce the monte-carlo errors so we run a quick function to do, checking the effects of a number of imputed datasets on the width of the confidence interval, which is directly tied to the monte-carlo error.

Indeed, Von Hippel (2018) proposed a relationship between the fraction of missing information and the number of imputations needed to achieve a targeted CI length or monte-carlo error using a quadratic formula,


\[m=1+\frac{1}{2}\left(\frac{\gamma_{0}}{\operatorname{SD}\left(\sqrt{U_{\ell}}\right) \mathrm{E}\left(\sqrt{U_{\ell}}\right)}\right)^{2}\]


where \(\mathrm{E}\left(\sqrt{U_{\ell}}\right)\) and \(\mathrm{SD}\left(\sqrt{U_{\ell}}\right)\) are the coefficient of variation (CV), summarizing the imputation variation in the SE estimates. We can construct a function to depict this relationship and calculate how many imputations we will need based on some desired error and further display the distribution of this relationship by varying these parameters.

In the function below, adopted from von Hippel, and van Buuren,16 I vary both the coefficient of variation and the \(\alpha\)-level.



With this function, we can now graphically examine the effects of different numbers of imputations on monte-carlo errors



We can see above that as we aim for smaller monte-carlo errors, we many more imputations to achieve our desired target.

Now that I have constructed and graphed three differing relationships between monte carlo errors and number of imputations needed to achieve those errors, and varied the \(\alpha\)-levels, we can now move onto choosing a specific number of imputations to construct.

If I choose an \(\alpha\) level of 5%, which the maximum tolerable type-I error rate, and a coefficient of variation of 2.5%


# Power || Information Analysis for Number of Imputations

powerimp(model = init_res, cv = 0.05, alpha = 0.01)
#> [1] 121

I need to construct approximately 50 imputations to achieve this coefficient of variation, which is what I will use for all the analyses from here on.


The Primary Analysis


For the primary analysis, I will use the predictive-mean matching method and specify 5 donors (based on a number of simulation studies that have examined the effect of various donors specified)18, and fit a robust regression that includes all the variables within the dataset. Our imputation model will be


\[Y^{\operatorname{Post}} = \beta_{0} + \beta_{1}^{\operatorname{Group}} + \beta_{2}^{\operatorname{Base}} + \beta_{3}^{\operatorname{Age}} + \beta_{4}^{\operatorname{Z}} + \beta_{5}^{\operatorname{R1}} + \beta_{6}^{\operatorname{R2}} + \epsilon\]


# Impute Missing Data  -------------------------------------

form <- list(PostDBP ~ Group + BaselineDBP + Age + Z + R1 + R2)
pred <- make.predictorMatrix(BP_miss)

mice(BP_miss, method = "pmm", 
     predictorMatrix = pred,
     m = 20, maxit = 10, seed = 1031, 
     cores = 8, formulas = form, ridge = 0, 
     donors = 5, remove.collinear = FALSE,
     remove.constant = FALSE, allow.na = FALSE, 
     printFlag = FALSE, visitSequence = "monotone") -> imp1

head(imp1$loggedEvents, 10) 
#> NULL

Now that we have imputed our dataset, we may examine it for any issues before moving onto the next step of analyzing it.



Our imputed datasets seem to look fine so far. We now fit our models to each of these datasets and combine them using Rubin’s rules.


# Primary Analysis  -------------------------------------

analysis1 <- with(imp1, rlm(PostDBP ~ Group +
                          BaselineDBP + Age + Z + R1 + R2))
analysis2 <- with(imp1, rlm(PostDBP ~ Group))

anova(analysis1, analysis2, method = "D3", use = "LR")
#>    test statistic df1      df2 dfcom p.value      riv
#>  1 ~~ 2         0   5 95.65188   493       1 273.8017

result1 <- pool(analysis1)

results1_sum <- summary(result1)

colnames(results1_sum) <- c("Term", "Estimate", "SE",
                            "Statistic", "df", "P-val")

ztable(results1_sum)
Term Estimate SE Statistic df P-val
(Intercept) -7.46 5.44 -1.37 52.70 0.18
Group 9.68 0.12 78.33 68.20 0.00
BaselineDBP 1.05 0.05 22.46 50.57 0.00
Age 0.35 0.01 39.44 67.78 0.00
Z -1.94 1.72 -1.13 35.92 0.27
R1 -0.17 0.07 -2.58 50.12 0.01
R2 -0.16 0.15 -1.09 38.13 0.28

While we have our primary analysis estimates, we continue to examine our imputations.


ps <- rep(rowMeans(sapply(analysis1$analyses, fitted.values)),
          imp1$m + 1)
xyplot(imp1, PostDBP ~ ps | as.factor(.imp),
       xlab = "Probability that record is incomplete",
       ylab = "PostDBP", pch = c(1, 19), col = col)


We’ve done our primary analysis, in which we imputed the missing data under the assumption of MAR. Unfortunately, we cannot verify whether or not this assumption is true, but we can vary this missing data assumption and assume that the data are missing not at random (MNAR) \(\operatorname{Pr}\left(R=0 \mid Y_{\mathrm{obs}}, Y_{\mathrm{mis}}, \psi\right)\), and that in the individuals with missing data, we can assume different numbers than those with complete data. Our likelihood-ratio test also suggests that there’s very little difference between the full model and the reduced model, so we generally will go with the full model.



SA I: Controlled MI


Now suppose we wish to handle these missing under the assumption of MNAR, we would now be conducting a sensitivity analysis because we are still targeting the same estimand, but only varying the assumptions. The question / target remain the same. To handle missing data that we assume are MNAR, there are a number of different and complex approaches.

We start off with the \(\delta\)-adjustment, controlled multiple imputation method,19 in which we assume that the group with missing observations differs systematically from the group with complete observations by a certain quantity. We produce a range of these quantities and add or subtract them from the imputed values and then conduct our analysis on this adjusted, imputed dataset. Here I will make the assumption that the average PostDBP estimate from the missing value group differs from at least 0 mmHg to at most 30 mmHg from the group with no missing observations.


# Controlled Multiple Imputation  -------------------------------------

ini <- mice(BP_miss, method = "pmm", maxit = 0,
            predictorMatrix = pred)

ini$method["BaselineDBP"] <- ""
ini$method["PostDBP"] <- "pmm"
ini$method["Age"] <- ""
ini$method["R1"] <- ""
ini$method["Group"] <- ""
ini$method["R2"] <- ""
ini$method["Z"] <- ""
meth <- ini$method

delta <- c(-5:5)

imp.all <- vector("list", length(delta))
post <- ini$post
for (i in 1:length(delta)) {
  d <- delta[i]
  cmd <- paste("imp[[j]][, i] <- imp[[j]][, i] +", d)
  post["PostDBP"] <- cmd
  imp <- mice(BP_miss, method = meth,
              predictorMatrix = pred,
              m = 5, maxit = 5, donors = 5,
              cores = 4, ridge = 0, formulas = form,
              remove.collinear = FALSE,
              remove.constant = FALSE,
              allow.na = FALSE, post = post,
              seed = i * 1031, printFlag = FALSE)
  imp.all[[i]] <- imp
}

Now that we have our imputed datasets that are modified by \(\delta\)-adjustments, we can fit our models to each imputed dataset and combine the estimates using Rubin’s rules.


# Sensitivity Analysis 1  -------------------------------------

output <- sapply(imp.all,
                 function(x) pool(with(x, rlm(PostDBP ~ Group +
                                              BaselineDBP + Age + Z + R1 + R2))),
                 simplify = "array", USE.NAMES = TRUE)
a <- (as.data.frame(t(output)))$pooled
r <- array(rep(NA, length(delta) * 7),
           dim = c(length(delta), 7), dimnames = list(c(delta),
                                               c("Intercept", "Group",
                                                 "BaselineDBP", "Age",
                                                 "Z", "R1", "R2")))
for (i in 1:length(delta)) {
   r[i, ] <- cbind(a[[i]][["estimate"]])
}
r <- as.data.frame(r)
r <- data.frame(Delta = row.names(r), r, row.names = NULL)
ztable(r)
Delta Intercept Group BaselineDBP Age Z R1 R2
-5 -0.43 9.68 0.99 0.32 -0.89 -0.12 -0.11
-4 -3.38 9.63 1.01 0.32 -1.69 -0.14 -0.10
-3 -0.84 9.71 0.99 0.33 -1.14 -0.15 -0.21
-2 -3.53 9.67 1.02 0.34 -1.08 -0.15 -0.17
-1 -4.46 9.64 1.03 0.35 -1.40 -0.20 -0.22
0 -11.04 9.68 1.08 0.35 -3.46 -0.17 -0.13
1 -6.74 9.68 1.05 0.36 -1.69 -0.20 -0.17
2 -11.15 9.64 1.09 0.36 -2.61 -0.16 -0.18
3 -13.48 9.59 1.11 0.37 -3.04 -0.18 -0.19
4 -14.59 9.74 1.12 0.37 -3.63 -0.16 -0.17
5 -14.85 9.64 1.13 0.38 -3.63 -0.23 -0.19

The table above gives the various \(\delta\)’s (the first column with values ranging from -30 - 30 in increments of 5) that were added to the multiply imputed datasets, and the corresponding estimates from the models fit to those datasets. So now we’ve conducted our first sensitivity analysis using the \(\delta\)-adjustment, controlled multiple imputation method. This form of sensitivity analysis is also typically what is used and preferred by regulators.19.


Regulators prefer simple methods that impute the missing outcomes under MAR, and then add an adjustment δ to the imputes, while varying δ over a plausible range and independently for each treatment group (Permutt 2016). The most interesting scenarios will be those where the difference between the δ’s correspond to the size of the treatment effect in the completers. Contours of the p-values may be plotted on a graph as a function of the δ’s to assist in a tipping-point analysis (Liublinska and Rubin 2014).


And as expected, lower \(\delta\) values gave us results that were consistent with the primary analysis, while the upper end of the extreme \(\delta\) values gave us estimates that were substantially off the mark.


Next, we quickly examine the state of a random sample of these imputations to make sure they’re reliable.


# Examine Imputations  -------------------------------------

l <- sample(1:length(delta))[1]
densityplot(imp.all[[l]], plot.points = TRUE,
            theme = aesthetics, col = col)

bwplot(imp.all[[l]], PostDBP,  col = col)


Now we move onto our next and final sensitivity analysis.



SA II: Selection Models


To further explore the MNAR assumption, we could also use a Heckman selection model \(P(Y,R)=P(Y)P(R|Y)\), as described by van Buuren here:16


The selection model multiplies the marginal distribution \(P(Y)\) in the population with the response weights \(P(R|Y)\). Both \(P(Y)\) and \(P(R|Y)\) are unknown, and must be specified by the user. The model where \(P(Y)\) is normal and where \(P(R|Y)\) is a probit model is known as the Heckman model. This model is widely used in economics to correct for selection bias.


We can implement this in R using the miceMNAR package, which is an extension to the mice R package.


# Selection Models  -------------------------------------

JointModelEq <- generate_JointModelEq(data = BP_miss,
                                      varMNAR = "PostDBP")
JointModelEq[,"PostDBP_var_sel"] <- c(0, 1, 1, 1, 1, 1, 1)
JointModelEq[,"PostDBP_var_out"] <- c(0, 1, 1, 1, 0, 1, 1)

arg <- MNARargument(data = BP_miss, varMNAR = "PostDBP",
                    JointModelEq = JointModelEq)

imp2 <- mice(data = arg$data_mod,
             method = arg$method, seed = 1031,
             predictorMatrix = arg$predictorMatrix,
             JointModelEq = arg$JointModelEq,
             control = arg$control, formulas = form,
             maxit = 5, m = 5, cores = 4,
             ridge = 0, remove.collinear = FALSE,
             remove.constant = FALSE, allow.na = FALSE,
             printFlag = FALSE, visitSequence =  "monotone")

head(imp2$loggedEvents, 10)
#> NULL

imp2$data$Group <- as.factor(imp2$data$Group)

densityplot(imp2, col = col,
            plot.points = TRUE,
            theme = aesthetics)

plot(imp2, layout = c(1, 2),  col = col, theme = aesthetics)

bwplot(imp2, PostDBP, col = col)


Now we fit our models to these new datasets and pool them using Rubin’s rules.


# Sensitivity Analysis 2  -------------------------------------

analysis3 <- with(imp2, rlm(PostDBP ~ Group +
                            BaselineDBP + Age + Z + R1 + R2))
result3 <- pool(analysis3)

results3_sum <- summary(result3)
colnames(results3_sum) <- c("Term", "Estimate", "SE",
                            "Statistic", "df", "P-val")

ztable(results3_sum)
Term Estimate SE Statistic df P-val
(Intercept) -2.24 5.68 -0.39 24.46 0.70
Group1 9.74 0.19 51.73 8.50 0.00
BaselineDBP 1.01 0.05 20.62 22.76 0.00
Age 0.34 0.01 42.60 383.20 0.00
Z -0.68 1.38 -0.50 65.96 0.62
R1 -0.14 0.08 -1.79 12.54 0.10
R2 -0.21 0.14 -1.53 25.59 0.14

Now we examine our imputed values.


ps2 <- rep(rowMeans(sapply(analysis3$analyses, fitted.values)),
          imp2$m + 1)
xyplot(imp2, PostDBP ~ ps2 | as.factor(.imp),
       xlab = "Probability that record is incomplete",
       ylab = "PostDBP", pch = c(1, 19), col = col)


so now we have conducted our second sensitivity analysis.



The results from both the primary analysis, the first sensitivity analysis, and the second seem somewhat consistent. However, with the first sensitivity analysis where we applied \(\delta\)-adjustments, as the adjustments became larger, the coefficients changed drastically. However, such large \(\delta\)-adjustments are not realistic and those that were smaller led to coefficients that were closer to the other two analyses.


Supplementary Analyses


The following analyses are what I mostly consider to be supplementary analyses that may further investigate some violations of assumptions but often go far beyond what would be necessary for a principled sensitivity analysis that would typically accompany a primary analysis.


First, I like to check that the results from my analyses are consistent across different statistical packages. So I may check to see that the results from my primary analysis from R are also similar to the results from another statistical package like Stata.

So in Stata, I would likely run something similar to the following in order to mimic the primary analysis in R.

import delimited "https://lesslikely.com/datasets/temp.csv", numericcols(2 3 5 6 7 8)
#> . import delimited "https://lesslikely.com/datasets/temp.csv", numericcols(2> 5 6 7 8)
#> (encoding automatically selected: ISO-8859-2)
#> (8 vars, 500 obs)
#> 
#> . 
#> . mi set mlong
#> 
#> . 
#> . mi register imputed baselinedbp postdbp age z group r1 r2
#> (291 m=0 obs now marked as incomplete)
#> 
#> . 
#> . mi impute chained (pmm, knn(5)) postdbp = baselinedbp age z r1 r2 group, burn
#> > in(10) add(40) rseed (1031) savetrace(trace1, replace) bootstrap
#> note: missing-value pattern is monotone; no iteration performed.
#> 
#> Conditional models (monotone):
#>            postdbp: pmm postdbp baselinedbp age z r1 r2 group , bootstrap
#>                      knn(5)
#> 
#> Performing chained iterations ...
#> 
#> Multivariate imputation                     Imputations =       40
#> Chained equations                                 added =       40
#> Imputed: m=1 through m=40                       updated =        0
#> 
#> Initialization: monotone                     Iterations =        0
#>                                                 burn-in =        0
#> 
#>            postdbp: predictive mean matching
#> 
#> ------------------------------------------------------------------
#>                    |               Observations per m             
#>                    |----------------------------------------------
#>           Variable |   Complete   Incomplete   Imputed |     Total
#> -------------------+-----------------------------------+----------
#>            postdbp |        209          291       291 |       500
#> ------------------------------------------------------------------
#> (Complete + Incomplete = Total; Imputed is the minimum across m
#>  of the number of filled-in observations.)
#> 
#> Note: Posterior parameters obtained using sampling with replacement.
#> 
#> . 
#> . mi estimate, mcerror: rreg postdbp i.group baselinedbp age z r1 r2
#> 
#> Multiple-imputation estimates                   Imputations       =         40
#> Robust regression                               Number of obs     =        500
#>                                                 Average RVI       =     1.4641
#>                                                 Largest FMI       =     0.7196
#>                                                 Complete DF       =        493
#> DF adjustment:   Small sample                   DF:     min       =      50.16
#>                                                         avg       =      93.58
#>                                                         max       =     126.75
#> Model F test:       Equal FMI                   F(   6,  285.9)   =    1235.31
#>                                                 Prob > F          =     0.0000
#> 
#> ------------------------------------------------------------------------------
#>      postdbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
#> -------------+----------------------------------------------------------------
#>      1.group |   10.41553   .1343827    77.51   0.000     10.14839    10.68266
#>              |   .0153068   .0088366     5.25   0.000     .0251455    .0224951
#>              |
#>  baselinedbp |   .9541257   .0431295    22.12   0.000     .8687595    1.039492
#>              |   .0043726   .0017532     0.89   0.000      .005078    .0062013
#>              |
#>          age |   .3238384   .0099478    32.55   0.000     .3040228     .343654
#>              |   .0011781   .0008449     2.90   0.000     .0021126    .0021126
#>              |
#>            z |  -2.393966   1.341859    -1.78   0.077    -5.049314    .2613814
#>              |   .1349552   .0550873     0.12   0.019     .1878212      .16354
#>              |
#>           r1 |  -.3490421   .0852886    -4.09   0.000    -.5203355   -.1777487
#>              |   .0112217   .0081904     0.49   0.000     .0159233    .0242459
#>              |
#>           r2 |  -.1611967   .1469517    -1.10   0.276    -.4541912    .1317979
#>              |   .0176771    .008492     0.14   0.060     .0242385    .0257259
#>              |
#>        _cons |   6.453088    5.10019     1.27   0.208    -3.643662    16.54984
#>              |   .5205047   .2107395     0.12   0.043     .7220854    .6295723
#> ------------------------------------------------------------------------------
#> Note: Values displayed beneath estimates are Monte Carlo error estimates.
#> 
#> . 
#> . quietly mi passive: generate byte imputed = _mi_miss
#> 
#> . 
#> . replace imputed = 1 if imputed==.
#> (0 real changes made)
#> 
#> . 
#> . generate float postdbp_Delta_A1=postdbp
#> (291 missing values generated)
#> 
#> . 
#> . replace postdbp_Delta_A1=postdbp + 5 if imputed==1
#> (11,640 real changes made)
#> 
#> . 
#> . mi estimate, mcerror: rreg postdbp_Delta_A1 i.group baselinedbp age z r1 r2
#> 
#> Multiple-imputation estimates                   Imputations       =         40
#> Robust regression                               Number of obs     =        500
#>                                                 Average RVI       =     0.4338
#>                                                 Largest FMI       =     0.4403
#>                                                 Complete DF       =        493
#> DF adjustment:   Small sample                   DF:     min       =     118.94
#>                                                         avg       =     235.19
#>                                                         max       =     315.98
#> Model F test:       Equal FMI                   F(   6,  418.9)   =     918.73
#>                                                 Prob > F          =     0.0000
#> 
#> ------------------------------------------------------------------------------
#> postdbp_De~1 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
#> -------------+----------------------------------------------------------------
#>      1.group |   13.62051   .1967079    69.24   0.000     13.23315    14.00788
#>              |   .0145068   .0055858     2.00   0.000     .0195646     .017133
#>              |
#>  baselinedbp |   1.001326   .0687089    14.57   0.000     .8660981    1.136554
#>              |    .004666   .0012464     0.24   0.000     .0041898     .006217
#>              |
#>          age |   .3137277   .0148054    21.19   0.000     .2845127    .3429427
#>              |   .0013031   .0006316     0.91   0.000      .001624    .0020197
#>              |
#>            z |  -2.339549   2.121875    -1.10   0.271    -6.514338     1.83524
#>              |   .1355884   .0353007     0.06   0.026     .1713074    .1324027
#>              |
#>           r1 |  -1.441007   .1151125   -12.52   0.000    -1.668943   -1.213072
#>              |   .0118336   .0065291     0.77   0.000     .0145274    .0207451
#>              |
#>           r2 |  -.6562613   .2091866    -3.14   0.002    -1.068731   -.2437918
#>              |   .0174729   .0054034     0.12   0.001     .0202383     .021055
#>              |
#>        _cons |   3.054198    8.13298     0.38   0.708    -12.95496    19.06336
#>              |     .56593   .1605893     0.07   0.054     .7663276    .5137369
#> ------------------------------------------------------------------------------
#> Note: Values displayed beneath estimates are Monte Carlo error estimates.

Trace plot of imputed datasets


As you can see from above, I imported the missing data I generated in R into Stata, and then used it to multiply impute the datasets using the predictive mean matching method with 5 donors. I chose the exact same number of variables to include in the imputation model as I did with the primary and sensitivity analyses. Just like the primary analysis, I also used robust regression, rreg in Stata, to analyze the datasets. The numbers seem to be fairly consistent with those from the primary analysis and the some of the results from the sensitivity analyses.

I will now conduct another supplementary analysis, in which I fit a Bayesian regression model using a t-distribution so that it is analogous to the other robust regression models I have utilized so far, and I will specify a weakly informative prior.


# Bayesian Regression  -------------------------------------

brm_form <- bf(PostDBP ~ Group + BaselineDBP + 
               Age + Z + R1 + R2)

preds <-  c("Intercept", "Group", "BaselineDBP", 
            "Age", "Z", "R1", "R2")

prior1 <- c(prior(normal(0, 100), class = Intercept),
            prior(normal(0, 100), class = b),
            prior(normal(0, 100), class = sigma))

brm_multiple(brm_form, data = imp1, prior = prior1, 
             silent = 2, family = student(), iter = 40, refresh = 0,
             warmup = 10, chains = 1, inits = "0",
             thin = 2, open_progress = FALSE,
             combine = TRUE, sample_prior = "no", 
             seed = 1031, algorithm = "sampling", backend = "rstan",
             control = list(max_treedepth = 10, adapt_delta = 0.8),
             file = NULL, file_refit = "never") -> b_1

b_1[["fit"]]@sim[["fnames_oi"]] <- preds 
b_1
#>  Family: student 
#>   Links: mu = identity; sigma = identity; nu = identity 
#> Formula: PostDBP ~ Group + BaselineDBP + Age + Z + R1 + R2 
#>    Data: imp1 (Number of observations: 500) 
#> Samples: 20 chains, each with iter = 40; warmup = 10; thin = 2;
#>          total post-warmup samples = 300
#> 
#> Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
ztable(posterior_summary(b_1))
Estimate Est.Error Q2.5 Q97.5
Intercept -100.40 172.83 -500.80 196.50
Group 3.28 3.34 -4.45 8.19
BaselineDBP 1.31 1.19 -1.23 4.23
Age 0.71 0.98 -0.12 4.00
Z -5.07 3.04 -12.05 -0.08
R1 0.44 2.43 -6.78 4.28
R2 3.95 3.46 -3.38 8.68

We examine our model diagnostics to look for any anomalies.


#> Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not found in Windows font database

#> Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not found in Windows font database
#> Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font family not found in Windows font database


Nothing seems out of the ordinary.




Indeed, it is no surprise that the Bayesian model with a weakly dispersed prior would give similar results to the base rlm() function, while specifying a regularizing prior like the horseshoe prior with three degrees of freedom, would practically shrink all the coefficients towards zero, including our covariate Z. We now look at robust methods and resampling and the results they give.


Although this is currently under the supplementary analysis section, I would very much consider these next two analyses to be sensitivity analyses. The first method utilized here is known as the random indicator method, and it is an experimental method that essentially combines a selection model and a pattern-mixture model and is designed to handle data that are missing not at random. As van Buuren writes16


The random indicator method (Jolani 2012)20 is an experimental iterative method that redraws the missing data indicator under a selection model, and imputes the missing data under a pattern-mixture model, with the objective of estimating δ from the data under relaxed assumptions. Initial simulation results look promising. The algorithm is available as the ri method in mice.


The script below utilizes this method (random indicator), but bootstraps the missing dataset before the imputation procedure, and then after multiply imputing the datasets, obtains estimates using maximum likelihood multiple (MLMI) imputation, an alternative and more efficient method to the more popular posterior draw multiple imputation (PDMI). More about MLMI can be found here.21


# Bootstrap Inference -------------------------------------

bootMice(BP_miss, method = "ri", maxit = 5, 
         predictorMatrix = pred, nBoot = 10, nImp = 2,
         seed = 1031, formulas = form,
         ridge = 0, remove.collinear = FALSE,
         remove.constant = FALSE, allow.na = TRUE,
         printFlag = FALSE, visitSequence = "monotone") -> imps

analyseImp <- function(inputData) {
  return(coef(rlm(PostDBP ~ Group + BaselineDBP + Age + Z + R1 + R2,
                    data = inputData)))
}

ests <- bootImputeAnalyse(imps, analyseImp, quiet = TRUE)

ests <- as.data.frame(ests)
colnames(ests) <- c("Estimate", "Var",
                            "CI.ll", "CI.ul", "df")
rownames(ests) <- c("Intercept", "Group", "BaselineDBP",
                    "Age", "Z", "R1", "R2")

ztable(ests)
Estimate Var CI.ll CI.ul df
Intercept -6.97 18.40 -18.24 4.31 4.66
Group 9.71 0.01 9.45 9.97 3.00
BaselineDBP 1.05 0.00 0.95 1.15 4.55
Age 0.35 0.00 0.32 0.39 6.31
Z -1.88 3.07 -6.12 2.35 6.33
R1 -0.15 0.01 -0.35 0.06 7.90
R2 -0.08 0.02 -0.45 0.29 5.79

For our second supplementary approach, We will utilize the GAMLSS method, which is an extension of the generalized linear model and the generalized additive model. For the vast majority of the imputations we have conducted, we have drawn values from an assumed normal distribution. Thus, GAMLSS allows us to be far more flexible with our assumptions. However, it is also worth noting that general normal imputations are robust against violations of the assumption of normality


In general, normal imputations appear to be robust against violations of normality. Demirtas, Freels, and Yucel (2008) found that flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not appear to hamper the good performance of multiple imputation for the mean structure in samples n>400, even for high percentages (75%) of missing data in one variable. The variance parameter is more critical though, and could be off-target in smaller samples.


Regardless, we will be using the GAMLSS method and also imputing from a t-distribution, which is much more robust to outliers.


data(db)
data <- subset(db, age > 1 & age < 2, c("age", "head"))
names(data) <- c("age", "hc")
synthetic <- rep(c(FALSE, TRUE), each = nrow(data))
data2 <- rbind(data, data)
row.names(data2) <- 1:nrow(data2)
data2[synthetic, "hc"] <- NA
imp <- mice(data2, m = 10, meth = "gamlssTF", seed = 88009,
            print = FALSE)
syn <- subset(mice::complete(imp), synthetic)

Indeed, I shall try one more supplementary analysis which piqued my curiosity. I recently came across a few papers that concluded that random forests outperformed other methods in imputing real world data. They primarily based some of these conclusions on the root mean squared error. However, this is problematic for a number of reasons; imputation is not the same as prediction, and so the same loss functions are not as equally useful; most of the settings used in the program were simple defaults.

Indeed, the following simulation shows why relying so heavily on RMSE is misleading.


create.data <- function(beta = 1, sigma2 = 1, n = 50,
                        run = 1) {
  set.seed(seed = run)
  x <- rnorm(n)
  y <- beta * x + rnorm(n, sd = sqrt(sigma2))
  cbind(x = x, y = y)
}
make.missing <- function(data, p = 0.5){
  rx <- rbinom(nrow(data), 1, p)
  data[rx == 0, "x"] <- NA
  data
}
test.impute <- function(data, m = 5, method = method, ...) {
  imp <- mice(data, method = method, m = m, print = FALSE)
  fit <- with(imp, lm(y ~ x))
  tab <- summary(pool(fit), "all", conf.int = TRUE)
  tab[2, c(3, 8, 9)]
}
simulate <- function(runs = 10) {
  res <- array(NA, dim = c(2, runs, 3))
  dimnames(res) <- list(c("norm.predict", "norm.nob"),
                        as.character(1:runs),
                        c("x","estimate", "2.5 %","97.5 %"))
  for(run in 1:runs) {
    data <- create.data(run = run)
    data <- make.missing(data)
    res[1, run, ] <- test.impute(data, method = "norm.predict",
                                 m = 2)
    res[2, run, ] <- test.impute(data, method = "norm.nob")
  }
  res
}
rmse <- function(truedata, imp, v = "x") {
  mx <- is.na(mice::complete(imp, 0))[, v]
  mse <- rep(NA, imp$m)
  for (k in seq_len(imp$m)) {
    filled <- mice::complete(imp, k)[mx, v]
    true <- truedata[mx, v]
    mse[k] <- mean((filled - true)^2)
  }
  sqrt(mean(mse))
}
simulate2 <- function(runs = 10) {
  res <- array(NA, dim = c(2, runs, 4))
  dimnames(res) <- list(c("rf", "midastouch"),
                        as.character(1:runs), 
                        c("RMSE", "estimate", "2.5 %","97.5 %"))
  for(run in 1:runs) {
    truedata <- create.data(run = run)
    data <- make.missing(truedata)
    imp <- mice(data, method = "rf", m = 2,
                print = FALSE)
    res[1, run, 1] <- rmse(truedata, imp)
    imp <- mice(data, method = "midastouch", print = FALSE)
    res[2, run, 1] <- rmse(truedata, imp)
    res[1, run, 2:4] <- as.numeric(test.impute(data, method = "rf", m = 5))
    res[2, run, 2:4] <- as.numeric(test.impute(data, method = "midastouch",  m = 5))
  }
  res
}

Indeed, both the random forest method and predictive mean matching method yield similar RMSEs. A more informative approach to evaluating imputation methods is to look at other measures including


  • Raw Bias: where \(RB\) is bias and \(Q\) is the difference between the estimate and the truth \(RB=E(Q)-Q\), which may also be expressed as a percent \(100*\frac{RB=E(Q)-Q}{Q}\)

  • Coverage Rate: which is simply the proportion of times the intervals actually contain the true parameter value

  • Average Width: the width of the interval estimates, with longer widths indicating less information

res1 <- simulate(100)
#> Error in dimnames(res) <- list(c("norm.predict", "norm.nob"), as.character(1:runs), : length of 'dimnames' [3] not equal to array extent
ztable(apply(res1, c(1, 3), mean, na.rm = TRUE))
#> Error in apply(res1, c(1, 3), mean, na.rm = TRUE): object 'res1' not found
res2 <- simulate2(100)
ztable(apply(res2, c(1, 3), FUN = mean, na.rm = TRUE))
RMSE estimate 2.5 % 97.5 %
rf 1.00 1.01 0.63 1.38
midastouch 1.01 1.03 0.62 1.44

The authors specifically used the missForest package. I tried using the package, but encountered too many difficulties so I simply specified the rf method in mice, and worked from there.


form <- list(PostDBP ~ Group + BaselineDBP + Age + Z + R1 + R2)
rf <- mice(BP_miss, method = "rf", ntree = 20, 
           predictorMatrix = pred, 
           m = 20, maxit = 5, seed = 1031,
           cores = 4, formulas = form, ridge = 0,
           remove.collinear = FALSE, remove.constant = FALSE,
           allow.na = FALSE, printFlag = FALSE,
           visitSequence = "monotone") 

rf1 <- with(rf, rlm(PostDBP ~ Group + 
                    BaselineDBP + Age + 
                    Z + R1 + R2))
ztable(summary(pool(rf1)))
term estimate std.error statistic df p.value
(Intercept) 14.54 8.80 1.65 55.42 0.10
Group 9.60 0.19 50.19 89.28 0.00
BaselineDBP 0.88 0.08 11.60 53.46 0.00
Age 0.32 0.01 21.77 67.46 0.00
Z 0.14 2.06 0.07 109.41 0.95
R1 -0.08 0.10 -0.80 78.37 0.43
R2 -0.09 0.18 -0.53 135.18 0.60

Full Analysis Set


After we run all those analyses, we can finally check our numbers from the full dataset with no missing observations to see how our approaches to handle the missing data have performed. We first look at the fully adjusted model fit with rlm().


# Full Analysis Set  -------------------------------------

final_full_mod <- rlm(PostDBP ~ Group + BaselineDBP +
                      Age + Z + R1 + R2, data = BP_full)

ztable(summary(final_full_mod)[["coefficients"]])
Value Std. Error t value
(Intercept) -1.96 3.86 -0.51
Group 9.73 0.09 103.37
BaselineDBP 1.01 0.03 30.93
Age 0.34 0.01 49.15
Z -1.85 1.05 -1.76
R1 -0.10 0.05 -2.16
R2 -0.09 0.09 -1.02
ztable(check_collinearity(final_full_mod))
Term VIF SE_factor
Group 1.01 1.01
BaselineDBP 1.97 1.40
Age 1.00 1.00
Z 1.98 1.41
R1 1.00 1.00
R2 1.01 1.00

We can also graphically inspect our assumptions.



Then we look at the unadjusted model, where the difference is the group that the participants were randomized to.


# Unadjusted Model  -------------------------------------

final_reduced_mod <- rlm(PostDBP ~ Group, data = BP_full)
ztable(summary(final_reduced_mod)[["coefficients"]])
Value Std. Error t value
(Intercept) 133.56 0.20 653.25
Group 9.76 0.29 33.95

We can also compare the two to see if there are any substantial differences.


# Compare Models -------------------------------------

ztable(compare_performance(final_full_mod, final_reduced_mod))
Name Model AIC BIC RMSE Sigma
final_full_mod rlm 1424.59 1458.31 0.99 1.00
final_reduced_mod rlm 2575.88 2588.53 3.16 3.17

While the results have mostly been consistent, I also wanted to run a few additional analyses of my own to see whether these effect estimates were consistent I first started by constructing the confidence distribution using the concurve package to see the range of parameter values consistent with the data given the background model assumptions.22, 23


# Compatibility of Parameter Values -------------------------------------

fit1 <- lm(PostDBP ~ Group + BaselineDBP +
           Age + Z + R1 + R2, data = BP_full)

curve1 <- curve_gen(fit1, var = "Group", method = "rlm", 
                    log = FALSE, penalty = NULL,
                    m = NULL, steps = 1000, 
                    table = TRUE)
#> Error in mclapply(X, FUN, ..., mc.cores = mc.cores, mc.preschedule = mc.preschedule, : 'mc.cores' > 1 is not supported on Windows

ggcurve(curve1[[1]], type = "c", levels = c(0.5, 0.75, 0.95, 0.99)) +
labs(title = "Confidence Curve for 'Group' Parameter",
     subtitle = "Displays interval estimates at every level",
     x = "Parameter Value (Treatment Effect)") +
theme_less()  
#> Error in ncol(data): object 'curve1' not found

#> Error in mclapply(X, FUN, ..., mc.cores = mc.cores, mc.preschedule = mc.preschedule, : 'mc.cores' > 1 is not supported on Windows

We can see that that the overall treatment effect is approximately 9.7-9.9, and for most of the principled methods we used to handle the missing data, we did not obtain similar estimates. For example, both the primary analysis and the controlled multiple imputation methods did not yield the correct treatment effect estimate in the adjusted model.

In fact, nearly all of the methods used except for the Heckman selection model, yielded overestimates of the treatment effect. Even though such analyses can be substantially improved with more specification and more information, the reality is that data that are assumed to be MNAR are difficult to handle and that no method can guarantee the correct answer. Indeed, we knew the missing data model / mechanism, via our custom function


# Missing Data Mechanism -------------------------------------

lesslikely <- function() {
  set.seed(1031)
  logistic <- function(x) exp(x) / (1 + exp(x))
  1 - rbinom(n, 1, logistic(errors[, 2]))
}
set.seed(1031)
Ry <- ifelse(lesslikely() > 0, 1, 0)
PostDBP[Ry == 0] <- NA

\[\begin{equation} \Pr(R_1=0)=\psi_0+\frac{e^{Y_1}}{1+e^{Y_1}}\psi_1 \end{equation}\]


for MNAR we set \(\psi_\mathrm{MNAR}=(0,1)\)


Thus, we obtain the following model:


\[\begin{align} \mathrm{MNAR}&:\mathrm{logit}(\Pr(R_1=0)) = Y_1 \\ \end{align}\]


Even the random indicator method, which has shown much promise in simulation studies20 and offers the advantage of not having to specify a pattern-mixture or selection model could not recover the correct treatment effect estimate either, showing the difficulty of handling missing data and the limitations of automated approaches.


I hope it is clear from the examples above that it is quite easy to do any analyses with missing data and use a variety of ad-hoc approaches or even principled approaches to explore whether or not the results you’ve obtained are trustworthy or at least even give that impression, but doing a careful and principled one is quite difficult and requires much thought that will take into account real departures from assumptions and whether or not those will lead to substantial changes in results and conclusions/decisions.


Statistical Environment


#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#> 
#> Locale:
#>   LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
#>   LC_NUMERIC=C                           LC_TIME=English_United States.1252    
#> 
#> Package version:
#>   bayesplot_1.8.1     boot_1.3-28         bootImpute_1.2.0    brms_2.15.0         car_3.0-11          concurve_2.7.7     
#>   cowplot_1.1.1       doParallel_1.0.16   foreach_1.5.1       future_1.21.0       gamlss_5.3-4        ggcorrplot_0.1.3   
#>   ggplot2_3.3.5       here_1.0.1          Hmisc_4.5-0         ImputeRobust_1.3-1  kableExtra_1.3.4    knitr_1.33         
#>   lattice_0.20-44     magrittr_2.0.1      MASS_7.3-54         mice_3.13.0         miceMNAR_1.0.2      missForest_1.4     
#>   mvtnorm_1.1-2       parallel_4.1.0      parallelly_1.26.1   performance_0.7.2   qqplotr_0.0.5       rms_6.2-0          
#>   rstan_2.21.2        showtext_0.9-2      Statamarkdown_0.6.1 summarytools_0.9.9  tidyr_1.1.3         VIM_6.1.0          
#>   wesanderson_0.3.6

References


1. Greenland S. (2005). ‘Multiple-bias modelling for analysis of observational data.’ Journal of the Royal Statistical Society Series A (Statistics in Society). 168:267–306. doi: 10.1111/j.1467-985X.2004.00349.x.
2. Greenland S, Lash TL. (2008). ‘Bias analysis.’ In: Rothman KJ, Greenland S, Lash TL, editors. Modern Epidemiology. 3rd edition. Lippincott Williams & Wilkins. p. 345–380.
3. Lash TL, Fox MP, Fink AK. (2009). ‘Applying Quantitative Bias Analysis to Epidemiologic Data.’ Springer New York.
4. Lash TL, Fox MP, MacLehose RF, Maldonado G, McCandless LC, Greenland S. (2014). ‘Good practices for quantitative bias analysis.’ International Journal of Epidemiology. 43:1969–1985. doi: 10.1093/ije/dyu149.
5. Lash TL, Ahern TP, Collin LJ, Fox MP, MacLehose RF. (2021). ‘Bias Analysis Gone Bad.’ American Journal of Epidemiology. doi: 10.1093/aje/kwab072.
6. Gustafson P. (2021). ‘Invited Commentary: Toward Better Bias Analysis.’ American Journal of Epidemiology. doi: 10.1093/aje/kwab068.
7. Greenland S. (2021). ‘Dealing with the Inevitable Deficiencies of Bias Analysis and All Analyses.’ American Journal of Epidemiology. doi: 10.1093/aje/kwab069.
8. Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G. (2014). ‘Handbook of Missing Data Methodology.’ CRC Press.
9. Scott RaP. (2002). ‘The Multicentre Aneurysm Screening Study (MASS) into the effect of abdominal aortic aneurysm screening on mortality in men: A randomised controlled trial.’ The Lancet. 360:1531–1539. doi: 10.1016/S0140-6736(02)11522-4.
10. Morris TP, Kahan BC, White IR. (2014). ‘Choosing sensitivity analyses for randomised trials: principles.’ BMC Medical Research Methodology. 14:11. doi: 10.1186/1471-2288-14-11.
11. Mosteller F, Tukey JW. (1987). ‘Data analysis including statistics.’ In: The Collected Works of John W. Tukey: Philosophy and Principles of Data Analysis 1965-1986. CRC Press.
12. Akacha M, Bretz F, Ohlssen D, Rosenkranz G, Schmidli H. (2017). ‘Estimands and Their Role in Clinical Trials.’ Statistics in Biopharmaceutical Research. 9:268–271. doi: 10.1080/19466315.2017.1302358.
13. Permutt T. (2020). ‘Do Covariates Change the Estimand?’ Statistics in Biopharmaceutical Research. 12:45–53. doi: 10.1080/19466315.2019.1647874.
14. Mitroiu M, Oude Rengerink K, Teerenstra S, Pétavy F, Roes KCB. (2020). ‘A narrative review of estimands in drug development and regulatory evaluation: Old wine in new barrels?’ Trials. 21:671. doi: 10.1186/s13063-020-04546-1.
15. Little RJ, D’Agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, et al. (2012). ‘The Prevention and Treatment of Missing Data in Clinical Trials.’ New England Journal of Medicine. 367:1355–1360. doi: 10.1056/NEJMsr1203730.
16. Buuren S van. (2018). ‘Flexible Imputation of Missing Data, Second Edition.’ CRC Press.
17. Meng X-L, Rubin DB. (1992). ‘Performing Likelihood Ratio Tests with Multiply-Imputed Data Sets.’ Biometrika. 79:103–111. doi: 10.2307/2337151.
18. Schenker N, Taylor J. (1996). ‘Partially parametric techniques for multiple imputation.’ Computational Statistics & Data Analysis. 22:425–446. doi: 10.1016/0167-9473(95)00057-7.
19. Permutt T. (2016). ‘Sensitivity analysis for missing data in regulatory submissions.’ Statistics in Medicine. 35:2876–2879. doi: 10.1002/sim.6753.
20. Jolani S. (2012). ‘Dual Imputation Strategies for Analyzing Incomplete Data.’
21. von Hippel PT, Bartlett J. (2019). ‘Maximum likelihood multiple imputation: Faster imputations and consistent standard errors without posterior draws.’ arXiv:12100870 [stat]. http://arxiv.org/abs/1210.0870.
22. Rafi Z, Greenland S. (2020). ‘Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise.’ BMC Medical Research Methodology. 20:244. doi: 10.1186/s12874-020-01105-9.
23. Greenland S, Rafi Z. (2020). ‘To Aid Scientific Inference, Emphasize Unconditional Descriptions of Statistics.’ arXiv:190908583 [statME]. http://arxiv.org/abs/1909.08583.

  • Cite this blog post


  • comments powered by Disqus