# What Makes a Sensitivity Analysis?

## The Rise of Crony Statistical Rituals

This post will be split into two sections, in the first part, I will discuss the nature and difficulty of conducting sensitivity analyses and the quality of many published sensitivity analyses in various scientific fields, including medicine. I will then discuss a general framework that has existed for a long time but seen a major resurgence by regulatory agencies around the world that may help researchers conduct these analyses and other analyses in general.

In the second part of this post, I will use data from a historical parallel-arm RCT and analyze it from start to finish, however, the caveat is that there are extensive missing data, therefore I will have to utilize a number of non ad-hoc sensitivity analysis and missing data techniques to better understand these data were collected and how informative they are. Much of the code in this second part will be in R and Stata, but references will also be given to specific techniques and theory.

# Cargo-Cult Uncertainty

Sensitivity analyses are an important part of statistical science and many other disciplines when conducted in a principle and systematic manner. However, in the published sensitivity analyses literature, there are many inconsistencies, misconceptions, and highly misleading findings from these analyses. A giant obstacle that prevents these issues from recurring is that they (sensitivity analysis techniques) are quite difficult to learn and often advanced statistical methods that even many statisticians have difficulty with.

Yet, the appearance of difficulty does not dissuade certain researchers away from adopting them for their own work, (so that they may give their colleagues and their stakeholders that they report to, the impression of rigor and methodological expertise), and when this is done mindlessly, researchers will often skip over learning the details and theory, and like many statistical procedures, they will rely on default settings built in the statistical software.

While there are many suites/commands/functions/libraries available to conduct such analyses, much of these procedures masquerade as meticulous sensitivity analyses and to the users, are often a formality to appease stakeholders and give researchers a false sense of confidence about what they are doing. And yet, their users have little to no idea what they’re actually doing. Thus, like many statistical procedures new and old, they too will inevitably be abused as they become more popular and as they are required in research reports.

As Stark & Saltelli1, along with many others such as Gigerenzer2 and Greenland3 have written in the past. Below is an excerpt where Stark describes that much of statistics is simply people masquerading as data analysts and being rewarded as experts despite not having a single clue what they’re doing.

In our experience, many applications of statistics are cargo-cult statistics: practitioners go through the motions of fitting models, computing p-values or confidence intervals, or simulating posterior distributions. They invoke statistical terms and procedures as incantations, with scant understanding of the assumptions or relevance of the calculations, or even the meaning of the terminology. This demotes statistics from a way of thinking about evidence and avoiding self-deception to a formal “blessing” of claims. The effectiveness of cargo-cult statistics is predictably uneven. But it is effective at getting weak work published - and is even required by some journals…

Here, Stark takes on the issue of widely accessible statistical software for academics, students, and analysts.

Statistical software does not help you know what to compute, nor how to interpret the result. It does not offer to explain the assumptions behind methods, nor does it flag delicate or dubious assumptions. It does not warn you about multiplicity or p-hacking. It does not check whether you picked the hypothesis or analysis after looking at the data, nor track the number of analyses you tried before arriving at the one you sought to publish - another form of multiplicity. The more “powerful” and “user-friendly” the software is, the more it invites cargo-cult statistics.1

Greenland3 focuses on how traditional statistical education is inadequate to deal with the messy and chaotic nature of the world and the missing data that is produced by it. Greenland writes that traditional statistical practice opens the room for several cognitive biases, making many reports and conclusions highly misleading. He provides some hope however, by pointing to the utility of sensitivity analyses for modeling bias and uncertainty.

I argue that current training in statistics and analytical methods is inadequate for addressing major sources of inference distortion, and that it should be expanded to cover the biased perceptual and thinking processes (cognitive biases) that plague research reports. As commonly misused, null-hypothesis significance testing (NHST) combines several cognitive problems to create highly distorted interpretations of study results. Interval estimation has proven highly vulnerable to the same problems. Sensitivity and bias analyses address model uncertainties by varying and relaxing assumptions, but (like Bayesian analyses) they are difficult to perform with proper accounting for prior information and are easily manipulated because they depend on specification of many models and parameters.

However, he then follows up to discuss how difficult sensitivity and bias analyses can be due to the plethora of decisions that the analyst must make when specifying parameters in the analysis. These judgements will differ from researcher to researcher, even in the same group, and such analyses have not been studied as widely as conventional statistical practice.

Even with realistic choices, the sensitivity of sensitivity and bias analyses must be evaluated (51). The plausibility of an estimated bias function is determined by intuitions, prejudices, and understanding of the applied context; those can vary dramatically across researchers, in turn leading to very different specifications and inferences even if they are anchored to the same conventional analysis. Adding to this problem, sensitivity and bias analyses are more difficult to perform correctly and more easily massaged toward preferred conclusions, because they require specification of many more equations and their parameters.

And unlike NHST, abuse of sensitivity and bias analysis is as yet barely studied because the pool of such analyses remains small and highly selective. It thus seems implausible that these analyses will increase replicability of inferences, although they can reveal how assumptions affect those inferences. (Here “replicability” is used according to recommendations of the American Statistical Association (52) to denote independent checks of reported results with new data; “reproducibility” then denotes checks of reported results using the original data and computer code.)

Indeed, this is one drawback of (relatively) newly adopted methods/procedures, they have not been studied long enough as say somethng such as a P-value, which has been studied for centuries. Even something as useful as exploring assumptions and uncertainty such as a principled sensitivity analysis is not free from this.

it seems Greenland’s predictions were accurate; in a large systematic review of published sensitivity analyses led by the mathematician Andrea Saltelli and his group, they found that highly cited papers in top impact factor journals rarely contained sensitivity analyses, and that many of these were poorly done and one-dimensional, in the sense of only varying one parameter at a time, which is hardly informative as a multi-dimensional analysis in which multiple parameters are varied at the same time.

Here, we will explore one example of how mindless sensitivity analyses are conducted within the context of clinical trials, although we will not explore multi-dimensional sensitivity analyses here. I make this distinction because the environment to conduct sensitivity/bias analyses differs substantially in trials where regulators are involved versus the environment in which many observational studies are analyzed and presented.

In medicine, sensitivity analyses (SA) are often conducted in trials and observational studies, but again, little thought is given to what they should entail.1 None of this is surprising given that they are not usually taught in traditional settings, although they have been discussed extensively within the statistical and epidemiological literature410 Yet, the other barrier that remains is that they are incredibly technical and will require the assistance of an analyst familiar with both the theory and the applications. It also does not help that the topic is vast and there are multiple new statistical methods being published every day.

Indeed, the Handbook of Missing Data Methodology,11 one of the most authoritative and comprehensive books to date on statistical methods to handle missing data in clinical trials, has nearly six chapters devoted to the topic of sensitivity analyses and different approaches to use. Unfortunately, at the rate that this field is advancing, this book may already be outdated, despite only coming out six years ago.

This article will focus on what a sensitivity analysis is often assumed to be based on practices in the literature, some differing points of view regarding sensitivity analyses (with an example of a highly promoted measure for conducting sensitivity analyses that has been very controversial, and a practical example of conducting principled(by principled, I mean non ad-hoc techniques, as James Carpenter would say) sensitivity analyses in the context of missing data in a randomized clinical trial.

## Vagueness of ‘Sensitivity’

Before we move forward, we must consider what a sensitivity analysis actually entails. The first word of the phrase is an obvious tell, it suggests an analysis that examines how sensitive or robust a result is. But a natural question is , sensitive specifically to what? What sort of perturbations and how relevant are they? Can I vary anything in the analysis to test the sensitivity of the result or assumption? For example, if I obtained an odds ratio of 1.7 from the primary analysis in my experiment and I conducted an extra analysis to see whether the odds ratio would change by a resampling method, is that a sensitivity analysis?

What if I did a completely different type of analysis, for example, suppose the primary analysis was a classical statistical test with an adjustment for multiple comparisons and I decided to run a hierarchical Bayesian regression with a spiked prior on the null, would that be a sensitivity analysis, given they are different frameworks and procedures (although being used for similar goals)? The possibilities of what can be varied are endless when a loose definition of sensitivity analysis is assumed. This is unique to every discipline and their culture. In epidemiology, sensitivity analyses and bias analyses which attempt to extensively explore assumptions and the robustness of the result are used synonymously, yet in the clinical trial world, this is not the case.

## Frequent Misconceptions

Take for example in clinical trials and observational studies, in particular, it is common to see primary analyses often being intent-to-treat (ITT) analyses and sensitivity analyses being per-protocol analyses (PP).

I will not define these terms here and would encourage readers to consult other sources.

The following is a similar example from a high-profile trial published in the The Lancet in 2002,12

The Multicentre Aneurysm Screening Study group randomised 67,800 men to receive an invitation to an abdominal ultrasound scan or not [6]. Of those invited to receive an abdominal scan, 20% did not accept. The primary analysis was by intention to treat, thus estimating the effect of being randomised to abdominal ultrasound. Another analysis investigated the complier average causal effect, which considers what the (average) effect of treatment was in patients who would have adhered to protocol however they were randomised [7].

To many, this may seem perfectly fine, even great. The authors used the question they were primarily interested in as the main analysis, and then conducted an additional analysis to see if these results are consistent. Unfortunately, this is highly problematic as Morris et al. (2014)13 describes

These questions are different, and observing different results should not shake our confidence in either. The CACE analysis was a secondary analysis, not a sensitivity analysis.

It is common for authors to compare the results of intention-to-treat with per-protocol analysis; see for example [8, 9]. While it is hard to pin down the precise question of per-protocol analysis [10], this is clearly different to the question intention-to-treat addresses. Per-protocol analysis should not therefore be considered as a sensitivity analysis for intention-to-treat but as a secondary analysis, if at all.

## Portable Sensitivity Analyses

These sorts of misunderstandings are so common and so prevalent throughout the literature, so it should come as no surprise that sensitivity analyses are rarely done, or they are done incorrectly. Although not in clinical trials, one particular controversial example has been the promotion and use of the $E$-value (the ‘E’ apparently stands for ‘evidence’, but I am not sure.) within epidemiology to assess the amount of confounding necessary within an observational study result to practically reduce the effect estimate to something that is practically null.

The method and value has been adopted with open arms by many epidemiologists and health researchers given that it has simplified a task that is often arduous and requires extensive and careful thought in comparison to the traditional classical analysis. Yet, others have also been highly critical of it for a number of reasons, and although they see the value in promoting sensitivity analyses to more researchers around the world, many are concerned that this measure will also eventually go down the road of P-values, but for observational research.

## Regulators & Sensitivity Analyses

As mentioned above, there are many possible ways to conduct a sensitivity analysis, especially if the phrase is used in a loose/vague way. Whether or not these are valid and principled approaches to conducting sensitivity analyses is another question. Luckily, many of us do not have to ponder day and night about the semantics about this because both The Handbook of Missing Data Methodology and the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) have given this topic much thought, and for the latter it is reflected by the fact that they recently created an entire new addition to their E9 guidance document (Statistical Principles for Clinical Trials) which has served as a reference for clinical trial statisticians for decades. In the addendum, titled Addendum on Estimands and Sensitivity Analysis In Clinical Trials, which has now been legally adopted by regulatory agencies around the world including the FDA and EMA, they elaborate on the concept of estimands and the role sensitivity analyses play.

To be clear, estimands are not new and have been discussed in the statistical literature since Tukey14 but the ICH working group’s addendum formalized its adoption in the clinical trial world and the importance of sensitivity analyses.1517

## Estimands & Sensitivity

A.5.2.1. Role of Sensitivity Analysis

Inferences based on a particular estimand should be robust to limitations in the data and deviations from the assumptions used in the statisticral model for the main estimator. This robustness is evaluated through a sensitivity analysis. Sensitivity analysis should be planned for the main estimators of all estimands that will be important for regulatory decision making and labelling in the product information. This can be a topic for discussion and agreement between sponsor and regulator.

The statistical assumptions that underpin the main estimator should be documented. One or more analyses, focused on the same estimand, should then be pre-specified to investigate these assumptions with the objective of verifying whether or not the estimate derived from the main estimator is robust to departures from its assumptions. This might be characterised as the extent of departures from assumptions that change the interpretation of the results in terms of their statistical or clinical significance (e.g. tipping point analysis).

Distinct from sensitivity analysis, where investigations are conducted with the intent of exploring robustness of departures from assumptions, other analyses that are conducted in order to more fully investigate and understand the trial data can be termed “supplementary analysis” (see Glossary; A.5.3.). Where the primary estimand(s) of interest is agreed between sponsor and regulator, the main estimator is pre-specified unambiguously, and the sensitivity analysis verifies that the estimate derived is reliable for interpretation, supplementary analyses should generally be given lower priority in assessment.

A.5.2.2. Choice of Sensitivity Analysis

When planning and conducting a sensitivity analysis, altering multiple aspects of the main analysis simultaneously can make it challenging to identify which assumptions, if any, are responsible for any potential differences seen. It is therefore desirable to adopt a structured approach, specifying the changes in assumptions that underlie the alternative analyses, rather than simply comparing the results of different analyses based on different sets of assumptions. The need for analyses varying multiple assumptions simultaneously should then be considered on a case by case basis. A distinction between testable and untestable assumptions may be useful when assessing the interpretation and relevance of different analyses.

ICH E9 (R1) Guideline

The need for sensitivity analysis in respect of missing data is established and retains its importance in this framework. Missing data should be defined and considered in respect of a particular estimand (see A.4.). The distinction between data that are missing in respect of a specific estimand and data that are not directly relevant to a specific estimand gives rise to separate sets of assumptions to be examined in sensitivity analysis.

they explicitly define a sensitivity analysis as being an analysis which:

• realistically varies the assumptions from the primary analysis,

• still targets the same estimand,

• examines the robustness of the results to

• assumption violations or departures, and

• can possibly change the results/conclusions drawn

They contrast this with more extensive analyses that investigate these violations/departures of assumptions, and characterize those analyse as being supplementary rather than sensitivity. The latter is typically what is seen in rigorous epidemiological studies that employ quantitative bias analyses (a more specialized form of sensitivity analysis).

The ICH E9 addendum contains similar views that are echoed by the National Research Council’s advice on clinical trials (National Research Council 2010)18 regarding estimands and sensitivity

Recommendation 15: Sensitivity analysis should be part of the primary reporting of findings from clinical trials. Examining sensitivity to the assumptions about the missing data mechanism should be a mandatory component of reporting.

Now, back to the NEJM paper. The reason why an ITT analysis and a PP analysis cannot serve as primary and sensitivity analyses, respectively, is because they are targeting entirely different estimands.13 Thus, because they are answering completely different questions, they’re just two different analyses. And the additional PP analysis, although important to certain stakeholders, can be classified as a supplementary analysis or a non-sensitivity analysis.

Indeed, the following flowchart from Morris et al. (2014)13 is particularly useful as a guide to differentiate sensitivity analyses from non-sensitivity analyses.

## Utility of Subject-Matter Experts

So what does a sensitivity analysis actually look like within the context of a clinical trial or experiment? Below, I walk through the analysis of a trial to give some examples.

# Sensitivity in Clinical Trials

I’ll use a sample clinical trial dataset, which is described here:19

“This dataset is typical of diastolic blood pressure data measured in small clinical trials in hypertension from the mid-to-late 1960s and for approximately a decade thereafter.

During this time, hypertension was more severe, the number of effective treatments was relatively small, and the definition (DBP > 95 mmHg) of essential hypertension was not as stringent as it is now (DBP > 80 mmHg) - as seen in the 1967 report from the Veterans Administration Cooperative Study Group on Antihypertensive Agents VA Study Group (1967).

In Table 3.1, diastolic blood pressure (DBP) was measured (mmHg) in the supine position at baseline (i.e., DBP) before randomization… Patients’ age and sex were recorded at baseline and represent potential covariates. The primary objective in this chapter in the analysis of this dataset is to test whether treatment A (new drug) may be effective in lowering DBP as compared to B (placebo) and to describe changes in DBP across the times at which it was measured.

So to recap, the primary investigators of this trial were interested to see whether treatment A was more effective in lowering DBP when compared to placebo.

Although we will not be using some of the scripts and project setup that I discussed in my statistical workflow post for the sake of efficiency and ease, I would urge and recommend others do so that it remains easy to catch errors. In particular, I would especially recommend using the following packages:

For the analyses in this post, see the R packages in the session info section at the end of this post, these will need to be loaded in order for the scripts below to work.

req_packs <- c("rms", "mice", "magrittr", "tidyr",
"future", "showtext", "concurve",
"ggtext", "tidyverse", "heavy", "
cmdstanr", "ggplot2", "parallel",
"kableExtra", "cowplot", "Cairo",
"svglite", "yardstick", "broom",
"mi", "Hmisc", "here", "VIM",
"wesanderson", "bayesplot", "lattice",
"doParallel", "blogdown", "Amelia",
"bootImpute", "brms", "rmarkdown",
"tinytex", "texPreview", "ggcorrplot",
"qqplotr", "car", "Statamarkdown",
"rstan", "missForest", "htmltools",
"tidybayes", "summarytools", "performance",
"mvtnorm", "ImputeRobust", "gamlss",
"ProfileLikelihood", "reshape2", "boot",
"knitr", "MASS", "miceMNAR")

# Load all packages at once

lapply(req_packs, library,
character.only = TRUE)

I will now simulate some of the data from this clinical trial.

# Simulate Trial Data  -------------------------------------

set.seed(1031, kind = "L'Ecuyer-CMRG",
normal.kind = "Inversion",
sample.kind = "Rejection")

BaselineDBP <- rnorm(500, 117, 2)
Group <- rbinom(500, 1, 0.5)
Age <- rnorm(500, 50, 7)
R1 <- rnorm(500, 0, 1)
R2 <- rnorm(500, 1, 0.5)

errors <- mvtnorm::rmvnorm(500, mean = c(0, 0),
sigma = matrix(c(1, 0.5, 0.5, 1),
nrow = 2, byrow = TRUE))

PostDBP <- BaselineDBP + (Age * 0.33) +
(Group * 9.7) + errors[, 1]

I have made some minor adjustments to this dataset and also generated a new variable (Z) based on the existing ones in the dataset that is strongly correlated with the outcome of interest (PostDBP).

# Explore Data  -------------------------------------

str(BP_full)
#> 'data.frame':    500 obs. of  7 variables:
#>  $PostDBP : num 142 140 129 148 135 ... #>$ BaselineDBP: num  119 116 115 119 119 ...
#>  $Group : Factor w/ 2 levels "Drug X","Drug Y": 1 1 2 1 2 2 1 2 1 1 ... #>$ Age        : num  41.8 44.4 41.1 56.6 50.2 ...
#>  $Z : num 0.0141 -0.0629 -0.1101 0.0313 0.0495 ... #>$ R1         : num  0.523 1.021 -0.839 0.413 -0.175 ...
#>  $R2 : num 0.451 0.35 0.897 1.227 1.166 ... Table 1: Patient Characteristics Variable N Overall, N = 5001 Drug X, N = 2581 Drug Y, N = 2421 PostDBP 500 138.5 (133.4, 142.8) 142.5 (140.5, 144.5) 133.4 (131.3, 135.3) BaselineDBP 500 116.88 (115.56, 118.22) 116.90 (115.54, 118.20) 116.81 (115.60, 118.22) Age 500 49 (45, 55) 49 (45, 54) 50 (46, 55) Z 500 0.00 (-0.04, 0.05) 0.00 (-0.04, 0.04) 0.00 (-0.05, 0.05) R1 500 0.00 (-0.73, 0.55) 0.05 (-0.72, 0.59) -0.10 (-0.74, 0.52) R2 500 0.96 (0.64, 1.28) 0.96 (0.60, 1.30) 0.97 (0.67, 1.27) 1 Median (IQR) summary(BP_full) #> PostDBP BaselineDBP Group Age Z R1 R2 #> Min. :124 Min. :110 Drug X:258 Min. :27.7 Min. :-0.1845 Min. :-3.21 Min. :-0.646 #> 1st Qu.:133 1st Qu.:116 Drug Y:242 1st Qu.:45.3 1st Qu.:-0.0444 1st Qu.:-0.73 1st Qu.: 0.645 #> Median :138 Median :117 Median :49.4 Median : 0.0001 Median : 0.00 Median : 0.960 #> Mean :138 Mean :117 Mean :49.7 Mean : 0.0000 Mean :-0.04 Mean : 0.962 #> 3rd Qu.:143 3rd Qu.:118 3rd Qu.:54.7 3rd Qu.: 0.0456 3rd Qu.: 0.55 3rd Qu.: 1.283 #> Max. :150 Max. :123 Max. :68.4 Max. : 0.2025 Max. : 3.38 Max. : 2.480 PostDBP BaselineDBP Group Age Z R1 R2 142 119 Drug X 41.8 0.01 0.52 0.45 140 116 Drug X 44.4 -0.06 1.02 0.35 129 115 Drug Y 41.1 -0.11 -0.84 0.90 148 119 Drug X 56.6 0.03 0.41 1.23 135 119 Drug Y 50.2 0.05 -0.17 1.17 135 118 Drug Y 55.3 0.01 -0.89 0.77 143 117 Drug X 54.3 0.00 1.07 0.34 133 116 Drug Y 55.8 -0.05 -1.91 0.96 144 118 Drug X 46.9 0.08 1.03 1.75 150 119 Drug X 62.0 0.01 -0.94 2.48 Now, we can quickly explore the other characteristics of the dataset, and first we look at the rank correlations between the predictors and the response variable. # Data Attributes ------------------------------------- study_formula <- as.formula(PostDBP ~ .) plot(spearman2(study_formula, BP_full), cex = 1, pch = 18, col = alpha("darkred", 0.75)) abline(v = 0, col = alpha(zred, 0.75), lty = 3) Along with a correlation matrix between the variables of interest. We will not be looking at the estimates from a fitted model yet. We are mainly looking at a few characteristics of this dataset before we move on to the next step. Now that we have familiarized ourselves with this dataset, we suddenly “notice” that there are missing data (which I actually generated below using a custom function) and suppose we didn’t actually know how the missing data were generated, we might suspect or start with the assumption that the data are missing at random (MAR). I do not plan on touching what the different missing data mechanisms are (MCAR, MAR, MNAR) as described by Rubin (1978). Although it is necessary to be familiar with them to better understand the approaches used throughout this post. Some individuals have remarked that the following diagram below in Statistical Rethinking helped them grasp these concepts, so I am pasting that here as a reference. Of course, here, we do have an idea of what the missing data mechanism is, but we shall assume the role of a data analyst who has just been given a dataset with missing values in the outcome from their clinical colleague. # Generate Missing Data ------------------------------------- set.seed(1031, kind = "L'Ecuyer-CMRG", normal.kind = "Inversion", sample.kind = "Rejection") Ry <- ifelse(lesslikely() > 0, 1, 0) PostDBP[Ry == 0] <- NA BP_miss <- data.frame(PostDBP, BaselineDBP, Group, Age, Z, R1, R2) str(BP_miss) #> 'data.frame': 500 obs. of 7 variables: #>$ PostDBP    : num  142 140 129 148 135 ...
#>  $BaselineDBP: num 119 116 115 119 119 ... #>$ Group      : Factor w/ 2 levels "Drug X","Drug Y": 1 1 2 1 2 2 1 2 1 1 ...
#>  $Age : num 41.8 44.4 41.1 56.6 50.2 ... #>$ Z          : num  0.0141 -0.0629 -0.1101 0.0313 0.0495 ...
#>  $R1 : num 0.523 1.021 -0.839 0.413 -0.175 ... #>$ R2         : num  0.451 0.35 0.897 1.227 1.166 ...

sum(is.na(BP_miss))
#> [1] 245

We will now attempt to visualize our dataset and examine what proportion of these data are missing.

missing_col <- c(zblue, zred)
suppressMessages(aggr(BP_miss, col = alpha(missing_col, 0.60),
plot = TRUE, prop = F, numbers = T))

So now, we must examine this missing dataset and its characteristics and look for any systematic differences. We also examine the proportion of missingness and the influx and outflux patterns. This is important to assess how well connectected the missing data are to the observed data, explained by van Buuren here

The influx of a variable quantifies how well its missing data connect to the observed data on other variables. The outflux of a variable quantifies how well its observed data connect to the missing data on other variables. In general, higher influx and outflux values are preferred…

Influx and outflux are summaries of the missing data pattern intended to aid in the construction of imputation models. Keeping everything else constant, variables with high influx and outflux are preferred. Realize that outflux indicates the potential (and not actual) contribution to impute other variables. A variable with high $O_{j}$ may turn out to be useless for imputation if it is unrelated to the incomplete variables. On the other hand, the usefulness of a highly predictive variable is severely limited by a low $O_{j}$.

More refined measures of usefulness are conceivable, e.g., multiplying $O_{j}$ by the average proportion of explained variance. Also, we could specialize to one or a few key variables to impute. Alternatively, analogous measures for $I_{j}$ could be useful. The further development of diagnostic summaries for the missing data pattern is a promising area for further investigation.

# Examine Flux Patterns  -------------------------------------

round(fico(BP_miss), 1)[2:6]
#> BaselineDBP       Group         Age           Z          R1
#>         0.5         0.5         0.5         0.5         0.5

POBS Influx Outflux AINB AOUT FICO
PostDBP 0.51 0.45 0 1 0.00 0.00
BaselineDBP 1.00 0.00 1 0 0.08 0.49
Group 1.00 0.00 1 0 0.08 0.49
Age 1.00 0.00 1 0 0.08 0.49
Z 1.00 0.00 1 0 0.08 0.49
R1 1.00 0.00 1 0 0.08 0.49
R2 1.00 0.00 1 0 0.08 0.49

## Exploratory Analyses

Next, We start off doing some preliminary analyses to eyeball differences between the observed data and the missing data.

#> [1] 245

pobs influx outflux ainb aout fico
PostDBP 0.51 0.45 0 1 0.00 0.00
BaselineDBP 1.00 0.00 1 0 0.08 0.49
Group 1.00 0.00 1 0 0.08 0.49
Age 1.00 0.00 1 0 0.08 0.49
Z 1.00 0.00 1 0 0.08 0.49
R1 1.00 0.00 1 0 0.08 0.49
R2 1.00 0.00 1 0 0.08 0.49

Having explored these data, we fit a preliminary model quickly to these missing data using the rlm() function from the MASS package and glance at what our estimates look like.

# Exploratory Model Estimates  -------------------------------------

eda_1 <- rlm(PostDBP ~ factor(Group) + BaselineDBP +
Age + Z + R1 + R2, data = BP_miss,
method = "MM", psi = psi.huber)

Value Std. Error t value
(Intercept) 11.68 5.77 2.02
factor(Group)Drug Y -9.58 0.13 -71.57
BaselineDBP 0.98 0.05 19.86
Age 0.34 0.01 34.33
Z -0.93 1.48 -0.62
R1 -0.03 0.07 -0.43
R2 -0.13 0.13 -0.94

AIC BIC RMSE Sigma
749 778 1.02 1.03

Because the MASS::rlm() function will automatically delete missing observations and do a complete-case analysis, we can gauge what our estimates would look like if we assume that the data are missing completely at random (MCAR), which is almost always unrealistic.

It is almost always more realistic to assume that the data are missing at random (MAR), $\operatorname{Pr}\left(R=0 \mid Y_{\mathrm{obs}}, Y_{\mathrm{mis}}, \psi\right)=\operatorname{Pr}\left(R=0 \mid Y_{\mathrm{obs}}, \psi\right)$. This can often be a reasonable assumption to start off with for the primary analysis.20

## The Analysis Workflow

For our main analysis, we will build an imputation model to impute our missing data. We include all possible information in our imputation model, meaning all the variables in our dataset, and impute the missing data using the predictive mean matching method, and then analyze each dataset and pool the results. Our basic workflow will look like this:

We do this with a fully-adjusted model, with all relevant covariates and a reduced model with just the treatment group. This is because we want both the adjusted and unadjusted estimates. We will then compare these models using the likelihood-ratio test21 to see whether there are any substantial differences between the fully-adjusted and reduced model.

## Monte Carlo Error Analysis

However, before we are able to move onto this step, we must determine how many imputations we will construct. To many, this may seem like a trivial and even nonsensical task when any individual can impute a large number of datasets. Indeed, it may seem intuitive that more is better. This is why we must conduct an MI power analysis.

Multiple imputation is a stochastic procedure. Each time we reimpute our data, we get different sets of imputations because of the randomness of the imputation step, and therefore we get different multiple-imputation estimates. However, we want to be able to reproduce MI results. Of course, we can always set the random-number seed to ensure reproducibility by obtaining the same imputed values. However, what if we use a different seed? Would we not want our results to be similar regardless of what seed we use? This leads us to a notion we call statistical reproducibility—we want results to be similar across repeated uses of the same imputation procedure; that is, we want to minimize the simulation error associated with our results.

To assess the level of simulation error, White, Royston, and Wood (2011) propose to use a Monte Carlo error of the MI results, defined as the standard deviation of the results across repeated runs of the same imputation procedure using the same data. The authors suggest evaluating Monte Carlo error estimates not only for parameter estimates but also for other statistics, including p-values and confidence intervals, as well as MI statistics including RVI and FMI.

The issue with this is that without giving careful thought to how many imputations may be needed for a particular study, one may run the risk of imputing too few datasets and losing information, or they may waste resources imputing a nonsensical amount that may not be necessary. Indeed, the latter seems especially trivial, however, it may often yield no advantages whatsoever, and only incur costs as many authors have argued.

The argument is that “the additional resources that would be required to create and store more than a few imputations would not be well spent” (Schafer 1997, 107), and “in most situations there is simply little advantage to producing and analyzing more than a few imputed datasets” (Schafer and Olsen 1998, 549).

Therefore, we will examine certain characteristics of our dataset with missing values and use them in an analysis to determine how many datasets we should impute to efficiently achieve our goals. We will base this off a number of characteristics such as the fraction of missing information, the proportion of missing observations, and the losses that we are willing to incur.

We first start by running a “dry” or naive imputation of our data to quickly examine some of these characteristics. This imputed dataset will not be used for our primary analysis, it is simply serving to guide us in our main imputation model. This is also why we have set the number of iterations to 0.

Once again, we will include all the variables in our dataset, and use the predictive mean matching method with approximately 5 donors.

# Dry Imputation -------------------------------------

form <- list(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2)
pred <- make.predictorMatrix(BP_miss)

parlmice(BP_miss, method = "pmm", m = 20,
maxit = 0, cores = cores, donors = 5,
cluster.seed = 1031, formulas = form, ridge = 0,
predictorMatrix = pred, remove.collinear = TRUE,
remove.constant = TRUE, allow.na = TRUE,
printFlag = FALSE, visitSequence = "monotone") -> init_imp

head(init_imp$loggedEvents, 10) #> NULL Now that we have imputed our dataset, we may examine it for any issues before moving onto the next step of analyzing it. Although it is only a dry run, there seems to be no issues with it. We now pool our estimates using Rubin’s rules. init_res <- mice::pool(with(init_imp, rlm(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2, method = "MM", psi = psi.huber))) init_res_sum <- summary(init_res) colnames(init_res_sum) <- c("Term", "Estimate", "SE", "Statistic", "df", "P-val") ztable(init_res_sum) Term Estimate SE Statistic df P-val (Intercept) 36.43 18.33 1.99 45.2 0.05 factor(Group)Drug Y -8.23 0.41 -20.09 59.0 0.00 BaselineDBP 0.78 0.15 5.16 49.1 0.00 Age 0.28 0.04 8.00 36.7 0.00 Z 0.71 3.99 0.18 112.1 0.86 R1 0.02 0.18 0.13 136.5 0.90 R2 0.02 0.38 0.07 80.2 0.95 Now that we have our vector, we can quickly examine certain characteristics such as the fraction of missing information (FIMO), the between-imputation variance and the within-imputation variance. The reason that we focus on these particular attributes is because they are crucial for several tasks in any statistical analysis plan, such as having enough power/precision, minimum long-run coverage, etc. We use the FIMO from the dry imputation and a number of other criteria to determine how many imputations we need: • Some have suggested taking the FIMO and multiplying by 100 to obtain the number of imputed datasets needed $m \approx 100 \lambda$ • For our goals, we wish to ensure that the monte-carlo standard error from the imputations are less than 10% of the between-imputation standard error $mcse < B^{\hat{-}}_{SE} * 0.10$ • We wish to minimize the monte-carlo error that is derived from dividing the chosen number of datasets to impute from the fraction of missing information so that the monte-carlo standard error is less than 0.01 $\frac{FMI}{m} ≈ 0.01$ • We also wish to minimize the total variance so that the square-root of it is no greater than the ideal variance and the corresponding confidence interval width. $T_{m}=\left(1+\frac{\gamma_{0}}{m}\right) T_{\infty}$ • Furthermore, it is commonly advocated by missing data experts to impute the same or similar number of datasets as fractions of missing observations $\gamma_{0}*100$ We start by extracting the FMIs from our dry imputation and calculating all these numbers. # Calculate Fraction of Missing Information fmi <- max(summary(init_res, type = c("all"), conf.int = TRUE, conf.level = 0.95)["fmi"]) m <- fmi / 0.01 # FMI Imputations mcse <- sqrt(max(init_res$pooled$b) / m) # MCSE se <- max((init_res$pooled$ubar) / sqrt(500)) # SE What we find is that the fraction of missing information is nearly 50% which is very high and that in order to get a monte-carlo error of less than 0.01, we would need a minimum of at least 50 imputed datasets. We also calculate both the monte-carlo standard error and the standard error from the dry run and find that they are practically equivalent. However, we must also explore other possibilities in terms of number of imputed datasets to reduce the monte-carlo errors so we run a quick function to do that, checking the effects of a number of imputed datasets on the width of the confidence interval, which is directly tied to the monte-carlo error. Indeed, Von Hippel (2018) proposed a relationship between the fraction of missing information and the number of imputations needed to achieve a targeted CI length or monte-carlo error using a quadratic formula, $m=1+\frac{1}{2}\left(\frac{\gamma_{0}}{\operatorname{SD}\left(\sqrt{U_{\ell}}\right) \mathrm{E}\left(\sqrt{U_{\ell}}\right)}\right)^{2}$ where $\mathrm{E}\left(\sqrt{U_{\ell}}\right)$ and $\mathrm{SD}\left(\sqrt{U_{\ell}}\right)$ are the coefficient of variation (CV), summarizing the imputation variation in the SE estimates. We can graphically construct a function to depict this relationship and calculate how many imputations we will need based on some desired error and further display the distribution of this relationship by varying these parameters. In the function below, adopted from von Hippel, and van Buuren,20 I vary both the coefficient of variation and the $\alpha$-level. With this function, we can now graphically examine the effects of different parameter values on the monte-carlo errors from the imputations. We can see above that as we aim for smaller monte-carlo errors, we need many more imputations to achieve our desired target. Now that I have constructed and graphed three differing relationships between monte-carlo errors and number of imputations needed to achieve those errors, and varied the $\alpha$-levels, we can now move onto choosing a specific number of imputations to construct. If I choose an $\alpha$ level of 5%, which the maximum tolerable type-I error rate, and a coefficient of variation of 5% # Estimate Number of Imputations ------------------------------------- powerimp(model = init_res, cv = 0.10, alpha = 0.05) #> [1] 32 powerimp(model = init_res, cv = 0.05, alpha = 0.05) #> [1] 124 powerimp(model = init_res, cv = 0.01, alpha = 0.05) #> [1] 3057 I need to construct approximately 124 imputations to achieve this coefficient of variation, which is what I will use for all the analyses from here on, and it is clear that it is far more reasonable than trying to impute 3000 datasets to achieve a tiny standard error. However, I want to verify my calculations by running a similar estimation command that Stata provides. The documentation says the following regarding RVI and FMI: Returning to the output, average RVI reports the average relative increase (averaged over all coefficients) in variance of the estimates because of the missing bmi values. A relative variance increase is an increase in the variance of the estimate because of the loss of information about the parameter due to nonresponse relative to the variance of the estimate with no information lost. The closer this number is to zero, the less effect missing data have on the variance of the estimate. Note that the reported RVI will be zero if you use mi estimate with the complete data or with missing data that have not been imputed. In our case, average RVI is small: 0.0312. Largest FMI reports the largest of all the FMI about coefficient estimates due to nonresponse. This number can be used to get an idea of whether the specified number of imputations is sufficient for the analysis. A rule of thumb is that M ≥ 100 × FMI provides an adequate level of reproducibility of MI analysis. In our example, the largest FMI is 0.14 and the number of imputations, 20, exceeds the required number of imputations: 14 Although I cannot get the number of imputations directly from a command, I can run the mi estimate command, and look at the overall fraction of missing information as the number of imputations increases. If my previous calculations are consistent, I should look to increase the number of imputations until I achieve an FMI of 1.24 or less: 124/100 clear all set seed 1031 parallel setclusters 8 parallel initialize 8 quietly import delimited "~/Dropbox/LessLikely/static/datasets/temp.csv", numericcols(2 3 5 6 7 8) quietly mi set mlong quietly mi register imputed baselinedbp postdbp age z group r1 r2 quietly parallel: mi impute chained (pmm, knn(5)) postdbp = baselinedbp age z r1 r2 group, burnin(5) add(120) rseed (1031) nomonotone quietly mi estimate, saving(miest, replace): rreg postdbp i.group baselinedbp age z r1 r2 mi estimate using miest, mcerror level(95) mi estimate, saving(miest, replace) vartable mcerror level(95) nocitable: rreg postdbp i.group baselinedbp age z r1 r2 mi estimate using miest, mcerror level(95) #> . clear . #> . set seed 1031 #> #> . #> . parallel setclusters 8 #> N Child processes: 8 #> Stata dir: /Applications/Stata/StataMP.app/Contents/MacOS/stata-mp #> #> . #> . parallel initialize 8 #> N Child processes: 8 #> Stata dir: /Applications/Stata/StataMP.app/Contents/MacOS/stata-mp #> #> . #> . quietly import delimited "~/Dropbox/LessLikely/static/datasets/temp.csv", numeri #> > ccols(2 3 5 6 7 8) #> #> . #> . quietly mi set mlong #> #> . #> . quietly mi register imputed baselinedbp postdbp age z group r1 r2 #> #> . #> . quietly parallel: mi impute chained (pmm, knn(5)) postdbp = baselinedbp age z r1 #> > r2 group, burnin(5) add(120) rseed (1031) nomonotone #> #> . #> . quietly mi estimate, saving(miest, replace): rreg postdbp i.group baselinedbp ag #> > e z r1 r2 #> #> . #> . mi estimate using miest, mcerror level(95) #> #> Multiple-imputation estimates Imputations = 120 #> Robust regression Number of obs = 500 #> Average RVI = 1.2446 #> Largest FMI = 0.8029 #> Complete DF = 493 #> DF adjustment: Small sample DF: min = 64.87 #> avg = 259.72 #> max = 342.89 #> Model F test: Equal FMI F( 6, 410.1) = 282.34 #> Prob > F = 0.0000 #> #> ------------------------------------------------------------------------------ #> postdbp | Coefficient Std. err. t P>|t| [95% conf. interval] #> -------------+---------------------------------------------------------------- #> 1.group | 8.689445 .387477 22.43 0.000 7.915569 9.46332 #> | .0314616 .0185369 1.08 0.000 .0474504 .0514125 #> | #> baselinedbp | .9682818 .0721626 13.42 0.000 .826246 1.110318 #> | .0034724 .0013932 0.27 0.000 .0046169 .0042748 #> | #> age | .3065044 .0129133 23.74 0.000 .2811046 .3319043 #> | .000536 .0001961 0.35 0.000 .0005838 .0007338 #> | #> z | -7.566645 2.253298 -3.36 0.001 -12.00257 -3.130721 #> | .1116652 .0526051 0.10 0.000 .1475993 .1589098 #> | #> r1 | -.0677966 .1120335 -0.61 0.546 -.2886173 .1530242 #> | .0063738 .0031658 0.06 0.038 .0096883 .0082499 #> | #> r2 | .1295295 .2002722 0.65 0.518 -.2643871 .5234461 #> | .0082479 .0026444 0.04 0.027 .0099567 .0096033 #> | #> _cons | 5.40357 8.429241 0.64 0.522 -11.18598 21.99312 #> | .3991835 .1614704 0.05 0.031 .4936807 .5316961 #> ------------------------------------------------------------------------------ #> Note: Values displayed beneath estimates are Monte Carlo error estimates. #> #> . #> . mi estimate, saving(miest, replace) vartable mcerror level(95) nocitable: rreg p #> > ostdbp i.group baselinedbp age z r1 r2 #> (system variable _mi_id updated because of changed number of obs) #> #> Multiple-imputation estimates Imputations = 120 #> Robust regression #> #> Variance information #> ------------------------------------------------------------------------------ #> | Imputation variance Relative #> | Within Between Total RVI FMI efficiency #> -------------+---------------------------------------------------------------- #> 1.group | .030369 .11878 .150138 3.94388 .802896 .993354 #> | .000351 .014187 .014313 .472623 .019598 .000162 #> | #> baselinedbp | .003749 .001447 .005207 .389201 .282239 .997654 #> | .000043 .000197 .000201 .053504 .028234 .000236 #> | #> age | .000132 .000034 .000167 .263424 .209925 .998254 #> | 1.5e-06 4.8e-06 5.1e-06 .037066 .023617 .000198 #> | #> z | 3.56859 1.49629 5.07735 .422789 .299391 .997511 #> | .041205 .233251 .236253 .06639 .03357 .000281 #> | #> r1 | .007636 .004875 .012552 .643772 .394767 .996721 #> | .000088 .000706 .000707 .094418 .035853 .000299 #> | #> r2 | .031878 .008163 .040109 .258217 .206621 .998281 #> | .000368 .000989 .001058 .031477 .020141 .000169 #> | #> _cons | 51.7711 19.1217 71.0521 .372429 .273359 .997727 #> | .597777 2.66617 2.71602 .052377 .028329 .000237 #> ------------------------------------------------------------------------------ #> Note: Values displayed beneath estimates are Monte Carlo error estimates. #> #> #> . #> . mi estimate using miest, mcerror level(95) #> #> Multiple-imputation estimates Imputations = 120 #> Robust regression Number of obs = 500 #> Average RVI = 1.2446 #> Largest FMI = 0.8029 #> Complete DF = 493 #> DF adjustment: Small sample DF: min = 64.87 #> avg = 259.72 #> max = 342.89 #> Model F test: Equal FMI F( 6, 410.1) = 282.34 #> Prob > F = 0.0000 #> #> ------------------------------------------------------------------------------ #> postdbp | Coefficient Std. err. t P>|t| [95% conf. interval] #> -------------+---------------------------------------------------------------- #> 1.group | 8.689445 .387477 22.43 0.000 7.915569 9.46332 #> | .0314616 .0185369 1.08 0.000 .0474504 .0514125 #> | #> baselinedbp | .9682818 .0721626 13.42 0.000 .826246 1.110318 #> | .0034724 .0013932 0.27 0.000 .0046169 .0042748 #> | #> age | .3065044 .0129133 23.74 0.000 .2811046 .3319043 #> | .000536 .0001961 0.35 0.000 .0005838 .0007338 #> | #> z | -7.566645 2.253298 -3.36 0.001 -12.00257 -3.130721 #> | .1116652 .0526051 0.10 0.000 .1475993 .1589098 #> | #> r1 | -.0677966 .1120335 -0.61 0.546 -.2886173 .1530242 #> | .0063738 .0031658 0.06 0.038 .0096883 .0082499 #> | #> r2 | .1295295 .2002722 0.65 0.518 -.2643871 .5234461 #> | .0082479 .0026444 0.04 0.027 .0099567 .0096033 #> | #> _cons | 5.40357 8.429241 0.64 0.522 -11.18598 21.99312 #> | .3991835 .1614704 0.05 0.031 .4936807 .5316961 #> ------------------------------------------------------------------------------ #> Note: Values displayed beneath estimates are Monte Carlo error estimates. ## The Primary Analysis For the primary analysis, I will use the predictive-mean matching method and specify 5 donors (based on a number of simulation studies that have examined the effect of various donors specified)22, and fit a robust regression that includes all the variables within the dataset. Our imputation model will be $Y^{\operatorname{Post}} = \beta_{0} + \beta_{1}^{\operatorname{Group}} + \beta_{2}^{\operatorname{Base}} + \beta_{3}^{\operatorname{Age}} + \beta_{4}^{\operatorname{Z}} + \beta_{5}^{\operatorname{R1}} + \beta_{6}^{\operatorname{R2}} + \epsilon$ So we will now impute our datasets. # Impute Missing Data ------------------------------------- form <- list(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2) pred <- make.predictorMatrix(BP_miss) parlmice(BP_miss, method = "pmm", predictorMatrix = pred, m = 120, maxit = 5, cluster.seed = 1031, n.core = cores, n.imp.core = (m / cores), formulas = form, ridge = 0, cl.type = "FORK", donors = 5, remove.collinear = TRUE, remove.constant = TRUE, allow.na = TRUE, printFlag = FALSE, visitSequence = "monotone") -> imp1 head(imp1$loggedEvents, 10)
#> NULL

Now that we have imputed our datasets, we may examine them for any issues before moving onto the next step of analyzing it.

Our imputed datasets seem to look fine so far. We now fit our models to each of these datasets and combine them using Rubin’s rules.

# Primary Analysis -------------------------------------

analysis1 <- with(imp1, rlm(PostDBP ~ (Group)  +
BaselineDBP + Age + Z + R1 + R2,
method = "MM", psi = psi.huber))
analysis2 <- with(imp1, rlm(PostDBP ~ (Group) ,
method = "MM", psi = psi.huber))

anova(analysis1, analysis2,
method = "D3", use = "LR")
#>    test statistic df1 df2 dfcom p.value riv
#>  1 ~~ 2         0   5 318   493       1 221

result1 <- mice::pool(analysis1)

results1_sum <- summary(result1)

colnames(results1_sum) <- c("Term", "Estimate", "SE",
"Statistic", "df", "P-val")

Term Estimate SE Statistic df P-val
(Intercept) 10.34 5.51 1.88 136 0.06
GroupDrug Y -9.54 0.13 -71.16 136 0.00
BaselineDBP 0.99 0.05 21.19 138 0.00
Age 0.33 0.01 33.27 126 0.00
Z -1.31 1.46 -0.89 146 0.37
R1 -0.03 0.07 -0.40 132 0.69
R2 -0.12 0.14 -0.84 114 0.40

While we have our primary analysis estimates, we also continue to examine our imputations for any anomalies.

We’ve done our primary analysis, in which we imputed the missing data under the assumption of MAR, which is often a reasonable assumption to start with. Unfortunately, we cannot verify whether or not this assumption is true, but we can vary this missing data assumption and assume that the data are missing not at random (MNAR) $\operatorname{Pr}\left(R=0 \mid Y_{\mathrm{obs}}, Y_{\mathrm{mis}}, \psi\right)$, and that in the individuals with missing data, we can assume different numbers than those with complete data. Our likelihood-ratio test also suggests that there’s very little difference between the full model and the reduced model, so we generally will go with the full model.

## SA I: Controlled MI

Now suppose we wish to handle these missing data under the assumption of MNAR, we would now be conducting a sensitivity analysis because we are still targeting the same estimand, but only varying the assumptions. The question / target remain the same. To handle missing data that we assume are MNAR, there are a number of different and complex approaches.

We start off with the $\delta$-adjustment, controlled multiple imputation method,23 in which we assume that the group with missing observations differs systematically from the group with complete observations by a certain quantity. We produce a range of these quantities and add or subtract them from the imputed values and then conduct our analysis on this adjusted, imputed dataset. Here I will make the assumption that the average PostDBP estimate from the missing value group differs from at least 0 mmHg to at most 20 mmHg from the group with no missing observations.

# Controlled MI  -------------------------------------

ini <- mice(BP_miss, method = "pmm", maxit = 0,
predictorMatrix = pred)

ini$method["BaselineDBP"] <- "" ini$method["PostDBP"] <- "pmm"
ini$method["Age"] <- "" ini$method["R1"] <- ""
ini$method["Group"] <- "" ini$method["R2"] <- ""
ini$method["Z"] <- "" meth <- ini$method

delta <- seq(-20, 20, 5)

imp.all <- vector("list", length(delta))
post <- ini$post for (i in 1:length(delta)) { d <- delta[i] cmd <- paste("imp[[j]][, i] <- imp[[j]][, i] +", d) post["PostDBP"] <- cmd imp <- mice(BP_miss, method = meth, predictorMatrix = pred, m = 5, maxit = 1, donors = 5, cores = cores, ridge = 0, formulas = form, remove.collinear = TRUE, remove.constant = TRUE, allow.na = TRUE, post = post, printFlag = FALSE) imp.all[[i]] <- imp } Now that we have our imputed datasets that are modified by $\delta$-adjustments, we can fit our models to each imputed dataset and combine the estimates using Rubin’s rules. # Sensitivity Analysis 1 ------------------------------------- output <- sapply(imp.all, function(x) mice::pool(with(x, rlm(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2, method = "MM", psi = psi.huber))), simplify = "array", USE.NAMES = TRUE) a <- (as.data.frame(t(output)))$pooled
r <- array(rep(NA, length(delta) * 7),
dim = c(length(delta), 7),
dimnames = list(c(delta),
c("Intercept", "Group",
"BaselineDBP", "Age",
"Z", "R1", "R2")))
for (i in 1:length(delta)) {
r[i, ] <- cbind(a[[i]][["estimate"]])
}

r <- as.data.frame(r)
r <- data.frame(Delta = row.names(r), r, row.names = NULL)

Delta Intercept Group BaselineDBP Age Z R1 R2
-20 -107.61 -11.81 1.95 0.30 -2.98 0.08 -1.04
-15 -53.50 -10.76 1.49 0.31 -2.14 0.03 -0.57
-10 -26.05 -10.24 1.27 0.32 -1.69 0.00 -0.38
-5 -6.16 -9.88 1.12 0.33 -1.29 -0.02 -0.26
0 10.25 -9.58 0.99 0.33 -1.16 -0.06 -0.13
5 30.07 -9.23 0.84 0.33 -0.79 -0.08 -0.03
10 50.22 -8.87 0.68 0.34 -0.34 -0.10 0.09
15 79.01 -8.34 0.45 0.34 0.26 -0.13 0.29
20 137.18 -7.25 -0.04 0.35 1.74 -0.20 0.82

The table above gives the various $\delta$’s (the first column with values ranging from -30 - 30 in increments of 5) that were added to the multiply imputed datasets, and the corresponding estimates from the models fit to those datasets. So now we’ve conducted our first sensitivity analysis using the $\delta$-adjustment, controlled multiple imputation method. This form of sensitivity analysis is also typically what is used and preferred by regulators.23

Regulators prefer simple methods that impute the missing outcomes under MAR, and then add an adjustment δ to the imputes, while varying δ over a plausible range and independently for each treatment group (Permutt 2016).

The most interesting scenarios will be those where the difference between the δ’s correspond to the size of the treatment effect in the completers. Contours of the p-values may be plotted on a graph as a function of the δ’s to assist in a tipping-point analysis (Liublinska and Rubin 2014).

And as expected, lower $\delta$ values gave us results that were consistent with the primary analysis, while the upper end of the extreme $\delta$ values gave us estimates that were substantially off the mark.

Next, we quickly examine the state of a random sample of these imputations to make sure they’re reliable.

Now we move onto our next and final sensitivity analysis.

## SA II: Selection Models

To further explore the MNAR assumption, we could also use a Heckman selection model $P(Y,R)=P(Y)P(R|Y)$, as described by van Buuren here20

The selection model multiplies the marginal distribution $P(Y)$ in the population with the response weights $P(R|Y)$. Both $P(Y)$ and $P(R|Y)$ are unknown, and must be specified by the user. The model where $P(Y)$ is normal and where $P(R|Y)$ is a probit model is known as the Heckman model. This model is widely used in economics to correct for selection bias.

We can implement this in R using the miceMNAR package, which is an extension to the mice R package.

# Selection Models  -------------------------------------

JointModelEq <- generate_JointModelEq(data = BP_miss,
varMNAR = "PostDBP")

JointModelEq[,"PostDBP_var_sel"] <- c(0, 1, 1, 1, 1, 1, 1)
JointModelEq[,"PostDBP_var_out"] <- c(0, 1, 1, 1, 0, 1, 1)

arg <- MNARargument(data = BP_miss, varMNAR = "PostDBP",
JointModelEq = JointModelEq)

imp2 <- mice(data = arg$data_mod, method = "pmm", seed = 1031, predictorMatrix = arg$predictorMatrix,
JointModelEq = arg$JointModelEq, control = arg$control,
remove.collinear = TRUE,
remove.constant = FALSE, allow.na = FALSE,
printFlag = FALSE, visitSequence = "monotone")

imp2$data$Group <- as.factor(imp2$data$Group)

Now we fit our models to these new datasets and pool them using Rubin’s rules.

# Sensitivity Analysis 2  -------------------------------------

analysis3 <- with(imp2, rlm(PostDBP ~ factor(Group)  +
BaselineDBP + Age + Z + R1 + R2,
method = "MM", psi = psi.huber))
result3 <- mice::pool(analysis3)

results3_sum <- summary(result3)
colnames(results3_sum) <- c("Term", "Estimate", "SE",
"Statistic", "df", "P-val")

Term Estimate SE Statistic df P-val
(Intercept) 10.12 6.14 1.65 11.38 0.13
factor(Group)Drug Y -9.50 0.20 -47.69 6.51 0.00
BaselineDBP 0.99 0.06 18.01 9.96 0.00
Age 0.33 0.01 27.99 8.89 0.00
Z -1.09 1.85 -0.59 8.84 0.57
R1 -0.02 0.08 -0.30 12.16 0.77
R2 -0.15 0.16 -0.96 9.70 0.36

Now we examine our imputed values.

so now we have completed our second sensitivity analysis.

The results from both the primary analysis, the first sensitivity analysis, and the second seem somewhat consistent. However, with the first sensitivity analysis where we applied $\delta$-adjustments, as the adjustments became larger, the coefficients changed drastically. However, such large $\delta$-adjustments are not realistic and those that were smaller led to coefficients that were closer to the other two analyses.

## Supplementary Analyses

The following analyses are what I mostly consider to be supplementary analyses that may further investigate some violations of assumptions but often go far beyond what would be necessary for a principled sensitivity analysis that would typically accompany a primary analysis.

First, I like to check that the results from my analyses are consistent across different statistical software So I may check to see that the results from my primary analysis from R are also similar to the results from another statistical software suite like Stata.

So in Stata, I would likely run something similar to the following in order to mimic the primary analysis in R.

clear all
#> . clear .
#> . set seed 1031
#>
#> .
#> . parallel setclusters 8, statapath()
#> N Child processes: 8
#> Stata dir:  /Applications/Stata/StataMP.app/Contents/MacOS/stata-mp
#>
#> .
#> . parallel initialize 8, statapath()
#> N Child processes: 8
#> Stata dir:  /Applications/Stata/StataMP.app/Contents/MacOS/stata-mp
#>
#> .
#> . import delimited "~/Dropbox/LessLikely/static/datasets/temp.csv", numericcols(2
#> > 3 5 6 7 8)
#> (encoding automatically selected: ISO-8859-2)
#> (8 vars, 500 obs)
#>
#> .
#> . summarize
#>
#>     Variable |        Obs        Mean    Std. dev.       Min        Max
#> -------------+---------------------------------------------------------
#>           v1 |        500       250.5    144.4818          1        500
#>      postdbp |        209    135.0179    4.550848   125.9651     149.47
#>  baselinedbp |        500    116.9717    1.961848    110.615   123.7551
#>        group |        500        .508    .5004367          0          1
#>          age |        500     49.4683     7.08616   26.96167   70.72639
#> -------------+---------------------------------------------------------
#>            z |        500   -9.29e-11    .0626851   -.209414   .2471141
#>           r1 |        500   -.0715254    .9988079  -3.568279   3.036855
#>           r2 |        500    .9799438    .4913294  -.6787649   2.356054
#>
#> .
#> . mi set mlong
#>
#> .
#> . mi register imputed baselinedbp postdbp age z group r1 r2
#> (291 m=0 obs now marked as incomplete)
#>
#> .
#> . mi impute chained (pmm, knn(5)) postdbp = baselinedbp age z r1 r2 group, burnin(
#> > 20) add(100) rseed (1031) savetrace(trace1, replace) nomonotone
#> note: missing-value pattern is monotone.
#>
#> Conditional models:
#>            postdbp: pmm postdbp baselinedbp age z r1 r2 group , knn(5)
#>
#> Performing chained iterations ...
#>
#> Multivariate imputation                     Imputations =      100
#> Chained equations                                 added =      100
#> Imputed: m=1 through m=100                      updated =        0
#>
#> Initialization: monotone                     Iterations =     2000
#>                                                 burn-in =       20
#>
#>            postdbp: predictive mean matching
#>
#> ------------------------------------------------------------------
#>                    |               Observations per m
#>                    |----------------------------------------------
#>           Variable |   Complete   Incomplete   Imputed |     Total
#> -------------------+-----------------------------------+----------
#>            postdbp |        209          291       291 |       500
#> ------------------------------------------------------------------
#> (Complete + Incomplete = Total; Imputed is the minimum across m
#>  of the number of filled-in observations.)
#>
#> .
#> . mi estimate, mcerror: rreg postdbp i.group baselinedbp age z r1 r2
#>
#> Multiple-imputation estimates                   Imputations       =        100
#> Robust regression                               Number of obs     =        500
#>                                                 Average RVI       =     1.6877
#>                                                 Largest FMI       =     0.7159
#>                                                 Complete DF       =        493
#> DF adjustment:   Small sample                   DF:     min       =      82.46
#>                                                         avg       =     127.69
#>                                                         max       =     191.63
#> Model F test:       Equal FMI                   F(   6,  374.8)   =    1171.55
#>                                                 Prob > F          =     0.0000
#>
#> ------------------------------------------------------------------------------
#>      postdbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
#> -------------+----------------------------------------------------------------
#>      1.group |      10.46   .1504111    69.54   0.000     10.16183    10.75818
#>              |   .0118239   .0053087     2.46   0.000     .0144955    .0173849
#>              |
#>  baselinedbp |   .9542017   .0509793    18.72   0.000     .8532671    1.055136
#>              |   .0038745    .002412     0.91   0.000     .0063144    .0061611
#>              |
#>          age |   .3229139   .0085025    37.98   0.000     .3061434    .3396845
#>              |   .0005449    .000226     1.02   0.000     .0007102    .0007091
#>              |
#>            z |  -2.622068   1.511717    -1.73   0.085    -5.611361    .3672254
#>              |   .1101344    .054734     0.09   0.016     .1620533    .1500265
#>              |
#>           r1 |   -.360883   .0856236    -4.21   0.000    -.5312012   -.1905648
#>              |   .0071805   .0051837     0.29   0.000      .011663    .0138291
#>              |
#>           r2 |  -.1834333   .1392627    -1.32   0.190    -.4588711    .0920045
#>              |   .0102243   .0053406     0.09   0.030     .0145569    .0151887
#>              |
#>        _cons |   6.511102    5.96798     1.09   0.277     -5.30351    18.32571
#>              |   .4518723   .2785653     0.09   0.040     .7162938    .7298489
#> ------------------------------------------------------------------------------
#> Note: Values displayed beneath estimates are Monte Carlo error estimates.
#>
#> .
#> . quietly mi passive: generate byte imputed = _mi_miss
#>
#> .
#> . describe
#>
#> Contains data
#>  Observations:        29,600
#>     Variables:            12
#> ----------------------------------------------------------------------------------
#> Variable      Storage   Display    Value
#>     name         type    format    label      Variable label
#> ----------------------------------------------------------------------------------
#> v1              int     %8.0g
#> postdbp         float   %8.0g                 PostDBP
#> baselinedbp     float   %9.0g                 BaselineDBP
#> group           byte    %8.0g                 Group
#> age             float   %9.0g                 Age
#> z               float   %9.0g                 Z
#> r1              float   %9.0g                 R1
#> r2              float   %9.0g                 R2
#> _mi_miss        byte    %8.0g
#> imputed         byte    %8.0g
#> _mi_m           int     %8.0g
#> _mi_id          int     %8.0g
#> ----------------------------------------------------------------------------------
#> Sorted by: _mi_m  _mi_id
#>      Note: Dataset has changed since last saved.

As you can see from above, I imported the missing data I generated in R into Stata, and then used it to multiply impute the datasets using the predictive mean matching method with 5 donors. I chose the exact same number of variables to include in the imputation model as I did with the primary and sensitivity analyses. Just like the primary analysis, I also used robust regression, rreg in Stata, to analyze the datasets. The numbers seem to be fairly consistent with those from the primary analysis and the some of the results from the sensitivity analyses.

I will now conduct another supplementary analysis, in which I fit a Bayesian regression model using a t-distribution so that it is analogous to the other robust regression models I have utilized so far, and I will specify a weakly informative prior. I will fit one model using the brm_multiple() function which will fit the model to each of the imputed datasets from before, while I fit another Bayesian model which will impute values during the model fitting process and compare the results.

# Quantile Regression  --------------------------

implist <- mitml::mids2mitml.list(imp1)
vcov.rq <- function(object, ...) summary(object,
se = "nid",
covariance = TRUE)$cov fit1 <- with(implist, rq(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2, tau = 0.5)) mitml::testEstimates(fit1) #> #> Call: #> #> mitml::testEstimates(model = fit1) #> #> Final parameter estimates and inferences obtained from 64 imputed data sets. #> #> Estimate Std.Error t.value df P(>|t|) RIV FMI #> (Intercept) 9.750 6.596 1.478 265.309 0.141 0.950 0.491 #> factor(Group)Drug Y -9.465 0.162 -58.283 268.058 0.000 0.941 0.489 #> BaselineDBP 0.993 0.056 17.726 269.331 0.000 0.937 0.487 #> Age 0.337 0.011 30.523 304.349 0.000 0.835 0.459 #> Z -1.278 1.752 -0.729 298.242 0.466 0.851 0.463 #> R1 -0.054 0.083 -0.643 285.295 0.521 0.887 0.474 #> R2 -0.111 0.170 -0.652 190.065 0.515 1.357 0.580 #> #> Unadjusted hypothesis test as appropriate in larger samples. confint(mitml::testEstimates(fit1)) #> 2.5 % 97.5 % #> (Intercept) -3.237 22.738 #> factor(Group)Drug Y -9.784 -9.145 #> BaselineDBP 0.883 1.104 #> Age 0.315 0.359 #> Z -4.726 2.170 #> R1 -0.218 0.111 #> R2 -0.446 0.224 # Bayesian Regression -------------------------- BP_miss$Group <- as.factor(BP_miss$Group) brm_form2 <- bf(PostDBP | mi() ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2) prior1 <- c(prior(normal(0, 100), class = Intercept), prior(normal(0, 100), class = b), prior(normal(0, 100), class = sigma)) b_2 <- brm(brm_form2, data = BP_miss, prior = prior1, silent = 2, family = student(), threads = 8, iter = 4000, refresh = 0, backend = "cmdstanr", warmup = 2000, chains = 4, seed = 1031) #> Running MCMC with 4 chains, at most 8 in parallel, with 8 thread(s) per chain... #> #> Chain 3 finished in 62.1 seconds. #> Chain 4 finished in 73.9 seconds. #> Chain 2 finished in 78.8 seconds. #> Chain 1 finished in 87.5 seconds. #> #> All 4 chains finished successfully. #> Mean chain execution time: 75.6 seconds. #> Total execution time: 87.6 seconds. post <- summary(b_2)[] summary(b_2) #> Family: student #> Links: mu = identity; sigma = identity; nu = identity #> Formula: PostDBP | mi() ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2 #> Data: BP_miss (Number of observations: 500) #> Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1; #> total post-warmup draws = 8000 #> #> Population-Level Effects: #> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS #> Intercept 11.23 5.66 0.17 22.46 1.00 5214 5557 #> factorGroupDrugY -9.59 0.13 -9.85 -9.32 1.00 6888 6042 #> BaselineDBP 0.98 0.05 0.89 1.08 1.00 5167 5707 #> Age 0.33 0.01 0.32 0.35 1.00 6033 6483 #> Z -0.94 1.48 -3.86 1.98 1.00 5493 6155 #> R1 -0.02 0.07 -0.16 0.12 1.00 6249 6331 #> R2 -0.14 0.13 -0.39 0.12 1.00 6555 6619 #> #> Family Specific Parameters: #> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS #> sigma 0.99 0.05 0.89 1.10 1.00 3230 5491 #> nu 26.10 14.13 8.19 61.29 1.00 7172 5393 #> #> Draws were sampled using sample(hmc). For each parameter, Bulk_ESS #> and Tail_ESS are effective sample size measures, and Rhat is the potential #> scale reduction factor on split chains (at convergence, Rhat = 1). We can also compare the results we get from brms/Stan with MCMCpack. posterior_95 <- MCMCquantreg(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2, data = BP_miss, tau = 0.95, mcmc = 5000, chains = 4, thin = 10, seed = 1031) summary(posterior_95) #> #> Iterations = 1001:5991 #> Thinning interval = 10 #> Number of chains = 1 #> Sample size per chain = 500 #> #> 1. Empirical mean and standard deviation for each variable, #> plus standard error of the mean: #> #> Mean SD Naive SE Time-series SE #> (Intercept) 10.0201 19.7535 0.88340 0.99771 #> factor(Group)Drug Y -9.6461 0.4302 0.01924 0.02502 #> BaselineDBP 1.0216 0.1686 0.00754 0.00852 #> Age 0.3109 0.0319 0.00143 0.00143 #> Z -3.1186 5.5476 0.24810 0.27155 #> R1 -0.0275 0.2707 0.01210 0.01401 #> R2 -0.4018 0.3971 0.01776 0.01933 #> #> 2. Quantiles for each variable: #> #> 2.5% 25% 50% 75% 97.5% #> (Intercept) -27.140 -2.353 10.2582 21.561 50.228 #> factor(Group)Drug Y -10.537 -9.927 -9.6197 -9.374 -8.842 #> BaselineDBP 0.670 0.925 1.0188 1.131 1.333 #> Age 0.249 0.289 0.3097 0.333 0.373 #> Z -14.156 -6.870 -3.2881 0.843 7.019 #> R1 -0.536 -0.215 -0.0178 0.144 0.492 #> R2 -1.244 -0.651 -0.3759 -0.132 0.303 Next, we examine the results from the Bayesian model in which values were imputed during model fitting. posterior_summary(b_2)[1:8] (post <- posterior_samples(b_2)) They seem to be similar to the results we’ve been seeing so far. Indeed, it is no surprise that the Bayesian model with a weakly dispersed prior would give similar results to the base rlm() function. We now look at robust methods and resampling and the results they give. Although this is currently under the supplementary analysis section, I would very much consider these next two analyses to be sensitivity analyses. The first method utilized here is known as the random indicator method, and it is an experimental method that essentially combines a selection model and a pattern-mixture model and is designed to handle data that are missing not at random. As van Buuren writes20 The random indicator method (Jolani 2012)24 is an experimental iterative method that redraws the missing data indicator under a selection model, and imputes the missing data under a pattern-mixture model, with the objective of estimating δ from the data under relaxed assumptions. Initial simulation results look promising. The algorithm is available as the ri method in mice. The script below utilizes this method (random indicator), but bootstraps the missing dataset before the imputation procedure, and then after multiply imputing the datasets, obtains estimates using maximum likelihood multiple (MLMI) imputation, an alternative and more efficient method to the more popular posterior draw multiple imputation (PDMI). More about MLMI can be found here.25 # Bootstrap Inference ------------------------------------- bootMice(BP_miss, method = "ri", maxit = 5, predictorMatrix = pred, nBoot = 96, nImp = 2, seed = 1031, formulas = form, nCores = 8L, ridge = 0, remove.collinear = TRUE, remove.constant = TRUE, allow.na = TRUE, printFlag = FALSE, visitSequence = "monotone") -> imps analyseImp <- function(inputData) { return(coef(rlm(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2, data = inputData, method = "MM", psi = psi.huber))) } analyseImp1 <- function(inputData) { return(coef(glm(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2, data = inputData, family = gaussian()))) } ests <- bootImputeAnalyse(imps, analyseImp, quiet = TRUE) ests1 <- bootImputeAnalyse(imps, analyseImp1, quiet = TRUE) ests <- as.data.frame(ests) colnames(ests) <- c("Estimate", "Var", "CI.ll", "CI.ul", "df") rownames(ests) <- c("Intercept", "Group", "BaselineDBP", "Age", "Z", "R1", "R2") ests1 <- as.data.frame(ests1) colnames(ests1) <- c("Estimate", "Var", "CI.ll", "CI.ul", "df") rownames(ests1) <- c("Intercept", "Group", "BaselineDBP", "Age", "Z", "R1", "R2") Estimate Var CI.ll CI.ul df Intercept 12.13 20.79 2.88 21.38 35.2 Group -9.55 0.03 -9.87 -9.24 60.3 BaselineDBP 0.97 0.00 0.89 1.05 38.2 Age 0.34 0.00 0.32 0.36 51.5 Z -0.69 1.83 -3.40 2.03 50.5 R1 -0.02 0.01 -0.18 0.14 51.6 R2 -0.14 0.02 -0.41 0.13 53.9 Estimate Var CI.ll CI.ul df Intercept 11.79 20.94 2.50 21.07 35.2 Group -9.56 0.02 -9.88 -9.25 63.9 BaselineDBP 0.98 0.00 0.90 1.06 38.2 Age 0.33 0.00 0.31 0.35 53.4 Z -0.73 1.91 -3.50 2.04 53.9 R1 -0.01 0.01 -0.17 0.14 52.4 R2 -0.15 0.02 -0.42 0.13 54.6 For our second supplementary analysis, We will utilize the GAMLSS method, which is an extension of the generalized linear model and the generalized additive model. For the vast majority of the imputations we have conducted, we have drawn values from an assumed normal distribution. Thus, GAMLSS allows us to be far more flexible with our assumptions. However, it is also worth noting that general normal imputations are robust against violations of the assumption of normality In general, normal imputations appear to be robust against violations of normality. Demirtas, Freels, and Yucel (2008) found that flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not appear to hamper the good performance of multiple imputation for the mean structure in samples n > 400, even for high percentages (75%) of missing data in one variable. The variance parameter is more critical though, and could be off-target in smaller samples. Regardless, we will be using the GAMLSS method and also imputing from a t-distribution, which is much more robust to outliers. gamt <- mice(BP_miss, method = "gamlssTF", predictorMatrix = pred, m = 5, maxit = 1, seed = 1031, cores = cores, formulas = form, ridge = 0, remove.collinear = TRUE, remove.constant = TRUE, allow.na = TRUE, printFlag = FALSE) gt1 <- with(gamt, rlm(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2)) resultsgt <- mice::pool(gt1)  ztable(summary(resultsgt)) term estimate std.error statistic df p.value (Intercept) 13.23 4.95 2.67 18.2 0.02 factor(Group)Drug Y -9.52 0.13 -76.02 15.4 0.00 BaselineDBP 0.97 0.04 23.48 20.2 0.00 Age 0.32 0.01 35.44 14.8 0.00 Z -0.62 1.39 -0.44 15.9 0.66 R1 -0.02 0.07 -0.34 13.1 0.74 R2 -0.14 0.10 -1.35 42.1 0.19 Having done that, I shall try one more supplementary analysis which piqued my curiosity. I recently came across a few papers that concluded that random forests outperformed other imputation algorithms in recovering the true parameter values. Generally, imputation algorithms are evaluated based on a general simulation design which often resembles the following: 1. Sampling mechanism only. The basic simulation steps are: choose $Q$, take samples $Y^{(s)}$, fit the complete-data model, estimate $\hat{Q}^{(s)}$ and $U^{(s)}$ and calculate the outcomes aggregated over $s$. 1. Sampling and missing data mechanisms combined. The basic simulation steps are: choose $Q$, take samples $Y^{(s)}$, generate incomplete data $Y_{\mathrm{obs}}^{(s, t)}$, impute, estimate $\hat{Q}^{(s, t)}$ and $T^{(s, t)}$ and calculate outcomes aggregated over $s$ and $t$. 1. Missing data mechanism only. The basic simulation steps are: choose $(\hat{Q}, U)$, generate incomplete data $Y_{o b s}^{(t)}$, impute, estimate $(Q, U)^{(t)}$ and $B^{(t)}$ and calculate outcomes aggregated over $t$. These designs generally lead to evaluating a number of metrics which include the following: • Raw Bias: where $RB$ is bias and $Q$ is the difference between the estimate and the truth $RB=E(Q)-Q$, which may also be expressed as a percent $100*\frac{RB=E(Q)-Q}{Q}$ • Coverage Rate: which is simply the proportion of times the intervals actually contain the true parameter value • Average Width: the width of the interval estimates, with longer widths indicating less information • Root mean squared error (RMSE) $\operatorname{RMSE}=\sqrt{\frac{1}{n_{\text {mis }}} \sum_{i=1}^{n_{\text {mits }}}\left(y_{i}^{\text {mis }}-\dot{y}_{i}\right)^{2}}$ The authors primarily based their conclusions on the root mean squared error (RMSE). However, this is problematic for a number of reasons; imputation is not the same as prediction, and so the same loss functions are not as useful; indeed, the RMSE is highly misleading as a measure of performance. van Buuren explains here: It is well known that the minimum RMSE is attained by predicting the missing $\dot{{y}}_{i}$ by the linear model with the regression weights set to their least squares estimates. According to this reasoning the “best” method replaces each missing value by its most likely value under the model. However, this will find the same values over and over, and is single imputation. This method ignores the inherent uncertainty of the missing values (and acts as if they were known after all), resulting in biased estimates and invalid statistical inferences. Hence, the method yielding the lowest RMSE is bad for imputation. More generally, measures based on similarity between the true and imputed values do not separate valid from invalid imputation methods. Having set up our simulation design, we can now test the performances between regression imputation norm.predict and and regular regression imputation norm and we will be examining the metrics above to evaluate performance. Regression imputation generally gives very biased results given that it uses its predictions to impute the missing values. Thus, I will show here how RMSE is highly misleading as a metric to evaluate the performance of imputation algorithms. Our simulation above will generate a dataset, delete values from the dataset and then impute the missing values for that dataset using regression imputation and stochastic regression imputation. We will use two metrics to evaluate the performance of these two algorithms, RMSE and MAE. Once again, it is worth keeping in mind that regression imputation generally performs very poorly when it comes to imputation and that stochastic regression imputation performs better. What this simulation clearly shows is that despite regression imputation norm.predict giving highly biased results as usual relative to the truth, it still has a much lower RMSE than stochastic regression imputation, which both had a higher RMSE and a closer distance to the correct parameter values. Once again, this is not surprising given that imputation is not the same thing as prediction and therefore the loss functions cannot be used synonymously. However, when it came to raw bias and mean absolute error, stochastic regression imputation outperformed regression imputation which had a much larger mean absolute error and raw bias. This is why the most informative metrics to focus on when evaluating an imputation algorithm include information/width, sufficient long-run coverage, and less misleading metrics such as mean absolute error and mean absolute percentage error. The authors specifically used the missForest package. I tried using the package, but encountered too many difficulties so I simply specified the rf method in mice, and worked from there. # MI Using Random Forest ------------------------------------- form <- list(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2) rf <- parlmice(BP_miss, method = "rf", ntree = 20, predictorMatrix = pred, m = 10, maxit = 5, cluster.seed = 1031, cores = cores, formulas = form, ridge = 0, remove.collinear = TRUE, remove.constant = TRUE, allow.na = TRUE, printFlag = FALSE, visitSequence = "monotone") rf1 <- with(rf, rlm(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2)) results5 <- mice::pool(rf1) results5_sum <- summary(results5) colnames(results5_sum) <- c("Term", "Estimate", "SE", "Statistic", "df", "P-val") ztable(results5_sum) Term Estimate SE Statistic df P-val (Intercept) 41.38 9.24 4.48 26.9 0.00 factor(Group)Drug Y -9.32 0.19 -48.05 50.4 0.00 BaselineDBP 0.74 0.08 9.31 25.7 0.00 Age 0.28 0.01 21.98 90.5 0.00 Z 1.55 2.53 0.61 26.2 0.54 R1 0.03 0.11 0.24 43.7 0.81 R2 0.04 0.19 0.19 55.8 0.85 ## Full Analysis Set After we run all those analyses, we can finally check our numbers from the full dataset with no missing observations to see how our approaches to handle the missing data have performed. We first look at the fully adjusted model fit with rlm(). # Full Analysis Set ------------------------------------- final_full_mod <- rlm(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2, data = BP_full, method = "MM", psi = psi.huber) Value Std. Error t value (Intercept) 9.19 3.72 2.47 factor(Group)Drug Y -9.66 0.09 -106.85 BaselineDBP 1.00 0.03 31.68 Age 0.33 0.01 50.17 Z -2.02 1.01 -2.00 R1 0.01 0.05 0.21 R2 -0.01 0.09 -0.09 Term VIF VIF_CI_low VIF_CI_high SE_factor Tolerance Tolerance_CI_low Tolerance_CI_high factor(Group) 1.01 1.00 489.74 1.00 0.99 0.00 1.00 BaselineDBP 1.97 1.75 2.26 1.40 0.51 0.44 0.57 Age 1.02 1.00 4.09 1.01 0.98 0.24 1.00 Z 1.98 1.75 2.27 1.41 0.51 0.44 0.57 R1 1.01 1.00 33.65 1.01 0.99 0.03 1.00 R2 1.01 1.00 1603.17 1.00 0.99 0.00 1.00 We can also graphically inspect our assumptions. Then we look at the unadjusted model, where the difference is the group that the participants were randomized to. # Unadjusted Model ------------------------------------- final_reduced_mod <- rlm(PostDBP ~ factor(Group) , data = BP_full, method = "MM", psi = psi.huber) final_reduced_mod_sum <- summary(final_reduced_mod)[["coefficients"]] colnames(final_reduced_mod_sum) <- c("Value", "Std. Error", "t-value") Value Std. Error t-value (Intercept) 142.6 0.20 728.8 factor(Group)Drug Y -9.2 0.28 -32.7 We can also compare the two to see if there are any substantial differences. Name Model AIC AIC_wt BIC BIC_wt RMSE Sigma final_full_mod rlm 1420 1 1453 1 0.98 0.99 final_reduced_mod rlm 2556 0 2569 0 3.10 3.11 While the results have mostly been consistent, I also wanted to run a few additional analyses of my own to see whether these effect estimates were consistent I first started by constructing the confidence distribution using the concurve package to see the range of parameter values consistent with the data given the background model assumptions.26, 27 # Compatibility of Values ------------------------------------- fit1 <- glm(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2, data = BP_full, family = gaussian()) summary(fit1) #> #> Call: #> glm(formula = PostDBP ~ factor(Group) + BaselineDBP + Age + Z + #> R1 + R2, family = gaussian(), data = BP_full) #> #> Deviance Residuals: #> Min 1Q Median 3Q Max #> -3.350 -0.638 0.048 0.690 2.311 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 9.02736 3.65879 2.47 0.014 * #> factor(Group)Drug Y -9.64556 0.08906 -108.30 <2e-16 *** #> BaselineDBP 1.00567 0.03120 32.23 <2e-16 *** #> Age 0.32696 0.00645 50.69 <2e-16 *** #> Z -1.97391 0.99551 -1.98 0.048 * #> R1 0.01071 0.04711 0.23 0.820 #> R2 -0.02493 0.08799 -0.28 0.777 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> (Dispersion parameter for gaussian family taken to be 0.983) #> #> Null deviance: 15519.32 on 499 degrees of freedom #> Residual deviance: 484.49 on 493 degrees of freedom #> AIC: 1419 #> #> Number of Fisher Scoring iterations: 2 fit2 <- glm(PostDBP ~ factor(Group), data = BP_full, family = gaussian()) summary(fit2) #> #> Call: #> glm(formula = PostDBP ~ factor(Group), family = gaussian(), data = BP_full) #> #> Deviance Residuals: #> Min 1Q Median 3Q Max #> -12.354 -2.042 -0.059 1.957 7.965 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 142.596 0.193 737.3 <2e-16 *** #> factor(Group)Drug Y -9.263 0.278 -33.3 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> (Dispersion parameter for gaussian family taken to be 9.65) #> #> Null deviance: 15519.3 on 499 degrees of freedom #> Residual deviance: 4805.8 on 498 degrees of freedom #> AIC: 2556 #> #> Number of Fisher Scoring iterations: 2 curve1 <- curve_gen(fit1, var = "factor(Group)Drug Y", method = "lm", log = FALSE, steps = 1000, table = TRUE) Lower Limit Upper Limit Interval Width Interval Level (%) CDF P-value S-value (bits) -9.67 -9.62 0.057 25.0 0.625 0.750 0.415 -9.71 -9.59 0.120 50.0 0.750 0.500 1.000 -9.75 -9.54 0.205 75.0 0.875 0.250 2.000 -9.76 -9.53 0.228 80.0 0.900 0.200 2.322 -9.77 -9.52 0.256 85.0 0.925 0.150 2.737 -9.79 -9.50 0.293 90.0 0.950 0.100 3.322 -9.82 -9.47 0.349 95.0 0.975 0.050 4.322 -9.85 -9.45 0.399 97.5 0.988 0.025 5.322 -9.88 -9.42 0.459 99.0 0.995 0.010 6.644 Table of Statistics for Various Interval Estimate Percentiles Indeed, our confidence distribution ends up being very similar to our Bayesian posterior distribution for the treatment effect. Under certain situations, the maximum likelihood estimate ends up also being equivalent to the median unbiased estimate. In this case, it happens to be very similar to the maximum a posteriori estimate. $C V_{n}(\theta)=1-2\left|H_{n}(\theta)-0.5\right|=2 \min \left\{H_{n}(\theta), 1-H_{n}(\theta)\right\}$ Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept 11.23 5.66 0.17 22.46 1 5214 5557 factorGroupDrugY -9.59 0.13 -9.85 -9.32 1 6888 6042 BaselineDBP 0.98 0.05 0.89 1.08 1 5167 5707 Age 0.33 0.01 0.32 0.35 1 6033 6483 Z -0.94 1.48 -3.86 1.98 1 5493 6155 R1 -0.02 0.07 -0.16 0.12 1 6249 6331 R2 -0.14 0.13 -0.39 0.12 1 6555 6619 We can also check to see that our confidence distribution and posterior distribution are consistent with the profile-likelihood function, which is free of nuisance parameters. The advantage of profiling is allowing us to recover more accurate parameter estimates when there are several nuisance parameters or messy data. Below are implementations in both R and Stata, except this time we do not specify errors from the t-distribution. analyseImp3 <- function(inputData, i) { dt <- inputData[i, ] (coef(glm(PostDBP ~ factor(Group) + BaselineDBP + Age + Z + R1 + R2, data = dt, family = gaussian() ))[2]) } bootdist <- boot::boot(data = BP_full, statistic = analyseImp3, R = 4000, sim = "ordinary", stype = "i", parallel = "multicore", ncpus = 8) sd <- boot::boot.ci(bootdist, conf = c((1:99) / 100), type = "bca")[4] A <- sd$bca

qplot((c(A[, 4], A[, 5])), geom = "density", fill = I("#3f8f9b44")) +
theme_less() +
labs(title= "bCA Bootstrap Density Plot",
x = "'Group' Parameter Value (Treatment Effect mmHg)",
y = "Density") 

$L_{t_{n}}(\theta)=f_{n}\left(F_{n}^{-1}\left(H_{p i v}(\theta)\right)\right)\left|\frac{\partial}{\partial t} \psi\left(t_{n}, \theta\right)\right|=h_{p i v}(\theta)\left|\frac{\partial}{\partial t} \psi(t, \theta)\right| /\left.\left|\frac{\partial}{\partial \theta} \psi(t, \theta)\right|\right|_{t=t_{n}}$

We now focus on profiling on the treatment effect and calculating the profile deviance function. We then multiply our deviance by -0.5 to obtain the profile log-likelihood function.

# Profile Likelihood -------------------------------------

prof <- profile(fit1)
disp <- attr(prof, "summary")$dispersion mindev <- attr(prof, "original.fit")$deviance

dev1 <- prof[[1]]$tau^2 dev2 <- dev1 * disp + mindev tmpf <- function(x, n) { data.frame(par = n, tau = x$tau,
deviance = x$tau^2 * disp + mindev, x$par.vals, check.names = FALSE)
}

pp <- do.call(rbind, mapply(tmpf, prof, names(prof),
SIMPLIFY = FALSE))

pp2 <- melt(pp, id.var = 1:3)
pp3 <- subset(pp2, par == variable, select = -variable)
pp4 <- pp3[-c(1:13), ]

pp4$loglik <- pp4$deviance * -0.5

clear all
#> . clear . set seed 1031
#>
#> . import delimited "~/Dropbox/LessLikely/static/datasets/temp.csv", numericcols(2
#> > 3 5 6 7 8)
#> (encoding automatically selected: ISO-8859-2)
#> (8 vars, 500 obs)
#>
#> . foreach var of varlist  group baselinedbp age z r1 r2 {
#>   2. grstyle init
#>   3. grstyle color background white, opacity(.1)
#>   4. quietly pllf glm postdbp group baselinedbp age z r1 r2, profile(var')
#>   5. graph copy Graph var'
#>   6. }
#>
#> . graph combine  group baselinedbp age z r1 r2
#>
#> . graph export "loglikfunction.svg", replace
#> file loglikfunction.svg saved as SVG format

Lower Limit Upper Limit CI Width Level (%) CDF P-value S-value
-9.67 -9.62 0.057 25.0 0.625 0.750 0.415
-9.71 -9.59 0.120 50.0 0.750 0.500 1.000
-9.75 -9.54 0.205 75.0 0.875 0.250 2.000
-9.76 -9.53 0.228 80.0 0.900 0.200 2.322
-9.77 -9.52 0.256 85.0 0.925 0.150 2.737
-9.79 -9.50 0.293 90.0 0.950 0.100 3.322
-9.82 -9.47 0.349 95.0 0.975 0.050 4.322
-9.85 -9.45 0.399 97.5 0.988 0.025 5.322
-9.88 -9.42 0.459 99.0 0.995 0.010 6.644
Table of Statistics for Various Interval Estimate Percentiles

I then compared the results of the full analysis set to the results from the primary analysis and sensitivity analyses.

## Examining the Analyses

Table 2: Evaluation of Imputation Algorithms
PMM Heckman Stata Bayes GAMLSS RI RF True
(Intercept) 10.34 10.12 6.84 11.23 13.23 11.79 41.38 9.19
factor(Group)Drug Y -9.54 -9.50 10.44 -9.59 -9.52 -9.56 -9.32 -9.66
BaselineDBP 0.99 0.99 0.95 0.98 0.97 0.98 0.74 1.00
Age 0.33 0.33 0.32 0.33 0.32 0.33 0.28 0.33
Z -1.31 -1.09 -2.64 -0.94 -0.62 -0.73 1.55 -2.02
R1 -0.03 -0.02 -0.36 -0.02 -0.02 -0.01 0.03 0.01
R2 -0.12 -0.15 -0.17 -0.14 -0.14 -0.15 0.04 -0.01
MPE -145.79 -214.28 243.82 -190.96 -194.37 -220.40 42.63 0.00
Huber Loss 0.13 0.13 3.10 0.30 0.64 0.41 4.98 0.00
MAE 0.31 0.32 3.38 0.48 0.83 0.60 5.21 0.00
MSD -0.26 -0.26 -2.37 -0.43 -0.77 -0.54 -5.12 0.00
Metric Abbreviations:
MPE = Mean Percentage Error; MAE = Mean Absolute Error; MSD = Mean Signed Deviation
MI Abbreviations:
PMM = Predictive Mean Matching; Heckman = Heckman Selection Model; RI = Random Indicator; RF = Random Forest; Bayes = Imputation During Model Fitting

We can see that that the overall treatment effect is approximately 9.7-9.9, and for most of the principled methods we used to handle the missing data, we did not obtain similar estimates. For example, both the primary analysis and the controlled multiple imputation methods did not yield the correct treatment effect estimate in the adjusted model.

In fact, nearly all of the methods used except for the Heckman selection model, yielded overestimates of the treatment effect. Even though such analyses can be substantially improved with more specification and more information, the reality is that data that are assumed to be MNAR are difficult to handle and that no method can guarantee the correct answer. Indeed, we knew the missing data model / mechanism, via our custom function

# Missing Data Mechanism -------------------------------------

lesslikely <- function() {
set.seed(1031, kind = "L'Ecuyer-CMRG",
normal.kind = "Inversion",
sample.kind = "Rejection")
logistic <- function(x) exp(x) / (1 + exp(x))
1 - rbinom(n, 1, logistic(errors[, 2]))
}

Ry <- ifelse(lesslikely() > 0, 1, 0)
PostDBP[Ry == 0] <- NA

$$$\Pr(R_1=0)=\psi_0+\frac{e^{Y_1}}{1+e^{Y_1}}\psi_1$$$

for MNAR we set $\psi_\mathrm{MNAR}=(0,1)$

Thus, we obtain the following model:

\begin{align} \mathrm{MNAR}&:\mathrm{logit}(\Pr(R_1=0)) = Y_1 \ \end{align}

Even the random indicator method, which has shown much promise in simulation studies24 and offers the advantage of not having to specify a pattern-mixture or selection model could not recover the correct treatment effect estimate either, showing the difficulty of handling missing data and the limitations of automated approaches.

I hope it is clear from the examples above that it is quite easy to do any analyses with missing data and use a variety of ad-hoc approaches or even principled approaches to explore whether or not the results you’ve obtained are trustworthy or at least even give that impression, but doing a careful and principled one is quite difficult and requires much thought that will take into account real departures from assumptions and whether or not those will lead to substantial changes in results and conclusions/decisions.

# Acknowledgements

I’d like to thank Dr. Tim Morris and Dr. Andrew Althouse for helpful comments on an earlier version of this piece. My acknowledgment of their help does not imply endorsement of my views by these colleagues or vice versa, and I remain solely responsible for the views expressed herein. All errors are my own. To point out such errors, please leave a comment below or use the contact form to get in touch.

# Statistical Environments

## R Environment

si <- sessionInfo()
print(si, RNG = TRUE, locale = TRUE)
#> R version 4.2.1 (2022-06-23)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.5.1
#>
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#>
#> Random number generation:
#>  RNG:     L'Ecuyer-CMRG
#>  Normal:  Inversion
#>  Sample:  Rejection
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#>  [1] splines   grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#>  [1] cli_3.3.0               texPreview_2.0.0        tinytex_0.41            rmarkdown_2.16.1        brms_2.17.0
#>  [6] bootImpute_1.2.0        miceMNAR_1.0.2          knitr_1.40              boot_1.3-28             gtsummary_1.6.1
#> [11] reshape2_1.4.4          ProfileLikelihood_1.2   ImputeRobust_1.3-1      gamlss_5.4-3            gamlss.dist_6.0-5
#> [16] gamlss.data_6.0-2       mvtnorm_1.1-3           performance_0.9.2       summarytools_1.0.1      tidybayes_3.0.2
#> [21] htmltools_0.5.3         Statamarkdown_0.7.1     car_3.1-0               carData_3.0-5           qqplotr_0.0.5
#> [26] ggcorrplot_0.1.3        mitml_0.4-3             pbmcapply_1.5.1         Amelia_1.8.0            Rcpp_1.0.9
#> [31] blogdown_1.11.3         doParallel_1.0.17       iterators_1.0.14        foreach_1.5.2           bayesplot_1.9.0
#> [36] wesanderson_0.3.6       VIM_6.2.2               colorspace_2.0-3        here_1.0.1              progress_1.2.2
#> [41] loo_2.5.1               mi_1.1                  Matrix_1.4-1            broom_1.0.1             yardstick_1.0.0
#> [46] svglite_2.1.0           Cairo_1.6-0             cowplot_1.1.1           mgcv_1.8-40             nlme_3.1-159
#> [51] xfun_0.32               broom.mixed_0.2.9.4     reticulate_1.26         kableExtra_1.3.4        posterior_1.3.0
#> [56] checkmate_2.1.0         parallelly_1.32.1       miceFast_0.8.1          ggmice_0.0.1            randomForest_4.7-1.1
#> [61] missForest_1.5          miceadds_3.14-3         mice.mcerror_0.0.0-9000 quantreg_5.94           MCMCpack_1.6-3
#> [66] MASS_7.3-58.1           coda_0.19-4             latex2exp_0.9.4         rstan_2.21.5            StanHeaders_2.21.0-7
#> [71] cmdstanr_0.5.3          lubridate_1.8.0         forcats_0.5.2           stringr_1.4.1           dplyr_1.0.10
#> [76] purrr_0.3.4             readr_2.1.2             tibble_3.1.8            tidyverse_1.3.2.9000    ggtext_0.1.1
#> [81] concurve_2.7.7          showtext_0.9-5          showtextdb_3.0          sysfonts_0.8.8          future.apply_1.9.0
#> [86] future_1.28.0           tidyr_1.2.0             magrittr_2.0.3          mice_3.14.0             rms_6.3-0
#> [91] SparseM_1.81            Hmisc_4.7-1             ggplot2_3.3.6           Formula_1.2-4           survival_3.4-0
#> [96] lattice_0.20-45
#>
#> loaded via a namespace (and not attached):
#>   [1] mitools_2.4          pander_0.6.5         tcltk_4.2.1          vctrs_0.4.1          usethis_2.1.6
#>   [6] gmp_0.6-6            later_1.3.0          nloptr_2.0.3         DBI_1.1.3            distributional_0.3.1
#>  [11] jpeg_0.1-9           MatrixModels_0.5-0   pspline_1.0-19       htmlwidgets_1.5.4    miscTools_0.6-26
#>  [16] inline_0.3.19        laeken_0.5.2         markdown_1.1         DEoptimR_1.0-11      DT_0.24
#>  [21] promises_1.2.0.1     pkgload_1.3.0        magick_2.7.3         RcppParallel_5.1.5   startupmsg_0.9.6
#>  [26] fs_1.5.2             mnormt_2.1.0         ranger_0.14.1        digest_0.6.29        png_0.1-7
#>  [31] polspline_1.1.20     bookdown_0.28.2      pkgconfig_2.0.3      dygraphs_1.1.1.6     estimability_1.4.1
#>  [36] minqa_1.2.4          gt_0.7.0             bslib_0.4.0          zoo_1.8-10           tidyselect_1.1.2
#>  [41] pcaPP_2.0-2          gamm4_0.2-6          viridisLite_0.4.1    pkgbuild_1.3.1       rlang_1.0.5
#>  [46] jquerylib_0.1.4      Rmpfr_0.8-9          glue_1.6.2           gdtools_0.2.4        RColorBrewer_1.1-3
#>  [51] pryr_0.1.5           distrEx_2.8.0        emmeans_1.8.0        matrixStats_0.62.0   ggsignif_0.6.3
#>  [56] threejs_0.3.3        labeling_0.4.2       httpuv_1.6.5         class_7.3-20         TH.data_1.1-1
#>  [61] clipr_0.8.0          shinystan_2.6.0      webshot_0.5.3        jsonlite_1.8.0       mime_0.12
#>  [66] systemfonts_1.0.4    gridExtra_2.3        easypackages_0.1.0   stringi_1.7.8        insight_0.18.2
#>  [71] processx_3.7.0       distr_2.8.0          gsl_2.1-7.1          survey_4.1-1         ggdist_3.2.0
#>  [76] data.table_1.14.2    survminer_0.4.9      officer_0.4.3        rstudioapi_0.14      sfsmisc_1.1-13
#>  [81] listenv_0.8.0        miniUI_0.1.1.1       survMisc_0.5.6       pan_1.6              urlchecker_1.0.1
#>  [86] sessioninfo_1.2.2    lifecycle_1.0.1      commonmark_1.8.0     munsell_0.5.0        visNetwork_2.1.0
#>  [91] codetools_0.2-18     systemfit_1.1-28     magic_1.6-0          lmtest_0.9-40        sys_3.4
#>  [96] htmlTable_2.4.1      itertools_0.1-3      rstantools_2.2.0     gWidgets2_1.0-9      xtable_1.8-4
#>  [ reached getOption("max.print") -- omitted 115 entries ]

## Stata Environment

about
#> . abStata/MP 17.0 for Mac (Apple Silicon)
#> Revision 23 Aug 2022
#>
#> Total physical memory: 16.00 GB
#>
#> Stata license: Single-user 2-core , expiring  8 Sep 2023
#> Serial number: 501709400845
#>                Atlantis

# References

1. Stark PB, Saltelli A. (2018). ‘Cargo-cult statistics and scientific crisis’. Significance. 15:40–43. doi: 10.1111/j.1740-9713.2018.01174.x.
2. Gigerenzer G. (2018). ‘Statistical Rituals: The Replication Delusion and How We Got There. Advances in Methods and Practices in Psychological Science. 1:198–218. doi: 10.1177/2515245918771329.
3. Greenland S. (2017). ‘Invited commentary: The need for cognitive science in methodology’. American Journal of Epidemiology. 186:639–645. doi: 10.1093/aje/kwx259.
4. Greenland S. (2005). ‘Multiple-bias modelling for analysis of observational data’. Journal of the Royal Statistical Society Series A (Statistics in Society). 168:267–306. doi: 10.1111/j.1467-985X.2004.00349.x.
5. Greenland S, Lash TL. (2008). ‘Bias analysis’. In: Rothman KJ, Greenland S, Lash TL, editors. Modern Epidemiology. 3rd edition. Lippincott Williams & Wilkins. p. 345–380.
6. Lash TL, Fox MP, Fink AK. (2009). ‘Applying Quantitative Bias Analysis to Epidemiologic Data. Springer New York.
7. Lash TL, Fox MP, MacLehose RF, Maldonado G, McCandless LC, Greenland S. (2014). ‘Good practices for quantitative bias analysis’. International Journal of Epidemiology. 43:1969–1985. doi: 10.1093/ije/dyu149.
8. Lash TL, Ahern TP, Collin LJ, Fox MP, MacLehose RF. (2021). ‘Bias Analysis Gone Bad. American Journal of Epidemiology. doi: 10.1093/aje/kwab072.
9. Gustafson P. (2021). ‘Invited Commentary: Toward Better Bias Analysis. American Journal of Epidemiology. doi: 10.1093/aje/kwab068.
10. Greenland S. (2021). ‘Dealing with the Inevitable Deficiencies of Bias Analysis and All Analyses. American Journal of Epidemiology. doi: 10.1093/aje/kwab069.
11. Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G. (2014). ‘Handbook of Missing Data Methodology. CRC Press.
12. Scott RaP. (2002). ‘The Multicentre Aneurysm Screening Study (MASS) into the effect of abdominal aortic aneurysm screening on mortality in men: A randomised controlled trial’. The Lancet. 360:1531–1539. doi: 10.1016/S0140-6736(02)11522-4.
13. Morris TP, Kahan BC, White IR. (2014). ‘Choosing sensitivity analyses for randomised trials: principles’. BMC Medical Research Methodology. 14:11. doi: 10.1186/1471-2288-14-11.
14. Mosteller F, Tukey JW. (1987). ‘Data analysis including statistics’. In: The Collected Works of John W. Tukey: Philosophy and Principles of Data Analysis 1965-1986. CRC Press.
15. Akacha M, Bretz F, Ohlssen D, Rosenkranz G, Schmidli H. (2017). ‘Estimands and Their Role in Clinical Trials. Statistics in Biopharmaceutical Research. 9:268–271. doi: 10.1080/19466315.2017.1302358.
16. Permutt T. (2020). ‘Do Covariates Change the Estimand?’ Statistics in Biopharmaceutical Research. 12:45–53. doi: 10.1080/19466315.2019.1647874.
17. Mitroiu M, Oude Rengerink K, Teerenstra S, Pétavy F, Roes KCB. (2020). ‘A narrative review of estimands in drug development and regulatory evaluation: Old wine in new barrels?’ Trials. 21:671. doi: 10.1186/s13063-020-04546-1.
18. Little RJ, D’Agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, et al. (2012). ‘The Prevention and Treatment of Missing Data in Clinical Trials. New England Journal of Medicine. 367:1355–1360. doi: 10.1056/NEJMsr1203730.
19. Chen D-G(Din), Peace KE, Zhang P. (2017). ‘Clinical trial data analysis using r and SAS’. 2nd edition. Chapman; Hall/CRC. doi: 10.1201/9781315155104.
20. Buuren S van. (2018). ‘Flexible Imputation of Missing Data, Second Edition. CRC Press.
21. Meng X-L, Rubin DB. (1992). ‘Performing Likelihood Ratio Tests with Multiply-Imputed Data Sets. Biometrika. 79:103–111. doi: 10.2307/2337151.
22. Schenker N, Taylor J. (1996). ‘Partially parametric techniques for multiple imputation’. Computational Statistics & Data Analysis. 22:425–446. doi: 10.1016/0167-9473(95)00057-7.
23. Permutt T. (2016). ‘Sensitivity analysis for missing data in regulatory submissions’. Statistics in Medicine. 35:2876–2879. doi: 10.1002/sim.6753.
24. Jolani S. (2012). ‘Dual Imputation Strategies for Analyzing Incomplete Data.
25. von Hippel PT, Bartlett J. (2019). ‘Maximum likelihood multiple imputation: Faster imputations and consistent standard errors without posterior draws’. arXiv:12100870 [stat]. https://arxiv.org/abs/1210.0870.
26. Rafi Z, Greenland S. (2020). ‘Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise’. BMC Medical Research Methodology. 20:244. doi: 10.1186/s12874-020-01105-9.
27. Greenland S, Rafi Z. (2020). ‘To Aid Scientific Inference, Emphasize Unconditional Descriptions of Statistics. arXiv:190908583 [statME]. https://arxiv.org/abs/1909.08583.

Help support the website!