The \(P\)-value doesn’t have many fans. There are those who don’t
understand it, often treating it as a measure it’s not, whether that’s a
posterior probability, the probability of getting results due to chance
alone, or some other bizarre/incorrect interpretation.^{1–3} Then there are those who dislike it because they think the concept is
too difficult to understand or because they see it as a noisy statistic
we’re not interested in.

However, the groups of people mentioned above aren’t mutually exclusive.
Many who dislike and criticize the \(P\)-value also do not understand its
properties and behavior. This is unfortunate, given how important and widely used they are.
In this article, which could also have been titled, *\(P\)-values: More Than You Ever Wanted to Know*,
I take on the task of explaining:

what \(P\)-values are

the assumptions behind them

their properties and behavior

different schools of interpretation

misleading criticisms of \(P\)-values

some valid issues in interpretation

how these issues can be resolved

# What is a \(P\)-value Anyway?

## Definition

The \(P\)-value is the probability of getting a result (specifically, a
test statistic) at least as extreme as what was observed if **every
model assumption**, in addition to the targeted test hypothesis (usually
a null hypothesis), used to compute it **were correct**.^{3–5}

A simple, mathematically rigorous definition of a \(P\)-value (for those interested) is given by Stark (2015).

Let \(P\) be the probability distribution of the data \(X\), which takes values in the measurable space \(\mathcal{X}\). Let \(\left\{R_{\alpha}\right\}_{\alpha \in[0,1]}\) be a collection of \(P\) -measurable subsets of \(\mathcal{X}\) such that (1) \(P\left(R_{\alpha}\right)=\alpha\) and (2) If \(\alpha^{\prime}<\alpha\) then \(R_{\alpha^{\prime}} \subset R_{\alpha}\). Then the \(P\)-value of \(H_{0}\) for data \(X=x\) is inf \(_{\alpha \in[0,1]}\left\{\alpha: x \in R_{\alpha}\right\}\).

## Misleading Definitions

It is very common to see the \(P\)-value defined as

The probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.

Indeed, this is the actual definition currently given on the Wikipedia page for the topic, however, it is inadequate and misleading because it hides and reifies the other assumptions used to compute the \(P\)-value and exclusively focuses on the null hypothesis.

The test hypothesis (often the null hypothesis) is only one component of the entire model that is being tested. This is reflected in the first definition I gave above, which explicitly emphasizes that every model assumption must be true. Thus, the \(P\)-value is sensitive to all these assumptions and their violation(s).

## Auxilliary Assumptions

Some of these key **assumptions** behind the computation of a \(P\)-value are that
some sort of **random process was employed** (random sampling,
random assignment, etc.), that there are **no uncontrolled sources of bias**
(confounding, programming errors, equipment defects, sparse-data bias)^{6} in the
results, and that **the test hypothesis** (often the null hypothesis) **is
correct**. Some of these assumptions can be seen in the figure below
from,^{7} which will be discussed
later on. This entire set of assumptions is generally referred to as the
*test model*, and that is because the entire assumed model is being tested.

We often start from the position that all those assumptions are correct (hence, we “condition”
on them, even though they are often not
correct)^{7} when calculating the
\(P\)-value, so that any deviation of the data from what was expected
under those assumptions would be **purely random error**. But in reality
such deviations could also be the result of **any** assumptions being false,
including *but not limited to* the test hypothesis.

Note: “Conditioning” here refers to taking the assumptions in the model as given, and should not be confused with conditional probability.

For example, in high-energy physics, neutrinos were found in one study to be faster than light due to
the resulting large test statistic and corresponding small \(P\)-value, but this result was later
found to be a result of a defect in the fiber-optic timing system for that experiment.^{8} Thus,
the low \(P\)-value was not because the assumed null hypothesis was false, but instead due to a bias in the procedure.

So the \(P\)-value **cannot** be the probability of one of these
assumptions, such as *“the probability of getting results due to chance
alone.”* A statement like this is **backwards** because it’s quantifying one
of the assumptions behind the computation of a \(P\)-value.

This assumption of chance causing the results is assumed to be true (aka 100%)
along with several other things, when calculating the \(P\)-value,
but this does not mean it is actually correct and the calculation of the
\(P\)-value *cannot* be the *probability* of one of those
**assumptions**.

## Probability of What?

It is also important to clarify that \(P\)-values are not
*probabilities of data* or parameter values, which many like to say to differentiate from
probabilities of hypotheses. Rather, \(P\)-values are probabilities of
“data features”, such as test statistics (i.e. a z-score or \(\chi^{2}\)
statistic) or can be interpreted as the percentile at which the observed test
statistic falls within the expected distribution for the test statistic,
assuming all the model assumptions are
true.^{9, 10}

## Properties (Uniformity)

A \(P\)-value is considered to be valid if over repeated trials it would be uniform when the tested hypothesis and all other assumptions used to compute the \(P\)-value are correct (see the histogram below to see what this looks like). Typically, this test hypothesis is a null hypothesis where the tested parameter value is usually 0 or 1, but this property applies to any test hypothesis for any parameter value. Thus, there is the random variable \(P\), which (when valid) follows this uniform distribution, and the realization of this random variable, \(p\), which is the observed \(P\)-value. The latter is what most researchers are interpreting from studies.

Many frequentist statisticians do not consider \(P\)-values to be valid/useful if they fail to meet this validity criterion of being uniform, hence they do not recognize variants such as the posterior predictive \(P\)-value (which concentrates around values such as 0.5, rather than being uniform) to be valid.

Indeed, there have been great efforts to calibrate the \(P\)-value
which ranges from mathematical solutions such as taking the \((1 + [-e*p*\log(p)]^{-1})^{-1}\) which gives the lower bound on the conditional type I error,^{11, 12}
to taking the \(C_{1}(K):=\sqrt{K}-1\) of the \(P\)-value (the square-root calibrator), yielding a test martingale,^{13}
or even empirically attempting to recalibrate the \(P\)-value by collecting observed \(P\)-values from observational studies
with negative controls (“test-hypotheses where the exposure is not believed to cause the outcome”) and using them to calculate the empirical null distribution.^{14}

The latter is done since observational studies are prone to several more biases than controlled, randomized experiments, thus the observed \(P\)-values
and estimated effect sizes are used to calculate the systematic errors within the sampling distribution and are used
for recalibration of the \(P\)-value. Whether or not this approach is effective, however, is a different matter.^{15}
In short, calibration is an often sought-out property of \(P\)-values.

```
library("ggplot2")
RNGkind(kind = "L'Ecuyer-CMRG")
set.seed <- 1031
n.sim <- 10000
t.sim <- numeric(n.sim)
n.samp <- 1000
pb <- txtProgressBar(min = 0, max = n.sim,
initial = 0, style = 3)
for (i in 1:n.sim) {
X <- rnorm(n.samp, mean = 0, sd = 1)
Y <- rnorm(n.samp, mean = 0, sd = 1)
df <- data.frame(X, Y)
t <- t.test(X, Y, data = df)
t.sim[i] <- t[[3]]
setTxtProgressBar(pb, i)
}
ggplot(NULL, aes(x = t.sim)) +
geom_histogram(bins = 20, col = "black",
fill = "#99c7c7", alpha = 0.5) +
scale_x_continuous(breaks = seq(0, 1, 0.10)) +
theme_bw()
```

## The Different Interpretations

### The Decision-Theoretic Approach

Many researchers interpret the \(P\)-value in a behavioral, decision-guiding
way such as being statistically significant or not (defined below) depending
on whether observed *p* from a study (the realization of the random variable \(P\))
falls below a fixed cutoff level (\(\alpha\), which is the maximum tolerable
type I error rate).^{16}

### Statistical Significance

Thus, in this approach, users do not care how small or large the observed \(P\)-value \(p\) is, but simply, whether or not it fell beneath
the pre-specified \(\alpha\) level (often 0.05). If it falls below \(\alpha\) they
behave inline with the rejection of this test hypothesis, and if it fails to fall below
\(\alpha\), then they must behave in a manner where they accept this test hypothesis.
The phrase *statistical significance*, simply indicates that the observed \(P\)-value \(p\) fell
below this pre-specified \(\alpha\) level, and nothing else.
It does not indicate any meaningful significance on its own.

The pioneers of this approach, Jerzy Neyman and Egon Pearson, define this behavioral guidance in their 1933 paper,
“On the Problem of the Most Efficient Tests of Statistical Hypotheses”^{16}

Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.

This decision-making framework may be useful in certain
scenarios,^{17} where some sort of randomization is possible, where experiments can be repeated,
and where there is large control over the experimental conditions, with
one of the most notable historical examples being Egon Pearson (son of
Karl Pearson and coauthor of Jerzy Neyman) using it to improve quality
control
in industrial settings.

Contrary to some claims,^{18} this approach does **NOT** require
exact replications of the experiments, instead, it requires that a valid \(\alpha\) level is used consistently.^{16, 19} In this approach, the exact, observed \(P\)-value
from a study is not as relevant and cannot validly be interpreted
without an entire set of studies that are compared to the fixed error rate (\(\alpha\)).

### The Inductive Approach

Others interpret the \(P\)-value \(p\) in an inductive inferential/evidential (**Fisherian**)
way,^{20, 21} as a **continuous** measure of
evidence against the very test hypothesis and entire model (all
assumptions) used to compute it (let’s go with this for now, even though
there are some problems with this interpretation, more on that below).

This interpretation as a continuous measure of evidence **against** the
test hypothesis and the entire model used to compute it can be seen in
the figure below from^{7}. In one
framework (left panel), we may assume certain assumptions to be true
(“conditioning” on them, i.e, use of random assignment), and in the
other (right panel), we question all assumptions, hence the
“unconditional” interpretation. Unlike the **Neyman-Pearson** approach, this
inferential approach allows interpretation of \(P\)-values from single studies, and
indeed, lower values of it are taken as more evidence against the tested hypothesis.

### Null-Hypothesis Significance Testing

However, it is also worth pointing out that most individuals do not
interpret \(P\)-values from a **Neyman-Pearson** or **Fisherian** standpoint, rather, they fuse both approaches together, which
is what we commonly know today as “null-hypothesis significance testing.” This approach is regarded
by most as being a incompatible hybrid given that it often confuses error rates (\(\alpha\), \(\beta\)), which are
fixed before a study, with the \(P\)-value, which is not a fixed error-rate, and the fusion of these approaches often has been blamed for the
replication crisis in science by many statisticians. Though some believe
these approaches can be reconciled and are useful.^{22}

Back to the **Fisherian** approach, the interpretation of the \(P\)-value as a continuous measure of evidence
against the test model that produced it shouldn’t be confused with other
statistics that serve as support measures. Likelihood ratios and Bayes
factors are **absolute** measures of evidence **for** a model compared to another
model, whereas the \(P\)-value is a **relative** measure of “evidence” (more on that below) that can be tricky to interpret.^{23–25} Indeed, this is why the \(P\)-value is converted by some Bayesians to a lower bound of the
Bayes factor by taking \(-e*p*\log(p)\).^{11, 12}

## Measure of Compatibility

The \(P\)-value is not an absolute measure of evidence for a model (such as the
null/alternative model), it is a continuous **measure of the
compatibility** of the **observed data** with the **model** used to
compute it.^{3}

If it’s high, it means the observed data are **very compatible** with
the model used to compute it. If it’s very low, then it indicates that
the data are **not as compatible** with the model used to calculate
it, and this low compatibility may be due to random variation and/or it may be
due to a violation of assumptions (such as the null model not being
true, not using randomization, a programming error or equipment defect
such as that seen with
neutrinos,
etc.).

Low compatibility of the data with the model can be implied as evidence
against the test hypothesis, if we accept the rest of the model used to
compute the \(P\)-value. Thus, lower \(P\)-values from a **Fisherian**
perspective are seen as stronger evidence against the test hypothesis
given the rest of the model.

# Common, Misleading Criticisms

## Estimation and Intervals

A common criticism put forth by many is that \(P\)-values are useless, given that they cannot tell you the size of the effect and because they are confounded by sample size and effect size, and that researchers should instead give compatibility (confidence) intervals. However, this criticism is nonsensical as they can both be given and serve different purposes.

A \(P\)-value for a particular parameter value gives the compatibility between the test model in question, which will vary from one parameter value to the next, and the data. An interval estimate such as a 95% frequentist interval simply gives the region of parameter values with \(P\)-values above the corresponding \(\alpha\) level, and which are more consistent with the data than the parameter values outside the interval limits. An interval estimate by itself does not explicitly tell one how consistent a parameter value is with the data, which the \(P\)-value does.

## Overstating the Evidence

\(P\)-values are routinely criticized for overstating the amount of evidence from a study. Such statements are also often given using Bayesian arguments, of which many are skeptical. However, the \(P\)-value cannot overstate evidence as it is simply providing the location at which the test statistic fell in the expected distribution, given that every model assumption were true. It is simply indicative of how surprising/extreme the observed result was, given certain assumptions.

Any overstating of evidence, is not an issue of the statistic itself, but rather users.
If we treat the \(P-\) value as nothing more or less than a continuous
measure of compatibility of the observed data with the model used to
compute it (observed \(p\)) given certain model assumptions, we won’t run into some of the common
misinterpretations such as “the \(P\)-value is the probability of a
hypothesis”, or the “probability of chance alone”, or “the probability
of being incorrect”.^{3}

Indeed, many of the “problems” commonly associated with the \(P\)-value are not due to the actual statistic itself, but rather researchers’ misinterpretations of what it is and what it means for a study.

The answer to these misconceptions may be compatibilism, with less compatibility (smaller \(P\)-values) indicating a poor fit between the data and the test model and hence more evidence against the test hypothesis.

A \(P\)-value of 0.04 means that assuming that **all** the assumptions of
the model used to compute the \(P\)-value are correct, we won’t get data
(a test statistic) at least as extreme as what was observed by random
variation more than 4% of the time.

To many, such low compatibility between the data and the model may lead them to reject the test hypothesis (the null hypothesis).

# Some Valid Issues

## Mismatch With Direction

If you recall from above, I wrote that the \(P\)-value is seen by many as
being a continuous measure of evidence against the test hypothesis and
model. Technically speaking, it would be incorrect to define it this way
because as the \(P\)-value goes up (with the highest value being 1 or
100%), there is **less** evidence against the test hypothesis since the
data are **more compatible** with the test model. 1 = perfect
compatibility of the data with the test model.

As the \(P\)-value gets lower (with the lowest value being 0), there is
**less compatibility** between the data and the model, hence **more**
evidence against the test hypothesis used to compute \(p\).

Thus, saying that \(P\)-values are measures of evidence against the hypothesis used to compute them is a backward definition. This definition would be correct if higher \(P\)-values inferred more evidence against the test hypothesis and vice versa.

## Difficulties Due to Scale

Another problem with \(P\)-values and their interpretation is scaling. Since the statistic is meant to be a continuous measure of compatibility (and relative evidence against the test model + hypothesis), we would hope that differences between \(P\)-values would be equal (on an additive scale), as this makes it easier to interpret.

For example, the difference between 0 and 10 dollars is the same as the difference between 90 and 100 dollars, in that both are a difference of 10 dollars. And this property remains consistent across various intervals, 120 and 130, 1,000,000 and 1,000,010.

Unfortunately, this doesn’t apply to the \(P\)-value because it is on the inverse-exponential scale. The difference between a \(P\)-value of 0.01 and 0.10 is not the same as the difference between 0.90 and 0.99.

For example, with a normal distribution (above), a z-score of 0 results in a \(P\)-value of 1 (perfect compatibility). If we now move to a z-score of 1, the \(P\)-value is 0.31. Thus, we saw a dramatic decrease from a \(P\)-value of 1 to 0.31 with one z-score. A 0.69 decrease in the \(P\)-value.

Now let’s move from a z-score of 1 to a z-score of 2. We saw a
decrease of 0.69 with the change in **one** z-score before, so the
new \(P\)-value must be 0.31 - 0.69 = -0.38 right? **No**. The \(P\)-value for a
z-score of 2 is 0.045. The \(P\)-value for a z-score of 3 is 0.003.
Even though we’ve only been moving by **one** z-score at a time, the
changes in \(P\)-values don’t remain constant; the decreases become larger and
larger.

Thus, the difference between the \(P\)-values of 0.01 and 0.10, in terms of z-score, is substantially larger than the difference between 0.90 and 0.99. Again, this makes it difficult to interpret as a statistic across the board, especially as a continuous measure. This can further be seen in the figure from Rafi & Greenland (2020).

# Resolution with Surprisals

The issues described above such as the backward definition and the
problem of scaling can make it difficult to conceptualize the \(P\)-value
as being an evidence measure against the test hypothesis and test model.
However, these issues can be addressed by taking the negative log of the
\(P\)-value \(–\log_{2}(p)\) , which yields something known as the Shannon
information value or *surprisal (\(s\))
value*,^{4, 5, 26} named after Claude
Shannon, the father of
information theory.^{27}

Unlike the \(P\)-value, this value is not a probability but a
continuous measure of *information* in **bits** of information against the test
hypothesis and is taken from the observed test statistic computed by the
test model.

It also provides a more intuitive way to think about \(P\)-values. Imagine that the variable \(k\) is always the nearest integer to the calculated value of \(s\). Now, take for example a \(P\)-value of 0.05, the \(S\)-value for this would be \(s\) = \(–\log_{2}(0.05)\) which equals 4.3 bits of information embedded in the test statistic, which can be implied as evidence against the test hypothesis.

How much evidence is this? \(k\) can help us think about this. The
nearest integer to 4.3 is 4. Thus, the data which yield a \(P\)-value of
0.05 which results in an \(s\) value of 4.3 bits of information is
**no more surprising** than getting **all heads** on 4 fair coin tosses.

Another example. Let’s say our study gives us a \(P\)-value of 0.005, which would indicate to many very low compatibility between the test model and the observed data; this would yield an \(s\) value of \(–\log_{2}(0.005) = 7.6\) bits of information. \(k\) which is the closest integer to \(s\) would be 8. Thus, these data which yield a \(P\)-value of 0.005 are no more surprising than getting all heads on 8 fair coin tosses.

A table of various \(P\)-values and their corresponding \(S\)-values, maximum-likelihood ratios, and likelihood-ratio statistics can be found below from Rafi & Greenland (2020), which includes the general cutoffs used in different scientific fields such as high-energy physics and genome-wide association studies. It also shows how the traditional cutoffs used in these fields can be problematic.

For example, an \(\alpha\) of 0.05, which only corresponds to seeing all heads on 4 fair coin tosses, is practically nothing when compared to the cutoffs used in particle physics and GWAS, which correspond to seeing all heads on 22 and 30 fair coin tosses, respectively.

Unlike the \(P\)-value, the \(S\)-value is more intuitive as a measure of refutational evidence against the test hypothesis since its value (bits of information against the test hypothesis) increases with less compatibility, whereas the opposite is true for the \(P\)-value.

## Some Examples

Let’s try using some data to see this in action. I’ll take a sample
experimental dataset from `R`

on the effects of different conditions on
dried plant weight. We can plot the data and run a one-way ANOVA.

```
pg <- force(PlantGrowth)
library("rms")
library("ggstatsplot")
library("cachem")
ggbetweenstats(data = pg, x = group, y = weight,
xlab = "Grouping Condition", ylab = "Dried Weight of Plants",
type = "parametric", var.equal = TRUE,
plot.type = "boxviolin", bf.message = FALSE,
grouping.var = group, mean.ci = TRUE,
nboot = 10000, title = "Plant Growth Results")
```

Looks interesting. We can see some differences from the graph. Here’s what our test output gives us,

`(res <- anova(ols(weight ~ group, data = pg)))`

```
## Analysis of Variance Response: weight
##
## Factor d.f. Partial SS MS F P
## group 2 3.77 1.883 4.85 0.0159
## REGRESSION 2 3.77 1.883 4.85 0.0159
## ERROR 27 10.49 0.389
```

`obs_p <- res[1, 5]`

If we had set our \(\alpha\) to the traditional 0.05 level before the experiment, we can reject the test hypothesis (the null hypothesis), but that is not as interesting from a continuous evidential perspective. How can I interpret this \(P\)-value of 0.0159 more intuitively?

Let’s convert it into an \(S\)-value.

`-log2(obs_p)`

`## [1] 5.97`

\[–\log_2(0.0159) = 5.97\]

\[s= 5.97\]

That is 5.97 bits of information against the null hypothesis.

Remember, \(k\) is the nearest integer to the calculated value of \(s\) and in this case, would be 6. So these results (the test statistic, \(F\)(4.85)) are as surprising as getting all heads on 6 fair coin tosses. Somewhat surprising, depending on the individual interpreting the results.

How would we interpret it within the context of a given confidence interval? The \(S\)-value tells us that values within the computed 95% CI: have at most 4.3 bits of information against them. That is because all parameter values within a 95% CI have \(P\)-values greater than 0.05.

So those parameter values that are inside the 95% interval estimate have less bits of information against them, than the parameter values that go further and further away from the center of the 95% interval estimate. The point estimate is the most compatible with the data (meaning it has the least refutational information against it), while those values near the limits have more information against them.

In other words, as values head in the directions outside the interval, there is more refutational information against them, as depicted by the following function from Rafi & Greenland, 2020, which is known as the surprisal function.

The \(S\)-value is not meant to replace the \(P\)-value, and it isn’t superior to the \(P\)-value. It is merely a logarithmic transformation of it that rescales it on an additive scale and tells us how much information is embedded within the test statistic and can be used as evidence against the test hypothesis. It is meant to be a device to help interpret the information one obtains from a calculated \(P\)-value.

# A Surprisal Calculator

I’ve constructed a calculator that converts observed \(P\)-values into \(S\)-values and provides an intuitive way to think about them. For a more detailed discussion of \(S\)-values, see this article, in addition to the references below them.

Acknowledgments:I’m very grateful to Sander Greenland for his extensive commentary and corrections on several versions of this article. My acknowledgment does not imply endorsement of my views by these colleagues, and I remain solely responsible for the views expressed herein.

# Environment

The analyses were run on:

```
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## Random number generation:
## RNG: L'Ecuyer-CMRG
## Normal: Inversion
## Sample: Rejection
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cachem_1.0.5 ggstatsplot_0.7.2 rms_6.2-0 SparseM_1.81 Hmisc_4.5-0 Formula_1.2-4
## [7] survival_3.2-11 lattice_0.20-44 TeachingDemos_2.12 ggplot2_3.3.3
##
## loaded via a namespace (and not attached):
## [1] TH.data_1.0-10 colorspace_2.0-1 ggsignif_0.6.1 ellipsis_0.3.2
## [5] estimability_1.3 htmlTable_2.2.1 parameters_0.14.0 base64enc_0.1-3
## [9] mc2d_0.1-19 rstudioapi_0.13 farver_2.1.0 MatrixModels_0.5-0
## [13] ggrepel_0.9.1 fansi_0.5.0 mvtnorm_1.1-1 codetools_0.2-18
## [17] splines_4.1.0 knitr_1.33 SuppDists_1.1-9.5 zeallot_0.1.0
## [21] jsonlite_1.7.2 cluster_2.1.2 Rmpfr_0.8-4 png_0.1-7
## [25] effectsize_0.4.5 compiler_4.1.0 emmeans_1.6.0 PMCMRplus_1.9.0
## [29] backports_1.2.1 ggcorrplot_0.1.3 assertthat_0.2.1 Matrix_1.3-3
## [33] fastmap_1.1.0 htmltools_0.5.1.1 quantreg_5.85 tools_4.1.0
## [37] gmp_0.6-2 coda_0.19-4 gtable_0.3.0 glue_1.4.2
## [41] dplyr_1.0.6 Rcpp_1.0.6 jquerylib_0.1.4 vctrs_0.3.8
## [45] nlme_3.1-152 conquer_1.0.2 blogdown_1.3 insight_0.14.1
## [49] xfun_0.23 stringr_1.4.0 lifecycle_1.0.0 ipmisc_6.0.2
## [53] sys_3.4 gtools_3.8.2 polspline_1.1.19 MASS_7.3-54
## [57] zoo_1.8-9 scales_1.1.1 BayesFactor_0.9.12-4.2 credentials_1.3.0
## [61] parallel_4.1.0 sandwich_3.0-1 rematch2_2.1.2 RColorBrewer_1.1-2
## [65] prismatic_1.0.0 yaml_2.2.1 memoise_2.0.0 pbapply_1.4-3
## [69] gridExtra_2.3 sass_0.4.0 rpart_4.1-15 reshape_0.8.8
## [73] latticeExtra_0.6-29 stringi_1.6.2 paletteer_1.3.0 bayestestR_0.9.0
## [77] highr_0.9 checkmate_2.0.0 rlang_0.4.11 pkgconfig_2.0.3
## [81] matrixStats_0.58.0 evaluate_0.14 purrr_0.3.4.9000 patchwork_1.1.1
## [85] htmlwidgets_1.5.3 labeling_0.4.2 tidyselect_1.1.1 plyr_1.8.6
## [89] magrittr_2.0.1 bookdown_0.22 R6_2.5.0 pairwiseComparisons_3.1.5
## [93] generics_0.1.0 multcompView_0.1-8 multcomp_1.4-17 BWStest_0.2.2
## [97] DBI_1.1.1 pillar_1.6.1 foreign_0.8-81 withr_2.4.2
## [101] nnet_7.3-16 performance_0.7.2 tibble_3.1.2 crayon_1.4.1
## [105] WRS2_1.1-1 utf8_1.2.1 correlation_0.6.1 rmarkdown_2.8
## [109] kSamples_1.2-9 jpeg_0.1-8.1 grid_4.1.0 data.table_1.14.0
## [113] digest_0.6.27 xtable_1.8-4 tidyr_1.1.3 statsExpressions_1.1.0
## [117] openssl_1.4.4 munsell_0.5.0 bslib_0.2.5.1 askpass_1.1
```

# References

*Advances in Methods and Practices in Psychological Science*.

**1**:198–218. doi: 10.1177/2515245918771329.

*Seminars in Hematology*.

**45**:135–140. doi: 10.1053/j.seminhematol.2008.04.003.

*European Journal of Epidemiology*.

**31**:337–350. doi: 10.1007/s10654-016-0149-3.

*BMC Medical Research Methodology*.

**20**:244. doi: 10.1186/s12874-020-01105-9.

*The American Statistician*.

**73**:106–114. doi: 10.1080/00031305.2018.1529625.

*BMJ*.

**352**:i1981. doi: 10.1136/bmj.i1981.

*arXiv:190908583 [statME]*. https://arxiv.org/abs/1909.08583.

*Scientific American*.

*Frontiers in Psychology*.

**6**. doi: 10.3389/fpsyg.2015.00341.

*The American Statistician*.

**73**:135–147. doi: 10.1080/00031305.2018.1556735.

*The American Statistician*.

**55**:62–71. doi: 10.1198/000313001300339950.

*arXiv:200812991 [statME]*. https://arxiv.org/abs/2008.12991.

*Statistical Science*.

**26**:84–101. doi: fkcvt5.

*Statistics in Medicine*.

**35**:3883–3888. doi: ghqmsb.

*Statistics in Medicine*.

**35**:3869–3882. doi: ghqmtn.

*Philosophical Transactions of the Royal Society of London Series A, Containing Papers of a Mathematical or Physical Character*.

**231**:289–337. doi: 10.1098/rsta.1933.0009.

*Statistics in Medicine*.

**12**:1405–1413. doi: 10.1002/sim.4780121506.

*Synthese*. doi: 10.1007/s11229-019-02433-0.

*Journal of the Royal Statistical Society Series B (Methodological)*.

**17**:69–78. doi: 10.1111/j.2517-6161.1955.tb00180.x.

*The American Statistician*.

**0**:1–16. doi: 10.1080/00031305.2019.1699443.

*Mathematical Proceedings of the Cambridge Philosophical Society*.

**31**:203–222. doi: 10.1017/S030500410001330X.

*American Journal of Epidemiology*. doi: gg63md.

*The Bell System Technical Journal*.

**27**:379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x.