# P-Values Are Tough and S-Values Can Help

The $$P$$-value doesn’t have many fans. There are those who don’t understand it, often treating it as a measure it’s not, whether that’s a posterior probability, the probability of getting results due to chance alone, or some other bizarre/incorrect interpretation.13 Then there are those who dislike it because they think the concept is too difficult to understand or because they see it as a noisy statistic we’re not interested in.

However, the groups of people mentioned above aren’t mutually exclusive. Many who dislike and criticize the $$P$$-value also do not understand its properties and behavior.

# What is a $$P$$-value Anyway?

## Definition

The $$P$$-value is the probability of getting a result (specifically, a test statistic) at least as extreme as what was observed if every model assumption, in addition to the targeted test hypothesis (usually a null hypothesis), used to compute it were correct.35

Key assumptions are that some sort of random process was employed (sampling, assignment, etc.), that there are no uncontrolled sources of bias (confounding, programming errors, equipment defects, sparse-data bias)6 in the results, and that the test hypothesis (often the null hypothesis) is correct. Some of these assumptions can be seen in the figure below from,7 which will be discussed more later below.

We assume all those assumptions to be correct (hence, we “condition” on them, even though they are often not correct)7 when calculating the $$P$$-value, so that any deviation of the data from what was expected under those assumptions would be purely random error. But in reality such deviations could also be the result of any assumptions being false, including but not limited to the test hypothesis.

For example, in high-energy physics, neutrinos were found in one instance to be faster than light due to the resulting small test statistic/$$P$$-value, but this result was later found to be a result of a loose fiber optic cable that introduced a delay in the timing system.8 Thus, the low $$P$$-value was due to a bias in the procedure.

So the $$P$$-value cannot be the probability of one of these assumptions, such as “the probability of getting results due to chance alone.” A statement like this is backwards because it’s quantifying one of the assumptions behind the computations of a $$P$$-value.

We assumed this condition to be true (all deviations operating by random error) with several other things, when calculating the $$P$$-value, but this does not mean it is actually correct and the calculation of the $$P$$-value cannot be the probability of one of those assumptions. It is also worth clarifying that $$P$$-values are not probabilities of data or parameter values, which many like to say to differentiate from probabilities of hypotheses. Rather, $$P$$-values are probabilities of “data features”, such as test statistics (i.e. a z-score or $$\chi^{2}$$ statistic) or can be interpreted as the percentile at which the test statistic falls within the expected distribution for the test statistic assuming all the model assumptions are true.9, 10

## Properties (Calibration)

The $$P$$-value is a random variable and it is considered to be valid if it is well calibrated and meets the validity criterion of being uniform under the null hypothesis of no effect, where every value between 0 and 1 is equally likely (see the histogram below). Many frequentist statisticians do not consider $$P$$-values to be useful if they fail to meet this validity criterion, hence they do not recognize variants such as the posterior predictive $$P$$-value (which concentrates around values such as 0.5, rather than being uniform) to be valid.

In fact, there is an entire body of work on attempting to calibrate the $$P$$-value which ranges from mathematical solutions such as taking the $$(1 + [-e*p*\log(p)]^{-1})^{-1}$$ which gives the lower bound on the conditional type I error, as proposed by Sellke et al.,11, 12 to taking the $$C_{1}(K):=\sqrt{K}-1$$ of the $$P$$-value (the square-root calibrator), which yields something known as a test martingale,13 or even approaches such as empirically attempting to recalibrate the $$P$$-value by collecting observed $$P$$-values from observational studies with negative controls (“test-hypotheses where the exposure is not believed to cause the outcome”) and using them to calculate the empirical null distribution.14

The latter is done since observational studies are prone to several more biases than controlled, randomized experiments, thus the observed $$P$$-values and estimated effect sizes are used to calculate the systematic errors within the sampling distribution and are used for recalibration of the $$P$$-value. Whether or not this approach is effective, however, is a different matter.15 In short, calibration is a very important and sought-out property of $$P$$-values.

library("ggplot2")
RNGkind(kind = "L'Ecuyer-CMRG")
set.seed <- 1031
n.sim <- 1000
t.sim <- numeric(n.sim)
n.samp <- 1000
pb <- txtProgressBar(min = 0, max = n.sim,
initial = 0, style = 3)

for (i in 1:n.sim) {
X <- rnorm(n.samp, mean = 0, sd = 1)
Y <- rnorm(n.samp, mean = 0, sd = 1)
df <- data.frame(X, Y)
t <- t.test(X, Y, data = df)
t.sim[i] <- t[[3]]
setTxtProgressBar(pb, i)
}

ggplot(NULL, aes(x = t.sim)) +
geom_histogram(bins = 30, col = "black",
fill = "#99c7c7", alpha = 0.25) +
labs(title = "Distribution of P-values Under the Null",
subtitle = "Histogram of observed P-values from 10000 t-tests",
x = "P-value") +
scale_x_continuous(breaks = seq(0, 1, 0.05)) +
theme_bw()

## The Different Interpretations

### The Neyman-Pearson Approach

Many researchers interpret the $$P$$-value in a behavioral, decision-guiding way such as being statistically significant or not depending on whether observed p from a study (the realization of the random variable) falls below a fixed cutoff level ($$\alpha$$, which is the maximum tolerable type I error rate).16

This decision-making framework (Neyman-Pearson) may be useful in certain scenarios,17 where some sort of randomization is possible, where experiments can be repeated, and where there is large control over the experimental conditions, with one of the most notable historical examples being Egon Pearson (son of Karl Pearson and coauthor of Jerzy Neyman) using it to improve quality control in industrial settings.

Contrary to popular belief, this approach does NOT require exact replications of the experiments, instead, it requires that a valid, and the same $$\alpha$$ is used.16, 18 In this approach, the exact $$P$$-value from a study is not as relevant and cannot validly be interpreted without an entire set of studies that are compared to the fixed error rate ($$\alpha$$).

### The Fisherian Approach

Others interpret the $$P$$-value in an inductive inferential/evidential (Fisherian) way,19, 20 as a continuous measure of evidence against the very test hypothesis and entire model (all assumptions) used to compute it (let’s go with this for now, even though there are some problems with this interpretation, more on that below).

This interpretation as a continuous measure of evidence against the test hypothesis and the entire model used to compute it can be seen in the figure below from7. In one framework (left panel), we may assume certain assumptions to be true (“conditioning” on them, i.e, use of random assignment), and in the other (right panel), we question all assumptions, hence the “unconditional” interpretation. Unlike the Neyman-Pearson approach, this inferential approach allows interpretation of $$P$$-values from single studies, and indeed, lower values of it are taken as more evidence against the tested null hypothesis.

### Null-Hypothesis Significance Testing

However, it is also worth pointing out that most individuals do not interpret $$P$$-values from an Neyman-Pearson or Fisherian standpoint, rather, they fuse both approaches together, which is what we commonly know today as “null-hypothesis significance testing.” This approach is regarded by most as being a horrific hybrid given that it often confuses error rates ($$\alpha$$, $$\beta$$), which are fixed before a study, with the $$P$$-value, which is not a fixed error-rate, and this fusion often has been blamed for the replication crisis in science, since many statisticians believe that the two approaches are incompatible. Though some believe these approaches can be reconciled.

Back to the Fisherian approach, the interpretation of the $$P$$-value as a continuous measure of evidence against the test model that produced it shouldn’t be confused with other statistics that serve as support measures. Likelihood ratios and Bayes factors are absolute measures of evidence for a model compared to another model, whereas the $$P$$-value is a relative measure of evidence that can be tricky to interpret.2123 Indeed, this is why the $$P$$-value is sometimes converted by some Bayesians to a lower bound of the Bayes factor by using $$-e*p*\log(p)$$ as proposed by some such as Sellke et al.11, 12

## Measure of Compatibility

The $$P$$-value is not an absolute measure of evidence for a model (such as the null/alternative model), it is a continuous measure of the compatibility of the observed data with the model used to compute it.3

If it’s high, it means the observed data are very compatible with the model used to compute it. If it’s very low, then it indicates that the data are not as compatible with the model used to calculate it, and this low compatibility may be due to random variation and/or it may be due to a violation of assumptions (such as the null model not being true, not using randomization, a programming error or equipment defect such as that seen with neutrinos, etc.).

Low compatibility of the data with the model can be implied as evidence against the test hypothesis, if we accept the rest of the model used to compute the $$P$$-value. Thus, lower $$P$$-values from a Fisherian perspective are seen as stronger evidence against the test hypothesis given the rest of the model.

# Common Criticisms Don’t Hold Up

If we treat the $$P$$-value as nothing more or less than a continuous measure of compatibility of the observed data with the model used to compute it (observed $$p$$), we won’t run into some of the common misinterpretations such as “the $$P$$-value is the probability of a hypothesis”, or the “probability of chance alone”, or “the probability of being incorrect”.3

Indeed, many of the “problems” commonly associated with the $$P$$-value are not due to the actual statistic itself, but rather researchers’ misinterpretations of what it is and what it means for a study.

The answer to these misconceptions may be compatibilism, with less compatibility (smaller $$P$$-values) indicating a poor fit between the data and the test model and hence more evidence against the test hypothesis.

A $$P$$-value of 0.04 means that assuming that all the assumptions of the model used to compute the $$P$$-value are correct, we won’t get data (a test statistic) at least as extreme as what was observed by random variation more than 4% of the time.

To many, such low compatibility between the data and the model may lead them to reject the test hypothesis (the null hypothesis).

# Some Difficulties To Think About

## Conceptual Mismatch With Direction

If you recall from above, I wrote that the $$P$$-value is seen by many as being a continuous measure of evidence against the test hypothesis and model. Technically speaking, it would be incorrect to define it this way because as the $$P$$-value goes up (with the highest value being 1 or 100%), there is less evidence against the test hypothesis since the data are more compatible with the test model. 1 = perfect compatibility of the data with the test model.

As the $$P$$-value gets lower (with the lowest value being 0), there is less compatibility between the data and the model, hence more evidence against the test hypothesis used to compute $$p$$.

Thus, saying that $$P$$-values are measures of evidence against the hypothesis used to compute them is a backward definition. This definition would be correct if higher $$P$$-values inferred more evidence against the test hypothesis and vice versa.

## Interpretation Difficulty Due to Scale

Another problem with $$P$$-values and their interpretation is scaling. Since the statistic is meant to be a continuous measure of compatibility (and relative evidence against the test model + hypothesis), we would hope that differences between $$P$$-values would be equal (on an additive scale), as this makes it easier to interpret.

For example, the difference between 0 and 10 dollars is the same as the difference between 90 and 100 dollars. This makes it easy to think about and compare across various intervals.

Unfortunately, this doesn’t apply to the $$P$$-value because it is on the inverse-exponential scale. The difference between 0.01 and 0.10 is not the same as the difference between 0.90 and 0.99.

For example, with a normal distribution (above), a z-score of 0 results in a $$P$$-value of 1 (perfect compatibility). If we now move to a z-score of 1, the $$P$$-value is 0.31. Thus, we saw a dramatic decrease from a $$P$$-value of 1 to 0.31 with one z-score. A 0.69 difference in the $$P$$-value.

Now let’s move from a z-score of 1 to a z-score of 2. We saw a difference of 0.69 with the change in one z-score before, so the new $$P$$-value must be 0.31 - 0.69 = -0.38 right? No. The $$P$$-value for a z-score of 2 is 0.045. The $$P$$-value for a z-score of 3 is 0.003. Even though we’ve only been moving by one z-score at a time, the changes in $$P$$-values don’t remain constant; they become smaller and smaller.

Thus, the difference between the $$P$$-values of 0.01 and 0.10 in terms of z-scores is substantially larger than the difference between 0.90 and 0.99. Again, this makes it difficult to interpret as a statistic across the board, especially as a continuous measure. This can further be seen in the figure from Rafi & Greenland (2020).

# Surprisal-values as Support Aids

The issues described above such as the backward definition and the problem of scaling can make it difficult to conceptualize the $$P$$-value as being an evidence measure against the test hypothesis and test model. However, these issues can be addressed by taking the negative log of the $$P$$-value $$–\log_{2}(p)$$ , which yields something known as the Shannon information value or surprisal ($$s$$) value,4, 5, 24 named after Claude Shannon, the father of information theory.25

Unlike the $$P$$-value, this value is not a probability but a continuous measure of information in bits of information against the test hypothesis and is taken from the observed test statistic computed by the test model.

It also provides a more intuitive way to think about $$P$$-values. Imagine that the variable k is always the nearest integer to the calculated value of s. Now, take for example a $$P$$-value of 0.05, the $$S$$-value for this would be s = $$–\log_{2}(0.05)$$ which equals 4.3 bits of information embedded in the test statistic, which can be implied as evidence against the test hypothesis.

How much evidence is this? k can help us think about this. The nearest integer to 4.3 is 4. Thus, the data which yield a $$P$$-value of 0.05 which results in an s value of 4.3 bits of information is no more surprising than getting all heads on 4 fair coin tosses.

Another example. Let’s say our study gives us a $$P$$-value of 0.005, which would indicate to many very low compatibility between the test model and the observed data; this would yield an s value of $$–\log_{2}(0.005) = 7.6$$ bits of information. k which is the closest integer to s would be 8. Thus, the data which yield a $$P$$-value of 0.005 are no more surprising than getting all heads on 8 fair coin tosses.

Unlike the $$P$$-value, the $$S$$-value is more intuitive as a measure that provides evidence against the test hypothesis since its value (bits of information against the test hypothesis) increases with less compatibility, whereas it is the opposite for the $$P$$-value.

## Examples

Let’s try using some data to see this in action. I’ll take a sample experimental dataset from R on the effects of different conditions on dried plant weight. We can plot the data and also run a one-way ANOVA.

pg <- force(PlantGrowth)
library("rms")
library("ggstatsplot")
library("cachem")
ggbetweenstats(
data = pg,
x = group,
y = weight,
xlab = "Grouping Condition",
ylab = "Dried Weight of Plants",
type = "parametric",
var.equal = TRUE,
plot.type = "boxviolin",
bf.message = FALSE,
grouping.var = group,
mean.ci = TRUE,
nboot = 10000,
title = "Plant Growth Results")

Looks interesting. We can see some differences from the graph. Here’s what our test output gives us,

(res <- anova(ols(weight ~ group, data = pg)))
##                 Analysis of Variance          Response: weight
##
##  Factor     d.f. Partial SS MS        F    P
##  group       2    3.76634   1.8831700 4.85 0.0159
##  REGRESSION  2    3.76634   1.8831700 4.85 0.0159
##  ERROR      27   10.49209   0.3885959
obs_p <- res[1, 5]

If we set our alpha to the traditional 0.05 level, we can reject the test hypothesis (the null hypothesis), but that is not as interesting from a continuous evidential perspective. How can I interpret this $$P$$-value of 0.0159 more intuitively?

Let’s convert it into an $$S$$-value.

-log2(obs_p)
## [1] 5.973926

$–\log_2(0.0159) = 5.97$

$S-value= 5.97$

That is 5.97 bits of information against the null hypothesis.

Remember, k is the nearest integer to the calculated value of s and in this case, would be 6. So these results (the test statistic) are as surprising as getting all heads on 6 fair coin tosses. Somewhat surprising, depending on the individual interpreting the results.

How would we interpret it within the context of a given confidence interval? The $$S$$-value tells us that values within the computed 95% CI: have at most 4.3 bits of information against them. That is because all parameter values within a 95% CI have P-values greater than 0.05.

So those parameter values that are inside the 95% interval estimate have less bits of information against them, then values that go further and further away from the center of the 95% interval estimate. The point estimate is the most compatible with the data (meaning it has the least information against it), while those values near the limits have more information against them.

In other words, as values head in the directions outside the interval, there is more refutational information against them, as depicted by the following function from Rafi & Greenland, 2020, which is known as the surprisal function.

The $$S$$-value is not meant to replace the $$P$$-value, and it isn’t superior to the $$P$$-value. It is merely a logarithmic transformation of it that rescales it on an additive scale and tells us how much information is embedded within the test statistic and can be used as evidence against the test hypothesis. It is meant to be a device to help interpret the information one obtains from a calculated $$P$$-value.

I’ve constructed a calculator that converts observed $$P$$-values into $$S$$-values and provides an intuitive way to think about them. For a more detailed discussion of $$S$$-values, see this article, in addition to the references below them.

Acknowledgments: I thank Sander Greenland for his extensive commentary and corrections on several versions of this article. My acknowledgment does not imply endorsement of my views by these colleagues, and I remain solely responsible for the views expressed herein.

# Environment

The analyses were run on:

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## Random number generation:
##  RNG:     L'Ecuyer-CMRG
##  Normal:  Inversion
##  Sample:  Rejection
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
##  [1] cachem_1.0.4       ggstatsplot_0.7.0  rms_6.1-1          SparseM_1.81       Hmisc_4.4-2        Formula_1.2-4
##  [7] survival_3.2-7     lattice_0.20-41    TeachingDemos_2.12 ggplot2_3.3.3
##
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1              pairwiseComparisons_3.1.3 backports_1.2.1           plyr_1.8.6
##   [5] splines_4.0.4             gmp_0.6-2                 kSamples_1.2-9            ipmisc_6.0.0
##   [9] TH.data_1.0-10            digest_0.6.27             SuppDists_1.1-9.5         htmltools_0.5.1.1
##  [13] lmerTest_3.1-3            fansi_0.4.2               magrittr_2.0.1            checkmate_2.0.0
##  [17] memoise_2.0.0             paletteer_1.3.0           cluster_2.1.1             openxlsx_4.2.3
##  [21] credentials_1.3.0         matrixStats_0.58.0        sandwich_3.0-0            askpass_1.1
##  [25] jpeg_0.1-8.1              colorspace_2.0-0          ggrepel_0.9.1             haven_2.3.1
##  [29] xfun_0.21                 dplyr_1.0.4               prismatic_1.0.0           crayon_1.4.1
##  [33] jsonlite_1.7.2            lme4_1.1-26               zeallot_0.1.0             zoo_1.8-8
##  [37] glue_1.4.2                gtable_0.3.0              emmeans_1.5.4             MatrixModels_0.4-1
##  [41] statsExpressions_0.7.1    car_3.0-10                Rmpfr_0.8-2               abind_1.4-5
##  [45] scales_1.1.1              mvtnorm_1.1-1             DBI_1.1.1                 PMCMRplus_1.9.0
##  [49] miniUI_0.1.1.1            Rcpp_1.0.6                xtable_1.8-4              performance_0.7.0
##  [53] htmlTable_2.1.0           foreign_0.8-81            htmlwidgets_1.5.3         RColorBrewer_1.1-2
##  [57] ellipsis_0.3.1            pkgconfig_2.0.3           reshape_0.8.8             farver_2.0.3
##  [61] nnet_7.3-15               multcompView_0.1-8        sass_0.3.1                utf8_1.1.4
##  [65] reshape2_1.4.4            tidyselect_1.1.0          labeling_0.4.2            rlang_0.4.10
##  [69] later_1.1.0.1             ggcorrplot_0.1.3          effectsize_0.4.3          cellranger_1.1.0
##  [73] munsell_0.5.0             tools_4.0.4               generics_0.1.0            evaluate_0.14
##  [77] stringr_1.4.0             fastmap_1.1.0             BWStest_0.2.2             yaml_2.2.1
##  [81] sys_3.4                   rematch2_2.1.2            knitr_1.31                zip_2.1.1
##  [85] purrr_0.3.4               WRS2_1.1-1                pbapply_1.4-3             nlme_3.1-152
##  [89] mime_0.10                 quantreg_5.83             ggExtra_0.9               correlation_0.6.0
##  [93] debugme_1.1.0             compiler_4.0.4            rstudioapi_0.13           curl_4.3
##  [97] png_0.1-7                 ggsignif_0.6.0            statmod_1.4.35            tibble_3.0.6
## [101] afex_0.28-1               bslib_0.2.4               stringi_1.5.3             highr_0.8
## [105] blogdown_1.1              parameters_0.12.0         forcats_0.5.1             Matrix_1.3-2
## [109] nloptr_1.2.2.2            vctrs_0.3.6               pillar_1.5.0              lifecycle_1.0.0
## [113] mc2d_0.1-18               jquerylib_0.1.3           estimability_1.3          data.table_1.13.6
## [117] insight_0.13.1            conquer_1.0.2             httpuv_1.5.5              patchwork_1.1.1
## [121] R6_2.5.0                  latticeExtra_0.6-29       bookdown_0.21             promises_1.2.0.1
## [125] rio_0.5.16                gridExtra_2.3             BayesFactor_0.9.12-4.2    codetools_0.2-18
## [129] polspline_1.1.19          boot_1.3-27               MASS_7.3-53.1             gtools_3.8.2
## [133] assertthat_0.2.1          openssl_1.4.3             withr_2.4.1               multcomp_1.4-16
## [137] hms_1.0.0                 bayestestR_0.8.2          parallel_4.0.4            grid_4.0.4
## [141] rpart_4.1-15              minqa_1.2.4               tidyr_1.1.2               coda_0.19-4
## [145] rmarkdown_2.7             carData_3.0-4             numDeriv_2016.8-1.1       shiny_1.6.0
## [149] base64enc_0.1-3

# References

1. Gigerenzer G. (2018). ‘Statistical Rituals: The Replication Delusion and How We Got There.’ Advances in Methods and Practices in Psychological Science. 1:198–218. doi: 10.1177/2515245918771329.
2. Goodman S. (2008). ‘A Dirty Dozen: Twelve P-Value Misconceptions.’ Seminars in Hematology. 45:135–140. doi: 10.1053/j.seminhematol.2008.04.003.
3. Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. (2016). ‘Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations.’ European Journal of Epidemiology. 31:337–350. doi: 10.1007/s10654-016-0149-3.
4. Rafi Z, Greenland S. (2020). ‘Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise.’ BMC Medical Research Methodology. 20:244. doi: 10.1186/s12874-020-01105-9.
5. Greenland S. (2019). ‘Valid P-values behave exactly as they should: Some misleading criticisms of P-values and their resolution with S-values.’ The American Statistician. 73:106–114. doi: 10.1080/00031305.2018.1529625.
6. Greenland S, Mansournia MA, Altman DG. (2016). ‘Sparse data bias: A problem hiding in plain sight.’ BMJ. 352:i1981. doi: 10.1136/bmj.i1981.
7. Greenland S, Rafi Z. (2020). ‘To Aid Scientific Inference, Emphasize Unconditional Descriptions of Statistics.’ arXiv:190908583 [statME]. http://arxiv.org/abs/1909.08583.
8. Moskowitz C. (2012). ‘Faster-than-light neutrinos aren’t.’ Scientific American.
9. Perezgonzalez JD. (2015). ‘P-values as percentiles. Commentary on: Null hypothesis significance tests. A mixup of two different theories: The basis for widespread confusion and numerous misinterpretations”.’ Frontiers in Psychology. 6. doi: 10.3389/fpsyg.2015.00341.
10. Fraser DAS. (2019). ‘The P-value function and statistical inference.’ The American Statistician. 73:135–147. doi: 10.1080/00031305.2018.1556735.
11. Sellke T, Bayarri MJ, Berger JO. (2001). ‘Calibration of $$\rho$$ values for testing precise null hypotheses.’ The American Statistician. 55:62–71. doi: 10.1198/000313001300339950.
12. Greenland S, Rafi Z. (2020). ‘Technical Issues in the Interpretation of S-values and Their Relation to Other Information Measures.’ arXiv:200812991 [statME]. http://arxiv.org/abs/2008.12991.
13. Shafer G, Shen A, Vereshchagin N, Vovk V. (2011). ‘Test Martingales, Bayes Factors and p-Values.’ Statistical Science. 26:84–101. doi: fkcvt5.
14. Schuemie MJ, Hripcsak G, Ryan PB, Madigan D, Suchard MA. (2016). ‘Robust empirical calibration of p-values using observational data.’ Statistics in Medicine. 35:3883–3888. doi: ghqmsb.
15. Gruber S, Tchetgen ET. (2016). ‘Limitations of empirical calibration of p-values using observational data.’ Statistics in Medicine. 35:3869–3882. doi: ghqmtn.
16. Neyman J, Pearson ES. (1933). ‘On the Problem of the Most Efficient Tests of Statistical Hypotheses.’ Philosophical Transactions of the Royal Society of London Series A, Containing Papers of a Mathematical or Physical Character. 231:289–337. doi: 10.1098/rsta.1933.0009.
17. Whitehead J. (1993). ‘The case for frequentism in clinical trials.’ Statistics in Medicine. 12:1405–1413. doi: 10.1002/sim.4780121506.
18. Lehmann EL. (2011). ‘Fisher, Neyman, and the Creation of Classical Statistics.’ Springer New York. doi: 10.1007/978-1-4419-9500-1.
19. Fisher RA. (1935). ‘The Design of Experiments.’ Oxford, England: Oliver & Boyd.
20. Fisher R. (1955). ‘Statistical Methods and Scientific Induction.’ Journal of the Royal Statistical Society Series B (Methodological). 17:69–78. doi: 10.1111/j.2517-6161.1955.tb00180.x.
21. Jeffreys H. (1935). ‘Some Tests of Significance, Treated by the Theory of Probability.’ Mathematical Proceedings of the Cambridge Philosophical Society. 31:203–222. doi: 10.1017/S030500410001330X.
22. Jeffreys H. (1998). ‘The Theory of Probability.’ OUP Oxford.
23. Royall R. (1997). ‘Statistical Evidence: A Likelihood Paradigm.’ CRC Press.
24. Cole SR, Edwards JK, Greenland S. (2020). ‘Surprise!’ American Journal of Epidemiology. doi: gg63md.
25. Shannon CE. (1948). ‘A mathematical theory of communication.’ The Bell System Technical Journal. 27:379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x.

• Cite this blog post