#' @title Simulation of valid P-values where test hypothesis is true
#' @param X The first variable we are simulating
#' @param Y The second variable we are simulating
#' @param n.sim # The number of simulations
#' @param t The object storing the t-test results
#' @param t.sim # Empty numeric vector to contain values
#' @param n.samp # Sample size in each group
#' @NOTE The null hypothesis does not have to be 0, it can be any value.
n.sim <- 10000
t.sim <- numeric(n.sim)
n.samp <- 1000
for (i in 1:n.sim) {
X <- rnorm(n.samp, mean = 0, sd = 1)
Y <- rnorm(n.samp, mean = 0, sd = 1)
df <- data.frame(X, Y)
t <- t.test(X, Y, mu = 0, paired = FALSE,
var.equal = TRUE, data = df)
t.sim[i] <- t[[3]]
}P-values Are Tough And S-values Can Help
p-value, p-values, null hypothesis, test statistic, statistical test, statistical significance, bits, compatibility, evidence, hypothesis testing, significance testing
The
However, the groups of people mentioned above aren’t mutually exclusive. Many who dislike and criticize the
- what
-values are
- the assumptions behind them
- their properties and behavior
- different schools of interpretation
- misleading criticisms of
-values
- some valid issues in interpretation
- how these issues can be resolved
What is a P-value Anyway?
Some Definitions & Descriptions
The
A simple, mathematically rigorous definition of a
Let
be the probability distribution of the data , which takes values in the measurable space . Let be a collection of -measurable subsets of such that (1) and (2) If then . Then the -value of for data is inf .
A descriptive but technical definition is given by Sander Greenland below. The description can seem dense, so feel free to skip over it for now and revisit it after reading the rest of the post.
A single
-value is the quantile location of a directional measure of divergence = of the data point (usually, the vector in -space formed by individual observations) from a test model manifold in the -dimensional expectation space defined the logical structure of the data generator (“experiment” or causal structure) that produced the data . is the subset of the -space into which the conjunction of the model constraints (assumptions) force the data expectation or predict where y would be were there no ‘random’ variability. I also use to denote the set of all the model constraints, as well as their conjunction. With this logical set-up, the observed
-value is the quantile for the observed value of = . This is read off a reference distribution for derived from . This formulation is essentially that of the “value of P” appearing in Pearson’s seminal 1900 paper on goodness-of-fit tests. Notably, his famed chi-squared statistic is the squared Euclidean distance from to , with coordinates expressed in standard-deviation units derived from . More broadly, the statistic
can be taken as a measure of divergence of a more general embedding or background model manifold (which includes all ‘auxiliary’ assumptions) from a more restrictive model , with the goodness-of-fit case taking as a saturated model covering the entire observation space, and the more common “hypothesis testing” case taking M as the conjunction of an unsaturated with a targeted ‘test’ constraint (or set of constraints) . This is logically independent of and consistent with , with = & in logical terms, or = + in set-theoretic terms with + being union (in particular, we assume no element in is entailed or contradicted by and no element in is entailed or contradicted by ).
Misleading Definitions
It is very common to see the
The probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.
Indeed, this is the actual definition currently given on the Wikipedia page for the topic, however, it is inadequate and misleading because it hides and reifies the other assumptions used to compute the
The test hypothesis (often the null hypothesis) is only one component of the entire model that is being tested. This is reflected in the first definition I gave above, which explicitly emphasizes that every model assumption must be true. Thus, the
Auxilliary Assumptions
Some of these key assumptions behind the computation of a
We often start from the position that all those assumptions are correct (hence, we “condition” on them, even though they are often not correct)7 when calculating the
Note: “Conditioning” here refers to taking the assumptions in the model as given, and should not be confused with conditional probability.
For example, in high-energy physics, neutrinos were found in one study to be faster than light due to the resulting large test statistic and corresponding small
So the
This assumption of chance causing the results is assumed to be true (aka 100%) along with several other things, when calculating the
Probability of What?
It is also important to clarify that
Properties (Uniformity)
A
Thus, if we were to simulate two variables that are practically the same (meaning there’s no difference between them) and then compare them, say, using a t-test, and we were to iterate this process 10000 times and plot the distribution of the observed P-values, it would be uniform, indicating that any P-value within the interval from 0-1 is just as likely as any other to be observed.
#> Error in `theme_less()`:
#> ! could not find function "theme_less"
Many frequentist statisticians do not consider
Indeed, there have been great efforts to calibrate the
The latter is done since observational studies are prone to several more biases than controlled, randomized experiments, thus the observed
Many frequentist statisticians do not consider
Indeed, there have been great efforts to calibrate the
The latter is done since observational studies are prone to several more biases than controlled, randomized experiments, thus the observed
The Different Interpretations
The Decision-Theoretic Approach
Many researchers interpret the
Statistical Significance
Thus, in this approach, users do not care how small or large the observed
The pioneers of this approach, Jerzy Neyman and Egon Pearson, define this behavioral guidance in their 1933 paper, “On the Problem of the Most Efficient Tests of Statistical Hypotheses”16
Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.
This decision-making framework may be useful in certain scenarios,17 where some sort of randomization is possible, where experiments can be repeated, and where there is large control over the experimental conditions, with one of the most notable historical examples being Egon Pearson (son of Karl Pearson and coauthor of Jerzy Neyman) using it to improve quality control in industrial settings.
Contrary to some claims,18 this approach does NOT require exact replications of the experiments, instead, it requires that a valid

The Inductive Approach
Others interpret the
This interpretation as a continuous measure of evidence against the test hypothesis and the entire model used to compute it can be seen in the figure below from7. In one framework (left panel), we may assume certain assumptions to be true (“conditioning” on them, i.e, use of random assignment), and in the other (right panel), we question all assumptions, hence the “unconditional” interpretation. Unlike the Neyman-Pearson approach, this inferential approach allows interpretation of
Null-Hypothesis Significance Testing
However, it is also worth pointing out that most individuals do not interpret
Back to the Fisherian approach, the interpretation of the
Measure of Compatibility
The
If it’s high, it means the observed data are very compatible with the model used to compute it. If it’s very low, then it indicates that the data are not as compatible with the model used to calculate it, and this low compatibility may be due to random variation and/or it may be due to a violation of assumptions (such as the null model not being true, not using randomization, a programming error or equipment defect such as that seen with neutrinos, etc.).
Low compatibility of the data with the model can be implied as evidence against the test hypothesis, if we accept the rest of the model used to compute the
Common, Misleading Criticisms
Estimation and Intervals
A common criticism put forth by many is that
A
Overstating the Evidence
Any overstating of evidence, is not an issue of the statistic itself, but rather users. If we treat the
Indeed, many of the “problems” commonly associated with the
The answer to these misconceptions may be compatibilism, with less compatibility (smaller
A
To many, such low compatibility between the data and the model may lead them to reject the test hypothesis (the null hypothesis).
Some Valid Issues
Mismatch With Direction
If you recall from above, I wrote that the
As the
Thus, saying that
Difficulties Due to Scale
Another problem with
For example, the difference between 0 and 10 dollars is the same as the difference between 90 and 100 dollars, in that both are a difference of 10 dollars. And this property remains consistent across various intervals, 120 and 130, 1,000,000 and 1,000,010.
Unfortunately, this doesn’t apply to the
For example, with a normal distribution (above), a z-score of 0 results in a
Now let’s move from a z-score of 1 to a z-score of 2. We saw a decrease of 0.69 with the change in one z-score before, so the new
Thus, the difference between the
Resolution with Surprisals
The issues described above such as the backward definition and the problem of scaling can make it difficult to conceptualize the
Unlike the
It also provides a more intuitive way to think about
How much evidence is this?
Another example. Let’s say our study gives us a
A table of various
For example, an
| P-value (compatibility) | S-value (bits) | Maximum Likelihood Ratio | Deviance Statistic 2ln(MLR) |
|---|---|---|---|
| 0.99 | 0.01 | 1.000e+00 | 0.00 |
| 0.9 | 0.15 | 1.010e+00 | 0.02 |
| 0.5 | 1.00 | 1.260e+00 | 0.45 |
| 0.25 | 2.00 | 1.940e+00 | 1.32 |
| 0.1 | 3.32 | 3.870e+00 | 2.71 |
| 0.05 | 4.32 | 6.830e+00 | 3.84 |
| 0.025 | 5.32 | 1.230e+01 | 5.02 |
| 0.01 | 6.64 | 2.760e+01 | 6.63 |
| 0.005 | 7.64 | 5.140e+01 | 7.88 |
| 1e-04 | 13.29 | 1.935e+03 | 15.10 |
| 5 sigma (~ 2.9 in 10 million) | 21.70 | 5.200e+05 | 26.30 |
| 1 in 100 million (GWAS) | 26.60 | 1.400e+07 | 32.80 |
| 6 sigma (~ 1 in a billion) | 29.90 | 1.300e+08 | 37.40 |
| Abbreviations: | |||
| Table 1: $P$-values and binary $S$-values, with corresponding maximum-likelihood ratios (MLR) and deviance (likelihood-ratio) statistics for a simple test hypothesis H under background assumptions A |
Unlike the
Some Examples
Let’s try using some data to see this in action. I’ll take a sample experimental dataset from R on the effects of different conditions on dried plant weight. We can plot the data and run a one-way ANOVA.
pg <- force(PlantGrowth)
(Hmisc::describe(pg))
#> pg
#>
#> 2 Variables 30 Observations
#> --------------------------------------------------------------------------------
#> weight
#> n missing distinct Info Mean pMedian Gmd .05
#> 30 0 29 1 5.073 5.09 0.8131 3.983
#> .10 .25 .50 .75 .90 .95
#> 4.170 4.550 5.155 5.530 6.038 6.132
#>
#> lowest : 3.59 3.83 4.17 4.32 4.41, highest: 5.87 6.03 6.11 6.15 6.31
#> --------------------------------------------------------------------------------
#> group
#> n missing distinct
#> 30 0 3
#>
#> Value ctrl trt1 trt2
#> Frequency 10 10 10
#> Proportion 0.333 0.333 0.333
#> --------------------------------------------------------------------------------Looks interesting. We can see some differences from the graph. Here’s what our test output gives us,
res <- anova(lm(weight ~ group, data = pg))
ztable(res)
#> Error in `ztable()`:
#> ! could not find function "ztable"(obs_p <- res[1, 5])
#> [1] 0.01590996If we had set our
Let’s convert it into an
-log2(obs_p)
#> [1] 5.973926That is 5.97 bits of information against the null hypothesis.
Remember,
How would we interpret it within the context of a given confidence interval? The
So those parameter values that are inside the 95% interval estimate have less bits of information against them, than the parameter values that go further and further away from the center of the 95% interval estimate. The point estimate is the most compatible with the data (meaning it has the least refutational information against it), while those values near the limits have more information against them.
In other words, as values head in the directions outside the interval, there is more refutational information against them, as depicted by the following function from Rafi & Greenland, 2020, which is known as the surprisal function.
The
I’ve constructed a calculator that converts observed
S-value Calculator
Acknowledgments: I’m very grateful to Sander Greenland for his extensive commentary and corrections on several versions of this article. My acknowledgment does not imply endorsement of my views by these colleagues, and I remain solely responsible for the views expressed herein.
The analyses were run on:
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.3
#>
#> Matrix products: default
#> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> Random number generation:
#> RNG: Mersenne-Twister
#> Normal: Inversion
#> Sample: Rejection
#>
#> locale:
#> [1] C.UTF-8/C.UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] splines grid stats4 parallel stats graphics grDevices
#> [8] utils datasets methods base
#>
#> other attached packages:
#> [1] pbmcapply_1.5.1 texPreview_2.1.0 tinytex_0.58
#> [4] rmarkdown_2.30 brms_2.23.0 bootImpute_1.3.0
#> [7] knitr_1.51 boot_1.3-32 reshape2_1.4.5
#> [10] ProfileLikelihood_1.3 ImputeRobust_1.3-1 gamlss_5.5-0
#> [13] gamlss.dist_6.1-1 gamlss.data_6.0-7 mvtnorm_1.3-3
#> [16] performance_0.15.3.6 summarytools_1.1.5 tidybayes_3.0.7
#> [19] htmltools_0.5.9 Statamarkdown_0.9.6 car_3.1-3
#> [22] carData_3.0-5 qqplotr_0.0.7 ggcorrplot_0.1.4.1
#> [25] Amelia_1.8.3 Rcpp_1.1.1 blogdown_1.23
#> [28] doParallel_1.0.17 iterators_1.0.14 foreach_1.5.2
#> [31] lattice_0.22-7 bayesplot_1.15.0 wesanderson_0.3.7
#> [34] VIM_7.0.0 colorspace_2.1-2 here_1.0.2
#> [37] progress_1.2.3 loo_2.9.0 mi_1.2
#> [40] Matrix_1.7-4 broom_1.0.12 yardstick_1.3.2
#> [43] svglite_2.2.2 Cairo_1.7-0 cowplot_1.2.0
#> [46] mgcv_1.9-4 nlme_3.1-168 xfun_0.56
#> [49] broom.mixed_0.2.9.6 reticulate_1.44.1 kableExtra_1.4.0
#> [52] posterior_1.6.1 checkmate_2.3.3 parallelly_1.46.1
#> [55] miceFast_0.8.5 randomForest_4.7-1.2 missForest_1.6.1
#> [58] miceadds_3.18-36 mice_3.19.0 quantreg_6.1
#> [61] SparseM_1.84-2 MCMCpack_1.7-1 MASS_7.3-65
#> [64] coda_0.19-4.1 latex2exp_0.9.8 rstan_2.32.7
#> [67] StanHeaders_2.32.10 lubridate_1.9.4 forcats_1.0.1
#> [70] stringr_1.6.0 dplyr_1.1.4 purrr_1.2.1
#> [73] readr_2.1.6 tibble_3.3.1 ggplot2_4.0.1
#> [76] tidyverse_2.0.0 ggtext_0.1.2 concurve_3.0.0
#> [79] showtext_0.9-7 showtextdb_3.0 sysfonts_0.8.9
#> [82] future.apply_1.20.1 future_1.69.0 tidyr_1.3.2
#> [85] magrittr_2.0.4 rms_8.1-0 Hmisc_5.2-5
#>
#> loaded via a namespace (and not attached):
#> [1] dichromat_2.0-0.1 nnet_7.3-20 TH.data_1.1-5
#> [4] vctrs_0.7.1 digest_0.6.39 png_0.1-8
#> [7] shape_1.4.6.1 proxy_0.4-29 magick_2.9.0
#> [10] fontLiberation_0.1.0 withr_3.0.2 ggpubr_0.6.2
#> [13] survival_3.8-6 doRNG_1.8.6.2 emmeans_2.0.1
#> [16] MatrixModels_0.5-4 systemfonts_1.3.1 ragg_1.5.0
#> [19] zoo_1.8-15 V8_8.0.1 ggdist_3.3.3
#> [22] DEoptimR_1.1-4 Formula_1.2-5 prettyunits_1.2.0
#> [25] rematch2_2.1.2 httr_1.4.7 otel_0.2.0
#> [28] rstatix_0.7.3 globals_0.18.0 ps_1.9.1
#> [31] rstudioapi_0.18.0 extremevalues_2.4.1 pan_1.9
#> [34] generics_0.1.4 processx_3.8.6 base64enc_0.1-3
#> [37] curl_7.0.0 mitools_2.4 lgr_0.5.0
#> [40] desc_1.4.3 xtable_1.8-4 svUnit_1.0.8
#> [43] pracma_2.4.6 evaluate_1.0.5 hms_1.1.4
#> [46] glmnet_4.1-10 rcartocolor_2.1.2 lmtest_0.9-40
#> [49] palmerpenguins_0.1.1 robustbase_0.99-6 matrixStats_1.5.0
#> [52] svgPanZoom_0.3.4 class_7.3-23 pillar_1.11.1
#> [55] caTools_1.18.3 compiler_4.5.2 stringi_1.8.7
#> [58] paradox_1.0.1 jomo_2.7-6 minqa_1.2.8
#> [61] plyr_1.8.9 crayon_1.5.3 abind_1.4-8
#> [64] metadat_1.4-0 sp_2.2-0 mathjaxr_2.0-0
#> [67] rapportools_1.2 twosamples_2.0.1 sandwich_3.1-1
#> [70] whisker_0.4.1 codetools_0.2-20 multcomp_1.4-29
#> [73] textshaping_1.0.4 bcaboot_0.2-3 openssl_2.3.4
#> [76] flextable_0.9.10 QuickJSR_1.9.0 e1071_1.7-17
#> [79] gridtext_0.1.5 lme4_1.1-38 fs_1.6.6
#> [82] itertools_0.1-3 listenv_0.10.0 Rdpack_2.6.5
#> [85] pkgbuild_1.4.8.9000 estimability_1.5.1 ggsignif_0.6.4
#> [88] callr_3.7.6 tzdb_0.5.0 pkgconfig_2.0.3
#> [91] tools_4.5.2 rbibutils_2.4.1 viridisLite_0.4.2
#> [94] DBI_1.2.3 numDeriv_2016.8-1.1 fastmap_1.2.0
#> [97] scales_1.4.0 officer_0.7.3 opdisDownsampling_1.0.1
#> [100] insight_1.4.5 rpart_4.1.24 farver_2.1.2
#> [103] reformulas_0.4.3.1 survminer_0.5.1 yaml_2.3.12
#> [106] foreign_0.8-90 cli_3.6.5 lifecycle_1.0.5
#> [109] askpass_1.2.1 bbotk_1.8.1 backports_1.5.0
#> [112] Brobdingnag_1.2-9 mlr3tuning_1.5.1 timechange_0.3.0
#> [115] gtable_0.3.6 arrayhelpers_1.1-0 metafor_4.8-0
#> [118] jsonlite_2.0.0 mitml_0.4-5 bitops_1.0-9
#> [121] qqconf_1.3.2 mlr3learners_0.14.0 zip_2.3.3
#> [124] ranger_0.18.0 RcppParallel_5.1.11-1 polspline_1.1.25
#> [127] bridgesampling_1.2-1 survMisc_0.5.6 distributional_0.6.0
#> [130] pander_0.6.6 details_0.4.0 KMsurv_0.1-6
#> [133] mlr3pipelines_0.10.0 glue_1.8.0 tcltk_4.5.2
#> [136] gdtools_0.4.4 rprojroot_2.1.1 mcmc_0.9-8
#> [139] gridExtra_2.3 mlr3_1.3.0 R6_2.6.1
#> [142] arm_1.14-4 km.ci_0.5-6 vcd_1.4-13
#> [145] clipr_0.8.0 cluster_2.1.8.1 rngtools_1.5.2
#> [148] nloptr_2.2.1 mlr3misc_0.19.0 rstantools_2.6.0
#> [151] tidyselect_1.2.1 htmlTable_2.4.3 tensorA_0.36.2.1
#> [154] xml2_1.5.2 inline_0.3.21 fontBitstreamVera_0.1.1
#> [157] S7_0.2.1 furrr_0.3.1 laeken_0.5.3
#> [160] fontquiver_0.2.1 data.table_1.18.2.1 htmlwidgets_1.6.4
#> [163] RColorBrewer_1.1-3 rlang_1.1.7 uuid_1.2-2
References
References
Citation
@online{panda2018,
author = {Panda, Sir and Rafi, Zad},
title = {P-Values {Are} {Tough} {And} {S-values} {Can} {Help}},
date = {2018-11-11},
url = {https://lesslikely.com/posts/statistics/s-values.html},
langid = {en}
}