P-Values Are Tough and S-Values Can Help

The P-value doesn’t have many fans. There are those who don’t understand it, often treating it as a measure it’s not, whether that’s a posterior probability, the probability of getting results due to chance alone, or some other bizarre/incorrect interpretation. [13]

Then there are those who dislike it for reasons such as believing that the concept is too difficult to understand or because they see it as a noisy statistic that provides something we’re not interested in.
However, the groups of people mentioned above aren’t mutually exclusive. Many who dislike and criticize the P-value also do not understand its properties and behavior.

What is a P-Value Anyway?

The P-value is the probability of getting results at least as extreme as what was observed if every model assumption used to compute it were correct. [1]

Key assumptions are that randomization was employed (sampling, assignment, etc.), there are no uncontrolled sources of bias (systematic error) in the results, and the test hypothesis (often the null hypothesis) is correct.

We assume all those conditions to be correct (even though they are often not) when calculating the P-value, so that any deviation of the data from what was expected under those assumptions would be purely random error. But in reality such deviations could also be the result of assumptions being false, including but not limited to the test hypothesis.

So the P-value cannot be the probability of one of these assumptions, such as “the probability of getting results due to chance alone.” A statement like this is already asserting that one of the assumptions behind the computations of a P-value is correct.

We assumed this to be true (deviations operating by random error) with several other things, when calculating the P-value, but this does not mean it is actually correct and the calculation of the P-value cannot be the probability of one of those assumptions.

The Different Frameworks Accompanying P-Values

Many choose to interpret the P-value in a dichotomous way such as being statistically significant or not statistically significant depending on a fixed cutoff (alpha). [4] This framework (Neyman-Pearson) may be useful in certain scenarios [5] with one of the most notable examples being Egon Pearson (son of Karl Pearson and coauthor of Jerzy Neyman) using it to improve quality control in industrial settings.


Picture of the giants who founded frequentist statistics such as Egon Pearson, Ronald Fisher, and Jerzy Neyman


Others choose to interpret the P-value in a Fisherian way [6, 7], as a continuous measure of evidence against the very test hypothesis + model used to compute it (let’s go with this for now, even though there are some problems with this interpretation, more on that below).

This interpretation as a continuous measure of evidence against the test hypothesis shouldn’t be confused with other statistics that serve as support measures. Likelihood ratios and Bayes factors are measures of evidence for a model compared to another model. [8–10]

Compatibilism To The Rescue

The P-value is not a measure of evidence for a model (such as the null/alternative model), it is a continuous measure of the compatibility of the observed data with the model used to compute it. [1]

If it’s high, it means the observed data are very compatible with the model used to compute it. If it’s very low, then it indicates that the data are not very compatible with the model used to calculate it, and this low value may be due to random variation and/or it may be due to a violation of assumptions (such as the null model not being true, not using randomization, etc.).

Low compatibility of the data with the model can be implied as evidence against the test hypothesis, if we accept the rest of the model used to compute the P-value. Thus, lower P-values from a Fisherian perspective are seen as stronger evidence against the test hypothesis given the rest of the model.

Many Criticisms Don’t Hold Up

If we treat the P-value as nothing more or less than a continuous measure of compatibility of the observed data with the model used to compute it (observed p), we won’t run into some of the common misinterpretations such as the P-value is the probability of a hypothesis, or the probability of chance alone, or the probability of being incorrect. [13]

Thus, many of the “problems” commonly associated with the P-value are not due to the actual statistic itself, but rather researchers’ misinterpretations of what it is and what it means for a study.

The answer to these misconceptions is compatibilism, with less compatibility (smaller P-values) indicating a poor fit between the data and the test model and hence more evidence against the test hypothesis.

A P-value of \(0.04\) means that assuming that all the assumptions of the model used to compute the P-value are correct, we won’t get data at least as extreme as what was observed due to random variation more than \(4\) % of the time.

To many, such low compatibility between the data and the model may lead them to reject the test hypothesis (the null hypothesis).

Difficulties To Think About

Conceptual Mismatch With Direction

If you recall from above, I wrote that the P-value is seen by many as being a continuous measure of evidence against the test hypothesis. Technically speaking, it would be incorrect to define it this way because as the P-value goes up (with the highest value being \(1\) or \(100\) %), there is less evidence against the test hypothesis since the data are more compatible with the test model. 1 = perfect compatibility of the data with the test model.

As the P-value gets lower (with the lowest value being \(0\)), there is less compatibility between the data and the model, hence more evidence against the test hypothesis used to compute p. 

Thus, saying that P-values are measures of evidence against the hypothesis used to compute them is a backward definition. This definition would be correct if higher P-values inferred more evidence against the test hypothesis and vice versa.

Scaling

Another problem with P-values and their interpretation is scaling. Since the statistic is meant to be a continuous measure of compatibility (and evidence against the test model + hypothesis), we would hope that differences between P-values are equal, as this makes it easier to interpret.

For example, the difference between \(0\) and \(10\) dollars is the same as the difference between \(90\) and \(100\) dollars. This makes it easy to think about and compare across various intervals.

Unfortunately, this doesn’t apply to the P-value. The difference between \(0.01\) and \(0.10\) is not the same as the difference between \(0.90\) and \(0.99\).


Simple image of the normal distribution


For example, with a normal distribution (above), a Z score of 0 results in a P-value of \(1\) (perfect compatibility). If we now move to a Z score of \(1\), the P-value is \(0.31\). Thus, we saw a dramatic decrease from a P-value of \(1\) to \(0.31\) with one Z score. A 0.69 difference in the P-value.

Now let’s go from a Z score of \(1\) to a Z score of \(2\). We saw a difference of \(0.69\) with the change in one Z score before, so the new P-value must be \(0.31-0.69=-0.38\) right? No. The P-value for a Z score of \(2\) is \(0.045\). The P-value for a Z score of \(3\) is \(0.003\). Even though we’ve only been moving by one Z score at a time, the changes in P-values don’t remain constant; they become smaller and smaller.

Thus, the difference between the P-values of \(0.01\) and \(0.10\) in terms of Z scores is substantially larger than the difference between \(0.90\) and \(0.99\).

Again, this makes it difficult to interpret as a statistic across the board, especially as a continuous measure.

S-values To The Rescue

The issues described above such as the backward definition and the problem of scaling can make it difficult to conceptualize the P-value as being an evidential measure against the test hypothesis and test model. However, these issues can be addressed by taking the negative log of the P-value \(–log_{2}(p)\) , which yields something known as the Shannon information value or surprisal (s) value, [11–13] named after Claude Shannon, a notable contributor to information theory. [14]


Image of Claude Shannon conducting an experiment on mice


Unlike the P-value, this value is not a probability but rather a continuous measure of information in bits against the test hypothesis and is taken from the observed test statistic computed by the test model.

It also provides a highly intuitive way to think about P-values. Imagine that the variable k is always the nearest integer to the calculated value of s. Now, take for example a P-value of \(0.05\), the S-value for this would be s =\(–log_{2}(0.05)\) which equals \(4.3\) bits of information embedded in the test statistic, which can be used as evidence against the test hypothesis.

How much evidence is this? k can help us think about this. The nearest integer to \(4.3\) is \(4\). Thus, the data which yield a P-value of \(0.05\) which results in an s value of \(4.3\) bits of information is no more surprising than getting all heads in \(4\) fair coin tosses.

Let’s try another example. Let’s say our study gives us a P-value of \(0.005\), which would indicate to many very low compatibility between the test model and the observed data; this would yield an s value of \(–log_{2}(0.005)=7.6\) bits of information. k which is the closest integer to s would be \(8\). Thus, the data which yield a P-value of \(0.005\) are no more surprising than getting all heads on \(8\) fair coin tosses.

Unlike the P-value, the S-value is more intuitive as a measure that provides evidence against the test hypothesis since its value (information against the test hypothesis) increases with less compatibility, whereas it is the opposite for the P-value.

Examples

Let’s try using some data to see this in action. I’ll simulate some random data in R from a uniform distribution with the following code,

GroupA<-runif(10, 0, 20)
GroupB<-runif(10, 0, 20)
RandomData<-data.frame(GroupA, GroupB)

We can plot the data and also run an independent samples t-test.


Dot plot made with R showing differences between groups of random data


Looks interesting. We can obviously see some differences from the graph. Here’s what our test output gives us,

Welch Two Sample t-test

data: GroupA and GroupB  

t = 1.358, df = 14.856, p-value = 0.1947  

alternative hypothesis: true difference in means is not equal to 0  

95 percent confidence interval:  

-2.137637   9.627015  

sample estimates:  

mean of GroupA mean of GroupB  

10.258502   6.513812

Okay, we cannot reject the test hypothesis (the null hypothesis) at the \(5\) % level and the confidence interval is ridiculously wide. How can I interpret this P-value of \(0.1947\) more intuitively?

Let’s convert it into an S-value (here’s a calculator I constructed that converts P-values into S-values).

\[–log_2(0.1947)=2.36\]

S-value= \(2.36\)

That is \(2.36\) bits of information against the null hypothesis.

How would we interpret it within the context of a given confidence interval? The S-value tells us that values within the computed \(95\) % CI: (\(-2.137637\), \(9.627015\)) have at most \(4.3\) bits of information against them.

Remember, k is the nearest integer to the calculated value of s and in this case, would be \(2\).

So these results (the test statistic) are as surprising as getting all heads in \(2\) fair coin tosses. Not that surprising.

The S-value is not meant to replace the P-value, and it isn’t superior to the P-value. It is merely a logarithmic transformation of it that tells us how much information is embedded in the test statistic and can be used as evidence against the test hypothesis.

It is a useful cognitive device that can help us better interpret the information that we get from a calculated P-value.

I’ve constructed a calculator that converts observed p-values into s-values and provides an intuitive way to think about them.

Acknowledgement: The analogies and concepts in this blog can be attributed to Sander Greenland and his works (many of which are referenced below) and I thank him for his extensive commentary and corrections on several versions of this article.

References

  1. Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337-350.

  2. Gigerenzer G. Statistical Rituals: The Replication Delusion and How We Got There. Advances in Methods and Practices in Psychological Science. 2018;1(2):198-218.

  3. Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008;45(3):135-140.

  4. Neyman J, Pearson ES. IX. On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc Lond A. 1933;231(694-706):289-337.

  5. Lakens D, Adolfi FG, Albers CJ, et al. Justify your alpha. Nature Human Behaviour. 2018;2(3):168-171.

  6. Fisher R. Statistical Methods and Scientific Induction. J R Stat Soc Series B Stat Methodol. 1955;17(1):69-78.

  7. Fisher RA. The Design of Experiments. New York: Hafner Press; 1974.

  8. Royall R. Statistical Evidence: A Likelihood Paradigm. Routledge; 2017.

  9. Jeffreys H. Some Tests of Significance, Treated by the Theory of Probability. Math Proc Cambridge Philos Soc. 1935;31(2):203-222.

  10. Jeffreys H. The Theory of Probability. OUP Oxford; 1998.

  11. Amrhein V, Trafimow D, Greenland S. Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis If We Don’t Expect Replication. Am Stat; 2018.

  12. Greenland S. Valid P-values behave exactly as they should: Some misleading criticisms of P-values and their resolution with S-values. Am Stat. 2018;18(136).

  13. Greenland S. The unconditional information in P-values, and its refutational interpretation via S-values. 2018.

  14. Shannon CE. A Mathematical Theory of Communication. Bell System Technical Journal. 1948;27(3):379-423.


See also:

comments powered by Disqus