Statistical Glossary

Definitions of common statistical terms, with notes on their proper interpretation

A reference glossary of foundational statistical concepts — P-values, confidence intervals, S-values, likelihoods, Bayesian and frequentist terminology — with formal definitions and common misinterpretations.
Author
Affiliation

Less Likely

Published

May 26, 2026

Keywords

glossary, statistics, p-value, confidence interval, s-value, likelihood, bayes factor, null hypothesis, statistical power, effect size

This glossary collects working definitions for terms that appear frequently in statistical practice. The first two entries — P-value and confidence interval — are treated at greater depth because they are the most widely used and the most widely misinterpreted1. Subsequent entries are intentionally shorter and ordered alphabetically.

Where a term is treated in more depth elsewhere on the site, you’ll find a cross-link to the relevant post.


P-value

The P-value is the probability, assuming a specified statistical model is correct, of obtaining a test statistic at least as extreme as the one actually observed.

Formally, for an observed test statistic t_{\text{obs}} and a model M (which includes the null hypothesis H_{0} together with all auxiliary assumptions about sampling, independence, distributional form, and measurement),

P = \Pr\!\bigl(T \ge t_{\text{obs}} \,\big|\, M \bigr)

for a one-sided test. Two-sided P-values double the smaller one-sided tail (for symmetric reference distributions) or sum the two tails (more generally).

Properties

  • Continuous on the unit interval [0, 1].
  • Uniformly distributed on [0, 1] when the model M is exactly correct.
  • Smaller P-values indicate that the data are less compatible with the model than larger ones.
  • The P-value refers to the entire model — not just the null hypothesis. A small P-value can reflect a failure of any assumption inside the model, including but not limited to the null2.

Common misinterpretations

WarningWhat a P-value is not

A P-value is not:

  • the probability that the null hypothesis is true;
  • the probability that the data arose by chance alone;
  • 1 minus the probability that the alternative hypothesis is true;
  • a measure of the size or importance of an effect;
  • a measure of the strength of evidence against the null, except in a very narrow technical sense.

These interpretations are wrong because they confuse \Pr(\text{data} \mid \text{hypothesis}) with \Pr(\text{hypothesis} \mid \text{data}) — the latter requires Bayes’ theorem and a prior3, 4.

The American Statistical Association’s 2016 statement on P-values3 is the canonical reference for what P-values can and cannot tell us.

Confidence interval

A confidence interval is an interval estimator constructed so that, under repeated application of the same procedure to data generated by the same process, the interval contains the true parameter value a specified proportion of the time.

Formally, a (1-\alpha) confidence interval for a parameter \theta is a random interval \bigl[L(\mathbf{X}),\; U(\mathbf{X})\bigr] depending on the data \mathbf{X} such that

\Pr\!\bigl(L(\mathbf{X}) \le \theta \le U(\mathbf{X}) \,\big|\, \theta\bigr) \ge 1 - \alpha \quad \text{for every value of } \theta.

The probability statement is about the interval, which is random before the data are observed. Once the data are in hand and the interval is computed, the parameter is either inside it or it isn’t — but we don’t know which (Neyman, 1937).

Properties

  • The conventional 95\% level corresponds to \alpha = 0.05 and matches a two-sided P-value threshold of 0.05 for the parameter values inside the interval.
  • For a normally distributed estimator, the standard form is \hat{\theta} \pm z_{1 - \alpha/2}\,\widehat{\mathrm{SE}}(\hat{\theta}), where z_{1-\alpha/2} is the appropriate normal quantile and \widehat{\mathrm{SE}} is the estimated standard error.
  • Coverage is a long-run, frequency property of the procedure — not a property of any particular computed interval.
  • The interval contains exactly those parameter values that would not be rejected by a test at significance level \alpha. For this reason, some authors call it a compatibility interval5.

Common misinterpretations

WarningWhat a confidence interval is not

A 95\% confidence interval is not:

  • a 95\% probability that the true parameter lies in the calculated interval (that is a credible interval and requires a prior);
  • a range that contains 95\% of future observations (that is a prediction interval);
  • a range of equally plausible parameter values (the values near the point estimate are typically more compatible with the data than those at the edges);
  • a region outside of which results are “non-significant” — significance at level \alpha for a parameter value \theta_{0} depends on whether \theta_{0} lies inside the interval at the corresponding \alpha.

A useful way to understand the interval is through coverage simulation: repeatedly drawing samples and constructing an interval each time. Roughly 95\% of those intervals — across the simulations, not across parameter values — will cover the true parameter.

Bayes factor

The Bayes factor comparing hypotheses H_{1} and H_{0} is the ratio of the marginal likelihoods of the data under each hypothesis:

\mathrm{BF}_{10} \;=\; \frac{\Pr(\text{data} \mid H_{1})}{\Pr(\text{data} \mid H_{0})}.

It quantifies how much the data should shift the prior odds in favour of H_{1} relative to H_{0}. Unlike a P-value, a Bayes factor requires the analyst to fully specify both hypotheses, including any unknown parameters’ prior distributions (Kass & Raftery, 1995).

Bayesian inference

A school of inference in which probability represents a degree of belief about a proposition, updated as data arrive via Bayes’ theorem:

\underbrace{\Pr(\theta \mid \mathbf{X})}_{\text{posterior}} \;\propto\; \underbrace{\Pr(\mathbf{X} \mid \theta)}_{\text{likelihood}} \;\times\; \underbrace{\Pr(\theta)}_{\text{prior}}.

Bayesian methods deliver full probability distributions over parameters and predictions, conditional on the specified model and prior.

Bayesian credible interval

An interval [L, U] such that the posterior probability of the parameter lying in [L, U] equals a chosen level, e.g. \Pr(L \le \theta \le U \mid \mathbf{X}) = 0.95. This is the interval that can be interpreted as “a 95% probability that the parameter lies in the interval” — but only because it conditions on a prior.

Effect size

A quantitative summary of the magnitude of a phenomenon. Examples include a mean difference, a risk ratio, an odds ratio, a correlation coefficient, or a standardised effect such as Cohen’s d:

d \;=\; \frac{\bar{x}_{1} - \bar{x}_{2}}{s_{\text{pooled}}}.

Effect sizes describe how much; P-values describe how compatible. Confusing the two is one of the most common errors in applied research6.

Frequentist inference

A school of inference in which probability refers to the long-run relative frequency of events under hypothetical repetitions of an experiment. Parameters are treated as fixed but unknown quantities; uncertainty is described by properties of estimators across repeated sampling (e.g. unbiasedness, coverage, type I error rate).

Likelihood

For a fixed dataset \mathbf{X}, the likelihood function is the probability (or density) of the data viewed as a function of the parameter \theta:

L(\theta) \;=\; \Pr(\mathbf{X} \mid \theta).

Likelihood is not a probability over parameters — it is a function of \theta for fixed data. Its shape is what most inferential procedures (maximum likelihood, likelihood ratio tests, Bayesian updating) ultimately depend on (Fisher, 1922).

NHST (Null Hypothesis Significance Testing)

The hybrid testing procedure most often taught in introductory courses: compute a test statistic, compute a P-value against a null hypothesis, and reject the null if the P-value falls below a pre-specified threshold \alpha (typically 0.05).

NHST mixes elements of Fisherian significance testing and Neyman–Pearson hypothesis testing, which are conceptually distinct frameworks. The hybrid form has been criticised for producing dichotomous “significant / not significant” thinking that ignores effect sizes and uncertainty1, 5.

Null hypothesis

A specific value or set of values for a parameter that is taken as the reference point for a statistical test, usually denoted H_{0}. In a comparison of two group means, H_{0} is typically \mu_{1} - \mu_{2} = 0 (“no difference”), but the null hypothesis can be any value the analyst chooses to test.

Power

The probability that a test correctly rejects the null hypothesis when a specified alternative is true:

\text{Power} \;=\; 1 - \beta \;=\; \Pr(\text{reject } H_{0} \mid H_{1} \text{ true}).

Power depends on the effect size, the sample size, the variability in the data, and the chosen significance level \alpha. Underpowered studies inflate the magnitude of detected effects (Type M error) and increase the chance of getting the sign wrong (Type S error)7.

Posterior distribution

The probability distribution over parameters after observing data, \Pr(\theta \mid \mathbf{X}). The posterior is the answer that Bayesian inference is designed to produce. All Bayesian summaries — point estimates, credible intervals, predictive distributions — are derived from it.

Prior distribution

The probability distribution that encodes information about parameters before observing data, \Pr(\theta). Priors can be informative (reflecting substantive knowledge), weakly informative (regularising), or attempted as “non-informative” (a more contentious notion than it sounds).

S-value

A transformation of the P-value into Shannon information bits:

S \;=\; -\log_{2}(P).

An S-value of s bits represents the same amount of information against the model as observing s heads in a row from a hypothetically fair coin. The transformation puts P-values on a scale where differences are interpretable (3 bits and 6 bits of information are very different; P = 0.10 and P = 0.01 are easier to confuse)8.

Standard error

The standard deviation of an estimator’s sampling distribution — i.e., how much the estimate would vary from sample to sample if the experiment were repeated. For an estimator \hat{\theta},

\mathrm{SE}(\hat{\theta}) \;=\; \sqrt{\operatorname{Var}(\hat{\theta})}.

The standard error is what goes into the denominator of test statistics (t, z) and into the half-width of confidence intervals. It is not the standard deviation of the data — that’s the sample SD. The two are related by \mathrm{SE}(\bar{x}) = s / \sqrt{n} for the sample mean.

Type I and Type II error

In the Neyman–Pearson framework, the two ways a test can be wrong:

  • Type I error: rejecting the null hypothesis when it is actually true. Its probability is \alpha (the significance level).
  • Type II error: failing to reject the null when an alternative is true. Its probability is \beta; power is 1 - \beta.

Choosing \alpha and the sample size sets a target for both error rates. The two trade off: lowering \alpha raises \beta at a fixed sample size9.


Further reading

For deeper treatments of the concepts covered here, the following references are particularly useful:

NoteA note on terminology

Several entries above flag a tension between conventional definitions and newer terminology proposed in the statistical reform literature — for example, the suggestion to read confidence intervals as “compatibility intervals.” Both framings are mathematically equivalent; the disagreement is about which interpretation is least likely to mislead. This glossary uses the conventional terminology while pointing out the relevant alternatives.

Back to top

References

1. Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. (2016). “Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations.” European Journal of Epidemiology. 31:337–350. doi: 10.1007/s10654-016-0149-3.
2. Greenland S. (2019). “Valid P-values behave exactly as they should: Some misleading criticisms of P-values and their resolution with S-values.” The American Statistician. 73:106–114. doi: 10.1080/00031305.2018.1529625.
3. Wasserstein RL, Lazar NA. (2016). “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician. 70:129–133. doi: 10.1080/00031305.2016.1154108.
4. Goodman S. (2008). “A Dirty Dozen: Twelve P-Value Misconceptions.” Seminars in Hematology. 45:135–140. doi: 10.1053/j.seminhematol.2008.04.003.
5. Amrhein V, Greenland S, McShane B. (2019). “Scientists rise up against statistical significance.” Nature. 567:305. doi: 10.1038/d41586-019-00857-9.
6. Cohen J. (1988). “Statistical Power Analysis for the Behavioral Sciences.” Erlbaum Associates, Hillsdale.
7. Gelman A, Loken E. (2014). “The statistical crisis in science.” American Scientist.
8. Rafi Z, Greenland S. (2020). “Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise.” BMC Medical Research Methodology. 20:244. doi: 10.1186/s12874-020-01105-9.
9. Neyman J, Pearson ES. (1933). “On the Problem of the Most Efficient Tests of Statistical Hypotheses.” Philosophical Transactions of the Royal Society of London Series A, Containing Papers of a Mathematical or Physical Character. 231:289–337. doi: 10.1098/rsta.1933.0009.

Comments

Comments are loaded on demand so they don’t slow down the initial page render.