Around April of last year, a group of surgeons published a paper in the *Annals of Surgery* (apparently one of the most read journals in surgical science) where they suggested that CONSORT and STROBE guidelines be modified to recommend calculations of post-hoc power, specifically observed power.

They write,

“But, as 80% power is difficult to achieve in surgical studies, we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if less than 80%—with the given sample size and effect size observed in that study.”

Some folks noticed, including Andrew Gelman, who decided to write a letter to the editor (LTE) explaining why this is a bad idea.

Gelman writes in the LTE,

"This would be a bad idea. The problem is that the (estimated) effect size observed in a study is noisy, especially so in the sorts of studies discussed by the authors. Using estimated effect size can give a terrible estimate of power, and in many cases can lead to drastic overestimates of power (thus, extreme overconfidence of the sort that is rightly deplored by Bababekov et al. in their article), with the problem becoming even worse for studies that happen to achieve statistical significance.

The problem is well known in the statistical and medical literatures; see, e.g., Lane and Dunlap (1978), Hedges (1984), Goodman and Berlin (1994), Senn (2002), and Lenth (2007). For some discussion of the systemic consequences of biased power calculations based on noisy estimates of effect size, see Button et al. (2013), and for an alternative approach to design and power analysis, see Gelman and Carlin (2014)."

The authors replied of course, where most of their argument boils down to,

“We respectfully disagree that it is wrong to report post hoc power in the surgical literature. We fully understand that P value and post hoc power based on observed effect size are mathematically redundant; however, we would point out that being redundant is not the same as being incorrect… We also respectfully disagree that knowing the power after the fact is not useful in surgical science.”

Andrew Gelman notices and happens to blog about it, and also gets a chance to write a response to the response to the response to the original paper.

Guess *now* it’s over. Or well… we thought!

Same group of surgeons come back with a completely new paper where they conducted a review of the surgical literature, removed all significant results, and calculated the observed power of each nonsignificant study.

They summarize their findings,

“In this analysis of the current, high-impact surgical literature, we found the median power in studies concluding insignificant differences was 0.16, with only four studies reaching the standard of 0.8.”

They also draw some interesting parallels in their discussion section to show how change can be difficult but necessary,

"The continued inability of surgical studies to reach power of 0.8 should raise the following salient questions: Are we aiming for an arbitrary standard that can rarely be achieved? Are we discrediting the academic contributions of studies that do not reach the arbitrary 0.8 threshold? And, should we reconsider the power standard in surgical science so that it is more applicable to surgery?

It is not unreasonable to challenge standards that may not be applicable.

Before 1971, the voting age was arbitrarily set at 21 years old in the United States. The 26th Amendment lowered the voting age to 18 years in 1971 to acknowledge the contributions of all military servicemen and women, many of whom were under age 21 years and could not vote under earlier laws.Similarly, the surgical community should reconsider the arbitrary power threshold to recognize the contribution of studies in the surgical literature that have been arbitrarily discounted as underpowered.

The statistical community noticed once again, and this led to a explosion on Twitter, DataMethods, and PubPeer.

I also happened to write something (found below, which I also posted on DataMethods and PubPeer) and constructed some simulations to argue why the authors are promoting something misleading.

Despite statisticians warning against the use of observed power for decades (Hoenig & Heisey, 2001), the practice continues, usually from research groups who are unaware of the problems with it. Bababekov et al. (the authors of this paper) have not only defended their use of observed power in previous papers, even when criticized by statisticians who have looked into the problem, but unfortunately, they *encourage others* in surgical science to do the same.

The arguments put forth in this paper have **no mathematical** or **empirical basis**. Meaning, Bababekov et al. have not given any mathematical proofs or simulated data to make their case. Their argument merely consists of,

“Although it has been argued that P value and post hoc power based on observed effect size are mathematically redundant; we assert that being redundant is not the same as being incorrect.”

“We believe this redundancy is necessary in the unique case of surgical science, as post hoc power improves the communication of results by supplementing information provided by the P value.(12) Moreover, post hoc power provides a single metric to help readers better appreciate the limitation of a study.”

The problem is that this practice is **not only redundant**, it is **incorrect** and **highly misleading**. This can easily be shown with simulated data (see images below). Here, I simulate \(45000\) studies from a normal distribution and where I have set a true between-group difference of d = \(0.5\).

Group A has a mean of \(25\) and standard deviation of \(10\), and group B has a mean of \(20\) and also a standard deviation of \(10\). Thus, \((25-20)/10\) gives us a standardized effect size of \(0.5\). This is the *true* effect size, something we would rarely know in the real world (the true effect size is unknown, we can only attempt to estimate it).

In this simulation, the studies can be categorized by nine different sample sizes, meaning they have varying true power (something we know because it’s a simulation). True power ranges all the way from \(10\)% to \(90\)% and each of the nine scenarios has samples simulated from a normal distribution.

The graph above has some random variation added between points to make them out, but in reality, this (below) is what the graph looks like without any random variation at all. (These are not lines, but actual points plotted!)

In animated form:

Notice that despite true power being known for each of these studies, observed power highly varies within these scenarios, especially for studies with low(er) power. For example, when true power is set at \(10\)%, many studies yield results showing “observed power” of \(50\)%. Therefore, if we were to follow the recommendations put forth by Bababekov et al. in several scenarios, we would mistakenly put more confidence in results that truly have low power and which yield noisy estimates. Meaning it is not reliable in *“improving the communication of results”.*

This is because Babakev et al. does not seem to understand long-term frequencies. Statistical power is a frequentist concept, meaning that if someone designs a study with the hope of the test correctly detecting a dimensional violation by the test hypothesis (such as the null hypothesis) \(80\)% of the time, this will occur in the *long term* if there is a violation of assumptions and given other assumptions.

This does not mean that the study *itself* has an \(80\)% probability of correctly detecting a violation. For example, when we say that a fair coin has a \(50\)% probability of getting either heads or tails, it does not mean we expect the coin to always give us results that fairly balance the distribution of heads and tails. We will not always expect alternating results of heads and tails (H T H T H T). In fact, it is very plausible that if we flip the coin 10 times, we may get all heads or all tails. However, in the long term, we expect the proportion of heads and tails to be fairly equal.

Furthermore, because the P-value is a random variable and measures random error in the estimate (Greenland, 2019), taking it (observed P) from a single study and transforming it into a distorted version of power (observed power) is beyond fallacious. The authors not only confuse observed power and true power, but they also confuse the random variable P with its realization, observed P. Observed P-values are useful because they indicate statistical compatibility between the observed data and the test hypothesis.

If they are being used for decision rules with a set alpha, and the researchers suspect that a type-II error was very likely, they can use the computed confidence / compatibility / consonance intervals to see what effect sizes are within the interval and how wide the interval estimates are (which measure random error/precision).

However, Bababekov et al. does not see this as a solution as they argue here,

“An alternative solution to mitigate risk of misinterpreting P > 0.05 is for surgical investigators to report the confidence intervals in their results along with the P value.”

“However, the general audience would likely pay more attention to the overlap of confidence intervals in a comparative study with two or more comparison groups with multiple confidence intervals, and still misinterpret the findings as demonstrating equivalency.”

The problem here is that investigators confusing overlapping intervals for equivalency is a *cognitive* problem (Greenland, 2017) where they do not understand what frequentist intervals are rather than a defect in the method itself.

A computed frequentist interval, such as a \(95\)% confidence / consonance interval contains all effect sizes that yield P-values greater than \(0.05\) (Cox, 2006). Of course, the computed point estimate, often the maximum likelihood estimate (MLE), has the largest P-value, and values close to it, which will have slightly smaller P-values than the MLE, indicate greater statistical compatibility than effect sizes that are further from the MLE, which will have much smaller P-values.

Thus, it is largely irrelevant whether two or more confidence/consonance intervals overlap. The useful questions are:

- how wide are these intervals?
- how much do they overlap?
- what effect sizes do they span?

Better than computing a single interval estimate for a particular alpha level is to compute the function, where every interval estimate is plotted with its corresponding alpha (\(10\)%, \(15\)%, \(25\)%, \(50\)%, \(90\)%, \(95\)%, \(99\)% confidence / consonance intervals), since the the \(95\)% interval is largely arbitrary and a relic of NHST with a \(5\)% alpha.

Bababekov et al. instead sees observed power as being a solution to the “problems” they highlight with interval estimates,

“Power provides a single metric to help readers better appreciate the limitation of a study. As such, we believe reporting power is the most reasonable solution to avoid misinterpreting P > 0.05.”

This argument further highlights the problems with trying to come to conclusions from single studies. The authors wish to answer

“Was my study large enough and reliable, and if not, what single metric can I use to better understand how reliable the decision (reject/fail to reject) is?”

For this, they have found a method, one that they reify, that helps to address their uncertainties of the data-generation process, but unfortunately, it is unreliable and cannot do what they wish it does.

As others have pointed out above, the arguments put forth in this paper are fallacious and thus the manuscript itself is defective and has the potential to seriously mislead many surgical investigators who are not well versed in statistical methods. Letters to the editor (of which there are numerous) and corrections are simply not enough. A retraction is necessary when a manuscript is this defective.

Also, the `R`

code to reproduce the simulations:

```
# install and load these packages if you haven't already
library(pwr)
library(ggplot2)
# Creates a function that to produce random samples from a normal distribution
# No need to edit anything inside the function unless you want something to be computed differently
falsepower<-function(N, MeanA, MeanB, SD, pwr) {
# Generates random values from a normal distribution.
x<- rnorm(N, MeanA, SD)
y<- rnorm(N, MeanB, SD)
# runs a statistical test between the two groups above and extracts the observed P-values and stores in an object
pvalues<- t.test(x, y)$p.value
# Computes the standard deviations of both groups and pools them and stores it
pooledsd<- sqrt((((N-1))*(sd(x)^2) + ((N-1)*(sd(y)^2)))/((N*2)-2))
# Calculate the standardized, observed effect size
obses <- abs(mean(x)-mean(y))/pooledsd
# Calculates the observed power using the observed effect size
obspower<- pwr.t.test(N, obses, sig.level = 0.05, type = "two.sample", alternative = "two.sided")$power
df <- data.frame(pvalues, obspower)
df$pvalues<- round(df$pvalues, digits = 2)
df$obspower<- round((df$obspower*100), digits = 2)
df$truepower<- pwr
return(df)
}
# Edit the values here if you wish to set different means, standard deviations, or sample sizes
df10<- do.call("rbind",
replicate(5000,
(falsepower(N=5, MeanA=25, MeanB=20, SD=10,
pwr="10% True Power (N = 10)")),
simplify = FALSE))
df20<- do.call("rbind",
replicate(5000,
(falsepower(N=11, MeanA=25, MeanB=20, SD=10,
pwr="20% True Power (N = 22)")),
simplify = FALSE))
df30<- do.call("rbind",
replicate(5000,
(falsepower(N=18, MeanA=25, MeanB=20, SD=10,
pwr="30% True Power (N = 36)")),
simplify = FALSE))
df40<- do.call("rbind",
replicate(5000,
(falsepower(N=25, MeanA=25, MeanB=20, SD=10,
pwr="40% True Power (N = 50)")),
simplify = FALSE))
df50<- do.call("rbind",
replicate(5000,
(falsepower(N=32, MeanA=25, MeanB=20, SD=10,
pwr="50% True Power (N = 64)")),
simplify = FALSE))
df60<- do.call("rbind",
replicate(5000,
(falsepower(N=41, MeanA=25, MeanB=20, SD=10,
pwr="60% True Power (N = 82)")),
simplify = FALSE))
df70<- do.call("rbind",
replicate(5000,
(falsepower(N=51, MeanA=25, MeanB=20, SD=10,
pwr="70% True Power (N = 102)")),
simplify = FALSE))
df80<- do.call("rbind",
replicate(5000,
(falsepower(N=64, MeanA=25, MeanB=20, SD=10,
pwr="80% True Power (N = 128)")),
simplify = FALSE))
df90<- do.call("rbind",
replicate(5000,
(falsepower(N=86, MeanA=25, MeanB=20, SD=10,
pwr="90% True Power (N = 172)")),
simplify = FALSE))
df <- rbind(df10, df20, df30, df40, df50, df60, df70, df80, df90)
ggplot(data=df, mapping=aes(x=pvalues, y=obspower)) +
geom_jitter(aes(fill=truepower), alpha=0.2, shape=21, stroke=0.7, size=2) +
annotate("rect", xmin = 0.05, xmax = 1, ymin = 0, ymax = 50, color="#990000",
alpha = .005, size=0.3) +
annotate("pointrange", x=0.05, y=50, ymin = 0, ymax = 100,
colour = "#990000", alpha=0.8, size = .5) +
annotate("segment", x = 0, xend = 0.05, y = 50, yend = 50,
colour = "#990000") +
annotate(geom="text", x = 0.5, y= 58,
label="Nonsignificant Results Infrequently \nExceed 50% Observed Power", size=3.5, color="#000000") +
annotate(geom="text", x = 0.5, y= 26,
label="Nonsignificant Results", size=5, color="#000000") +
annotate(geom="text", x = 0.5, y= 75,
label="Nonsignificant Results", size=5, color="#000000") +
facet_wrap( ~ truepower, ncol=3) +
labs(title="Relationship Between P-values, Observed Power, & True Power",
subtitle="Data from 45,000 Simulated Studies with an Effect Size of d = 0.5 (No Random Variation Between Points)",
x="Observed P-Value",
y="Observed Power (%)",
fill = "True Power") +
theme_light() +
theme(legend.position="none",
axis.title.x = element_text(size=14, vjust=-2),
axis.text.x= element_text(color='#000000'),
axis.title.y = element_text(size=14),
strip.background = element_rect(fill="#dddddd" ),
strip.text = element_text(color = 'black', size=12)) +
scale_x_continuous(breaks=seq(0,1,.10), expand = c(0, 0)) +
scale_y_continuous(breaks=seq(0,100,25), expand = c(0, 0)) +
theme(text = element_text(color="#000000")) +
theme(plot.title = element_text(size=16),
plot.subtitle = element_text(size=14))
```