Two months ago, a study came out in JAMA which compared the effectiveness of the antidepressant escitalopram to placebo for long-term major adverse cardiac events (MACE).
The authors explained in the methods section of their paper how they calculated their sample size and what differences they were looking for between groups.
First, they used some previously published data to get an idea for incidence rates,
“Because previous studies in this field have shown conflicting results, there was no appropriate reference for power calculation within the designated sample size. The KAMIR study reported a 10.9% incidence of major adverse cardiac events (MACE) over 1 year… Therefore, approximately 50% MACE incidence was expected during a 5-year follow-up.”
Then, they calculated their sample size based on some differences they were interested in finding,
“Assuming 2-sided tests, α = .05, and a follow-up sample size of 300, the expected power was 70% and 96% for detecting 10% and 15% group differences, respectively.”
So far so good.
Then, we get to the results,
“A significant difference was found: composite MACE incidence was 40.9% (61/149) in the escitalopram group and 53.6% (81/151) in the placebo group (hazard ratio [HR], 0.69; 95% CI, 0.49-0.96; P = .03). The model assumption was met (Schoenfeld P = .48). The estimated statistical power to detect the observed difference in MACE incidence rates between the 2 groups was 89.7%.”
Ouch. This issue ended up bothering me so much that I wrote a letter to the editor (LTE) to point out the issue. Unfortunately, the LTE got rejected, but Andrew Althouse suggested that I discuss this over at DataMethods, so I did, and I also discussed it on Twitter but also wanted to publish the LTE on my blog. Here it is.
Misplaced Confidence in Post-Hoc Power
Kim et al (1) present the data from a randomized trial showing that treatment with escitalopram for 24 weeks lowered the risk of major adverse cardiac events (MACE) in depressed patients following recent acute coronary syndrome. The authors should be commended for exploring new approaches to reduce major cardiac events.
However, in their paper, they seem to misunderstand the purpose of statistical power. In hypothesis testing, power is the probability of correctly rejecting the null hypothesis, given that the alternative hypothesis is true. (2) Power analyses are useful for designing studies but lose their utility after the study data have been analyzed. (3)
In the design phase, Kim et al used data from the Korea Acute Myocardial Infarction Registry (KAMIR) study and expected the power of their study to be 70% and 96% for detecting 10% and 15% group differences, respectively, with a sample of 300 participants. This is an appropriate use of power analysis, often referred to as “design power.” It is based on the study design and beliefs about the true population structure.
After completing the study and analyzing the data, the authors calculated how much power their test had based on the hazard ratio (HR=0.69). They estimated that their test had 89.7% power to detect between-group differences in MACE incidence rates. This is referred to as “observed power” because it is calculated from the observed study data. (3) This form of power analysis is misleading.
One cannot know the true power of a statistical test because there is no way to be sure that the effect size estimate from the study (HR=0.69) is the true parameter. Therefore, one cannot be certain that the observed power (89.7%) is the true power of the test. Observed power can often mislead researchers into having a false sense of confidence in their results. Furthermore, observed power, which is a 1:1 function of the p-value, yields no relevant information not already provided by the p-value. (3,4)
Statistical power is useful for designing future studies, but once the data from a study have been produced, it is the effect sizes, the p-values, and the confidence intervals that contain insightful information. Observed power adds little to this and can be misleading. (3) If one wishes to see how uncertain (3) they should be about their point estimate, then they should look at the width of the confidence intervals and the values they cover. (2)
Kim J-M, Stewart R, Lee Y-S, et al. Effect of Escitalopram vs Placebo Treatment for Depression on Long-term Cardiac Outcomes in Patients With Acute Coronary Syndrome: A Randomized Clinical Trial. JAMA. 2018;320(4):350-358.
Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337-350.
Hoenig JM, Heisey DM. The Abuse of Power. Am Stat. 2001;55(1):19-24.
Greenland S. Nonsignificance plus high power does not imply support for the null over the alternative. Ann Epidemiol. 2012;22(5):364-368.
In a similar tale, a group of surgeons published a methodological article advocating this practice of calculating observed power, which I further discuss here.