A recent survey of biostatisticians in the US found that many of them received inappropriate requests when analyzing data for their collaborators/clients.1 Requests ranged from things like spinning results to meet expectations to deleting observations so that the data would better fit the hypothesis.
In a (somewhat) similar tale, a recent review2 of introductory psychology textbooks found that nearly 89% of 30 books (many of which were best sellers) had incorrect definitions of statistical significance and P-values.
What am I trying to say here by pointing to these studies?
Statistics is necessary for nearly every discipline in science, yet little attention is given to these topics and their controversies during the training of researchers. These issues aren’t limited to just psychology or medicine, they’re prevalent in nearly every field.3–5
To many applied researchers, statistics may just be an obstacle in the way of their real interests, which may require years of formal training. If a biologist already spent several years understanding all the processes occurring in cells or something like the Krebs cycle, what would incentivize them to spend more time learning how to analyze data outside of their formal training? Why can’t everyone just use the default settings of statistical programs and not understand what’s going on under the hood?6 What’s wrong with shortcuts?
And of course, there are several shortcut proposals such as redefining statistical significance,10 using completely new methods,11 or just switching over to Bayesian inference (especially Bayesian testing).12
But I think many of these proposals miss the target, and Andrew Gelman explains why:
“The only reasonable inference to conclude here is that applied statistics is hard. Doing a statistical analysis is like playing basketball, or knitting a sweater. You can get better with practice.”
Unfortunately, there aren’t many useful resources that are both correct and easy to read, that can help researchers practice better and correctly. Many of the issues still present today have already been discussed to death by statisticians… in journals for statisticians, with hundreds of proofs and simulations. Yet, applied researchers keep making the same mistakes again and again.
Why Bayesian Inference Isn’t the Answer
Again, statistics is hard, probability is hard, and both frequentism and Bayesian inference are hard. None of the cognitive issues that we encounter with frequentist statistics will magically vanish with Bayesian statistics. Many of these issues have been discussed at length by Greenland, 2017:13
- Pseudoskepticism by putting unrealistic emphases on the point null
- The desire to dichotomize continuous variables for ease
- Reifying statistical models and not questioning their assumptions
The issues of P-values, statistical significance, confidence intervals, statistical power, and the overall replication crisis can’t be resolved simply by correcting someone on social media, writing a letter to the editor, or condescendingly screaming how Bayesian statistics is superior.
I believe these topics need deep discussions, explaining what they are, what they are not, how to use them, how to think about them, how to correctly write interpretations, things to question about them, along with several other discussions.
To effectively tackle these issues, rote memorization is not enough, we need better language to describe what we’re speaking of, better analogies, better visualizations, and easy access to tools that can aid understanding of data.
With these things in mind, Sander Greenland and I have written a pair of papers14,15 hoping to make some of these things easier to understand, while also using real examples where researchers have fallen prey to some notable cognitive biases. We discuss potential changes to the way we talk about statistics and think about probabilities to better improve understanding and to reduce misintereptations.
We originally wrote a single paper hoping to discuss some of these things, but after getting feedback from 18 statisticians and methodologists, we ended up splitting the paper into two papers so we could discuss each topic in greater depth.
In paper 1, we discuss:
- A comprehensive discussion of P-values and their reconciliation with S-values
- Testing several alternative hypotheses of interest rather than just the null
- Graphical functions/tables to present alternative results
In paper 2:
- Why unconditional interpretations of statistics need to be emphasized
- Why terminology change is needed for reform
- How discussion needs to move on to decisions and their costs
I think many individuals will find these discussions to be interesting, even if they were to disagree with most of what we say.
And if anyone wishes to discuss the contents of these papers, they are invited to comment below on this blog or on PubPeer. The PubPeer links for the corresponding papers are below:
Andrew Gelman has also posted about the second paper and there may be an ongoing discussion there.
1. Wang MQ, Yan AF, Katz RV. Researcher requests for inappropriate analysis and reporting: A U.S. Survey of consulting biostatisticians. Annals of Internal Medicine. 2018;169(8):554. doi:10.7326/M18-1230
2. Cassidy SA, Dimova R, Gigu‘ere B, Spence JR, Stanley DJ. Failing grade: 89% of introduction-to-psychology textbooks that define or explain statistical significance do so incorrectly. Advances in Methods and Practices in Psychological Science. June 2019. doi:10.1177/2515245919858072
3. Camerer CF, Dreber A, Forsell E, et al. Evaluating replicability of laboratory experiments in economics. Science. 2016;351(6280):1433-1436. doi:10.1126/science.aaf0918
4. Freedman LP, Cockburn IM, Simcoe TS. The economics of reproducibility in preclinical research. PLoS Biol. 2015;13(6):e1002165. doi:10.1371/journal.pbio.1002165
5. Lash TL, Collin LJ, Van Dyke ME. The replication crisis in epidemiology: Snowball, snow job, or winter solstice? Current Epidemiology Reports. 2018;5(2):175-183. doi:10.1007/s40471-018-0148-x
6. Stark PB, Saltelli A. Cargo-cult statistics and scientific crisis. Significance. 2018;15(4):40-43. doi:10.1111/j.1740-9713.2018.01174.x
7. Camerer CF, Dreber A, Holzmeister F, et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour. 2018;2(9):637-644. doi:10.1038/s41562-018-0399-z
8. Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716. doi:10.1126/science.aac4716
9. Simmons JP, Nelson LD, Simonsohn U. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci. 2011;22(11):1359-1366. doi:10.1177/0956797611417632
10. Benjamin DJ, Berger JO, Johannesson M, et al. Redefine statistical significance. Nature Human Behaviour. 2017;2(1):6-10. doi:10.1038/s41562-017-0189-z
11. Colquhoun D. The False Positive Risk: A Proposal Concerning What to Do About p-Values. The American Statistician. 2019;73(sup1):192-201. doi:10.1080/00031305.2018.1529622
12. Goodman SN. Of P-values and Bayes: A modest proposal. Epidemiology. 2001;12(3):295-297.
13. Greenland S. Invited commentary: The need for cognitive science in methodology. American Journal of Epidemiology. 2017;186(6):639-645. doi:10.1093/aje/kwx259
14. Chow ZR, Greenland S. Semantic and cognitive tools to aid statistical inference: Replace confidence and significance by compatibility and surprise. arXiv:190908579 [stat]. September 2019. http://arxiv.org/abs/1909.08579.
15. Greenland S, Chow ZR. To Aid Statistical Inference, Emphasize Unconditional Descriptions of Statistics. arXiv:190908583 [stat]. September 2019. http://arxiv.org/abs/1909.08583.