Medicine Is Being Treated with Snake-Oil Statistics

statistics
Authors
Affiliations

Zad Rafi

Department of Population Health, NYU Langone Medical Center, New York, NY, USA

Andrew Gelman

Department of Statistics and Department of Political Science, Columbia University, New York, NY, USA

Aleksi Reito

Department of Surgery, Central Finland Hospital, Jyväskylä, Keski-Suomi, Finland

Sander Greenland

Department of Epidemiology and Department of Statistics, University of California, Los Angeles, CA, USA

Published

November 11, 2020

Suggested Running Head: Snake Oil Statistics


Statistics has helped medicine move away from an eminence-based framework, where subject-matter experts decided what worked and what didn’t, towards an evidence-based one. Among the techniques deployed, null-hypothesis statistical testing (NHST) is the most common, where the word “null” is invariably taken by users to mean that the only hypotheses tested are those of “no association” or “no effect” (sometimes labeled “nil hypotheses”). Despite objections to them extending throughout the past century1, these tests are supposed to serve as a safeguard against researchers fooling themselves and others, as well as serving as a central component of experimental design and analysis.

Unfortunately, these tests are easily subverted into playing the reverse role of providing a badge of approval, allowing researchers to make stronger claims than warranted by valid statistical analyses. Confidence intervals have been extensively promoted to address this problem, but they too have been subverted by being treated as if they are only testing null hypotheses. The consequences for evidence-based medicine have been dire.

There are three commonly stated principles (shown in Figure 1) of evidence-based medicine2:

  1. reliance on statistically significant results (and thus NHST) from randomized controlled trials,
  2. balancing of costs, benefits, and uncertainties in decision making, and
  3. combining clinical expertise with external evidence to tailor treatments for individuals.

Figure 1

Unfortunately, the use of NHST can get in the way of the movement toward an evidence-based framework. This may sound paradoxical, given that one of the foundations of evidence-based medicine is hypothesis testing based on randomized controlled trials, deemed by many to be the most reliable forms of evidence2. One problem is that reliance on statistical significance (principle 1) may conflict with the other principles: In conflict with balancing costs and benefits under uncertainty (principle 2), statistical significance or non-significance is typically used to replace uncertainty with certainty3, 4 — indeed, researchers are encouraged to do this and it is even forced by some journals5.

Reliance on statistical significance can interact badly with expertise and background evidence (principle 3), often leading to incoherent attempts at resolution. For example, researchers may do separate analyses for men and women, often with no biological basis for expecting other than small differences in effects (if any). In doing so they often find that one group (usually the larger one, which is usually men) show a “significant” effect while the other does not. They then misreport this difference in “significance” as if it represented a significant difference in effect between groups when in fact it only reflects a difference in the group sizes, plus random differences in group-specific P-values. Such misinterpretations fool researchers into thinking that the data support targeting treatments at the group showing “significance” and mislead clinicians into tailoring treatment plans for individuals based on nothing more than random variation.

The unfortunate reality is that estimating effects for individuals or population subsets requires far more sophistication than basic testing procedures6. It can take several times more patients to estimate variation in effects (“interactions”) than average effects7. Given that few studies are large enough to estimate main effects of interest, it will typically be impossible to obtain reliable estimates of effect variation even if the study is otherwise flawless. That problem should be dealt with under principle (2) by recognizing that subgroups will have very imprecise estimates, leading to enormous uncertainties about who if anyone should be targeted for or excluded from treatment. Unfortunately, the prevalent misunderstandings of classical statistics have trained most researchers, editors, and reviewers to demand statistical significance and certainty as a prerequisite for publication8 and decision making. Applying those demands within subgroups all but guarantees that distorted impressions of patient-specific effects will follow.

Through neglect of basic education, the statistics profession is partly responsible for these issues, but there is nothing new about medicine’s desire for certainty. That desire is part of human nature and thus existed long before the adoption of statistical methods. The resulting demands for certainty in reported results have provided ample opportunities for overconfident researchers to rise to prominence. Physicians have long capitalized on several opportunities with their sophisticated knowledge of physiology and anatomy, and used their medical authority to argue what treatments worked, shape public policy, and design clinical guidelines, maintaining a form of medicine characterized as “eminence based.”

The landscape changed when advances in quantitative methods eventually reached medicine9, leading to a newfound demand for scientific rigor and making many individuals wary of the claims of subject-matter experts10. These demands may be some of the biggest contributors to medicine’s adoption of statistical methods and its movement towards an evidence-based framework. Unfortunately, in its pursuit for objectivity, medicine was sold snake oil statistics and as a result, its desire for objectivity and rigor backfired.

Figure 2: Snake oil statistics. Methodological interventions that have been oversold by early adopters as cure alls despite no supporting evidence of utility in a particular application.

To see the reach of these problems, one simply needs to look wherever a new drug has been considered effective only if it has been shown statistically significantly better than a control in one or more randomized clinical trials (a standard that applies to FDA new-drug approvals, though not to medical devices, which can be cleared through other pathways). The largest enforcers of these methods have been regulatory agencies that wish to minimize treatments that do not work (false positives) and minimize adverse events from treatments. One should ask: what specifically convinced these agencies that statistical significance and in particular NHST was the best analysis criterion to achieve these goals?

In a nonexistent ideal world, the regulatory agencies looked at several statistical methods, tested each of them in many settings to see how well they identified true benefits and harms while avoiding false conclusions, and from that determined NHST performed best out of all options – with periodic updates as new methods appeared. Of course, history shows something else entirely, even indicating that adoption of NHST was mainly a result of political desperation11, 12. To understand the harsh reality, we must go back to the mid-20th century, when pharmaceutical companies submitted new drug applications to the FDA that often lacked any protocols and statistical analysis plans, making the entire drug approval process chaotic. By the end of the 1960s, the FDA had become desperate to standardize the drug-approval process and make it scientifically rigorous, especially with impending pressures from the Drug Amendments of 1962, which demanded rigorous evidence for drug approval.

At the same time, applied researchers were looking for rigorous ways to summarize experiments with numerical quantities. Their search struck gold with the works of the prominent statisticians, Ronald A. Fisher1315, Jerzy Neyman16, and Egon Pearson17, who created and popularized powerful statistical tools for researchers that lacked statistical training. Eventually, researchers across many disciplines adopted NHST and the now-infamous 0.05 cutoff. Soon, the FDA followed and incorporated these methods into its regulatory process, without any formal debate about the methods’ utility or evidence12.

Figure 3: Austin Bradford Hill. The statistician and epidemiologist who conducted the first randomized controlled trial in medicine.

It thus appears that embracement of the NHST paradigm arose from expediently following the ascendant trend in research rather than a critical evaluation of various options emerging in the same period. Since its adoption into the regulatory process, this framework has been rigorously enforced by the FDA as the gold standard in the approval process.

We find it ironic that the gatekeepers of evidence-based medicine — regulatory agencies, journal editors and reviewers, and medical researchers – continue to insist on enforcement of a statistical framework that was never critically examined for its utility in medicine12. This failure makes it understandable why it can be credibly argued that conventional statistical methods have done more harm than good in medical science. These methods, including null-hypothesis tests and so-called “confidence” intervals are a set of decision-making tools designed for tightly controlled randomized experiments in which the treatment effects (if any) can be distinguished from all other causal effects, and can be distinguished from random error (noise) by increasing the study size in an affordable manner. Examples of such scenarios arise in agricultural research where the experimental units are plants or plots, and industrial quality control, where the experimental unit is a part or product. The tools were originally developed for these environments, in which (compared to clinical studies) experimenters have almost godlike control over the selection of units and their subsequent experiences, aided by having a rather short follow-up period and low cost per experimental unit. And then, the decisions to be made are relatively simple and easily monitored, e.g., change a fertilizer formulation or manufacturing tolerance17. Experimental psychology is similar in terms of control, cost, and even lower cost-benefit consequences.

In clinical environments, however, the cost per experimental unit (the patient) is far higher, creating severe limits on study size and thus noise reduction, and the costs and benefits of decisions can be enormous. At the same time, there is far less control of extraneous selection and confounding effects: Physicians and their patients can and do selectively refuse to participate or cease to adhere to assigned treatment protocols, and may drop out for unknown reasons. Meanwhile, direct physical control of the patient environment is extremely limited or nonexistent, especially when follow-up extends beyond hospital stay. And then, amplifying these limits, the required decisions are often complex and of highly uncertain consequence, yet may be pivotal to the experimental results and clinical decisions, e.g., when to withdraw treatment from patients apparently experiencing side effects or when to switch treatments for nonresponsive patients.

These vast differences haven’t stopped medical researchers from using conventional methods as if they were operating in a tightly controlled environment, treating ambiguous trial results as if they supported decisive conclusions despite obvious uncertainties and potentially devastating consequences. The usual depictions of this problem involves researchers trying to “game” statistical methods so that they show “significant” effects18, 19; while such “significance questing” is a real problem, warnings about it usually ignore or dismiss opposite behavior in which showing “nonsignificance” will facilitate publication in prestigious medical journals, especially where so-called “replication failure” has become a hot topic3. Adding to these distortions is the publication bias that results when researchers or editors deem results unworthy of submission or acceptance because they are just not interesting because they fail to report a “discovery” or only replicate what is “known.” This desire for novelty or publicity, along with demands for certainty, has distorted the medical literature with dubious, inflated effects20 and misleading claims of replication failure based on fundamentally ambiguous results21, 22.

Responses to the problems

In response to the ongoing abuse of statistical testing, the American Statistical Association released a statement in 2016 cautioning against the fixating on P-values and statistical significance23. Three years later, the organization published an issue titled “A World Beyond P < 0.05” with 43 commentaries from statistical experts on how to improve statistical inference, with or without P-values24. The issue was accompanied by a highly discussed commentary in Nature titled “Scientists Rise up Against Statistical Significance” that discouraged mindless automation and dichotomization of statistical results21, and was supported by signatures of some 800 applied researchers.

While these calls are impressive and a growing number of journals — including the New England Journal of Medicine and, in modified form, JAMA — have updated their statistical-reporting policies, most regulatory agencies and the bulk of the medical literature continue to fixate on statistical significance5, 25. Correspondingly, we can expect medical researchers to keep cutting corners to achieve or remove statistical significance18, 19 or misinterpret ambiguous results as if definitive, a practice that is often labeled as “spin”26.

We can interpret the persistence of null hypothesis significance testing (NHST) in two ways. One story is that the value of NHST is recognized by real-world decision makers, despite the carping of ivory-tower critics such as the authors of the present article. The other story is that the counterproductive nature of NHST has been denied by the medical establishment5, 25, and that better alternatives are available12. There are legitimate practical concerns behind both these perspectives. On one hand, active researchers and regulators have legitimate concerns about working on or approving treatments that do not work. On the other hand, many well-publicized examples have made it clear that NHST can easily lead to overconfidence and erroneous inferences, as seen in discussions that treat statistically significance as demonstrating presence of effects and nonsignificance as demonstrating absence of effects.

Methodologists propose new methods to address NHST in their field

Fear of false positives has dominated most discussions of scientific rigor18, 27, 28, often relegating false negatives to more technical discussions of power. Although false positives can be costly, so can missing clinically meaningful effects21. Take postmarketing surveillance for pharmaceuticals; in such situations, serious adverse events from drugs are often underreported and have in some cases been actively suppressedVandenbroucke2008?, Doshi2012?, as seen in Cochrane reviews of unpublished trial data, leading to a scarcity of data and resulting in studies that will never be able to show a “significant” effect because of the lack of resources and time. If a study comparing adverse events from those taking an approved drug and some control group is unable to show statistical significance, even when the estimated effect is important and plausible, the results will typically be confused and used for evidence of absence21, 29, and prescribing is unlikely to be curtailed, leading to continued harm.

The confusion of statistical nonsignificance with evidence of absence will remain a problem in fields where effects are often small and yet studies powered to detect them are infeasible. Most studies in surgical science rarely have more than a dozen participants. Again, this makes it incredibly difficult to achieve statistical significance, even when clinically important effects are likely. In such fields, adopting the NHST framework for analysis all but guarantees failure to detect those effects. As a result, researchers have looked to other methods — including Bayes factors, second-generation P-values, and post-hoc power calculations — to circumvent the bad hand that was dealt to them3032. Unfortunately, such transitions are not always helpful: while reasonably defensible alternatives such as interval testing and posterior intervals do exist, several of the most-publicized proposals introduce their own errors, and in fact, some may actually result in more errors than before.

For example, a group of surgeons recently published several statistical recommendations on what to do if a surgical study result was nonsignificant3335, providing hope to researchers who have had their inquiries halted by that result. But the recommendations are not only statistically invalid and clinically misleading, and thus have been discouraged by statisticians for nearly two decades36, 37. In a similar tale, two sports scientists published a statistical method38 which was supposed to improve the classification of individual treatment responses by reducing the influence of random error in small studies. Unfortunately, reducing the influence of random error in a valid manner requires improvement of study design, including increase in study size, so unsurprisingly the proposed method has unacceptably poor statistical properties39.

The efforts behind new methods and recommendations are a response to an arbitrary dichotomy that has been imposed upon researchers by the medical and scientific establishment: decide whether a result is “positive” or “negative”, with no allowance for what is usually the most reasonable interpretation – ambiguous or indecisive. Unfortunately, conventional statistics has aggravated this problem by offering methods like NHST which produce only dichotomous answers. These methods which swept research sciences in the mid-20th century and are now firmly rooted in tradition, as if that tradition is the best we can do. But it isn’t.

The problem with upending this tradition however is that there is no consensus about what to replace it with, a problem that only grows as alternatives continue to proliferate. A idealistic (and we think naïve) view would simply allow authors to choose alternatives of their liking. The problem with this anarchic approach is that many of the alternatives are themselves misleading and even defective or at best inferior to other methods in demonstrable ways. Yet seeing these problems requires not only sufficient technical expertise to evaluate methods, but also a willingness to see them – a willingness that should not be assumed for originators and adopters of the method. Unfortunately, mere publication of a method in a peer-reviewed journal is a faulty indicator of method validity. That is especially so if publication is not in a statistics journal, for in that case it means only that peer-review was by referees who may have had little of the special technical expertise needed for thorough evaluation.

Solutions

There are no simple solutions or universal guidelines that medical researchers can always use to improve scientific rigor within their area of work. We nonetheless offer some recommendations and resources that we believe may be useful to those who recognize problems in their field and wish for some sort of guidance:

  1. Collaborate with well-qualified statisticians to design and conduct studies that are rigorous, efficient, and cost effective40, 41. Look to these statistical collaborators for guidance about honestly reporting uncertainty, not certainty19.
  2. If possible, pool resources to run larger studies that are likely to be more precise and informative than individual studies, which may simply waste resources and offer little yield42. However, more data is not synonymous with more information: Increasing study size may be detrimental if it entails a reduction in data quality43.
  3. Aim to be more descriptive and less inferential22. Accept uncertainty and the anxiety that comes with it4, along with the idea that no one study warrants conclusions about the true nature of a phenomenon, a delusion that is not even true in particle physics! And if a phenomenon is subtle, even several studies may be insufficient for valid conclusions beyond “more research is needed.”
  4. When making real world decisions, use all information available to you, balancing costs, benefits, and uncertainties44, rather than basing decisions on whether a single numerical value is above or below an arbitrary cutoff.
  5. Be mindful of cognitive biases that may distort your conclusions throughout the study and the many cognitive biases that will afflict you, your colleagues, and your collaborators3, 45, 46.
  6. Use statistical methods that have been reviewed and validated by the statistical community beyond their developers and promoters. That validation is provided not only by publication in statistical journals, but also by applications that can be judged as having reached sound, well-cautioned conclusions in context. Especially, beware of any method that (like NHST) claims to offer firm conclusions based on purely numeric comparisons.
Back to top

References

1. Boring EG. (1919). “Mathematical vs. Scientific significance.” Psychological Bulletin. 16:335–338. doi: 10.1037/h0074554.
2. Sackett DL, Rosenberg WM, Gray JA, Haynes RB, Richardson WS. (1996). “Evidence based medicine: What it is and what it isn’t.” BMJ (Clinical research ed). 312:71–72. doi: 10.1136/bmj.312.7023.71.
3. Greenland S. (2017). “Invited commentary: The need for cognitive science in methodology.” American Journal of Epidemiology. 186:639–645. doi: 10.1093/aje/kwx259.
4. McShane BB, Gal D, Gelman A, Robert C, Tackett JL. (2019). “Abandon Statistical Significance.” The American Statistician. 73:235–245. doi: 10.1080/00031305.2018.1527253.
5. Bauchner H, Golub RM, Fontanarosa PB. (2019). “Reporting and Interpretation of Randomized Clinical Trials.” Journal of the American Medical Association. 322:732–735. doi: 10.1001/jama.2019.12056.
6. Senn S. (2018). “Statistical pitfalls of personalized medicine.” Nature. 563:619–621. doi: 10.1038/d41586-018-07535-2.
7. Greenland S. (1983). “Tests for interaction in epidemiologic studies: A review and a study of power.” Statistics in Medicine. 2:243–251. doi: 10.1002/sim.4780020219.
8. Rosenthal R. (1979). “The file drawer problem and tolerance for null results.” Psychol Bull. 86:638. doi: 10.1037/0033-2909.86.3.638.
9. Council MR. (1948). Streptomycin Treatment of Pulmonary Tuberculosis.” British Medical Journal. 2:769–782.
10. Goodman SN. (2019). “Why is Getting Rid of P-Values So Hard? Musings on Science and Statistics.” The American Statistician. 73:26–30. doi: 10.1080/00031305.2018.1558111.
11. Kennedy-Shaffer L. (2017). When the Alpha is the Omega: P-Values, Substantial Evidence,’ and the 0.05 Standard at FDA.” Food Drug Law J. 72:595–635.
12. Ruberg SJ, Jr FEH, Gamalo-Siebers M, LaVange L, Lee JJ, Price K, et al. (2019). “Inference and Decision Making for 21st-Century Drug Development and Approval.” The American Statistician. 73:319–327. doi: 10.1080/00031305.2019.1566091.
13. Fisher R. (1955). “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society Series B (Methodological). 17:69–78. doi: 10.1111/j.2517-6161.1955.tb00180.x.
14. Fisher RA. (1925). “Statistical Methods for Research Workers.” Oliver and Boyd: Edinburgh.
15. Fisher RA. (1935). “The Design of Experiments.” Oxford, England: Oliver & Boyd.
16. Neyman J, Pearson ES. (1933). “On the Problem of the Most Efficient Tests of Statistical Hypotheses.” Philosophical Transactions of the Royal Society of London Series A, Containing Papers of a Mathematical or Physical Character. 231:289–337. doi: 10.1098/rsta.1933.0009.
17. Pearson ES. (1933). “A Survey of the Uses of Statistical Method in the Control and Standardization of the Quality of Manufactured Products.” Journal of the Royal Statistical Society. 96:21–75. doi: 10.2307/2341869.
18. Simmons JP, Nelson LD, Simonsohn U. (2011). “False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant.” Psychological Science. 22:1359–1366. doi: 10.1177/0956797611417632.
19. Wang MQ, Yan AF, Katz RV. (2018). “Researcher requests for inappropriate analysis and reporting: A U.S. Survey of consulting biostatisticians.” Annals of Internal Medicine. 169:554. doi: 10.7326/M18-1230.
20. Dickersin K, Berlin JA. (1992). “Meta-analysis: State-of-the-Science.” Epidemiologic Reviews. 14:154–176. doi: 10.1093/oxfordjournals.epirev.a036084.
21. Amrhein V, Greenland S, McShane B. (2019). “Scientists rise up against statistical significance.” Nature. 567:305. doi: 10.1038/d41586-019-00857-9.
22. Amrhein V, Trafimow D, Greenland S. (2019). “Inferential statistics as descriptive statistics: There is no replication crisis if we don’t expect replication.” The American Statistician. 73:262–270. doi: 10.1080/00031305.2018.1543137.
23. Wasserstein RL, Lazar NA. (2016). “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician. 70:129–133. doi: 10.1080/00031305.2016.1154108.
24. Wasserstein RL, Schirm AL, Lazar NA. (2019). “Moving to a world beyond ‘p < 0.05’.” The American Statistician. 73:1–19. doi: 10.1080/00031305.2019.1583913.
25. Harrington D, D’Agostino RB, Gatsonis C, Hogan JW, Hunter DJ, Normand S-LT, et al. (2019). “New guidelines for statistical reporting in the journal.” New England Journal of Medicine. 381:285–286. doi: 10.1056/NEJMe1906559.
26. Khan MS, Lateef N, Siddiqi TJ, Rehman KA, Alnaimat S, Khan SU, et al. (2019). “Level and Prevalence of Spin in Published Cardiovascular Randomized Clinical Trial Reports With Statistically Nonsignificant Primary Outcomes: A Systematic Review.” JAMA Network Open. 2:e192622–e192622. doi: 10.1001/jamanetworkopen.2019.2622.
27. Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E, Berk R, et al. (2017). “Redefine statistical significance.” Nature Human Behaviour. 2:6–10. doi: 10.1038/s41562-017-0189-z.
28. Ioannidis JPA. (2018). “The Proposal to Lower P Value Thresholds to .005.” JAMA. 319:1429–1430. doi: 10.1001/jama.2018.1536.
29. Altman DG, Bland JM. (1995). “Absence of evidence is not evidence of absence.” BMJ (Clinical research ed). 311:485. doi: 10.1136/bmj.311.7003.485.
30. Gigerenzer G, Marewski JN. (2015). “Surrogate Science: The Idol of a Universal Method for Scientific Inference.” Journal of Management. 41:421–440. doi: 10.1177/0149206314547522.
31. van Ravenzwaaij D, Monden R, Tendeiro JN, Ioannidis JPA. (2019). “Bayes factors for superiority, non-inferiority, and equivalence designs.” BMC Medical Research Methodology. 19:71. doi: 10.1186/s12874-019-0699-7.
32. Simonsohn U, Nelson LD, Simmons JP. (2014). “P-Curve and Effect Size: Correcting for Publication Bias Using Only Significant Results.” Perspect Psychol Sci. 9:666–681. doi: 10.1177/1745691614553988.
33. Bababekov YJ, Stapleton SM, Mueller JL, Fong ZV, Chang DC. (2018). “A Proposal to Mitigate the Consequences of Type 2 Error in Surgical Science.” Annals of Surgery. 267:621. doi: 10.1097/SLA.0000000000002547.
34. Bababekov YJ, Hung Y-C, Hsu Y-T, Udelsman BV, Mueller JL, Lin H-Y, et al. (2019). “Is the Power Threshold of 0.8 Applicable to Surgical Science? The Underpowered Study.” Journal of Surgical Research. 241:235–239. doi: 10.1016/j.jss.2019.03.062.
35. Bababekov YJ, Chang DC. (2019). “Post Hoc Power: A Surgeon’s First Assistant in Interpreting Negative Studies.” Annals of Surgery. doi: 10.1097/SLA.0000000000002914.
36. Greenland S. (2012). “Nonsignificance plus high power does not imply support for the null over the alternative.” Annals of Epidemiology. 22:364–368. doi: 10.1016/j.annepidem.2012.02.007.
37. Hoenig JM, Heisey DM. (2001). “The Abuse of Power.” The American Statistician. 55:19–24. doi: 10.1198/000313001300339897.
38. Dankel SJ, Loenneke JP. (2019). “A Method to Stop Analyzing Random Error and Start Analyzing Differential Responders to Exercise.” Sports Medicine. doi: 10.1007/s40279-019-01147-0.
39. Tenan M, Vigotsky AD, Caldwell AR. (2019). “On the Statistical Properties of the Dankel-Loenneke Method.” SportRxiv. doi: 10.31236/osf.io/8ndhg.
40. Gelman A, Carlin J. (2014). “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.” Perspectives on Psychological Science. 9:641–651. doi: 10.1177/1745691614551642.
41. Rothman KJ, Greenland S. (2018). “Planning study size based on precision rather than power.” Epidemiology. 29:599–603. doi: 10.1097/EDE.0000000000000876.
42. Moshontz H, Campbell L, Ebersole CR, IJzerman H, Urry HL, Forscher PS, et al. (2018). “The Psychological Science Accelerator: Advancing Psychology Through a Distributed Collaborative Network:” Advances in Methods and Practices in Psychological Science. doi: 10.1177/2515245918797607.
43. Cox DR, Donnelly CA. (2011). “Principles of Applied Statistics.” Cambridge University Press.
44. Parmigiani G, Inoue L. (2009). “Decision Theory: Principles and Approaches.” John Wiley & Sons.
45. Gigerenzer G. (2004). “Mindless statistics.” The Journal of Socio-Economics. 33:587–606. doi: 10.1016/j.socec.2004.09.033.
46. Stark PB, Saltelli A. (2018). “Cargo-cult statistics and scientific crisis.” Significance. 15:40–43. doi: 10.1111/j.1740-9713.2018.01174.x.

Citation

For attribution, please cite this work as:
1. Rafi Z, Rafi Z, Gelman A, Reito A, Greenland S. (2020). ‘Medicine Is Being Treated with Snake-Oil Statistics’. https://lesslikely.com/posts/statistics/snakeoilstats.html.

Comments

Comments are loaded on demand so they don’t slow down the initial page render.