How Useful Is Nutritional Epidemiology?

Nutritional epidemiological findings are often the studies that generate the most buzz, but they’re also the ones that get harshly criticized. Some folks will even go out of their way to say that the entire field produces findings that are mostly useless.

Here’s what one of the leading meta researchers has to say about nutritional epidemiology:

“Nutritional Epidemiology is a scandal. It should just go to the waste bin.” - John Ioannidis

This is clearly a very extreme statement. Now it’s important to distinguish that this isn’t an argument against epidemiology as a field, but instead, against nutritional epidemiology. It would be hard to argue that epidemiology produces useless findings. Remember, epidemiological evidence, along with evidence from multiple other lines, helped us establish causal relationships between smoking and lung cancer, LDL and heart disease, and Zika and birth defects.

And those who aren’t delusional know very well that randomized controlled trials (RCT) cannot answer all of our questions. So, yes epidemiology is useful, but back to the topic at hand. Critics of nutritional epidemiology mostly claim that nutritional epidemiology isn’t very useful because effects are:

And I would have to mostly agree with the critics. I wouldn’t go to say anything as extreme as Ioannidis, but we should probably be skeptical of a lot of the findings that have come out of this field and we should avoid getting influenced by the media hype that accompanies these studies.

Analytical Flexibility

Why? Well, here’s what some of the evidence shows. In one meta-study, titled “Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations”, John Ioannidis showed that by using different combinations of a certain number of covariates, you could virtually make effects go in either direction (type S errors) or change their magnitude (type M errors).1 These are the variables that researchers typically adjust for in multiple regression models.

Ioannidis downloaded 13 variables from the NHANES dataset that were linked to all-cause mortality, and that had a substantial number of participants associated with each variable (at least 1000 participants and 100 deaths). From those 13 variables, he was able to produce 8,192 different statistical models that all resulted in different hazard ratios (HR), as seen in the image below. The variables were included age, smoking, BMI, hypertension, diabetes, cholesterol, alcohol consumption, education, income, sex, family history of heart disease, heart disease, and any cancer.

The vibration of effects, depicting how much analytical flexibility there is with covariate choice in model building and selection

For one particular relationship between vitamin D and all-cause mortality, Ioannidis reported that with no adjustment of covariates, Vitamin D resulted in an impressive 0.64 HR. Wow. A 36% decrease. However, when all 13 covariates are included in the model, the HR increases to 0.75. Maybe some researcher in California decided to include five variables in a model, while another researcher in New York decided to include 10.

The fact that there are nearly 8,000 statistical models that are possible with just 13 variables and far more with more possible covariates is a significant concern when considering the problem of multiple comparisons. Of course, in areas like epidemiology where there are no random mechanisms, it largely seems inappropriate to test statistical hypotheses and far more appropriate to focus on estimation and interval estimates. However, many epidemiologists still discuss whether their results are statistically significant at a particular alpha level, which to me seems misguided.

Ioannidis and colleagues recommended reporting all possible statistical models from all possible combinations of covariates and to report the median of all these models, rather than selectively reporting a few models. And to note whether there was something called a “Janus effect”, where the effect size would go in both directions.

In the last pattern, as exemplified by α-tocopherol, the estimated HRs can be both greater and less than the null value (HR > 1 and HR ≤ 1) depending on what adjustments were made. We call this the Janus effect after the two-headed representation of the ancient Roman god. For α-tocopherol, most of the HR and p-values were concentrated around 1 and non-significance, respectively… The Janus effect is common: 131 (31%) of the 417 variables had their 99th percentile HR>1 and their 1st percentile HR<1.

However, statistician/epidemiologist, Sander Greenland has been a critic of meta-research and epidemiological studies that lack biochemical sophistication and lump several compounds together as if they were the same and behaved the same in the body.

“For nutrition, the lack of biochemistry sophistication among the trial designers leads to a lot of dubious and noncomparable studies, while the meta-analysts and reviewers do a lot of distortive lumping, e.g., talking about “vitamin E” as if that were a single entity - a recent review by prominent authors didn’t even notice that almost all trials used the racemic synthetic mixture, dl-alpha-tocopheryl, (which they misidentify with “alpha tocopherol”) with vastly diferent and unjustifed dosages, and which hardly resembles the eight or so natural d-tocopherols or d-tocotrienols that account for dietary intake."1

Prior to this study, statistician Stan Young also criticized the analytical flexibility in the journal Significance.2

For example, consider the use of linear regression to adjust the risk levels of two treatments to the same background level of risk. There can be many covariates, and each set of covariates can be in or out of the model. With ten covariates, there are over 1000 possible models. Consider a maze as a metaphor for modelling (Figure 3).

Analytical flexibility depicted by attempting to get through a maze
The maze of forking paths.

The red line traces the correct path out of the maze. The path through the maze looks simple, once it is known. Returning to a linear regression model, terms can be put into and taken out of a regression model. Once you get a p‐value smaller than 0.05, the model can be frozen and the model selection justified after the fact. It is easy to justify each turn.

In another meta-study, Ioannidis showed that several foods had large associations with cancer as shown below.

Many nutritional epidemiological findings showing that all foods are associated with an increased or decreased risk of cancer

However, when several of the studies were pooled in meta-analyses, these large effects shrunk as shown below.

Meta-analyses of nutrition studies shrink the large effects

Which makes sense considering that as you pool more and more models together, you’re likely to get closer to the real effect and you reduce the impact of large effects that passed the significance threshold. So, here, we again see the problem of multiplicity and publication bias in the field.

Luckily, the constant push to preregister all data-analysis protocols may help attenuate the problem of selective reporting. Furthermore, Ioannidis and several other statisticians explain that rather than focus on model selection, it would be ideal to report all statistical models and look for the median and mean effect sizes/p-values from all the included models. This would yield far more useful information, than the results from a few associations.

Measurement Error

One of the most concerning aspects of nutritional epidemiology is the reliance on food frequency questionnaires and other memory-based questionnaires. The data collected from these studies often suffer from a serious number of errors, such as classical measurement error, as modeled here \(W_{ij}=X_{i}+ \epsilon_{ij}\).

Some researchers have proposed using methods like regression calibration, moment reconsutruction, multiple imputation, and graphical methods to help address many of the errors that plague such data.3 However, others are skeptical and believe that such data are beyond saving.4

Archer and colleagues argue the following:

  1. The use of memory-based methods is founded upon two inter-related logical fallacies: a category error and reification
  2. Human memory and recall are not valid instruments for scientific data collection
  3. The measurement errors of memory-based dietary assessment methods are neither quantifiable nor falsifiable; this renders these methods and data pseudoscientific
  4. The post hoc process of pseudoquantification is impermissible and invalid
  5. Memory-based dietary data were repeatedly demonstrated to be physiologically implausible (i.e., meaningless numbers)
  6. The failure to cite or acknowledge contrary evidence and empirical refutations contributed to a fictional discourse on diet-disease relations

Our conclusion is that nutrition epidemiology is a degenerating research paradigm in which the use of expedient but nonfalsifiable anecdotal evidence impeded scientific progress and engendered a fictional discourse on diet-health relations. The continued funding and use of memory-based methods such as Food Frequency Questionnaires and 24-hour dietary interviews is anathema to evidence-based public policy.

Thus, our recommendation is simply that the use of memory-based methods must stop and all previous articles presenting memory-based methods data and conclusions must be corrected to include contrary evidence and address the consequences of a half century of pseudoscientific claims.

These are clearly harsh criticisms. However, many other researchers in the field believe that Archer et al overstates the evidence against FFQ data.5 It seems more plausible to me that FFQ data have some valuable and we should attempt to extract as much signal as we can from the noise. With that perspective, Keogh et al’s3 recommendations are both practical and valuable.

Poor Concordance with Randomized Trials

Another common criticism is that the findings of nutritional epidemiological studies are rarely corroborated by randomized trials. For example, Stan Young writes in his 2011 Significance paper

We ourselves carried out an informal but comprehensive accounting of 12 randomised clinical trials that tested observational claims – see Table 1. The 12 clinical trials tested 52 observational claims. They all confirmed no claims in the direction of the observational claims. We repeat that figure: 0 out of 52. To put it another way, 100% of the observational claims failed to replicate.

In fact, five claims (9.6%) are statistically significant in the clinical trials in the opposite direction to the observational claim. To us, a false discovery rate of over 80% is potent evidence that the observational study process is not in control. The problem, which has been recognised at least since 1988, is systemic.

My problem with this assessment is twofold:

  1. The randomized trials are nowhere near as large as the observational studies, therefore it is possible that their designs were likely to be underpowered in the first place! Thus, focusing on statistical significance makes even less sense.
  2. In observational studies, there often is no random mechanism such as random assignment or random sampling, thus putting so much weight on a decision theoretic like statistical significance is rarely justified.6 Greenland summarized this with clarity in his 1990 paper published in Epidemiology.6

Randomization provides the key link between inferential statistics and causal parameters. Inferential statistics, such as P values, confidence intervals, and likelihood ratios, have very limited meaning in causal analysis when the mechanism of exposure assignment is largely unknown or is known to be nonrandom. It is my impression that such statistics are often given a weight of authority appropriate only in randomized studies

I’m reasonably confident that Greenland has changed his mind since then on the appropriate weight of authority such statistics should be given, even in randomized studies. Especially given the recent Nature commentary he published on the problems with statistical significance and dichotomous thinking.7

In the same paper, Greenland also points out the problems with several probabilistic arguments that are used to justify classical statistics in areas where the exposure is unknown or unlikely to be random. He proposes some possible solutions:

In causal analysis of observational data, valid use of inferential statistics as measures of compatibility, conflict, or support depends crucially on randomization assumptions about which we are at best agnostic and more usually doubtful. Among the possible remedies are:

  • Restrain our interpretation of classical statistics by explicating and criticizing any randomization assumptions that are necessary for probabilistic interpretations;
  • train our students and retrain ourselves to focus on nonprobabilistic interpretations of inferential statistics;
  • deemphasize inferential statistics in favor of pure data descriptors, such as graphs and tables;
  • expand our analytic repertoire to include more elaborate techniques that depend on assumptions in the “agnostic” rather than the “doubtful” realm, and subject the results of these techniques to influence and sensitivity analysis.

These are neither mutually exclusive nor exhaustive possibilities, but I think any one of them would constitute an improvement over much of what we have done in the past.

As we can see, there are plenty of issues with nutritional epidemiology as its often done. But it is certainly not doomed! Many of these issues can be addressed by improving measurements and being more transparent with the analyses.

It’s also important to remember that even though small effects may not seem relevant, especially on a personal level, they are worth chasing because if they happen to be real, and policy changes are made upon these small effects, the outcomes could be large when looking at it from a population level. But as my colleague Kevin points out below, jumping on the results of every nutritional epidemiological study as if it were groundbreaking is simply delusional.


Also, I did write about a few nutritional epidemiological studies:

That’s all for today!


1. Patel CJ, Burford B, Ioannidis JPA. Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations. J Clin Epidemiol. 2015;68(9):1046-1058. doi:10.1016/j.jclinepi.2015.05.029

2. Young SS, Karr A. Deming, data and observational studies. Significance. 2011;8(3):116-120. doi:10.1111/j.1740-9713.2011.00506.x

3. Keogh RH, White IR. A toolkit for measurement error correction, with a focus on nutritional epidemiology. Statistics in Medicine. 2014;33(12):2137-2155. doi:10.1002/sim.6095

4. Archer E, Marlow ML, Lavie CJ. Controversy and debate: Memory-Based Methods Paper 1: The fatal flaws of food frequency questionnaires and other memory-based dietary assessment methods. Journal of Clinical Epidemiology. 2018;104:113-124. doi:10.1016/j.jclinepi.2018.08.003

5. Subar AF, Freedman LS, Tooze JA, et al. Addressing Current Criticism Regarding the Value of Self-Report Dietary Data12. The Journal of Nutrition. 2015;145(12):2639-2645. doi:10.3945/jn.115.219634

6. Greenland S. Randomization, statistics, and causal inference. Epidemiology. 1990;1(6):421-429.

7. Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567(7748):305. doi:10.1038/d41586-019-00857-9

See also:

comments powered by Disqus