<

Datasets

Sample datasets for teaching and reproducible research

Author

Zad Rafi

A curated collection of datasets for teaching statistical concepts, demonstrating analytical methods, and practicing reproducible research. All datasets are available for download and use in educational settings.


Dataset Summaries

Dataset Description Observations Download
PUFA Meta-Analysis Impact of increased PUFA consumption as replacement for SFA on CHD events (Mozaffarian et al., 2010) 6 studies CSV
Beta Blocker RCT Early randomized trial examining mortality after beta blocker treatment 44 patients CSV
Polyp Prevention Trial Randomized trial to reduce colorectal polyps 20 patients CSV
Crossover BP Medication Trial Crossover trial comparing blood pressure medications 2,000 patients CSV
Acupuncture Headache Trial RCT of acupuncture for chronic headache (Vickers et al., 2006) 401 patients CSV
International Stroke Trial Large prospective RCT testing aspirin and heparin effectiveness (1991-1996) 19,435 patients CSV

PUFA Meta-Analysis

Description

This dataset examines the impact of increased polyunsaturated fatty acid (PUFA) consumption as a replacement for saturated fatty acids (SFA) on coronary heart disease (CHD) events. Based on Mozaffarian et al. (2010).

Data Structure

6 observations of 5 variables:

  • study - Study identifier
  • chd_pufa - CHD events in PUFA group
  • total_pufa - Total participants in PUFA group
  • chd_control - CHD events in control group
  • total_control - Total participants in control group

Sample Data

study chd_pufa total_pufa chd_control total_control
Study A 12 846 25 852
Study B 8 221 11 229
Study C 51 2033 63 2036

Download PUFA dataset

Use cases: Meta-analysis, risk ratios, forest plots, heterogeneity assessment


Beta Blocker RCT

Description

An early randomized controlled trial examining mortality outcomes after beta blocker treatment post-myocardial infarction.

Data Structure

44 observations of 3 variables:

  • treatment - Treatment group (beta blocker vs control)
  • deaths - Number of deaths
  • total - Total number of patients

Sample Data

treatment deaths total
Control 39 685
Treated 27 674

Download Beta Blocker dataset

Use cases: Risk difference, number needed to treat (NNT), confidence intervals, hypothesis testing


Polyp Prevention Trial

Description

A randomized trial investigating interventions to reduce the recurrence of colorectal polyps.

Data Structure

20 observations of 4 variables:

  • treatment - Treatment assignment
  • baseline - Baseline polyp count
  • followup - Follow-up polyp count
  • age - Patient age

Download Polyp dataset

Use cases: Paired data analysis, change scores, regression to the mean


Crossover BP Medication Trial

Description

A crossover trial comparing the effectiveness of different blood pressure medications in controlling hypertension.

Data Structure

2,000 observations of 6 variables:

  • patient_id - Patient identifier
  • period - Treatment period (1 or 2)
  • treatment - Medication type
  • sbp - Systolic blood pressure (mmHg)
  • dbp - Diastolic blood pressure (mmHg)
  • sequence - Treatment sequence group

Download BP Crossover dataset

Use cases: Crossover designs, period effects, carryover effects, mixed models


Acupuncture Headache Trial

Description

A randomized controlled trial of acupuncture for chronic headache prevention, based on Vickers et al. (2006).

Data Structure

401 observations of 5 variables:

  • id - Patient identifier
  • age - Age in years
  • chronicity - Years with chronic headache
  • treatment - Acupuncture vs control
  • pk1 - Primary outcome: headache score at follow-up
  • pk5 - Secondary outcome at 12 months

Download Acupuncture dataset

Use cases: Analysis of covariance (ANCOVA), baseline adjustment, intention-to-treat analysis


The International Stroke Trial

Description

A large prospective randomized controlled trial conducted from 1991-1996 testing the effectiveness of aspirin and heparin in acute ischemic stroke.

Data Structure

19,435 observations of 12 variables:

  • id - Patient identifier
  • aspirin - Aspirin treatment (yes/no)
  • heparin - Heparin dose (high/medium/none)
  • dead - Death within 14 days
  • fdead - Death within 6 months
  • recstroke - Recurrent stroke within 14 days
  • age - Age in years
  • sex - Sex (M/F)
  • rconsc - Conscious state at randomization
  • rdelay - Hours from stroke to randomization
  • stype - Stroke type
  • rxhep - Heparin received as allocated

Sample Data

id aspirin heparin dead age sex stype
1 Y None N 75 F Infarction
2 N Medium N 83 M Infarction
3 Y High Y 71 F Unknown

Download Stroke dataset

Use cases: Factorial designs, survival analysis, subgroup analysis, multiple outcomes, large sample methods


Using These Datasets

R

# Read CSV files
pufa <- read.csv("datasets/pufa.csv")
stroke <- read.csv("datasets/stroke.csv")

# Or use readr for better parsing
library(readr)
pufa <- read_csv("datasets/pufa.csv")

Stata

* Import CSV files
import delimited "datasets/pufa.csv", clear
import delimited "datasets/stroke.csv", clear

Python

import pandas as pd

# Read CSV files
pufa = pd.read_csv("datasets/pufa.csv")
stroke = pd.read_csv("datasets/stroke.csv")

References

  • Mozaffarian D, Micha R, Wallace S. Effects on coronary heart disease of increasing polyunsaturated fat in place of saturated fat: a systematic review and meta-analysis of randomized controlled trials. PLoS Med. 2010;7(3):e1000252.

  • Vickers AJ, Rees RW, Zollman CE, et al. Acupuncture for chronic headache in primary care: large, pragmatic, randomised trial. BMJ. 2004;328(7442):744.

  • International Stroke Trial Collaborative Group. The International Stroke Trial (IST): a randomised trial of aspirin, subcutaneous heparin, both, or neither among 19435 patients with acute ischaemic stroke. Lancet. 1997;349(9065):1569-1581.


Citation

If you use these datasets in your teaching or research, please cite appropriately and acknowledge the original sources listed in the references above.


See Also

Citation

BibTeX citation:
@online{panda,
  author = {Panda, Sir and Rafi, Zad},
  title = {Datasets},
  url = {https://lesslikely.com/datasets.html},
  langid = {en}
}
For attribution, please cite this work as:
1. Panda S, Rafi Z. ‘Datasets’. https://lesslikely.com/datasets.html.