Datasets
Sample datasets for teaching and reproducible research
A curated collection of datasets for teaching statistical concepts, demonstrating analytical methods, and practicing reproducible research. All datasets are available for download and use in educational settings.
Dataset Summaries
| Dataset | Description | Observations | Download |
|---|---|---|---|
| PUFA Meta-Analysis | Impact of increased PUFA consumption as replacement for SFA on CHD events (Mozaffarian et al., 2010) | 6 studies | CSV |
| Beta Blocker RCT | Early randomized trial examining mortality after beta blocker treatment | 44 patients | CSV |
| Polyp Prevention Trial | Randomized trial to reduce colorectal polyps | 20 patients | CSV |
| Crossover BP Medication Trial | Crossover trial comparing blood pressure medications | 2,000 patients | CSV |
| Acupuncture Headache Trial | RCT of acupuncture for chronic headache (Vickers et al., 2006) | 401 patients | CSV |
| International Stroke Trial | Large prospective RCT testing aspirin and heparin effectiveness (1991-1996) | 19,435 patients | CSV |
PUFA Meta-Analysis
Description
This dataset examines the impact of increased polyunsaturated fatty acid (PUFA) consumption as a replacement for saturated fatty acids (SFA) on coronary heart disease (CHD) events. Based on Mozaffarian et al. (2010).
Data Structure
6 observations of 5 variables:
study- Study identifierchd_pufa- CHD events in PUFA grouptotal_pufa- Total participants in PUFA groupchd_control- CHD events in control grouptotal_control- Total participants in control group
Sample Data
| study | chd_pufa | total_pufa | chd_control | total_control |
|---|---|---|---|---|
| Study A | 12 | 846 | 25 | 852 |
| Study B | 8 | 221 | 11 | 229 |
| Study C | 51 | 2033 | 63 | 2036 |
Use cases: Meta-analysis, risk ratios, forest plots, heterogeneity assessment
Beta Blocker RCT
Description
An early randomized controlled trial examining mortality outcomes after beta blocker treatment post-myocardial infarction.
Data Structure
44 observations of 3 variables:
treatment- Treatment group (beta blocker vs control)deaths- Number of deathstotal- Total number of patients
Sample Data
| treatment | deaths | total |
|---|---|---|
| Control | 39 | 685 |
| Treated | 27 | 674 |
Use cases: Risk difference, number needed to treat (NNT), confidence intervals, hypothesis testing
Polyp Prevention Trial
Description
A randomized trial investigating interventions to reduce the recurrence of colorectal polyps.
Data Structure
20 observations of 4 variables:
treatment- Treatment assignmentbaseline- Baseline polyp countfollowup- Follow-up polyp countage- Patient age
Use cases: Paired data analysis, change scores, regression to the mean
Crossover BP Medication Trial
Description
A crossover trial comparing the effectiveness of different blood pressure medications in controlling hypertension.
Data Structure
2,000 observations of 6 variables:
patient_id- Patient identifierperiod- Treatment period (1 or 2)treatment- Medication typesbp- Systolic blood pressure (mmHg)dbp- Diastolic blood pressure (mmHg)sequence- Treatment sequence group
Use cases: Crossover designs, period effects, carryover effects, mixed models
Acupuncture Headache Trial
Description
A randomized controlled trial of acupuncture for chronic headache prevention, based on Vickers et al. (2006).
Data Structure
401 observations of 5 variables:
id- Patient identifierage- Age in yearschronicity- Years with chronic headachetreatment- Acupuncture vs controlpk1- Primary outcome: headache score at follow-uppk5- Secondary outcome at 12 months
Use cases: Analysis of covariance (ANCOVA), baseline adjustment, intention-to-treat analysis
The International Stroke Trial
Description
A large prospective randomized controlled trial conducted from 1991-1996 testing the effectiveness of aspirin and heparin in acute ischemic stroke.
Data Structure
19,435 observations of 12 variables:
id- Patient identifieraspirin- Aspirin treatment (yes/no)heparin- Heparin dose (high/medium/none)dead- Death within 14 daysfdead- Death within 6 monthsrecstroke- Recurrent stroke within 14 daysage- Age in yearssex- Sex (M/F)rconsc- Conscious state at randomizationrdelay- Hours from stroke to randomizationstype- Stroke typerxhep- Heparin received as allocated
Sample Data
| id | aspirin | heparin | dead | age | sex | stype |
|---|---|---|---|---|---|---|
| 1 | Y | None | N | 75 | F | Infarction |
| 2 | N | Medium | N | 83 | M | Infarction |
| 3 | Y | High | Y | 71 | F | Unknown |
Use cases: Factorial designs, survival analysis, subgroup analysis, multiple outcomes, large sample methods
Using These Datasets
R
# Read CSV files
pufa <- read.csv("datasets/pufa.csv")
stroke <- read.csv("datasets/stroke.csv")
# Or use readr for better parsing
library(readr)
pufa <- read_csv("datasets/pufa.csv")Stata
* Import CSV files
import delimited "datasets/pufa.csv", clear
import delimited "datasets/stroke.csv", clearPython
import pandas as pd
# Read CSV files
pufa = pd.read_csv("datasets/pufa.csv")
stroke = pd.read_csv("datasets/stroke.csv")References
Mozaffarian D, Micha R, Wallace S. Effects on coronary heart disease of increasing polyunsaturated fat in place of saturated fat: a systematic review and meta-analysis of randomized controlled trials. PLoS Med. 2010;7(3):e1000252.
Vickers AJ, Rees RW, Zollman CE, et al. Acupuncture for chronic headache in primary care: large, pragmatic, randomised trial. BMJ. 2004;328(7442):744.
International Stroke Trial Collaborative Group. The International Stroke Trial (IST): a randomised trial of aspirin, subcutaneous heparin, both, or neither among 19435 patients with acute ischaemic stroke. Lancet. 1997;349(9065):1569-1581.
Citation
If you use these datasets in your teaching or research, please cite appropriately and acknowledge the original sources listed in the references above.
See Also
Citation
@online{panda,
author = {Panda, Sir and Rafi, Zad},
title = {Datasets},
url = {https://lesslikely.com/datasets.html},
langid = {en}
}