Code
library(matchatr)In an unmatched case-control study, cases and controls are sampled independently from the source population — no individual or frequency matching. The classical analysis is the conditional odds ratio, identified from the case-control sample even though the marginal disease frequency is fixed by the sampling design. matchatr provides two engines on unmatched_cc():
estimator = "logistic" — logistic regression (outcome ~ exposure + confounders), the conditional OR for any exposure type. Because case-control sampling shifts only the intercept (Prentice & Pyke, 1979), the slope ORs equal the source-population ORs.estimator = "mh" — the closed-form Mantel-Haenszel stratified OR with a Robins-Breslow-Greenland confidence interval.We simulate a cohort with a known log odds ratio, then draw an unmatched case-control sample (all cases plus an equal number of controls). The conditional OR should recover the cohort log-OR.
set.seed(1)
n <- 8000
x <- rbinom(n, 1, 0.4)
age <- rnorm(n, 50, 10)
# True conditional log-OR for x is log(2.5).
lp <- -1 + log(2.5) * x + 0.03 * (age - 50)
case <- rbinom(n, 1, plogis(lp))
cohort <- data.frame(case = case, x = x, age = age)
# Unmatched case-control sample: every case + an equal random sample of controls.
cases <- cohort[cohort$case == 1, ]
controls <- cohort[cohort$case == 0, ]
cc <- rbind(cases, controls[sample(nrow(controls), nrow(cases)), ])
fit <- matcha(
cc,
outcome = "case", exposure = "x",
design = unmatched_cc(),
confounders = ~ age, estimator = "logistic"
)
contrast(fit, type = "or")
#> <matchatr_result>
#> Estimator: logistic (engine: glm_logistic)
#> Estimand: conditional OR
#> Contrast: Odds ratio
#> CI method: model
#> N: 5492
#>
#> Contrasts:
#> comparison estimate se ci_lower ci_upper
#> <char> <num> <num> <num> <num>
#> 1: x 2.369472 0.1338476 2.121137 2.646882The recovered OR is close to the true 2.5 even though the sample is roughly 50% cases — the odds ratio is invariant to the case / control sampling fractions.
The full coefficient table is available through tidy() (the intercept is not an interpretable baseline risk under case-control sampling, so treat it as a nuisance):
tidy(fit, exponentiate = TRUE)
#> term estimate std.error statistic p.value conf.low conf.high
#> <char> <num> <num> <num> <num> <num> <num>
#> 1: (Intercept) 0.1413099 0.152205029 -12.85634 7.923957e-38 0.1048614 0.1904276
#> 2: x 2.3694722 0.056488381 15.27159 1.182591e-52 2.1211365 2.6468822
#> 3: age 1.0320131 0.002901316 10.86106 1.766951e-27 1.0261612 1.0378983A continuous exposure yields a single per-unit OR:
set.seed(2)
n <- 6000
dose <- rnorm(n)
case <- rbinom(n, 1, plogis(-1 + 0.8 * dose))
d_cont <- data.frame(case = case, dose = dose)
fit_cont <- matcha(
d_cont,
outcome = "case", exposure = "dose",
design = unmatched_cc(), estimator = "logistic"
)
contrast(fit_cont, type = "or")
#> <matchatr_result>
#> Estimator: logistic (engine: glm_logistic)
#> Estimand: conditional OR
#> Contrast: Odds ratio
#> CI method: model
#> N: 6000
#>
#> Contrasts:
#> comparison estimate se ci_lower ci_upper
#> <char> <num> <num> <num> <num>
#> 1: dose 2.240782 0.07515095 2.098225 2.393024esoph dose-responseThe built-in esoph data is the Ille-et-Vilaine case-control study of oesophageal cancer (Breslow & Day, 1980), stored as per-cell case / control counts. We expand it to one row per subject so it can enter the logistic engine.
esoph_long <- do.call(rbind, lapply(seq_len(nrow(esoph)), function(i) {
with(esoph[i, ], data.frame(
case = c(rep(1L, ncases), rep(0L, ncontrols)),
agegp = agegp, alcgp = alcgp, tobgp = tobgp
))
}))
# Use alcohol as an unordered factor so we get one OR per consumption band.
esoph_long$alcgp <- factor(esoph_long$alcgp, ordered = FALSE)
fit_alc <- matcha(
esoph_long,
outcome = "case", exposure = "alcgp",
design = unmatched_cc(),
confounders = ~ agegp + tobgp, estimator = "logistic"
)
contrast(fit_alc, type = "or")
#> <matchatr_result>
#> Estimator: logistic (engine: glm_logistic)
#> Estimand: conditional OR
#> Contrast: Odds ratio
#> CI method: model
#> N: 975
#>
#> Contrasts:
#> comparison estimate se ci_lower ci_upper
#> <char> <num> <num> <num> <num>
#> 1: alcgp40-79 4.198086 1.049782 2.571569 6.853375
#> 2: alcgp80-119 7.247940 2.063936 4.147868 12.664971
#> 3: alcgp120+ 36.703378 14.132151 17.256874 78.063846A factor exposure reports one OR per non-reference level versus the baseline (here 0-39g/day), recovering the canonical monotone alcohol dose-response. The reference level used is recorded on the result:
Entering an ordered exposure as a numeric score gives a single per-step trend OR instead of per-level contrasts:
esoph_long$alc_score <- as.integer(factor(
esoph_long$alcgp,
levels = c("0-39g/day", "40-79", "80-119", "120+")
))
fit_trend <- matcha(
esoph_long,
outcome = "case", exposure = "alc_score",
design = unmatched_cc(),
confounders = ~ agegp + tobgp, estimator = "logistic"
)
contrast(fit_trend, type = "or")
#> <matchatr_result>
#> Estimator: logistic (engine: glm_logistic)
#> Estimand: conditional OR
#> Contrast: Odds ratio
#> CI method: model
#> N: 975
#>
#> Contrasts:
#> comparison estimate se ci_lower ci_upper
#> <char> <num> <num> <num> <num>
#> 1: alc_score 2.921659 0.3093806 2.374073 3.595547model_fnThe logistic fitter is pluggable. Pass model_fn = mgcv::gam and a smooth term in confounders to adjust for a confounder flexibly while keeping the exposure parametric (so it still has an interpretable OR). The smooth-basis coefficients are not reported as odds ratios.
fit_gam <- matcha(
cc,
outcome = "case", exposure = "x",
design = unmatched_cc(),
confounders = ~ s(age), model_fn = mgcv::gam, estimator = "logistic"
)
contrast(fit_gam, type = "or")
#> <matchatr_result>
#> Estimator: logistic (engine: glm_logistic)
#> Estimand: conditional OR
#> Contrast: Odds ratio
#> CI method: model
#> N: 5492
#>
#> Contrasts:
#> comparison estimate se ci_lower ci_upper
#> <char> <num> <num> <num> <num>
#> 1: x 2.369472 0.1338477 2.121136 2.646882When confounding is controlled by stratification rather than a model, the Mantel-Haenszel estimator pools the per-stratum 2×2 odds ratios in closed form. Declare the stratifying variable(s) on the design and select estimator = "mh".
set.seed(3)
n <- 8000
agegrp <- sample(c("30s", "40s", "50s", "60s"), n, replace = TRUE)
x <- rbinom(n, 1, 0.4)
# True OR for x is 3, with an age-group main effect (confounding).
lp <- -1 + log(3) * x + 0.2 * as.integer(factor(agegrp))
case <- rbinom(n, 1, plogis(lp))
d_mh <- data.frame(case = case, x = x, agegrp = agegrp)
fit_mh <- matcha(
d_mh,
outcome = "case", exposure = "x",
design = unmatched_cc(strata = "agegrp"), estimator = "mh"
)
contrast(fit_mh, type = "or")
#> <matchatr_result>
#> Estimator: mh (engine: mantel_haenszel)
#> Estimand: Mantel-Haenszel OR
#> Contrast: Odds ratio
#> CI method: model
#> N: 8000
#>
#> Contrasts:
#> comparison estimate se ci_lower ci_upper
#> <char> <num> <num> <num> <num>
#> 1: x 3.001334 0.1440983 2.731788 3.297477The reported interval is the Robins-Breslow-Greenland (1986) variance, valid in both the sparse-data and large-stratum limits. The MH estimator needs a 2×2 table per stratum, so the exposure must be binary; a categorical (k > 2) or continuous exposure is routed to estimator = "logistic" with a classed error. With no strata, "mh" reduces to the crude single-table OR.
Only the odds ratio is identified from an unmatched case-control sample. Risk differences and risk ratios require the source-population prevalence q0 and a case-control-weighted estimator:
contrast(fit, type = "difference")
#> Error in `contrast()`:
#> ! The risk difference is not identified from an unmatched case-control sample without the source-population prevalence q0.
#> ℹ Report the conditional odds ratio with `type = "or"`.
#> ℹ For a marginal risk difference / ratio, supply `prevalence =` on the design and use a case-control-weighted estimator (e.g. `estimator = "ccw_gformula"`).Legend. ✅ truth-pinned in tests · ⛔ rejected with an informative error.
| Exposure | Estimator | Estimand | Variance | Status |
|---|---|---|---|---|
| Binary | logistic |
conditional OR | model / sandwich | ✅ |
| Continuous | logistic |
per-unit OR | model / sandwich | ✅ |
| Categorical (k>2) | logistic |
per-level OR | model / sandwich | ✅ |
| Ordinal score | logistic |
per-step trend OR | model / sandwich | ✅ |
| Any (GAM-adjusted) | logistic + model_fn |
conditional OR | model | ✅ |
| Binary | mh |
Mantel-Haenszel OR | Robins-Breslow-Greenland | ✅ |
| Ordered factor | logistic |
— | — | ⛔ use a numeric score |
| Categorical / continuous | mh |
— | — | ⛔ use logistic |
| Any | logistic / mh |
RD / RR | — | ⛔ need q0 |
See FEATURE_COVERAGE_MATRIX.md for the authoritative status of every combination.
Breslow NE, Day NE (1980). Statistical Methods in Cancer Research, Volume 1: The Analysis of Case-Control Studies. IARC.
Prentice RL, Pyke R (1979). Logistic disease incidence models and case-control studies. Biometrika 66(3):403-411.
Robins J, Breslow N, Greenland S (1986). Estimators of the Mantel-Haenszel variance consistent in both sparse data and large-strata limiting models. Biometrics 42(2):311-323.
---
title: "Unmatched case-control odds ratios"
code-fold: show
code-tools: true
vignette: >
%\VignetteIndexEntry{Unmatched case-control odds ratios}
%\VignetteEngine{quarto::html}
%\VignetteEncoding{UTF-8}
---
```{r}
#| include: false
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```
In an *unmatched* case-control study, cases and controls are sampled
independently from the source population — no individual or frequency matching.
The classical analysis is the **conditional odds ratio**, identified from the
case-control sample even though the marginal disease frequency is fixed by the
sampling design. matchatr provides two engines on `unmatched_cc()`:
- `estimator = "logistic"` — logistic regression
(`outcome ~ exposure + confounders`), the conditional OR for any exposure
type. Because case-control sampling shifts only the intercept (Prentice &
Pyke, 1979), the slope ORs equal the source-population ORs.
- `estimator = "mh"` — the closed-form Mantel-Haenszel stratified OR with a
Robins-Breslow-Greenland confidence interval.
```{r}
#| message: false
library(matchatr)
```
## Logistic conditional OR
### A binary exposure
We simulate a cohort with a known log odds ratio, then draw an unmatched
case-control sample (all cases plus an equal number of controls). The conditional
OR should recover the cohort log-OR.
```{r}
set.seed(1)
n <- 8000
x <- rbinom(n, 1, 0.4)
age <- rnorm(n, 50, 10)
# True conditional log-OR for x is log(2.5).
lp <- -1 + log(2.5) * x + 0.03 * (age - 50)
case <- rbinom(n, 1, plogis(lp))
cohort <- data.frame(case = case, x = x, age = age)
# Unmatched case-control sample: every case + an equal random sample of controls.
cases <- cohort[cohort$case == 1, ]
controls <- cohort[cohort$case == 0, ]
cc <- rbind(cases, controls[sample(nrow(controls), nrow(cases)), ])
fit <- matcha(
cc,
outcome = "case", exposure = "x",
design = unmatched_cc(),
confounders = ~ age, estimator = "logistic"
)
contrast(fit, type = "or")
```
The recovered OR is close to the true 2.5 even though the sample is roughly 50%
cases — the odds ratio is invariant to the case / control sampling fractions.
The full coefficient table is available through `tidy()` (the intercept is
*not* an interpretable baseline risk under case-control sampling, so treat it as
a nuisance):
```{r}
tidy(fit, exponentiate = TRUE)
```
### A continuous exposure
A continuous exposure yields a single per-unit OR:
```{r}
set.seed(2)
n <- 6000
dose <- rnorm(n)
case <- rbinom(n, 1, plogis(-1 + 0.8 * dose))
d_cont <- data.frame(case = case, dose = dose)
fit_cont <- matcha(
d_cont,
outcome = "case", exposure = "dose",
design = unmatched_cc(), estimator = "logistic"
)
contrast(fit_cont, type = "or")
```
### A categorical exposure: the `esoph` dose-response
The built-in `esoph` data is the Ille-et-Vilaine case-control study of
oesophageal cancer (Breslow & Day, 1980), stored as per-cell case / control
counts. We expand it to one row per subject so it can enter the logistic engine.
```{r}
esoph_long <- do.call(rbind, lapply(seq_len(nrow(esoph)), function(i) {
with(esoph[i, ], data.frame(
case = c(rep(1L, ncases), rep(0L, ncontrols)),
agegp = agegp, alcgp = alcgp, tobgp = tobgp
))
}))
# Use alcohol as an unordered factor so we get one OR per consumption band.
esoph_long$alcgp <- factor(esoph_long$alcgp, ordered = FALSE)
fit_alc <- matcha(
esoph_long,
outcome = "case", exposure = "alcgp",
design = unmatched_cc(),
confounders = ~ agegp + tobgp, estimator = "logistic"
)
contrast(fit_alc, type = "or")
```
A factor exposure reports one OR per non-reference level versus the baseline
(here `0-39g/day`), recovering the canonical monotone alcohol dose-response.
The reference level used is recorded on the result:
```{r}
fit_alc_result <- contrast(fit_alc, type = "or")
fit_alc_result$reference
```
### An ordinal trend
Entering an ordered exposure as a numeric *score* gives a single per-step trend
OR instead of per-level contrasts:
```{r}
esoph_long$alc_score <- as.integer(factor(
esoph_long$alcgp,
levels = c("0-39g/day", "40-79", "80-119", "120+")
))
fit_trend <- matcha(
esoph_long,
outcome = "case", exposure = "alc_score",
design = unmatched_cc(),
confounders = ~ agegp + tobgp, estimator = "logistic"
)
contrast(fit_trend, type = "or")
```
### Smooth confounder adjustment with `model_fn`
The logistic fitter is pluggable. Pass `model_fn = mgcv::gam` and a smooth term
in `confounders` to adjust for a confounder flexibly while keeping the exposure
parametric (so it still has an interpretable OR). The smooth-basis coefficients
are not reported as odds ratios.
```{r}
#| eval: !expr requireNamespace("mgcv", quietly = TRUE)
fit_gam <- matcha(
cc,
outcome = "case", exposure = "x",
design = unmatched_cc(),
confounders = ~ s(age), model_fn = mgcv::gam, estimator = "logistic"
)
contrast(fit_gam, type = "or")
```
## Mantel-Haenszel stratified OR
When confounding is controlled by *stratification* rather than a model, the
Mantel-Haenszel estimator pools the per-stratum 2×2 odds ratios in closed form.
Declare the stratifying variable(s) on the design and select `estimator = "mh"`.
```{r}
set.seed(3)
n <- 8000
agegrp <- sample(c("30s", "40s", "50s", "60s"), n, replace = TRUE)
x <- rbinom(n, 1, 0.4)
# True OR for x is 3, with an age-group main effect (confounding).
lp <- -1 + log(3) * x + 0.2 * as.integer(factor(agegrp))
case <- rbinom(n, 1, plogis(lp))
d_mh <- data.frame(case = case, x = x, agegrp = agegrp)
fit_mh <- matcha(
d_mh,
outcome = "case", exposure = "x",
design = unmatched_cc(strata = "agegrp"), estimator = "mh"
)
contrast(fit_mh, type = "or")
```
The reported interval is the Robins-Breslow-Greenland (1986) variance, valid in
both the sparse-data and large-stratum limits. The MH estimator needs a 2×2
table per stratum, so the exposure must be binary; a categorical (k > 2) or
continuous exposure is routed to `estimator = "logistic"` with a classed error.
With no `strata`, `"mh"` reduces to the crude single-table OR.
## What the design does and does not identify
Only the odds ratio is identified from an unmatched case-control sample. Risk
differences and risk ratios require the source-population prevalence q0 and a
case-control-weighted estimator:
```{r}
#| error: true
contrast(fit, type = "difference")
```
## Covered combinations
**Legend.** ✅ truth-pinned in tests · ⛔ rejected with an informative error.
| Exposure | Estimator | Estimand | Variance | Status |
|---|---|---|---|---|
| Binary | `logistic` | conditional OR | model / sandwich | ✅ |
| Continuous | `logistic` | per-unit OR | model / sandwich | ✅ |
| Categorical (k>2) | `logistic` | per-level OR | model / sandwich | ✅ |
| Ordinal score | `logistic` | per-step trend OR | model / sandwich | ✅ |
| Any (GAM-adjusted) | `logistic` + `model_fn` | conditional OR | model | ✅ |
| Binary | `mh` | Mantel-Haenszel OR | Robins-Breslow-Greenland | ✅ |
| Ordered factor | `logistic` | — | — | ⛔ use a numeric score |
| Categorical / continuous | `mh` | — | — | ⛔ use `logistic` |
| Any | `logistic` / `mh` | RD / RR | — | ⛔ need q0 |
See `FEATURE_COVERAGE_MATRIX.md` for the authoritative status of every
combination.
## References
Breslow NE, Day NE (1980). *Statistical Methods in Cancer Research, Volume 1:
The Analysis of Case-Control Studies*. IARC.
Prentice RL, Pyke R (1979). Logistic disease incidence models and case-control
studies. *Biometrika* 66(3):403-411.
Robins J, Breslow N, Greenland S (1986). Estimators of the Mantel-Haenszel
variance consistent in both sparse data and large-strata limiting models.
*Biometrics* 42(2):311-323.