Unmatched case-control odds ratios

In an unmatched case-control study, cases and controls are sampled independently from the source population — no individual or frequency matching. The classical analysis is the conditional odds ratio, identified from the case-control sample even though the marginal disease frequency is fixed by the sampling design. matchatr provides two engines on unmatched_cc():

estimator = "logistic" — logistic regression (outcome ~ exposure + confounders), the conditional OR for any exposure type. Because case-control sampling shifts only the intercept (Prentice and Pyke 1979), the slope ORs equal the source-population ORs.
estimator = "mh" — the closed-form Mantel-Haenszel stratified OR with a Robins-Breslow-Greenland confidence interval.

Code

library(matchatr)

Logistic conditional OR

A binary exposure

We simulate a cohort with a known log odds ratio, then draw an unmatched case-control sample (all cases plus an equal number of controls). The conditional OR should recover the cohort log-OR.

Code

set.seed(1)
n <- 8000
x <- rbinom(n, 1, 0.4)
age <- rnorm(n, 50, 10)
# True conditional log-OR for x is log(2.5).
lp <- -1 + log(2.5) * x + 0.03 * (age - 50)
case <- rbinom(n, 1, plogis(lp))
cohort <- data.frame(case = case, x = x, age = age)

# Unmatched case-control sample: every case + an equal random sample of controls.
cases <- cohort[cohort$case == 1, ]
controls <- cohort[cohort$case == 0, ]
cc <- rbind(cases, controls[sample(nrow(controls), nrow(cases)), ])

fit <- matcha(
  cc,
  outcome = "case", exposure = "x",
  design = unmatched_cc(),
  confounders = ~ age, estimator = "logistic"
)
contrast(fit, type = "or")
#> <matchatr_result>
#>  Estimator:  logistic  (engine: glm_logistic)
#>  Estimand:   conditional OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          5492
#> 
#> Contrasts:
#>    comparison estimate        se ci_lower ci_upper
#>        <char>    <num>     <num>    <num>    <num>
#> 1:          x 2.369472 0.1338476 2.121137 2.646882

The recovered OR is close to the true 2.5 even though the sample is roughly 50% cases — the odds ratio is invariant to the case / control sampling fractions.

The full coefficient table is available through tidy() (the intercept is not an interpretable baseline risk under case-control sampling, so treat it as a nuisance):

Code

tidy(fit, exponentiate = TRUE)
#>           term  estimate   std.error statistic      p.value  conf.low conf.high
#>         <char>     <num>       <num>     <num>        <num>     <num>     <num>
#> 1: (Intercept) 0.1413099 0.152205029 -12.85634 7.923957e-38 0.1048614 0.1904276
#> 2:           x 2.3694722 0.056488381  15.27159 1.182591e-52 2.1211365 2.6468822
#> 3:         age 1.0320131 0.002901316  10.86106 1.766951e-27 1.0261612 1.0378983

A continuous exposure

A continuous exposure yields a single per-unit OR:

Code

set.seed(2)
n <- 6000
dose <- rnorm(n)
case <- rbinom(n, 1, plogis(-1 + 0.8 * dose))
d_cont <- data.frame(case = case, dose = dose)

fit_cont <- matcha(
  d_cont,
  outcome = "case", exposure = "dose",
  design = unmatched_cc(), estimator = "logistic"
)
contrast(fit_cont, type = "or")
#> <matchatr_result>
#>  Estimator:  logistic  (engine: glm_logistic)
#>  Estimand:   conditional OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          6000
#> 
#> Contrasts:
#>    comparison estimate         se ci_lower ci_upper
#>        <char>    <num>      <num>    <num>    <num>
#> 1:       dose 2.240782 0.07515095 2.098225 2.393024

A categorical exposure: the `esoph` dose-response

The built-in esoph data is the Ille-et-Vilaine case-control study of oesophageal cancer (Breslow and Day 1980), stored as per-cell case / control counts. We expand it to one row per subject so it can enter the logistic engine.

Code

esoph_long <- do.call(rbind, lapply(seq_len(nrow(esoph)), function(i) {
  with(esoph[i, ], data.frame(
    case  = c(rep(1L, ncases), rep(0L, ncontrols)),
    agegp = agegp, alcgp = alcgp, tobgp = tobgp
  ))
}))
# Use alcohol as an unordered factor so we get one OR per consumption band.
esoph_long$alcgp <- factor(esoph_long$alcgp, ordered = FALSE)

fit_alc <- matcha(
  esoph_long,
  outcome = "case", exposure = "alcgp",
  design = unmatched_cc(),
  confounders = ~ agegp + tobgp, estimator = "logistic"
)
contrast(fit_alc, type = "or")
#> <matchatr_result>
#>  Estimator:  logistic  (engine: glm_logistic)
#>  Estimand:   conditional OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          975
#> 
#> Contrasts:
#>     comparison  estimate        se  ci_lower  ci_upper
#>         <char>     <num>     <num>     <num>     <num>
#> 1:  alcgp40-79  4.198086  1.049782  2.571569  6.853375
#> 2: alcgp80-119  7.247940  2.063936  4.147868 12.664971
#> 3:   alcgp120+ 36.703378 14.132151 17.256874 78.063846

A factor exposure reports one OR per non-reference level versus the baseline (here 0-39g/day), recovering the canonical monotone alcohol dose-response. The reference level used is recorded on the result:

Code

fit_alc_result <- contrast(fit_alc, type = "or")
fit_alc_result$reference
#> [1] "0-39g/day"

An ordinal trend

Entering an ordered exposure as a numeric score gives a single per-step trend OR instead of per-level contrasts:

Code

esoph_long$alc_score <- as.integer(factor(
  esoph_long$alcgp,
  levels = c("0-39g/day", "40-79", "80-119", "120+")
))

fit_trend <- matcha(
  esoph_long,
  outcome = "case", exposure = "alc_score",
  design = unmatched_cc(),
  confounders = ~ agegp + tobgp, estimator = "logistic"
)
contrast(fit_trend, type = "or")
#> <matchatr_result>
#>  Estimator:  logistic  (engine: glm_logistic)
#>  Estimand:   conditional OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          975
#> 
#> Contrasts:
#>    comparison estimate        se ci_lower ci_upper
#>        <char>    <num>     <num>    <num>    <num>
#> 1:  alc_score 2.921659 0.3093806 2.374073 3.595547

Smooth confounder adjustment with `model_fn`

The logistic fitter is pluggable. Pass model_fn = mgcv::gam and a smooth term in confounders to adjust for a confounder flexibly while keeping the exposure parametric (so it still has an interpretable OR). The smooth-basis coefficients are not reported as odds ratios.

Code

fit_gam <- matcha(
  cc,
  outcome = "case", exposure = "x",
  design = unmatched_cc(),
  confounders = ~ s(age), model_fn = mgcv::gam, estimator = "logistic"
)
contrast(fit_gam, type = "or")
#> <matchatr_result>
#>  Estimator:  logistic  (engine: glm_logistic)
#>  Estimand:   conditional OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          5492
#> 
#> Contrasts:
#>    comparison estimate        se ci_lower ci_upper
#>        <char>    <num>     <num>    <num>    <num>
#> 1:          x 2.369472 0.1338477 2.121136 2.646882

Mantel-Haenszel stratified OR

When confounding is controlled by stratification rather than a model, the Mantel-Haenszel estimator pools the per-stratum 2×2 odds ratios in closed form. Declare the stratifying variable(s) on the design and select estimator = "mh".

Code

set.seed(3)
n <- 8000
agegrp <- sample(c("30s", "40s", "50s", "60s"), n, replace = TRUE)
x <- rbinom(n, 1, 0.4)
# True OR for x is 3, with an age-group main effect (confounding).
lp <- -1 + log(3) * x + 0.2 * as.integer(factor(agegrp))
case <- rbinom(n, 1, plogis(lp))
d_mh <- data.frame(case = case, x = x, agegrp = agegrp)

fit_mh <- matcha(
  d_mh,
  outcome = "case", exposure = "x",
  design = unmatched_cc(strata = "agegrp"), estimator = "mh"
)
contrast(fit_mh, type = "or")
#> <matchatr_result>
#>  Estimator:  mh  (engine: mantel_haenszel)
#>  Estimand:   Mantel-Haenszel OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          8000
#> 
#> Contrasts:
#>    comparison estimate        se ci_lower ci_upper
#>        <char>    <num>     <num>    <num>    <num>
#> 1:          x 3.001334 0.1440983 2.731788 3.297477

The reported interval is the Robins-Breslow-Greenland variance (Robins et al. 1986), valid in both the sparse-data and large-stratum limits. The MH estimator needs a 2×2 table per stratum, so the exposure must be binary; a categorical (k > 2) or continuous exposure is routed to estimator = "logistic" with a classed error. With no strata, "mh" reduces to the crude single-table OR.

What the design does and does not identify

Only the odds ratio is identified from an unmatched case-control sample. Risk differences and risk ratios require the source-population prevalence q0 and a case-control-weighted estimator:

Code

contrast(fit, type = "difference")
#> Error in `contrast()`:
#> ! The risk difference is not identified from an unmatched case-control sample without the source-population prevalence q0.
#> ℹ Report the conditional odds ratio with `type = "or"`.
#> ℹ For a marginal risk difference / ratio, supply `prevalence =` on the design and use a case-control-weighted estimator (e.g. `estimator = "ccw_gformula"`).

Covered combinations

Legend. ✅ truth-pinned in tests · ⛔ rejected with an informative error.

Exposure	Estimator	Estimand	Variance	Status
Binary	`logistic`	conditional OR	model / sandwich	✅
Continuous	`logistic`	per-unit OR	model / sandwich	✅
Categorical (k>2)	`logistic`	per-level OR	model / sandwich	✅
Ordinal score	`logistic`	per-step trend OR	model / sandwich	✅
Any (GAM-adjusted)	`logistic` + `model_fn`	conditional OR	model	✅
Binary	`mh`	Mantel-Haenszel OR	Robins-Breslow-Greenland	✅
Ordered factor	`logistic`	—	—	⛔ use a numeric score
Categorical / continuous	`mh`	—	—	⛔ use `logistic`
Any	`logistic` / `mh`	RD / RR	—	⛔ need q0

See FEATURE_COVERAGE_MATRIX.md for the authoritative status of every combination.

References

Breslow, Norman E., and Nicholas E. Day. 1980. Statistical Methods in Cancer Research, Volume 1: The Analysis of Case-Control Studies. International Agency for Research on Cancer (IARC Scientific Publications No. 32).

Prentice, Ross L., and Ronald Pyke. 1979. “Logistic Disease Incidence Models and Case-Control Studies.” Biometrika 66 (3): 403–11. https://doi.org/10.1093/biomet/66.3.403.

Robins, James, Norman Breslow, and Sander Greenland. 1986. “Estimators of the Mantel-Haenszel Variance Consistent in Both Sparse Data and Large-Strata Limiting Models.” Biometrics 42 (2): 311–23. https://doi.org/10.2307/2531052.

Logistic conditional OR

A binary exposure

A continuous exposure

A categorical exposure: the esoph dose-response

An ordinal trend

Smooth confounder adjustment with model_fn

Mantel-Haenszel stratified OR

What the design does and does not identify

Covered combinations

References

A categorical exposure: the `esoph` dose-response

Smooth confounder adjustment with `model_fn`