Unmatched case-control odds ratios

In an unmatched case-control study, cases and controls are sampled independently from the source population — no individual or frequency matching. The classical analysis is the conditional odds ratio, identified from the case-control sample even though the marginal disease frequency is fixed by the sampling design. matchatr provides two engines on unmatched_cc():

Code
library(matchatr)

Logistic conditional OR

A binary exposure

We simulate a cohort with a known log odds ratio, then draw an unmatched case-control sample (all cases plus an equal number of controls). The conditional OR should recover the cohort log-OR.

Code
set.seed(1)
n <- 8000
x <- rbinom(n, 1, 0.4)
age <- rnorm(n, 50, 10)
# True conditional log-OR for x is log(2.5).
lp <- -1 + log(2.5) * x + 0.03 * (age - 50)
case <- rbinom(n, 1, plogis(lp))
cohort <- data.frame(case = case, x = x, age = age)

# Unmatched case-control sample: every case + an equal random sample of controls.
cases <- cohort[cohort$case == 1, ]
controls <- cohort[cohort$case == 0, ]
cc <- rbind(cases, controls[sample(nrow(controls), nrow(cases)), ])

fit <- matcha(
  cc,
  outcome = "case", exposure = "x",
  design = unmatched_cc(),
  confounders = ~ age, estimator = "logistic"
)
contrast(fit, type = "or")
#> <matchatr_result>
#>  Estimator:  logistic  (engine: glm_logistic)
#>  Estimand:   conditional OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          5492
#> 
#> Contrasts:
#>    comparison estimate        se ci_lower ci_upper
#>        <char>    <num>     <num>    <num>    <num>
#> 1:          x 2.369472 0.1338476 2.121137 2.646882

The recovered OR is close to the true 2.5 even though the sample is roughly 50% cases — the odds ratio is invariant to the case / control sampling fractions.

The full coefficient table is available through tidy() (the intercept is not an interpretable baseline risk under case-control sampling, so treat it as a nuisance):

Code
tidy(fit, exponentiate = TRUE)
#>           term  estimate   std.error statistic      p.value  conf.low conf.high
#>         <char>     <num>       <num>     <num>        <num>     <num>     <num>
#> 1: (Intercept) 0.1413099 0.152205029 -12.85634 7.923957e-38 0.1048614 0.1904276
#> 2:           x 2.3694722 0.056488381  15.27159 1.182591e-52 2.1211365 2.6468822
#> 3:         age 1.0320131 0.002901316  10.86106 1.766951e-27 1.0261612 1.0378983

A continuous exposure

A continuous exposure yields a single per-unit OR:

Code
set.seed(2)
n <- 6000
dose <- rnorm(n)
case <- rbinom(n, 1, plogis(-1 + 0.8 * dose))
d_cont <- data.frame(case = case, dose = dose)

fit_cont <- matcha(
  d_cont,
  outcome = "case", exposure = "dose",
  design = unmatched_cc(), estimator = "logistic"
)
contrast(fit_cont, type = "or")
#> <matchatr_result>
#>  Estimator:  logistic  (engine: glm_logistic)
#>  Estimand:   conditional OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          6000
#> 
#> Contrasts:
#>    comparison estimate         se ci_lower ci_upper
#>        <char>    <num>      <num>    <num>    <num>
#> 1:       dose 2.240782 0.07515095 2.098225 2.393024

A categorical exposure: the esoph dose-response

The built-in esoph data is the Ille-et-Vilaine case-control study of oesophageal cancer (Breslow & Day, 1980), stored as per-cell case / control counts. We expand it to one row per subject so it can enter the logistic engine.

Code
esoph_long <- do.call(rbind, lapply(seq_len(nrow(esoph)), function(i) {
  with(esoph[i, ], data.frame(
    case  = c(rep(1L, ncases), rep(0L, ncontrols)),
    agegp = agegp, alcgp = alcgp, tobgp = tobgp
  ))
}))
# Use alcohol as an unordered factor so we get one OR per consumption band.
esoph_long$alcgp <- factor(esoph_long$alcgp, ordered = FALSE)

fit_alc <- matcha(
  esoph_long,
  outcome = "case", exposure = "alcgp",
  design = unmatched_cc(),
  confounders = ~ agegp + tobgp, estimator = "logistic"
)
contrast(fit_alc, type = "or")
#> <matchatr_result>
#>  Estimator:  logistic  (engine: glm_logistic)
#>  Estimand:   conditional OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          975
#> 
#> Contrasts:
#>     comparison  estimate        se  ci_lower  ci_upper
#>         <char>     <num>     <num>     <num>     <num>
#> 1:  alcgp40-79  4.198086  1.049782  2.571569  6.853375
#> 2: alcgp80-119  7.247940  2.063936  4.147868 12.664971
#> 3:   alcgp120+ 36.703378 14.132151 17.256874 78.063846

A factor exposure reports one OR per non-reference level versus the baseline (here 0-39g/day), recovering the canonical monotone alcohol dose-response. The reference level used is recorded on the result:

Code
fit_alc_result <- contrast(fit_alc, type = "or")
fit_alc_result$reference
#> [1] "0-39g/day"

An ordinal trend

Entering an ordered exposure as a numeric score gives a single per-step trend OR instead of per-level contrasts:

Code
esoph_long$alc_score <- as.integer(factor(
  esoph_long$alcgp,
  levels = c("0-39g/day", "40-79", "80-119", "120+")
))

fit_trend <- matcha(
  esoph_long,
  outcome = "case", exposure = "alc_score",
  design = unmatched_cc(),
  confounders = ~ agegp + tobgp, estimator = "logistic"
)
contrast(fit_trend, type = "or")
#> <matchatr_result>
#>  Estimator:  logistic  (engine: glm_logistic)
#>  Estimand:   conditional OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          975
#> 
#> Contrasts:
#>    comparison estimate        se ci_lower ci_upper
#>        <char>    <num>     <num>    <num>    <num>
#> 1:  alc_score 2.921659 0.3093806 2.374073 3.595547

Smooth confounder adjustment with model_fn

The logistic fitter is pluggable. Pass model_fn = mgcv::gam and a smooth term in confounders to adjust for a confounder flexibly while keeping the exposure parametric (so it still has an interpretable OR). The smooth-basis coefficients are not reported as odds ratios.

Code
fit_gam <- matcha(
  cc,
  outcome = "case", exposure = "x",
  design = unmatched_cc(),
  confounders = ~ s(age), model_fn = mgcv::gam, estimator = "logistic"
)
contrast(fit_gam, type = "or")
#> <matchatr_result>
#>  Estimator:  logistic  (engine: glm_logistic)
#>  Estimand:   conditional OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          5492
#> 
#> Contrasts:
#>    comparison estimate        se ci_lower ci_upper
#>        <char>    <num>     <num>    <num>    <num>
#> 1:          x 2.369472 0.1338477 2.121136 2.646882

Mantel-Haenszel stratified OR

When confounding is controlled by stratification rather than a model, the Mantel-Haenszel estimator pools the per-stratum 2×2 odds ratios in closed form. Declare the stratifying variable(s) on the design and select estimator = "mh".

Code
set.seed(3)
n <- 8000
agegrp <- sample(c("30s", "40s", "50s", "60s"), n, replace = TRUE)
x <- rbinom(n, 1, 0.4)
# True OR for x is 3, with an age-group main effect (confounding).
lp <- -1 + log(3) * x + 0.2 * as.integer(factor(agegrp))
case <- rbinom(n, 1, plogis(lp))
d_mh <- data.frame(case = case, x = x, agegrp = agegrp)

fit_mh <- matcha(
  d_mh,
  outcome = "case", exposure = "x",
  design = unmatched_cc(strata = "agegrp"), estimator = "mh"
)
contrast(fit_mh, type = "or")
#> <matchatr_result>
#>  Estimator:  mh  (engine: mantel_haenszel)
#>  Estimand:   Mantel-Haenszel OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          8000
#> 
#> Contrasts:
#>    comparison estimate        se ci_lower ci_upper
#>        <char>    <num>     <num>    <num>    <num>
#> 1:          x 3.001334 0.1440983 2.731788 3.297477

The reported interval is the Robins-Breslow-Greenland (1986) variance, valid in both the sparse-data and large-stratum limits. The MH estimator needs a 2×2 table per stratum, so the exposure must be binary; a categorical (k > 2) or continuous exposure is routed to estimator = "logistic" with a classed error. With no strata, "mh" reduces to the crude single-table OR.

What the design does and does not identify

Only the odds ratio is identified from an unmatched case-control sample. Risk differences and risk ratios require the source-population prevalence q0 and a case-control-weighted estimator:

Code
contrast(fit, type = "difference")
#> Error in `contrast()`:
#> ! The risk difference is not identified from an unmatched case-control sample without the source-population prevalence q0.
#> ℹ Report the conditional odds ratio with `type = "or"`.
#> ℹ For a marginal risk difference / ratio, supply `prevalence =` on the design and use a case-control-weighted estimator (e.g. `estimator = "ccw_gformula"`).

Covered combinations

Legend. ✅ truth-pinned in tests · ⛔ rejected with an informative error.

Exposure Estimator Estimand Variance Status
Binary logistic conditional OR model / sandwich
Continuous logistic per-unit OR model / sandwich
Categorical (k>2) logistic per-level OR model / sandwich
Ordinal score logistic per-step trend OR model / sandwich
Any (GAM-adjusted) logistic + model_fn conditional OR model
Binary mh Mantel-Haenszel OR Robins-Breslow-Greenland
Ordered factor logistic ⛔ use a numeric score
Categorical / continuous mh ⛔ use logistic
Any logistic / mh RD / RR ⛔ need q0

See FEATURE_COVERAGE_MATRIX.md for the authoritative status of every combination.

References

Breslow NE, Day NE (1980). Statistical Methods in Cancer Research, Volume 1: The Analysis of Case-Control Studies. IARC.

Prentice RL, Pyke R (1979). Logistic disease incidence models and case-control studies. Biometrika 66(3):403-411.

Robins J, Breslow N, Greenland S (1986). Estimators of the Mantel-Haenszel variance consistent in both sparse data and large-strata limiting models. Biometrics 42(2):311-323.