Code
library(matchatr)Some case-control studies have more than two outcome groups: several disease subtypes (does the exposure act differently on each?), or several control groups. The design-faithful analysis is baseline-category multinomial logistic regression, fit by nnet::multinom. Each non-reference equation’s exposure coefficient is that subtype’s log odds ratio versus the reference, so contrast(type = "or") reports one OR per (subtype, exposure coefficient).
We simulate an outcome with a control group and two disease subtypes, where the binary exposure raises the odds of subtype A by a factor of 2 and subtype B by a factor of 4 (versus control).
set.seed(5)
n <- 6000
x <- rbinom(n, 1, 0.4)
# Baseline-category logits relative to "control": OR_A = 2, OR_B = 4.
eta <- cbind(
control = 0,
A = -1.0 + log(2) * x,
B = -1.5 + log(4) * x
)
prob <- exp(eta) / rowSums(exp(eta))
subtype <- apply(prob, 1, function(p) sample(c("control", "A", "B"), 1, prob = p))
d_poly <- data.frame(subtype = subtype, x = x)
table(d_poly$subtype)
#>
#> A B control
#> 1507 1289 3204Select estimator = "polytomous" and name the baseline group with reference. The outcome must be a factor or character column with three or more groups (a two-group, numeric, or logical outcome is routed back to the binary estimators).
fit <- matcha(
d_poly,
outcome = "subtype", exposure = "x",
design = unmatched_cc(),
estimator = "polytomous", reference = "control"
)
fit
#> <matchatr_fit>
#> Design: Unmatched case-control
#> Estimator: polytomous (engine: multinom)
#> Outcome: subtype
#> Exposure: x
#> Confounders: none
#> N: 6000 (groups: control*: 3204, A: 1507, B: 1289)
#> * reference groupThe fit prints the per-group counts and flags the reference group. The contrast reports each subtype’s exposure OR versus the reference, recovering the true 2 and 4:
contrast(fit, type = "or")
#> <matchatr_result>
#> Estimator: polytomous (engine: multinom)
#> Estimand: subtype OR
#> Contrast: Odds ratio
#> CI method: model
#> N: 6000
#>
#> Contrasts:
#> comparison estimate se ci_lower ci_upper
#> <char> <num> <num> <num> <num>
#> 1: A: x 2.058045 0.1328040 1.813541 2.335512
#> 2: B: x 3.881635 0.2686833 3.389185 4.445638tidy() renders the whole multinomial model, one equation per non-reference group, with a leading y.level column naming the group each row contrasts against the reference:
tidy(fit, exponentiate = TRUE)
#> y.level term estimate std.error statistic p.value conf.low
#> <char> <char> <num> <num> <num> <num> <num>
#> 1: A (Intercept) 0.3580303 0.04102209 -25.03865 2.320950e-138 0.3303709
#> 2: A x 2.0580449 0.06452919 11.18496 4.831211e-29 1.8135415
#> 3: B (Intercept) 0.2169466 0.04988641 -30.63167 4.636816e-206 0.1967385
#> 4: B x 3.8816352 0.06921911 19.59367 1.750921e-85 3.3891853
#> conf.high
#> <num>
#> 1: 0.3880055
#> 2: 2.3355124
#> 3: 0.2392303
#> 4: 4.4456383The interval is the multinomial information matrix; the Huber-White sandwich is not defined for this fit, so robust = TRUE / ci_method = "sandwich" are declined.
The reference group is releveled to the front before fitting, so it can be any observed group. Confounders enter the same way as in the binary engines, holding the exposure ORs adjusted:
set.seed(6)
d_poly$age <- rnorm(n, 50, 10)
fit_adj <- matcha(
d_poly,
outcome = "subtype", exposure = "x",
design = unmatched_cc(),
confounders = ~ age,
estimator = "polytomous", reference = "control"
)
contrast(fit_adj, type = "or")
#> <matchatr_result>
#> Estimator: polytomous (engine: multinom)
#> Estimand: subtype OR
#> Contrast: Odds ratio
#> CI method: model
#> N: 6000
#>
#> Contrasts:
#> comparison estimate se ci_lower ci_upper
#> <char> <num> <num> <num> <num>
#> 1: A: x 2.058611 0.1328494 1.814025 2.336174
#> 2: B: x 3.882546 0.2687620 3.389954 4.446717The central question of a subtype analysis is etiologic heterogeneity: is the exposure odds ratio the same across the disease subtypes, or does the exposure act through different mechanisms on different subtypes? test_homogeneity() answers it. For each exposure term it runs a Wald test of the null that the exposure log odds ratio is equal across the non-reference subtypes (H0: β1 = … = βM) and reports the efficient pooled (“common”) odds ratio that holds under homogeneity.
test_homogeneity(fit)
#> <matchatr_homogeneity>
#> Estimator: polytomous (engine: multinom)
#> Test: Homogeneity of subtype odds ratios (Wald)
#> Reference: control
#> N: 6000
#>
#> Common (pooled) odds ratio per exposure term and homogeneity test:
#> term common_or se ci_lower ci_upper statistic df p.value
#> <char> <num> <num> <num> <num> <num> <int> <num>
#> 1: x 2.733635 0.1489887 2.456679 3.041815 67.51785 1 2.087887e-16
#>
#> Per-subtype odds ratios (pooled):
#> comparison or ci_lower ci_upper
#> <char> <num> <num> <num>
#> 1: A: x 2.058045 1.813541 2.335512
#> 2: B: x 3.881635 3.389185 4.445638
#>
#> A small p-value is evidence the exposure odds ratio differs across subtypes.Here the exposure doubles the odds of subtype A but quadruples them for subtype B, so the test returns a small p-value — evidence the effect is not homogeneous. The pooled odds ratio is the minimum-variance (inverse-variance / GLS) combination of the subtype odds ratios — the restricted estimator under homogeneity, asymptotically equal to the constrained maximum-likelihood fit — so it is more efficient than any single subtype odds ratio. It is the canonical etiologic- heterogeneity Wald test (Begg & Gray, 1984), computed on the already-fitted model, so it needs no refit and handles continuous confounders directly:
test_homogeneity(fit_adj)
#> <matchatr_homogeneity>
#> Estimator: polytomous (engine: multinom)
#> Test: Homogeneity of subtype odds ratios (Wald)
#> Reference: control
#> N: 6000
#>
#> Common (pooled) odds ratio per exposure term and homogeneity test:
#> term common_or se ci_lower ci_upper statistic df p.value
#> <char> <num> <num> <num> <num> <num> <int> <num>
#> 1: x 2.734342 0.1490397 2.457292 3.042628 67.50644 1 2.099995e-16
#>
#> Per-subtype odds ratios (pooled):
#> comparison or ci_lower ci_upper
#> <char> <num> <num> <num>
#> 1: A: x 2.058611 1.814025 2.336174
#> 2: B: x 3.882546 3.389954 4.446717
#>
#> A small p-value is evidence the exposure odds ratio differs across subtypes.tidy() returns the per-term table (the common odds ratio, its interval, and the homogeneity χ², degrees of freedom, and p-value), one row per exposure term:
An unordered factor exposure is tested one level at a time (one “risk factor” per level). A small p-value points to distinct etiologic mechanisms; a large one licenses reporting the single pooled odds ratio.
Unlike stats::glm, nnet::multinom does not alias a rank-deficient exposure column to NA — it splits the coefficient across the outcome equations, which would silently attenuate the OR. matchatr guards against this by checking the design-matrix rank up front: an exposure that is constant, or collinear with a confounder, is rejected with a classed matchatr_unestimable_exposure error rather than returning a misleading number. Collinearity confined to the confounders leaves the exposure estimable and is not rejected.
Legend. ✅ truth-pinned in tests · ⛔ rejected with an informative error.
| Outcome | Estimator | Estimand | Status |
|---|---|---|---|
| ≥3-group factor / character | polytomous |
per-subtype OR vs reference | ✅ |
| ≥3-group (confounder-adjusted) | polytomous |
adjusted per-subtype OR | ✅ |
| ≥3-group | polytomous + test_homogeneity() |
homogeneity Wald test + pooled common OR | ✅ |
| 2-group / numeric / logical | polytomous |
— | ⛔ use logistic / clogit |
| matched design | polytomous |
— | ⛔ unsupported |
| any | polytomous |
RD / RR | ⛔ need q0 |
| any | polytomous |
sandwich / bootstrap variance | ⛔ information matrix only |
See FEATURE_COVERAGE_MATRIX.md for the authoritative status of every combination.
Begg CB, Gray R (1984). Calculation of polytomous logistic regression parameters using individualized regressions. Biometrika 71(1), 11-18.
Borgan Ø, Breslow N, Chatterjee N, Gail MH, Scott A, Wild CJ (2018). Handbook of Statistical Methods for Case-Control Studies, Chapter 5. Chapman & Hall/CRC.
Venables WN, Ripley BD (2002). Modern Applied Statistics with S, 4th ed. Springer. (The nnet::multinom fitter.)
---
title: "Multiple case / control groups"
code-fold: show
code-tools: true
vignette: >
%\VignetteIndexEntry{Multiple case / control groups}
%\VignetteEngine{quarto::html}
%\VignetteEncoding{UTF-8}
---
```{r}
#| include: false
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```
Some case-control studies have more than two outcome groups: several *disease
subtypes* (does the exposure act differently on each?), or several *control
groups*. The design-faithful analysis is **baseline-category multinomial
logistic regression**, fit by `nnet::multinom`. Each non-reference equation's
exposure coefficient is that subtype's log odds ratio versus the reference, so
`contrast(type = "or")` reports one OR per (subtype, exposure coefficient).
```{r}
#| message: false
library(matchatr)
```
## A three-group outcome
We simulate an outcome with a control group and two disease subtypes, where the
binary exposure raises the odds of subtype A by a factor of 2 and subtype B by a
factor of 4 (versus control).
```{r}
set.seed(5)
n <- 6000
x <- rbinom(n, 1, 0.4)
# Baseline-category logits relative to "control": OR_A = 2, OR_B = 4.
eta <- cbind(
control = 0,
A = -1.0 + log(2) * x,
B = -1.5 + log(4) * x
)
prob <- exp(eta) / rowSums(exp(eta))
subtype <- apply(prob, 1, function(p) sample(c("control", "A", "B"), 1, prob = p))
d_poly <- data.frame(subtype = subtype, x = x)
table(d_poly$subtype)
```
Select `estimator = "polytomous"` and name the baseline group with `reference`.
The outcome must be a factor or character column with three or more groups (a
two-group, numeric, or logical outcome is routed back to the binary estimators).
```{r}
fit <- matcha(
d_poly,
outcome = "subtype", exposure = "x",
design = unmatched_cc(),
estimator = "polytomous", reference = "control"
)
fit
```
The fit prints the per-group counts and flags the reference group. The contrast
reports each subtype's exposure OR versus the reference, recovering the true 2
and 4:
```{r}
contrast(fit, type = "or")
```
## The full per-equation table
`tidy()` renders the whole multinomial model, one equation per non-reference
group, with a leading `y.level` column naming the group each row contrasts
against the reference:
```{r}
tidy(fit, exponentiate = TRUE)
```
The interval is the multinomial information matrix; the Huber-White sandwich is
not defined for this fit, so `robust = TRUE` / `ci_method = "sandwich"` are
declined.
## Choosing the reference and adjusting for confounders
The `reference` group is releveled to the front before fitting, so it can be any
observed group. Confounders enter the same way as in the binary engines, holding
the exposure ORs adjusted:
```{r}
set.seed(6)
d_poly$age <- rnorm(n, 50, 10)
fit_adj <- matcha(
d_poly,
outcome = "subtype", exposure = "x",
design = unmatched_cc(),
confounders = ~ age,
estimator = "polytomous", reference = "control"
)
contrast(fit_adj, type = "or")
```
## Does the exposure act the same way on every subtype?
The central question of a subtype analysis is *etiologic heterogeneity*: is the
exposure odds ratio the same across the disease subtypes, or does the exposure
act through different mechanisms on different subtypes? `test_homogeneity()`
answers it. For each exposure term it runs a Wald test of the null that the
exposure log odds ratio is equal across the non-reference subtypes
(H~0~: β~1~ = … = β~M~) and reports the efficient pooled ("common") odds ratio
that holds under homogeneity.
```{r}
test_homogeneity(fit)
```
Here the exposure doubles the odds of subtype A but quadruples them for subtype
B, so the test returns a small p-value — evidence the effect is *not* homogeneous.
The pooled odds ratio is the minimum-variance (inverse-variance / GLS) combination
of the subtype odds ratios — the restricted estimator under homogeneity,
asymptotically equal to the constrained maximum-likelihood fit — so it is more
efficient than any single subtype odds ratio. It is the canonical etiologic-
heterogeneity Wald test (Begg & Gray, 1984), computed on the already-fitted
model, so it needs no refit and handles continuous confounders directly:
```{r}
test_homogeneity(fit_adj)
```
`tidy()` returns the per-term table (the common odds ratio, its interval, and the
homogeneity χ², degrees of freedom, and p-value), one row per exposure term:
```{r}
tidy(test_homogeneity(fit_adj))
```
An unordered factor exposure is tested one level at a time (one "risk factor"
per level). A small p-value points to distinct etiologic mechanisms; a large one
licenses reporting the single pooled odds ratio.
## A note on collinearity
Unlike `stats::glm`, `nnet::multinom` does not alias a rank-deficient exposure
column to `NA` — it splits the coefficient across the outcome equations, which
would silently attenuate the OR. matchatr guards against this by checking the
design-matrix rank up front: an exposure that is constant, or collinear with a
confounder, is rejected with a classed `matchatr_unestimable_exposure` error
rather than returning a misleading number. Collinearity confined to the
confounders leaves the exposure estimable and is not rejected.
## Covered combinations
**Legend.** ✅ truth-pinned in tests · ⛔ rejected with an informative error.
| Outcome | Estimator | Estimand | Status |
|---|---|---|---|
| ≥3-group factor / character | `polytomous` | per-subtype OR vs reference | ✅ |
| ≥3-group (confounder-adjusted) | `polytomous` | adjusted per-subtype OR | ✅ |
| ≥3-group | `polytomous` + `test_homogeneity()` | homogeneity Wald test + pooled common OR | ✅ |
| 2-group / numeric / logical | `polytomous` | — | ⛔ use `logistic` / `clogit` |
| matched design | `polytomous` | — | ⛔ unsupported |
| any | `polytomous` | RD / RR | ⛔ need q0 |
| any | `polytomous` | sandwich / bootstrap variance | ⛔ information matrix only |
See `FEATURE_COVERAGE_MATRIX.md` for the authoritative status of every
combination.
## References
Begg CB, Gray R (1984). Calculation of polytomous logistic regression parameters
using individualized regressions. *Biometrika* 71(1), 11-18.
Borgan Ø, Breslow N, Chatterjee N, Gail MH, Scott A, Wild CJ (2018). *Handbook
of Statistical Methods for Case-Control Studies*, Chapter 5. Chapman & Hall/CRC.
Venables WN, Ripley BD (2002). *Modern Applied Statistics with S*, 4th ed.
Springer. (The `nnet::multinom` fitter.)