Multiple case / control groups

Some case-control studies have more than two outcome groups: several disease subtypes (does the exposure act differently on each?), or several control groups. The design-faithful analysis is baseline-category multinomial logistic regression, fit by nnet::multinom. Each non-reference equation’s exposure coefficient is that subtype’s log odds ratio versus the reference, so contrast(type = "or") reports one OR per (subtype, exposure coefficient).

Code

library(matchatr)

A three-group outcome

We simulate an outcome with a control group and two disease subtypes, where the binary exposure raises the odds of subtype A by a factor of 2 and subtype B by a factor of 4 (versus control).

Code

set.seed(5)
n <- 6000
x <- rbinom(n, 1, 0.4)
# Baseline-category logits relative to "control": OR_A = 2, OR_B = 4.
eta <- cbind(
  control = 0,
  A       = -1.0 + log(2) * x,
  B       = -1.5 + log(4) * x
)
prob <- exp(eta) / rowSums(exp(eta))
subtype <- apply(prob, 1, function(p) sample(c("control", "A", "B"), 1, prob = p))
d_poly <- data.frame(subtype = subtype, x = x)
table(d_poly$subtype)
#> 
#>       A       B control 
#>    1507    1289    3204

Select estimator = "polytomous" and name the baseline group with reference. The outcome must be a factor or character column with three or more groups (a two-group, numeric, or logical outcome is routed back to the binary estimators).

Code

fit <- matcha(
  d_poly,
  outcome = "subtype", exposure = "x",
  design = unmatched_cc(),
  estimator = "polytomous", reference = "control"
)
fit
#> <matchatr_fit>
#>  Design:     Unmatched case-control
#>  Estimator:  polytomous  (engine: multinom)
#>  Outcome:    subtype
#>  Exposure:   x
#>  Confounders: none
#>  N:          6000  (groups: control*: 3204, A: 1507, B: 1289)
#>              * reference group

The fit prints the per-group counts and flags the reference group. The contrast reports each subtype’s exposure OR versus the reference, recovering the true 2 and 4:

Code

contrast(fit, type = "or")
#> <matchatr_result>
#>  Estimator:  polytomous  (engine: multinom)
#>  Estimand:   subtype OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          6000
#> 
#> Contrasts:
#>    comparison estimate        se ci_lower ci_upper
#>        <char>    <num>     <num>    <num>    <num>
#> 1:       A: x 2.058045 0.1328040 1.813541 2.335512
#> 2:       B: x 3.881635 0.2686833 3.389185 4.445638

The full per-equation table

tidy() renders the whole multinomial model, one equation per non-reference group, with a leading y.level column naming the group each row contrasts against the reference:

Code

tidy(fit, exponentiate = TRUE)
#>    y.level        term  estimate  std.error statistic       p.value  conf.low
#>     <char>      <char>     <num>      <num>     <num>         <num>     <num>
#> 1:       A (Intercept) 0.3580303 0.04102209 -25.03865 2.320950e-138 0.3303709
#> 2:       A           x 2.0580449 0.06452919  11.18496  4.831211e-29 1.8135415
#> 3:       B (Intercept) 0.2169466 0.04988641 -30.63167 4.636816e-206 0.1967385
#> 4:       B           x 3.8816352 0.06921911  19.59367  1.750921e-85 3.3891853
#>    conf.high
#>        <num>
#> 1: 0.3880055
#> 2: 2.3355124
#> 3: 0.2392303
#> 4: 4.4456383

The interval is the multinomial information matrix; the Huber-White sandwich is not defined for this fit, so robust = TRUE / ci_method = "sandwich" are declined.

Choosing the reference and adjusting for confounders

The reference group is releveled to the front before fitting, so it can be any observed group. Confounders enter the same way as in the binary engines, holding the exposure ORs adjusted:

Code

set.seed(6)
d_poly$age <- rnorm(n, 50, 10)

fit_adj <- matcha(
  d_poly,
  outcome = "subtype", exposure = "x",
  design = unmatched_cc(),
  confounders = ~ age,
  estimator = "polytomous", reference = "control"
)
contrast(fit_adj, type = "or")
#> <matchatr_result>
#>  Estimator:  polytomous  (engine: multinom)
#>  Estimand:   subtype OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          6000
#> 
#> Contrasts:
#>    comparison estimate        se ci_lower ci_upper
#>        <char>    <num>     <num>    <num>    <num>
#> 1:       A: x 2.058611 0.1328494 1.814025 2.336174
#> 2:       B: x 3.882546 0.2687620 3.389954 4.446717

Does the exposure act the same way on every subtype?

The central question of a subtype analysis is etiologic heterogeneity: is the exposure odds ratio the same across the disease subtypes, or does the exposure act through different mechanisms on different subtypes? test_homogeneity() answers it. For each exposure term it runs a Wald test of the null that the exposure log odds ratio is equal across the non-reference subtypes (H₀: β₁ = … = β_M) and reports the efficient pooled (“common”) odds ratio that holds under homogeneity.

Code

test_homogeneity(fit)
#> <matchatr_homogeneity>
#>  Estimator:  polytomous  (engine: multinom)
#>  Test:       Homogeneity of subtype odds ratios (Wald)
#>  Reference:  control
#>  N:          6000
#> 
#> Common (pooled) odds ratio per exposure term and homogeneity test:
#>      term common_or        se ci_lower ci_upper statistic    df      p.value
#>    <char>     <num>     <num>    <num>    <num>     <num> <int>        <num>
#> 1:      x  2.733635 0.1489887 2.456679 3.041815  67.51785     1 2.087887e-16
#> 
#> Per-subtype odds ratios (pooled):
#>    comparison       or ci_lower ci_upper
#>        <char>    <num>    <num>    <num>
#> 1:       A: x 2.058045 1.813541 2.335512
#> 2:       B: x 3.881635 3.389185 4.445638
#> 
#> A small p-value is evidence the exposure odds ratio differs across subtypes.

Here the exposure doubles the odds of subtype A but quadruples them for subtype B, so the test returns a small p-value — evidence the effect is not homogeneous. The pooled odds ratio is the minimum-variance (inverse-variance / GLS) combination of the subtype odds ratios — the restricted estimator under homogeneity, asymptotically equal to the constrained maximum-likelihood fit — so it is more efficient than any single subtype odds ratio. It is the canonical etiologic- heterogeneity Wald test (Begg and Gray 1984), computed on the already-fitted model, so it needs no refit and handles continuous confounders directly:

Code

test_homogeneity(fit_adj)
#> <matchatr_homogeneity>
#>  Estimator:  polytomous  (engine: multinom)
#>  Test:       Homogeneity of subtype odds ratios (Wald)
#>  Reference:  control
#>  N:          6000
#> 
#> Common (pooled) odds ratio per exposure term and homogeneity test:
#>      term common_or        se ci_lower ci_upper statistic    df      p.value
#>    <char>     <num>     <num>    <num>    <num>     <num> <int>        <num>
#> 1:      x  2.734342 0.1490397 2.457292 3.042628  67.50644     1 2.099995e-16
#> 
#> Per-subtype odds ratios (pooled):
#>    comparison       or ci_lower ci_upper
#>        <char>    <num>    <num>    <num>
#> 1:       A: x 2.058611 1.814025 2.336174
#> 2:       B: x 3.882546 3.389954 4.446717
#> 
#> A small p-value is evidence the exposure odds ratio differs across subtypes.

tidy() returns the per-term table (the common odds ratio, its interval, and the homogeneity χ², degrees of freedom, and p-value), one row per exposure term:

Code

tidy(test_homogeneity(fit_adj))
#>      term estimate std.error conf.low conf.high statistic    df      p.value
#>    <char>    <num>     <num>    <num>     <num>     <num> <int>        <num>
#> 1:      x 2.734342 0.1490397 2.457292  3.042628  67.50644     1 2.099995e-16

An unordered factor exposure is tested one level at a time (one “risk factor” per level). A small p-value points to distinct etiologic mechanisms; a large one licenses reporting the single pooled odds ratio.

A note on collinearity

Unlike stats::glm, nnet::multinom does not alias a rank-deficient exposure column to NA — it splits the coefficient across the outcome equations, which would silently attenuate the OR. matchatr guards against this by checking the design-matrix rank up front: an exposure that is constant, or collinear with a confounder, is rejected with a classed matchatr_unestimable_exposure error rather than returning a misleading number. Collinearity confined to the confounders leaves the exposure estimable and is not rejected.

Covered combinations

Legend. ✅ truth-pinned in tests · ⛔ rejected with an informative error.

Outcome	Estimator	Estimand	Status
≥3-group factor / character	`polytomous`	per-subtype OR vs reference	✅
≥3-group (confounder-adjusted)	`polytomous`	adjusted per-subtype OR	✅
≥3-group	`polytomous` + `test_homogeneity()`	homogeneity Wald test + pooled common OR	✅
2-group / numeric / logical	`polytomous`	—	⛔ use `logistic` / `clogit`
matched design	`polytomous`	—	⛔ unsupported
any	`polytomous`	RD / RR	⛔ need q0
any	`polytomous`	sandwich / bootstrap variance	⛔ information matrix only

See FEATURE_COVERAGE_MATRIX.md for the authoritative status of every combination.

References

Begg, Colin B., and Robert Gray. 1984. “Calculation of Polychotomous Logistic Regression Parameters Using Individualized Regressions.” Biometrika 71 (1): 11–18. https://doi.org/10.1093/biomet/71.1.11.

Borgan, Ørnulf, Norman Breslow, Nilanjan Chatterjee, Mitchell H. Gail, Alastair Scott, and Chris J. Wild, eds. 2018. Handbook of Statistical Methods for Case-Control Studies. Chapman & Hall/CRC. https://doi.org/10.1201/9781315154084.

Venables, William N., and Brian D. Ripley. 2002. Modern Applied Statistics with s. Fourth. Springer. https://doi.org/10.1007/978-0-387-21706-2.