Introduction to matchatr

matchatr is the etverse package for causal inference from case-control-type study designs: (matched) case-control, nested case-control, and case-cohort samples. It provides design-faithful classical estimators (conditional logistic regression, Mantel-Haenszel, McNemar, weighted Cox, case-cohort pseudo-likelihood) and marginal causal effects via case-control weighting, delegating the heavy lifting to its siblings causatr (g-computation / IPW / AIPW) and survatr (causal survival).

This vignette introduces the design taxonomy and the two-step API. The estimator-specific articles cover the worked examples:

Unmatched case-control — logistic and Mantel-Haenszel odds ratios.
Matched case-control — conditional logistic, McNemar, and stratum-specific effect modification.
Multiple groups — polytomous (multinomial) subtype odds ratios.

Two orthogonal axes: design and estimator

Every matchatr analysis is specified along two independent axes:

A design object encodes the sampling structure — how the sample was drawn from the source population. It carries the strata / matched-set ids, the time scale (for risk-set designs), the source-population prevalence q0, and the intended weighting scheme.
An estimator chooses the analysis — conditional vs marginal, odds ratio vs hazard ratio vs risk difference.

The two are deliberately separate: the same matched case-control sample can be analysed by conditional logistic regression (a conditional odds ratio) or, with a prevalence q0, by a case-control-weighted g-formula (a marginal risk difference). You change the estimand without re-describing the design.

Code

library(matchatr)

# Six design constructors, one per sampling structure:
unmatched_cc()                              # independent case / control sampling
#> <matchatr_design>
#>  Type:       Unmatched case-control
#>  Weights:    none
matched_cc(strata = "set", ratio = 2)       # individually / frequency matched
#> <matchatr_design>
#>  Type:       Matched case-control
#>  Strata:     set
#>  Ratio:      2:1
#>  Weights:    none
nested_cc(strata = "set", time = "t")       # risk-set sampling from a cohort
#> <matchatr_design>
#>  Type:       Nested case-control
#>  Strata:     set
#>  Time:       t
#>  Weights:    inclusion-probability

Each returns a matchatr_design object that prints its structure:

Code

matched_cc(strata = c("age_grp", "sex"), ratio = 2)
#> <matchatr_design>
#>  Type:       Matched case-control
#>  Strata:     age_grp, sex
#>  Ratio:      2:1
#>  Weights:    none

The full set of constructors is unmatched_cc(), matched_cc(), nested_cc(), case_cohort(), two_phase(), and counter_matched().

The two-step API

matchatr mirrors the etverse verb convention (causatr::causat(), survatr::surv_fit()):

# Step 1 — fit: resolve the (design, estimator) pair and run the engine.
fit <- matcha(data, outcome = "case", exposure = "x",
              design = unmatched_cc(), confounders = ~ age + smoke,
              estimator = "logistic")

# Step 2 — contrast: report the effect on the requested scale.
contrast(fit, type = "or")

matcha() returns a matchatr_fit; contrast() returns a matchatr_result. When you omit estimator, the design’s canonical default is used; when you omit type, contrast() reports the estimand the design identifies.

A worked example

We use R’s built-in infert data — a matched case-control study of prior spontaneous / induced abortion and secondary infertility, matched on age and parity (the stratum column identifies the matched sets).

Code

fit <- matcha(
  infert,
  outcome = "case", exposure = "spontaneous",
  design = matched_cc(strata = "stratum"),
  confounders = ~ induced, estimator = "clogit"
)
fit
#> <matchatr_fit>
#>  Design:     Matched case-control
#>  Estimator:  clogit  (engine: clogit)
#>  Outcome:    case
#>  Exposure:   spontaneous
#>  Confounders: ~induced
#>  N:          248  (cases: 83, controls: 165)

The fit echoes the resolved engine and the case / control counts. The second step reports the exposure’s conditional odds ratio:

Code

contrast(fit, type = "or")
#> <matchatr_result>
#>  Estimator:  clogit  (engine: clogit)
#>  Estimand:   conditional OR
#>  Contrast:   Odds ratio
#>  CI method:  model
#>  N:          248
#> 
#> Contrasts:
#>     comparison estimate     se ci_lower ci_upper
#>         <char>    <num>  <num>    <num>    <num>
#> 1: spontaneous 7.285423 2.5677 3.651357 14.53635

Each prior spontaneous abortion multiplies the conditional odds of infertility by about 7, adjusting for induced abortions and the matched design.

Inspecting a fit

A matchatr_fit works with the broom-style generics. tidy() returns the coefficient table (log-odds scale, or the odds-ratio scale with exponentiate = TRUE); summary() prints the model summary.

Code

tidy(fit, exponentiate = TRUE)
#>           term estimate std.error statistic      p.value conf.low conf.high
#>         <char>    <num>     <num>     <num>        <num>    <num>     <num>
#> 1: spontaneous 7.285423 0.3524435  5.634592 1.754734e-08 3.651357 14.536346
#> 2:     induced 4.091909 0.3607124  3.906191 9.376245e-05 2.017841  8.297838

contrast() results also tidy to a one-row-per-contrast table:

Code

tidy(contrast(fit, type = "or"))
#>           term estimate std.error   type conf.low conf.high
#>         <char>    <num>     <num> <char>    <num>     <num>
#> 1: spontaneous 7.285423    2.5677     or 3.651357  14.53635

What is identified depends on the design

From a case-control sample the marginal outcome frequency is fixed by the sampling, so only the odds ratio is identified without extra information. Asking for a risk difference or risk ratio aborts with an informative, classed error pointing to the prevalence q0 you would need:

Code

contrast(fit, type = "difference")
#> Error in `contrast()`:
#> ! The risk difference is not identified from an unmatched case-control sample without the source-population prevalence q0.
#> ℹ Report the conditional odds ratio with `type = "or"`.
#> ℹ For a marginal risk difference / ratio, supply `prevalence =` on the design and use a case-control-weighted estimator (e.g. `estimator = "ccw_gformula"`).

Supplying the source-population prevalence on the design (unmatched_cc(prevalence = 0.02)) unlocks the case-control-weighted marginal contrasts — that layer is on the roadmap below.

What works today

matchatr is built phase by phase against the Handbook of Statistical Methods for Case-Control Studies (Borgan et al. 2018). The classical odds-ratio engines are implemented:

Design	Estimator	Estimand	Article
Unmatched case-control	`"logistic"`	conditional OR (any exposure type)	Unmatched CC
Unmatched case-control	`"mh"`	Mantel-Haenszel stratified OR	Unmatched CC
Matched case-control	`"clogit"`	conditional OR (+ effect modification)	Matched CC
Matched case-control	`"mcnemar"`	1:1 matched-pair OR	Matched CC
Multi-group outcome	`"polytomous"`	per-subtype OR vs reference	Multiple groups

The time-to-event sampling designs (nested case-control, case-cohort, IPW-NCC) and the marginal causal layer (case-control-weighting g-formula / IPW / AIPW / TMLE, design-weighted causal survival) are designed but not yet implemented; see FEATURE_COVERAGE_MATRIX.md for the authoritative status of every cell.

References

Borgan, Ørnulf, Norman Breslow, Nilanjan Chatterjee, Mitchell H. Gail, Alastair Scott, and Chris J. Wild, eds. 2018. Handbook of Statistical Methods for Case-Control Studies. Chapman & Hall/CRC. https://doi.org/10.1201/9781315154084.

Rose, Sherri, and Mark J. van der Laan. 2009. “Why Match? Investigating Matched Case-Control Study Designs with Causal Effect Estimation.” The International Journal of Biostatistics 5 (1): Article 1. https://doi.org/10.2202/1557-4679.1127.