Nested case-control hazard ratios

A nested case-control (NCC) study samples controls from inside a cohort by risk-set (incidence-density) sampling (Thomas 1977): at each case’s failure time, a few controls are drawn at random from the subjects still at risk. The design-faithful analysis is the conditional partial likelihood with each sampled risk set as a stratum — exactly the conditional logistic regression matchatr already uses for matched case-control data, because a matched set and a sampled risk set are the same stratum construction.

The estimand, however, is different. Under proper risk-set sampling the conditional estimate is the hazard ratio, exactly — there is no rare-disease approximation involved (Prentice and Breslow 1978). So matchatr reports it on the hazard-ratio scale (type = "hr"), which is the default for nested_cc().

Code

library(matchatr)

A cohort and a nested sample

We simulate a cohort with a constant baseline hazard and a known Cox log hazard ratio for a binary exposure x.

Code

set.seed(51)
n <- 3000
x <- rbinom(n, 1, 0.4)
beta <- log(2.2)                       # the true log hazard ratio for x
rate <- 0.08 * exp(beta * x)           # PH hazard: baseline times exp(lp)
time <- rexp(n, rate)
tau  <- 4                              # administrative censoring
cohort <- data.frame(
  id = seq_len(n), t = pmin(time, tau), d = as.integer(time <= tau), x = x
)

sample_ncc() draws the risk-set (incidence-density) sample: for each case it takes m controls at random from the subjects still at risk at that failure time. The result is analysis-ready — it carries a per-set case indicator (distinct from the cohort-wide event, since a sampled control may itself fail later), a matched-set id set, and the set’s risk_time.

Code

set.seed(1)
ncc <- sample_ncc(cohort, time = "t", event = "d", m = 3)
head(ncc)
#>       id           t     d     x   set  case   risk_time
#>    <int>       <num> <int> <int> <int> <int>       <num>
#> 1:    33 0.002633132     1     1     1     1 0.002633132
#> 2:  1018 4.000000000     0     0     1     0 0.002633132
#> 3:   680 4.000000000     0     0     1     0 0.002633132
#> 4:  2178 2.139775944     1     0     1     0 0.002633132
#> 5:   726 0.007580645     1     1     2     1 0.007580645
#> 6:   932 4.000000000     0     0     2     0 0.007580645

Two further sampling options match real designs: match = ~ sex + birth_cohort confines each case’s controls to its own population stratum, and entry = supplies a delayed-entry (left-truncation) column so a subject counts as at risk only after it enters follow-up. A case left with no eligible control aborts with an informative matchatr_empty_risk_set error rather than a degenerate set.

The risk-set hazard ratio

matcha() with design = nested_cc() fits the conditional partial likelihood; contrast() reports the hazard ratio with the partial-likelihood information-matrix Wald interval.

Code

fit <- matcha(
  ncc,
  outcome = "case", exposure = "x",
  design = nested_cc(strata = "set", time = "risk_time"),
  estimator = "clogit"
)
fit
#> <matchatr_fit>
#>  Design:     Nested case-control
#>  Estimator:  clogit  (engine: clogit)
#>  Outcome:    case
#>  Exposure:   x
#>  Confounders: none
#>  N:          4500  (cases: 1125, controls: 3375)

contrast(fit)        # type = "hr" is the default for a nested design
#> <matchatr_result>
#>  Estimator:  clogit  (engine: clogit)
#>  Estimand:   hazard ratio
#>  Contrast:   Hazard ratio
#>  CI method:  model
#>  N:          4500
#> 
#> Contrasts:
#>    comparison estimate        se ci_lower ci_upper
#>        <char>    <num>     <num>    <num>    <num>
#> 1:          x 2.352281 0.1666529 2.047311  2.70268

The estimate recovers the true log hazard ratio log(2.2) ≈ 0.79. The design’s time column records how the controls were sampled; the risk-set membership is read from strata, so the conditional likelihood does not enter time here.

OR = HR exactly

The whole point of risk-set sampling is that the sampled odds ratio is the hazard ratio, with no rare-disease assumption. We can see this directly: the NCC estimate matches the Cox hazard ratio fit on the full cohort (within sampling error, since the NCC sample is a subset).

Code

full_cohort <- survival::coxph(survival::Surv(t, d) ~ x, data = cohort)
c(ncc_HR    = exp(coef(fit$model)[["x"]]),
  cohort_HR = exp(coef(full_cohort)[["x"]]))
#>    ncc_HR cohort_HR 
#>  2.352281  2.338995

One scale per design

Each conditional design identifies exactly one scale: the matched design reports the odds ratio, the nested design the hazard ratio. Asking for the off-design scale is declined — the value would be the same number, but it is not the estimand the design targets:

Code

contrast(fit, type = "or")
#> Error in `contrast()`:
#> ! A nested case-control design is reported on the hazard-ratio scale.
#> ℹ Risk-set (incidence-density) sampling identifies the hazard ratio (OR = HR exactly; Prentice & Breslow 1978). Use `type = "hr"` (the default).

As with the other case-control designs, a marginal risk difference / risk ratio needs a source-population prevalence q0 and a case-control-weighted estimator, and the conditional fit reports the information-matrix interval only, so ci_method = "sandwich" / "bootstrap" are declined.

Counter-matching: concentrating power near the exposure-surrogate boundary

When a binary surrogate (e.g. the measured variant for a gene of interest) is correlated with the true exposure, counter-matching draws controls from the opposite surrogate stratum, concentrating the study in subjects whose surrogate value differs from the case. This is more efficient than uniform sampling when the surrogate is highly correlated with exposure.

sample_ncc_counter_matched() generates the dataset; analysis requires survival::coxph with the Langholz-Borgan log-weights (Langholz and Borgan 1995) as an offset (the unweighted clogit is biased for this design):

Code

cohort$z_bin <- cohort$x          # binary surrogate
ncc_cm <- sample_ncc_counter_matched(
  cohort, time = "t", event = "d", surrogate = "z_bin", m = 1L
)
fit_cm <- matcha(ncc_cm, "case", "x",
  design = counter_matched(strata = "set", time = "risk_time",
                           weights = "log_w"),
  estimator = "weighted_cox"
)
contrast(fit_cm)                   # hazard ratio by default

The log-weights (log_w) encode how many subjects each observation represents: the case represents its entire same-stratum risk set (log_w = log(n_same + 1)), and each control represents the opposite stratum divided by the number of controls drawn (log_w = log(n_opp / m_take)). Requesting type = "or" or a sandwich/bootstrap CI is declined.

Covered combinations

Legend. ✅ truth-pinned in tests · ⛔ rejected with an informative error.

Sampling	Estimator	Estimand	Status
`sample_ncc()` risk-set draw (+ `match` / `entry`)	—	NCC sample from a cohort	✅
m:1 risk-set	`clogit`	conditional hazard ratio	✅
m:1 risk-set + confounder	`clogit`	conditional hazard ratio	✅
risk-set + `effect_modifier`	`clogit`	stratum-specific hazard ratio	✅
nested	`clogit`	odds ratio	⛔ use `type = "hr"`
nested	`clogit`	RD / RR	⛔ need q0
nested, non-binary outcome	—	—	⛔
`sample_ncc_counter_matched()` draw	—	CM NCC sample	✅
counter-matched	`weighted_cox`	conditional hazard ratio	✅
counter-matched	`weighted_cox`	odds ratio	⛔ use `type = "hr"`
counter-matched	`weighted_cox`	sandwich / bootstrap CI	⛔

See FEATURE_COVERAGE_MATRIX.md for the authoritative status of every combination.

References

Goldstein, Larry, and Bryan Langholz. 1992. “Asymptotic Theory for Nested Case-Control Sampling in the Cox Regression Model.” The Annals of Statistics 20 (4): 1903–28. https://doi.org/10.1214/aos/1176348895.

Langholz, Bryan, and Ørnulf Borgan. 1995. “Counter-Matching: A Stratified Nested Case-Control Sampling Method.” Biometrika 82 (1): 69–79. https://doi.org/10.1093/biomet/82.1.69.

Prentice, Ross L., and Norman E. Breslow. 1978. “Retrospective Studies and Failure Time Models.” Biometrika 65 (1): 153–58. https://doi.org/10.1093/biomet/65.1.153.

Thomas, Duncan C. 1977. “Addendum to ‘Methods of Cohort Analysis: Appraisal by Application to Asbestos Mining’ (Liddell, f. D. K., McDonald, j. C. And Thomas, d. c.).” Journal of the Royal Statistical Society, Series A 140 (4): 469–91. https://doi.org/10.2307/2345280.