Draw a nested case-control sample from a cohort by risk-set sampling

Description

Generates a nested case-control (NCC) dataset from a full cohort by risk-set (incidence-density) sampling: at each event time the failing subject (the case) is matched to m controls drawn at random from the subjects still at risk at that instant. Each case and its sampled controls form one matched set (a sampled risk set), which is the stratum the conditional partial likelihood conditions on. The result feeds straight into matcha(design = nested_cc(strata = “set”, time = “risk_time”)), whose conditional logistic fit reports the hazard ratio (OR = HR exactly under risk-set sampling; Prentice & Breslow 1978).

Usage

sample_ncc(cohort, time, event, m = 1, match = NULL, entry = NULL)

Arguments

cohort A data.frame or data.table with one row per subject. Copied, never mutated.
time A single character string naming the exit / event-time column (the time scale on which risk sets are formed). Must be numeric.
event A single character string naming the cohort event indicator (logical, a two-level factor, or numeric 0/1); at least one event must occur.
m A single whole number >= 1, the number of controls sampled per case (default 1).
match NULL (no additional matching) or a one-sided formula naming population-stratum column(s) controls must share with the case (e.g. ~ sex + birth_cohort).
entry NULL (everyone enters at the time origin) or a single character string naming a delayed-entry / left-truncation column. Must be numeric when supplied; a subject is at risk at tc only if entry < tc.

Details

The risk set at a failure time tc is the set of subjects under observation then: those who have entered follow-up and not yet left, i.e. entry < tc <= time (with entry taken as the origin when not supplied, so the condition reduces to time >= tc). The case is excluded from its own control pool, and min(m, n_eligible) controls are sampled without replacement; a late failure time with fewer than m eligible controls yields a smaller set rather than an error, mirroring real NCC sampling. A subject sampled as a control may itself fail later in the cohort — it serves as a control before its own event — so the per-set case indicator is distinct from the cohort-wide event column.

Additional matching (match = ~ s1 + s2) restricts each case’s control pool to subjects sharing the case’s values on the named population-stratum variables (e.g. sex, birth cohort). Sampling uses the ambient random-number stream, so wrap the call in withr::with_seed() or precede it with set.seed() for reproducibility.

The sampler is implemented natively (base R + data.table) so it is always available and deterministically seedable; Epi::ccwc() is an equivalent external implementation used as a cross-check in the package’s tests.

A case whose risk set contains no eligible control carries no conditional-likelihood information and signals a sampling failure (a misspecified time origin/scale, an entry/exit mismatch, or over-fine match strata), so it aborts with matchatr_empty_risk_set rather than silently producing a singleton set.

Value

A data.table with one row per sampled subject: the selected rows of cohort (all original columns) plus set (integer matched-set id), case (per-set 0/1 indicator, 1 for the case), and risk_time (the set’s failure time). Aborts with matchatr_empty_risk_set when any case has no eligible control.

See Also

nested_cc(), matcha(), Epi::ccwc()

Examples

library("matchatr")

# A small cohort with event times; draw 2 controls per case.
cohort <- data.frame(
  id = 1:8,
  t  = c(2, 5, 1, 8, 3, 9, 4, 7),
  d  = c(1, 0, 1, 0, 1, 0, 0, 0),
  x  = c(1, 0, 1, 0, 0, 1, 0, 1)
)
set.seed(1)
ncc <- sample_ncc(cohort, time = "t", event = "d", m = 2)
ncc
      id     t     d     x   set  case risk_time
   <int> <num> <num> <num> <int> <int>     <num>
1:     3     1     1     1     1     1         1
2:     1     2     1     1     1     0         1
3:     5     3     1     0     1     0         1
4:     1     2     1     1     2     1         2
5:     2     5     0     0     2     0         2
6:     4     8     0     0     2     0         2
7:     5     3     1     0     3     1         3
8:     8     7     0     1     3     0         3
9:     6     9     0     1     3     0         3
# Analyse it: each sampled risk set is a stratum -> hazard ratio.
fit <- matcha(ncc, outcome = "case", exposure = "x",
              design = nested_cc(strata = "set", time = "risk_time"),
              estimator = "clogit")
contrast(fit)
<matchatr_result>
 Estimator:  clogit  (engine: clogit)
 Estimand:   hazard ratio
 Contrast:   Hazard ratio
 CI method:  model
 N:          9

Contrasts:
   comparison estimate       se  ci_lower ci_upper
       <char>    <num>    <num>     <num>    <num>
1:          x 1.686141 2.174996 0.1345573 21.12907