sample_ncc – matchatr

Draw a nested case-control sample from a cohort by risk-set sampling

Description

Generates a nested case-control (NCC) dataset from a full cohort by risk-set (incidence-density) sampling: at each event time the failing subject (the case) is matched to m controls drawn at random from the subjects still at risk at that instant. Each case and its sampled controls form one matched set (a sampled risk set), which is the stratum the conditional partial likelihood conditions on. The result feeds straight into matcha(design = nested_cc(strata = “set”, time = “risk_time”)), whose conditional logistic fit reports the hazard ratio (OR = HR exactly under risk-set sampling; Prentice & Breslow 1978).

Usage

sample_ncc(
  cohort,
  time,
  event,
  m = 1,
  match = NULL,
  entry = NULL,
  incl_prob = FALSE
)

Arguments

cohort A data.frame or data.table with one row per subject. Copied, never mutated.

time A single character string naming the exit / event-time column (the time scale on which risk sets are formed). Must be numeric.

event A single character string naming the cohort event indicator (logical, a two-level factor, or numeric 0/1); at least one event must occur.

m A single whole number >= 1, the number of controls sampled per case (default 1).

match NULL (no additional matching) or a one-sided formula naming population-stratum column(s) controls must share with the case (e.g. ~ sex + birth_cohort).

entry NULL (everyone enters at the time origin) or a single character string naming a delayed-entry / left-truncation column. Must be numeric when supplied; a subject is at risk at tc only if entry < tc.

incl_prob Logical; if TRUE, compute Samuelsen (1997) Kaplan-Meier inclusion probabilities for each sampled subject and append two columns to the output: .cohort_row (integer row index of the subject in cohort) and ipw_weight (inverse inclusion probability: 1 for cases, 1/pi_j for sampled controls). The weights are needed by the IPW nested case-control analysis (estimator = “ipw_cox”). The inclusion probability for control j is pi_j = 1 - prod over event times where j was eligible of (1 - m_i / n_elig_i), where m_i is the controls sampled and n_elig_i the eligible pool size at event i (Samuelsen 1997, Biometrika). Computation is O(n x K) where n is the cohort size and K is the number of events. Default FALSE.

Details

The risk set at a failure time tc is the set of subjects under observation then: those who have entered follow-up and not yet left, i.e. entry < tc <= time (with entry taken as the origin when not supplied, so the condition reduces to time >= tc). The case is excluded from its own control pool, and min(m, n_eligible) controls are sampled without replacement; a late failure time with fewer than m eligible controls yields a smaller set rather than an error, mirroring real NCC sampling. A subject sampled as a control may itself fail later in the cohort — it serves as a control before its own event — so the per-set case indicator is distinct from the cohort-wide event column.

Additional matching (match = ~ s1 + s2) restricts each case’s control pool to subjects sharing the case’s values on the named population-stratum variables (e.g. sex, birth cohort). Sampling uses the ambient random-number stream, so wrap the call in withr::with_seed() or precede it with set.seed() for reproducibility.

The sampler is implemented natively (base R + data.table) so it is always available and deterministically seedable; Epi::ccwc() is an equivalent external implementation used as a cross-check in the package’s tests.

A case whose risk set contains no eligible control carries no conditional-likelihood information and signals a sampling failure (a misspecified time origin/scale, an entry/exit mismatch, or over-fine match strata), so it aborts with matchatr_empty_risk_set rather than silently producing a singleton set.

Value

A data.table with one row per sampled subject: the selected rows of cohort (all original columns) plus set (integer matched-set id), case (per-set 0/1 indicator, 1 for the case), and risk_time (the set’s failure time). When incl_prob = TRUE, also .cohort_row (integer, 1-indexed row index in the original cohort) and ipw_weight (numeric, 1/pi_j). Aborts with matchatr_empty_risk_set when any case has no eligible control.

References

Thomas DC (1977). Addendum to "Methods of cohort analysis: appraisal by application to asbestos mining" (Liddell FDK, McDonald JC, Thomas DC). Journal of the Royal Statistical Society, Series A 140(4):469-491.

Prentice RL, Breslow NE (1978). Retrospective studies and failure time models. Biometrika 65(1):153-158.

Samuelsen SO (1997). A pseudolikelihood approach to analysis of nested case-control studies. Biometrika 84(2):379-394. (Kaplan-Meier inclusion probabilities returned when incl_prob = TRUE.)

Examples

library("matchatr")

# A small cohort with event times; draw 2 controls per case.
cohort <- data.frame(
  id = 1:8,
  t  = c(2, 5, 1, 8, 3, 9, 4, 7),
  d  = c(1, 0, 1, 0, 1, 0, 0, 0),
  x  = c(1, 0, 1, 0, 0, 1, 0, 1)
)
set.seed(1)
ncc <- sample_ncc(cohort, time = "t", event = "d", m = 2)
ncc

      id     t     d     x   set  case risk_time
   <int> <num> <num> <num> <int> <int>     <num>
1:     3     1     1     1     1     1         1
2:     1     2     1     1     1     0         1
3:     5     3     1     0     1     0         1
4:     1     2     1     1     2     1         2
5:     2     5     0     0     2     0         2
6:     4     8     0     0     2     0         2
7:     5     3     1     0     3     1         3
8:     8     7     0     1     3     0         3
9:     6     9     0     1     3     0         3

# Analyse it: each sampled risk set is a stratum -> hazard ratio.
fit <- matcha(ncc, outcome = "case", exposure = "x",
              design = nested_cc(strata = "set", time = "risk_time"),
              estimator = "clogit")
contrast(fit)

<matchatr_result>
 Estimator:  clogit  (engine: clogit)
 Estimand:   hazard ratio
 Contrast:   Hazard ratio
 CI method:  model
 N:          9

Contrasts:
   comparison estimate       se  ci_lower ci_upper
       <char>    <num>    <num>     <num>    <num>
1:          x 1.686141 2.174996 0.1345573 21.12907

# With Samuelsen KM inclusion probabilities for IPW analysis.
set.seed(1)
ncc_ipw <- sample_ncc(cohort, time = "t", event = "d", m = 2, incl_prob = TRUE)
ncc_ipw[, c("id", "case", "set", "ipw_weight", ".cohort_row")]

      id  case   set ipw_weight .cohort_row
   <int> <int> <int>      <num>       <int>
1:     3     1     1        1.0           3
2:     1     0     1        1.0           1
3:     5     0     1        1.0           5
4:     1     1     2        1.0           1
5:     2     0     2        1.4           2
6:     4     0     2        1.4           4
7:     5     1     3        1.0           5
8:     8     0     3        1.4           8
9:     6     0     3        1.4           6