Fit a causal survival hazard model on person-period data

Description

Fit-only entry point for survatr. Builds the risk set and fits the pooled-logistic discrete-time hazard model logit h(t | A, L) = alpha(t) + beta_A A + beta_L L on the at-risk person-period rows. Survival curves, risk / RMST contrasts, and variance live in contrast() (time-indexed curve-shaped result). This two-step split lets the user fit the hazard model once and cheaply contrast many interventions on top.

Usage

surv_fit(
  data,
  outcome,
  treatment,
  confounders,
  id,
  time,
  censoring = NULL,
  competing = NULL,
  time_formula = ~splines::ns(time, 4),
  weights = NULL,
  estimator = "gcomp",
  model_fn = stats::glm,
  propensity_model_fn = stats::glm,
  trim = NULL,
  confounders_tv = NULL,
  history = Inf,
  ...
)

Arguments

data A person-period (long) data.frame or data.table, rectangular across ids: every unique id must have one row at every unique time value. Ragged PP (ids dropped post-event / post-censor) is rejected with class survatr_ragged_pp. Pad ragged data before calling surv_fit() by appending rows with outcome = 0 and (if used) censoring = 1 so the risk-set builder drops them from the fit. Wide data (one row per id across a multi-period study) must be reshaped with causatr::to_person_period().
outcome Character scalar. Column name of the event indicator (1 = event at this period, 0 = no event). Must be in data.
treatment Character scalar. Column name of the treatment. For Track A the treatment is constant within id. Under estimator = “ice” (Track B) the treatment may vary within id and must be numeric (binary, or a numeric dose entered linearly); factor / categorical (k > 2) treatments are rejected with class survatr_ice_treatment_unsupported (a treatment-design formula path ships in a later chunk).
confounders A one-sided formula (e.g. ~ L1 + L2) describing the baseline (time-invariant) covariate adjustment set. Under estimator = “ice” (Track B), time-varying covariates go in confounders_tv instead; baseline terms here are never lagged.
id Character scalar. Column name of the individual identifier.
time Character scalar. Column name of the discrete period index (integer-valued; sorted within id).
censoring Character scalar or NULL. Column name of the censoring indicator (1 = censored at this period). NA or 0 means uncensored. When NULL, every uncensored period is treated as at-risk until the first event.
competing Character scalar or NULL (the default). When non-NULL, activates the competing-risks path: competing names a multi-valued event-type column (0 = no event this period, 1..J = the cause of the event this period), and surv_fit() fits J parallel cause-specific pooled-logistic hazard models on a shared all-cause risk set. It must name the same column as outcome; competing risks are gcomp / Track A only in this release (a non-“gcomp” estimator, fewer than two distinct causes, or outcome != competing aborts with survatr_competing_estimator / survatr_competing_misuse). Cumulative-incidence functions (CIF), CIF contrasts, and all-cause survival come from contrast(). Fine–Gray / subdistribution hazards are out of scope (cause-specific only).
time_formula One-sided formula for the baseline hazard alpha(t). Defaults to ~ splines::ns(time, 4) (4 df natural spline on the time variable). Pass ~ factor(time) for period dummies or ~ 1 for a time-constant hazard.
weights Optional numeric vector of external weights, length nrow(data). When supplied, the hazard model is fit with stats::quasibinomial() rather than stats::binomial() (same score equations, free dispersion – drops the "non-integer #successes" warning). The variance engine in later chunks reads the family from fit$family to pick the right dispersion.
estimator Character scalar. “gcomp” (pooled-logistic standardization), “ipw” (weighted marginal hazard MSM with stabilized density-ratio weights), or “ice” (Track B: longitudinal iterated-conditional-expectation hazards for a time-varying treatment). Matching is a hard reject with class survatr_matching_rejected pointing to survival::coxph(…, weights = , cluster = ).
model_fn Fitting function for the hazard model. Defaults to stats::glm. Accepts any function matching the stats::glm interface (formula, data, family, weights, …), e.g. mgcv::gam with an s(time) term in time_formula.
propensity_model_fn Fitting function for the treatment model under estimator = “ipw” (ignored otherwise). Same stats::glm-style interface as model_fn. Defaults to stats::glm. The treatment model is fit on the baseline rows (one per id) with confounders as predictors; the hazard MSM then uses A only.
trim Numeric scalar in (0, 1] or NULL (the default). Under estimator = “ipw”, the per-id stabilized weights are winsorized at the trim-th quantile (Cole & Hernán 2008) before being broadcast onto the person-period rows. NULL / 1 means no truncation. The resolved fixed cutoff is reused by the sandwich variance.
confounders_tv A one-sided formula of time-varying confounders for Track B (estimator = “ice”), lag-expanded at each backward step (e.g. ~ L builds L + lag1_L + …). NULL (the default) means no time-varying confounders. Ignored by Track A (gcomp / ipw).
history Markov lag order for Track B. Inf (the default) uses the full available history (capped at n_times - 1); an integer restricts the lag structure (e.g. history = 1 for first-order Markov). Ignored by Track A.
Forwarded to model_fn. na.action = na.exclude is rejected with class survatr_bad_na_actionna.exclude pads working residuals with NAs while model.matrix() drops them, which silently corrupts the sandwich variance downstream.

Value

An object of class survatr_fit holding the fitted hazard model, the person-period data (internal .survatr_* columns stripped), the time grid, and metadata needed by contrast() and diagnose().

See Also

causatr::to_person_period() for reshaping wide data.

Other survatr_fit functions: print.survatr_fit()

Examples

library("survatr")

# Small rectangular person-period dataset: 30 ids over 4 periods.
set.seed(1)
n_id <- 30L
K <- 4L
pp <- data.frame(
  id = rep(seq_len(n_id), each = K),
  t = rep(seq_len(K), times = n_id),
  A = rep(rbinom(n_id, 1L, 0.5), each = K),
  Y = rbinom(n_id * K, 1L, 0.1)
)

# Pooled-logistic hazard with period dummies for the baseline hazard.
fit <- surv_fit(pp, "Y", "A", ~1, "id", "t", time_formula = ~ factor(t))
fit
<survatr_fit>
  Track:       A
  Estimator:   gcomp
  Family:      binomial
  Outcome:     Y
  Treatment:   A
  ID:          id
  Time:        t
  Censoring:   none
  N:           30 individuals, 120 PP rows (104 at risk)
  Time grid:   [1, 4] (4 unique times)