Validation against reference implementations

causatr is built around a single influence-function engine that reproduces the sandwich-variance construction used by several purpose-built R packages. This vignette runs side-by-side comparisons against those packages so you can see the agreement (and disagreement, where it’s expected) directly. Each section fits the same causal quantity with causatr and with a reference implementation and prints a two-row table of the point estimate, standard error, and 95% CI.

What counts as “matching”. Some references implement the same estimator and should agree to high precision:

stdReg2::standardize_glm() uses the same g-formula influence function as causatr’s point-gcomp engine, so the point estimate should match to ~1e-6 and the SE to within rounding of the covariance inverse.
WeightIt::glm_weightit() uses the same M-estimator sandwich as causatr’s IPW Branch A, so the SE should match within rounding.
marginaleffects::avg_comparisons() on a MatchIt::matchit() output with vcov = ~subclass uses the same cluster-robust aggregation as causatr’s matching branch, so point estimate and SE should match closely.

Other references implement a different estimator that targets the same causal quantity. In those cases we expect agreement on the point estimate (up to finite-sample noise) but not on the SE construction:

lmtp::lmtp_tmle() is a TMLE, not a plug-in g-formula. It targets the same causal parameter but has a different finite-sample distribution; agreement is a sanity check, not a correctness proof.

The packages used here are all in Suggests; every chunk below is guarded with requireNamespace() so the vignette still builds if any of them is missing.

Setup

Code

library(causatr)
library(tinytable)

compare_row <- function(label, result, name, ref_name,
                        ref_est, ref_se, level = 0.95) {
  con <- result$contrasts
  est <- result$estimates
  if (name %in% con$comparison) {
    row <- con[con$comparison == name, ]
  } else if (name %in% est$intervention) {
    row <- est[est$intervention == name, ]
  } else {
    stop(
      "name '", name,
      "' not found in result$contrasts$comparison or result$estimates$intervention"
    )
  }
  stopifnot(
    length(ref_est) == 1L && is.finite(ref_est),
    length(ref_se) == 1L && is.finite(ref_se)
  )
  z <- stats::qnorm(1 - (1 - level) / 2)

  # Three rows: causatr, reference, and a row of absolute differences.
  # The diff row is the whole point of the vignette — it makes
  # "do the two implementations agree?" visually obvious without making
  # the reader compare two numbers to 6 decimal places.
  est_c <- row$estimate
  se_c  <- row$se
  ci_lc <- row$ci_lower
  ci_uc <- row$ci_upper
  ci_lr <- ref_est - z * ref_se
  ci_ur <- ref_est + z * ref_se

  dt <- data.table::data.table(
    source   = c("causatr", ref_name, "abs diff"),
    estimate = c(est_c, ref_est, abs(est_c - ref_est)),
    se       = c(se_c,  ref_se,  abs(se_c  - ref_se)),
    ci_lower = c(ci_lc, ci_lr,   abs(ci_lc - ci_lr)),
    ci_upper = c(ci_uc, ci_ur,   abs(ci_uc - ci_ur))
  )

  tinytable::tt(dt, caption = label) |>
    tinytable::format_tt(
      j = 2:5, digits = 4, num_fmt = "decimal"
    ) |>
    tinytable::style_tt(i = 3, italic = TRUE, color = "#888")
}

compare_row() pulls an estimate from a causatr_result (either an intervention mean or a contrast by name) and pairs it with a reference estimate + SE under a normal-approximation CI. The output is a three-row tinytable: causatr, the reference, and the absolute discrepancy on each column, so agreement (or the lack of it) is immediately legible.

We use three small simulated DGPs for reproducibility. NHEFS would work too, but inline simulation keeps the vignette self-contained and cheap to render. The data-generating mechanisms are described below.

DGM 1: Point treatment (`simulate_point()`)

A single confounder \(L\), binary treatment \(A\), and outcome \(Y\) (continuous or binary depending on the binary_outcome flag). Both versions share the same confounding structure: \(L\) affects both treatment assignment and the outcome.

Continuous outcome (binary_outcome = FALSE):

\[ \begin{aligned} L &\sim \mathcal{N}(0, 1) \\ A &\sim \text{Bernoulli}\bigl(\text{logit}^{-1}(0.4 \, L)\bigr) \\ Y &= 1 + 2 A + 0.5 \, L + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, 0.8^2) \end{aligned} \]

The structural coefficient on \(A\) is 2, so the true ATE is exactly 2.

Binary outcome (binary_outcome = TRUE):

\[ \begin{aligned} L &\sim \mathcal{N}(0, 1) \\ A &\sim \text{Bernoulli}\bigl(\text{logit}^{-1}(0.4 \, L)\bigr) \\ Y &\sim \text{Bernoulli}\bigl(\text{logit}^{-1}(-0.5 + 0.8 \, A + 0.3 \, L)\bigr) \end{aligned} \]

The conditional odds ratio is \(\exp(0.8) \approx 2.23\).

DGM 2: Longitudinal binary (`simulate_longitudinal()`)

\(K\) periods with binary treatment, a time-varying confounder, and a binary end-of-follow-up outcome. Treatment–confounder feedback is present: past treatment shifts the distribution of the next confounder.

\[ \begin{aligned} L_k &\sim \mathcal{N}(0, 1) + 0.3 \, A_{k-1} \quad (A_0 \equiv 0) \\ A_k &\sim \text{Bernoulli}\bigl(\text{logit}^{-1}(0.3 \, L_k)\bigr) \\ Y &\sim \text{Bernoulli}\!\left(\text{logit}^{-1}\!\left(-1 + 0.4 \sum_{k=1}^{K} A_k + 0.2 \, L_K\right)\right) \end{aligned} \]

The treatment–confounder feedback (\(L_k\) depends on \(A_{k-1}\)) means naive conditioning on time-varying confounders is biased; g-computation (ICE) or IPW with sequential weights is needed.

DGM 3: Longitudinal continuous (`simulate_longitudinal_continuous()`)

\(K\) periods with continuous (Gaussian) treatment and a continuous outcome. The same treatment–confounder feedback structure as DGM 2, but with continuous variables throughout.

\[ \begin{aligned} L_k &\sim \mathcal{N}(0, 1) + 0.2 \, A_{k-1} \quad (A_0 \equiv 0) \\ A_k &\sim \mathcal{N}(0.3 \, L_k,\; 1) \\ Y &= 1 + \sum_{k=1}^{K} A_k + 0.3 \, L_K + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, 0.5^2) \end{aligned} \]

Under a shift(+1) intervention that adds 1 to every period’s treatment, each period contributes exactly 1 unit to the outcome, so the true causal effect of shift(+1) vs the natural value is \(K\). For the default \(K = 2\), the truth is 2.

Code

simulate_point <- function(n = 2000, seed = 1L, binary_outcome = FALSE) {
  set.seed(seed)
  L <- stats::rnorm(n)
  A <- stats::rbinom(n, 1L, stats::plogis(0.4 * L))
  if (binary_outcome) {
    Y <- stats::rbinom(n, 1L, stats::plogis(-0.5 + 0.8 * A + 0.3 * L))
  } else {
    Y <- 1 + 2 * A + 0.5 * L + stats::rnorm(n, sd = 0.8)
  }
  data.frame(L = L, A = A, Y = Y)
}

simulate_longitudinal <- function(n = 500L, K = 3L, seed = 2L) {
  set.seed(seed)
  L <- matrix(0, n, K)
  A <- matrix(0L, n, K)
  for (k in seq_len(K)) {
    L[, k] <- stats::rnorm(n) +
      if (k == 1L) 0 else 0.3 * A[, k - 1L]
    A[, k] <- stats::rbinom(n, 1L, stats::plogis(0.3 * L[, k]))
  }
  Y <- stats::rbinom(n, 1L, stats::plogis(-1 + 0.4 * rowSums(A) + 0.2 * L[, K]))
  wide <- data.table::data.table(id = seq_len(n))
  for (k in seq_len(K)) {
    wide[[paste0("L", k)]] <- L[, k]
    wide[[paste0("A", k)]] <- A[, k]
  }
  wide$Y <- Y
  wide[]
}

simulate_longitudinal_continuous <- function(n = 400L, K = 2L, seed = 3L) {
  set.seed(seed)
  L <- matrix(0, n, K)
  A <- matrix(0, n, K)
  for (k in seq_len(K)) {
    L[, k] <- stats::rnorm(n) +
      if (k == 1L) 0 else 0.2 * A[, k - 1L]
    A[, k] <- stats::rnorm(n, mean = 0.3 * L[, k], sd = 1)
  }
  Y <- 1 + rowSums(A) + 0.3 * L[, K] + stats::rnorm(n, sd = 0.5)
  wide <- data.table::data.table(id = seq_len(n))
  for (k in seq_len(K)) {
    wide[[paste0("L", k)]] <- L[, k]
    wide[[paste0("A", k)]] <- A[, k]
  }
  wide$Y <- Y
  wide[]
}

1. Point g-computation vs `stdReg2`

stdReg2::standardize_glm() implements the parametric g-formula with the same influence-function sandwich variance that causatr uses in its point-gcomp branch. Agreement here should be exact up to rounding.

1a. Continuous outcome, risk difference

Code

df_pt <- simulate_point(n = 2000, binary_outcome = FALSE)

fit_gcomp <- causat(
  df_pt,
  outcome = "Y",
  treatment = "A",
  confounders = ~L,
  estimator = "gcomp",
  model_fn = stats::glm
)
res_gcomp <- contrast(
  fit_gcomp,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "difference",
  ci_method = "sandwich"
)
res_gcomp

Estimator: gcomp · Estimand: ATE · Contrast: difference · CI method: sandwich · N: 2000

Intervention means
intervention	estimate	se	ci_lower	ci_upper
a1	2.99	0.028	2.935	3.046
a0	0.993	0.029	0.937	1.049

Contrasts
comparison	estimate	se	ci_lower	ci_upper
a1 vs a0	1.997	0.037	1.925	2.07

Code

std_fit <- stdReg2::standardize_glm(
  Y ~ A + L,
  data = df_pt,
  values = list(A = c(0, 1)),
  contrasts = "difference",
  reference = 0
)
std_res <- generics::tidy(std_fit)
std_diff <- std_res[std_res$contrast == "difference" & std_res$A == 1, ]

compare_row(
  "ATE (risk difference)", res_gcomp,
  name = "a1 vs a0",
  ref_name = "stdReg2",
  ref_est = std_diff$Estimate,
  ref_se = std_diff$Std.Error
)

ATE (risk difference)
source	estimate	se	ci_lower	ci_upper
causatr	1.9974	0.037	1.9248	2.07
stdReg2	1.9974	0.0372	1.9245	2.0704
abs diff	0	0.0002	0.0004	0.0004

1b. Binary outcome, risk difference

Code

df_pt_bin <- simulate_point(n = 2000, binary_outcome = TRUE)

fit_bin <- causat(
  df_pt_bin,
  outcome = "Y",
  treatment = "A",
  confounders = ~L,
  family = binomial(),
  estimator = "gcomp",
  model_fn = stats::glm
)
res_bin_rd <- contrast(
  fit_bin,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "difference",
  ci_method = "sandwich"
)
res_bin_rd

Estimator: gcomp · Estimand: ATE · Contrast: difference · CI method: sandwich · N: 2000

Intervention means
intervention	estimate	se	ci_lower	ci_upper
a1	0.574	0.016	0.543	0.605
a0	0.38	0.015	0.35	0.411

Contrasts
comparison	estimate	se	ci_lower	ci_upper
a1 vs a0	0.194	0.022	0.15	0.237

Code

std_bin <- stdReg2::standardize_glm(
  Y ~ A + L,
  data = df_pt_bin,
  family = "binomial",
  values = list(A = c(0, 1)),
  contrasts = "difference",
  reference = 0
)
std_bin_res <- generics::tidy(std_bin)
std_bin_diff <- std_bin_res[std_bin_res$contrast == "difference" & std_bin_res$A == 1, ]

compare_row(
  "Risk difference", res_bin_rd,
  name = "a1 vs a0",
  ref_name = "stdReg2",
  ref_est = std_bin_diff$Estimate,
  ref_se = std_bin_diff$Std.Error
)

Risk difference
source	estimate	se	ci_lower	ci_upper
causatr	0.1937	0.0223	0.1499	0.2375
stdReg2	0.1937	0.0224	0.1498	0.2376
abs diff	0	0.0001	0.0001	0.0001

1c. Binary outcome, risk ratio

stdReg2 also supports contrasts = "ratio" and returns the SE on the ratio scale using the same influence-function approach causatr uses internally. Note that causatr’s CI is built on the log scale (exp(log(est) \(\pm\) z * se/est)) while stdReg2’s CI is linear on the ratio scale, so the CI bounds will differ slightly even when the point estimate and SE agree to rounding.

Code

res_bin_rr <- contrast(
  fit_bin,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "ratio",
  ci_method = "sandwich"
)
res_bin_rr

Estimator: gcomp · Estimand: ATE · Contrast: ratio · CI method: sandwich · N: 2000

Intervention means
intervention	estimate	se	ci_lower	ci_upper
a1	0.574	0.016	0.543	0.605
a0	0.38	0.015	0.35	0.411

Contrasts
comparison	estimate	se	ci_lower	ci_upper
a1 vs a0	1.509	0.075	1.37	1.663

Code

std_rr <- stdReg2::standardize_glm(
  Y ~ A + L,
  data = df_pt_bin,
  family = "binomial",
  values = list(A = c(0, 1)),
  contrasts = "ratio",
  reference = 0
)
std_rr_res <- generics::tidy(std_rr)
std_rr_row <- std_rr_res[std_rr_res$contrast == "ratio" & std_rr_res$A == 1, ]

compare_row(
  "Risk ratio", res_bin_rr,
  name = "a1 vs a0",
  ref_name = "stdReg2",
  ref_est = std_rr_row$Estimate,
  ref_se = std_rr_row$Std.Error
)

Risk ratio
source	estimate	se	ci_lower	ci_upper
causatr	1.5094	0.0748	1.3697	1.6633
stdReg2	1.5094	0.0749	1.3626	1.6562
abs diff	0	0.0001	0.0071	0.0071

2. IPW vs `WeightIt::glm_weightit`

causatr’s IPW Branch A wraps WeightIt::glm_weightit for the propensity fit and applies the Mparts correction to the marginal structural model. Comparing against glm_weightit directly verifies that the bookkeeping matches: point estimate and SE should agree.

Code

fit_ipw <- causat(
  df_pt,
  outcome = "Y",
  treatment = "A",
  confounders = ~L,
  estimator = "ipw",
  model_fn = stats::glm,
  propensity_model_fn = stats::glm
)
res_ipw <- contrast(
  fit_ipw,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "difference",
  ci_method = "sandwich"
)
res_ipw

Estimator: ipw · Estimand: ATE · Contrast: difference · CI method: sandwich · N: 2000

Intervention means
intervention	estimate	se	ci_lower	ci_upper
a1	3.002	0.029	2.946	3.059
a0	0.995	0.029	0.939	1.052

Contrasts
comparison	estimate	se	ci_lower	ci_upper
a1 vs a0	2.007	0.037	1.934	2.08

Code

w_ref <- WeightIt::weightit(
  A ~ L,
  data = df_pt,
  method = "glm",
  estimand = "ATE"
)
msm_ref <- WeightIt::glm_weightit(
  Y ~ A,
  data = df_pt,
  weightit = w_ref
)
coef_msm <- stats::coef(msm_ref)
vcov_msm <- stats::vcov(msm_ref)
ref_est <- unname(coef_msm["A"])
ref_se <- sqrt(vcov_msm["A", "A"])

compare_row(
  "ATE (IPW)", res_ipw,
  name = "a1 vs a0",
  ref_name = "WeightIt (glm_weightit)",
  ref_est = ref_est,
  ref_se = ref_se
)

ATE (IPW)
source	estimate	se	ci_lower	ci_upper
causatr	2.0071	0.0373	1.9339	2.0802
WeightIt (glm_weightit)	2.0071	0.0373	1.9339	2.0802
abs diff	0	0	0	0

With a saturated Y ~ A MSM, the coefficient on A is the marginal contrast, so reading it off glm_weightit gives the reference directly. If you wanted to use a richer MSM (e.g. to coarsen a multi-level treatment), you would standardize predictions from glm_weightit via marginaleffects instead — the same logic, just with an extra delta-method step.

3. Matching vs `marginaleffects` on `MatchIt`

For matching, the canonical post-match estimator is to refit the outcome on the matched subset with cluster-robust SE on the subclass. causatr’s matching branch does exactly this internally; marginaleffects::avg_comparisons() with vcov = ~subclass reproduces the same construction.

Code

fit_match <- causat(
  df_pt,
  outcome = "Y",
  treatment = "A",
  confounders = ~L,
  estimator = "matching",
  estimand = "ATT"
)
res_match <- contrast(
  fit_match,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "difference",
  ci_method = "sandwich"
)
res_match

Estimator: matching · Estimand: ATT · Contrast: difference · CI method: sandwich · N: 986

Intervention means
intervention	estimate	se	ci_lower	ci_upper
a1	3.106	0.03	3.047	3.164
a0	0.918	0.031	0.858	0.978

Contrasts
comparison	estimate	se	ci_lower	ci_upper
a1 vs a0	2.188	0.036	2.117	2.259

Code

m_ref <- MatchIt::matchit(
  A ~ L,
  data = df_pt,
  method = "nearest",
  estimand = "ATT"
)
md_ref <- MatchIt::match.data(m_ref)
lm_ref <- stats::lm(Y ~ A, data = md_ref, weights = weights)
me_match <- marginaleffects::avg_comparisons(
  lm_ref,
  variables = list(A = c(0, 1)),
  vcov = ~subclass,
  wts = "weights",
  newdata = subset(md_ref, A == 1)
)
compare_row(
  "ATT (matching)", res_match,
  name = "a1 vs a0",
  ref_name = "marginaleffects+MatchIt",
  ref_est = me_match$estimate,
  ref_se = me_match$std.error
)

ATT (matching)
source	estimate	se	ci_lower	ci_upper
causatr	2.1876	0.0362	2.1166	2.2586
marginaleffects+MatchIt	2.1876	0.0362	2.1165	2.2586
abs diff	0	0	0.0001	0.0001

The newdata = subset(md_ref, A == 1) argument tells marginaleffects to average over the treated-subject target population, which is what estimand = "ATT" means.

4. Longitudinal ICE vs `lmtp::lmtp_tmle`

lmtp::lmtp_tmle() is a TMLE that targets the same parameter as ICE g-computation but with a different finite-sample bias-variance profile. We expect agreement on the point estimate up to sampling noise (both methods are consistent for the same target), but not exact agreement on the SE.

4a. Binary treatment, binary outcome, “always” vs “never”

Code

wide_bin <- simulate_longitudinal(n = 500L, K = 3L)
long_bin <- to_person_period(
  wide_bin,
  id = "id",
  time_varying = list(L = c("L1", "L2", "L3"), A = c("A1", "A2", "A3")),
  time_invariant = "Y"
)

fit_ice <- causat(
  long_bin,
  outcome = "Y",
  treatment = "A",
  confounders = ~1,
  confounders_tv = ~L,
  family = binomial(),
  id = "id",
  time = "time",
  estimator = "gcomp",
  type = "longitudinal",
  history = 1L,
  model_fn = stats::glm
)
res_ice <- contrast(
  fit_ice,
  interventions = list(always = static(1), never = static(0)),
  reference = "never",
  type = "difference",
  ci_method = "sandwich"
)
res_ice

Estimator: gcomp · Estimand: ATE · Contrast: difference · CI method: sandwich · N: 500

Intervention means
intervention	estimate	se	ci_lower	ci_upper
always	0.537	0.038	0.463	0.611
never	0.332	0.036	0.261	0.402

Contrasts
comparison	estimate	se	ci_lower	ci_upper
always vs never	0.206	0.06	0.088	0.323

Code

set.seed(42)
lmtp_always <- tryCatch(
  lmtp::lmtp_tmle(
    data = as.data.frame(wide_bin),
    trt = c("A1", "A2", "A3"),
    outcome = "Y",
    time_vary = list("L1", "L2", "L3"),
    shift = function(data, trt) rep(1, nrow(data)),
    outcome_type = "binomial",
    learners_outcome = "SL.glm",
    learners_trt = "SL.glm",
    folds = 2
  ),
  error = function(e) NULL
)
#> Loading required package: nnls
lmtp_never <- tryCatch(
  lmtp::lmtp_tmle(
    data = as.data.frame(wide_bin),
    trt = c("A1", "A2", "A3"),
    outcome = "Y",
    time_vary = list("L1", "L2", "L3"),
    shift = function(data, trt) rep(0, nrow(data)),
    outcome_type = "binomial",
    learners_outcome = "SL.glm",
    learners_trt = "SL.glm",
    folds = 2
  ),
  error = function(e) NULL
)

if (!is.null(lmtp_always) && !is.null(lmtp_never)) {
  lmtp_diff <- lmtp::lmtp_contrast(lmtp_always, ref = lmtp_never, type = "additive")
  ref_row <- lmtp_diff$estimates
  compare_row(
    "Always vs never (RD)", res_ice,
    name = "always vs never",
    ref_name = "lmtp_tmle",
    ref_est = ref_row$estimate,
    ref_se = ref_row$std.error
  )
} else {
  message("lmtp fit failed; skipping comparison row.")
}

Always vs never (RD)
source	estimate	se	ci_lower	ci_upper
causatr	0.2055	0.0601	0.0877	0.3234
lmtp_tmle	0.2408	0.0963	0.0522	0.4295
abs diff	0.0353	0.0361	0.0355	0.1061

4b. Continuous treatment, shift intervention

Code

wide_cont <- simulate_longitudinal_continuous(n = 400L, K = 2L)
long_cont <- to_person_period(
  wide_cont,
  id = "id",
  time_varying = list(L = c("L1", "L2"), A = c("A1", "A2")),
  time_invariant = "Y"
)

fit_ice_shift <- causat(
  long_cont,
  outcome = "Y",
  treatment = "A",
  confounders = ~1,
  confounders_tv = ~L,
  family = gaussian(),
  id = "id",
  time = "time",
  estimator = "gcomp",
  type = "longitudinal",
  history = 1L,
  model_fn = stats::glm
)
res_ice_shift <- contrast(
  fit_ice_shift,
  interventions = list(plus1 = shift(1), zero = shift(0)),
  reference = "zero",
  type = "difference",
  ci_method = "sandwich"
)
res_ice_shift

Estimator: gcomp · Estimand: ATE · Contrast: difference · CI method: sandwich · N: 400

Intervention means
intervention	estimate	se	ci_lower	ci_upper
plus1	3.068	0.108	2.856	3.281
zero	0.953	0.084	0.788	1.117

Contrasts
comparison	estimate	se	ci_lower	ci_upper
plus1 vs zero	2.116	0.07	1.978	2.254

Code

set.seed(43)
shift_up <- function(data, trt) data[[trt]] + 1
shift_id <- function(data, trt) data[[trt]]

lmtp_up <- tryCatch(
  lmtp::lmtp_tmle(
    data = as.data.frame(wide_cont),
    trt = c("A1", "A2"),
    outcome = "Y",
    time_vary = list("L1", "L2"),
    shift = shift_up,
    outcome_type = "continuous",
    learners_outcome = "SL.glm",
    learners_trt = "SL.glm",
    folds = 2
  ),
  error = function(e) NULL
)
lmtp_id <- tryCatch(
  lmtp::lmtp_tmle(
    data = as.data.frame(wide_cont),
    trt = c("A1", "A2"),
    outcome = "Y",
    time_vary = list("L1", "L2"),
    shift = shift_id,
    outcome_type = "continuous",
    learners_outcome = "SL.glm",
    learners_trt = "SL.glm",
    folds = 2
  ),
  error = function(e) NULL
)

if (!is.null(lmtp_up) && !is.null(lmtp_id)) {
  lmtp_diff <- lmtp::lmtp_contrast(lmtp_up, ref = lmtp_id, type = "additive")
  ref_row <- lmtp_diff$estimates
  compare_row(
    "Shift(+1) vs identity", res_ice_shift,
    name = "plus1 vs zero",
    ref_name = "lmtp_tmle",
    ref_est = ref_row$estimate,
    ref_se = ref_row$std.error
  )
} else {
  message("lmtp fit failed; skipping comparison row.")
}

Shift(+1) vs identity
source	estimate	se	ci_lower	ci_upper
causatr	2.1159	0.0702	1.9783	2.2536
lmtp_tmle	2.2112	0.0929	2.0292	2.3932
abs diff	0.0953	0.0226	0.0509	0.1397

The structural effect of shift(+1) on this DGP is exactly 2 (one unit per period, two periods). Both estimators should recover this up to sampling noise; the two SEs correspond to different estimators (ICE plug-in vs TMLE) and are expected to differ in the second decimal place.

5. AIPW vs delicatessen (Python)

The AIPW engine is cross-validated against delicatessen (Zivich et al. 2024), a Python M-estimation library that implements the same stacked estimating-equation sandwich for AIPW (Robins, Rotnitzky & Zhao 1994). Because Python is not a test-suite dependency, the cross-check uses pre-computed reference values generated by data-raw/aipw_reference.py on the same DGP (same seed, same data written to CSV):

Binary treatment, linear-Gaussian DGP, n = 5000, seed = 42: μ₁, μ₀, ATE, and all three sandwich SEs match delicatessen to 4 decimal places.
Continuous treatment, shift δ = 1, linear-Gaussian DGP, n = 5000, seed = 99: μ_shift, μ_natural, and the contrast SE match delicatessen to 4 decimal places (pushforward density-ratio formula W_i = exp(δ(A_i − μ(L_i) − δ/2) / σ²)).

The reference values are pinned in test-aipw.R with tolerance = 0.005. See data-raw/aipw_reference.py for the full stacked-EE Python implementation.

6. Other references

Several additional R packages would extend this validation suite:

gfoRmula implements the parametric g-formula on wide longitudinal data and is the most direct peer of causatr’s ICE engine. Its intervention specification is more verbose (lists of intvars, interventions, int_descript), but a comparison on static “always vs never” would make a natural addition to section 4a.
survey::svyglm() (combined with a user-supplied weights column) is the standard reference for survey-weighted IPW. Since causatr validates survey-weight handling in its tests — including the direct causat(weights = svydesign(...)) convenience path that auto-extracts the design’s sampling weights and first-stage PSU into the contrast-time cluster — a vignette comparison would mostly duplicate that coverage.
sandwich::vcovCL() is the oracle for causatr’s general cluster-robust sandwich (causat(cluster = ...) / contrast(cluster = ...)). Agreement on a saturated Y ~ A + L model is pinned to machine precision after the G/(G-1) small-cluster correction in test-cluster-sandwich.R.

If you’d like to contribute a reference comparison for one of these (or for a package not listed here), open an issue — the pattern is the same as in sections 1–4: fit the same quantity two ways, pass both to compare_row(), and print the resulting two-row table.

Setup

DGM 1: Point treatment (simulate_point())

DGM 2: Longitudinal binary (simulate_longitudinal())

DGM 3: Longitudinal continuous (simulate_longitudinal_continuous())

1. Point g-computation vs stdReg2

1a. Continuous outcome, risk difference

1b. Binary outcome, risk difference

1c. Binary outcome, risk ratio

2. IPW vs WeightIt::glm_weightit

3. Matching vs marginaleffects on MatchIt

4. Longitudinal ICE vs lmtp::lmtp_tmle

4a. Binary treatment, binary outcome, “always” vs “never”

4b. Continuous treatment, shift intervention

5. AIPW vs delicatessen (Python)

6. Other references

DGM 1: Point treatment (`simulate_point()`)

DGM 2: Longitudinal binary (`simulate_longitudinal()`)

DGM 3: Longitudinal continuous (`simulate_longitudinal_continuous()`)

1. Point g-computation vs `stdReg2`

2. IPW vs `WeightIt::glm_weightit`

3. Matching vs `marginaleffects` on `MatchIt`

4. Longitudinal ICE vs `lmtp::lmtp_tmle`