Variance estimation for causal effects

This vignette develops the theory behind variance estimation for causal effect estimators in causatr. The core message is simple: there is one method — the influence function (IF) — applied uniformly to g-computation, IPW, and matching. What varies across methods is only the estimating equations that define the estimator; the variance formula is always the same.

Every non-trivial equation in the sections that follow has a “Verify numerically” tabset alongside it: you can switch from the algebra to a short block of R code that computes the same quantity on a fixed running example, and to the resulting value. Those code blocks are real — the vignette is rendered each time the package is built, so if a formula and its implementation ever drift apart, the table in the “Result” tab will make it obvious.

0. Roadmap

The destination is $\mathrm{Var}(\hat\mu)$, the sampling variance of a causal estimand like $\mathbb{E}[Y^a]$ or $\mathbb{E}[Y^{a_1}] - \mathbb{E}[Y^{a_0}]$. We get there in six steps:

Section 1 — Sandwich variance of $\hat\beta$. Any estimator defined as the root of $\sum_i \psi(O_i;\beta) = 0$ has an asymptotic variance of the form $A^{-1} M A^{-1}/n$ (bread × meat × bread). This is the “variance of the parameters”, not yet of the estimand.
Section 2 — The influence function. The sandwich is equivalent to $\mathrm{Var}(\hat\beta) = \frac{1}{n^2}\sum_i \mathrm{IF}_i \mathrm{IF}_i^{\mathsf T}$, where $\mathrm{IF}_i = A^{-1}\psi_i$ is a per-individual “gradient of the estimate with respect to individual $i$”. Same number, different parametrisation — but the IF form is the one that extends to causal estimands.
Section 3 — Two-channel IF for $\hat\mu$. A causal estimand $\hat\mu = \frac{1}{n}\sum_i m(X_i;\hat\beta)$ mixes two sources of randomness: the random draw of $X_i$ (Channel 1) and the estimation of $\hat\beta$ (Channel 2). The correct IF is the sum of both channels. The delta method $J V_\beta J^{\mathsf T}$ is the Channel-2-only shortcut — exact when predictions are constant per intervention, too small otherwise.
Section 4 — The core estimators. G-computation, IPW, and matching are three choices of estimating equations plugged into the same machinery. Each gets its own IF, but all are aggregated the same way. AIPW combines the g-comp and IPW influence functions into a doubly-robust IF; its variance follows the same framework.
Section 5 — Chained estimators (ICE). When several nuisance models feed into each other (as in iterated conditional expectation g-computation for longitudinal data), the IF picks up one extra Channel-2 correction per model. We derive these via a forward sensitivity recursion.
Section 6 — Summary. A table mapping each estimator to its estimating equations, its IF, and the line in causatr where that IF is computed.

A good way to read this vignette is: skim the math in a section, then expand the Verify numerically tabset under each key equation to see the formula instantiated on real data and cross-checked against either sandwich::sandwich() or causatr’s own variance_if() engine.

0.1 Running example

Every verification block below uses the same small logistic example. It’s deliberately tiny (so the tables fit in the vignette) but not so tiny that asymptotic approximations fall apart.

Code

library(causatr)
library(sandwich)
library(tinytable)

set.seed(42)
n <- 200
L <- rnorm(n)
A <- rbinom(n, 1, plogis(0.5 * L))
Y <- rbinom(n, 1, plogis(-1 + 0.8 * A + 0.6 * L))
dat <- data.frame(Y = Y, A = A, L = L)

# The outcome model we'll use throughout Sections 1-3.
fit <- glm(Y ~ A + L, data = dat, family = binomial)
beta_hat <- coef(fit)
X   <- model.matrix(fit)           # design matrix, n x p
mu  <- fitted(fit)                 # predicted probabilities
p   <- length(beta_hat)

# A small helper: print two named numeric matrices / vectors side-by-side
# as a tinytable, so that "manual" and "gold-standard" results sit next
# to each other and any disagreement is visible.
verify_tbl <- function(manual, reference, label_m = "manual",
                       label_r = "reference", digits = 4) {
  m <- as.numeric(manual)
  r <- as.numeric(reference)
  out <- data.frame(
    cell      = if (length(m) == 1) "scalar" else seq_along(m),
    `value`   = m,
    reference = r,
    `abs diff` = abs(m - r),
    check.names = FALSE
  )
  names(out)[2] <- label_m
  names(out)[3] <- label_r
  tt(out) |>
    format_tt(j = 2:4, digits = digits, num_fmt = "significant")
}

A quick sanity check: $\hat\beta$ and the model-based (non-sandwich) fit.

Code

tt(data.frame(coef = names(beta_hat), estimate = round(beta_hat, 4)))

coef	estimate
(Intercept)	-1.0622
A	0.7498
L	0.8197

flowchart TB
  EE["Complete estimating equations<br/>(nuisance models + estimand)"] -->|"linearise<br/>(Sec 1–2)"| IF["IF_i per individual<br/>= direct + model correction"]
  IF -->|"Var = (1/n²) Σ IF²<br/>(always correct)"| VM["Var(μ̂)"]

  EE -->|"shortcut:<br/>J V_β Jᵀ"| SC{"Predictions<br/>constant?"}
  SC -->|"Yes<br/>(saturated MSM,<br/>static intervention)"| VM
  SC -->|"No<br/>(g-comp,<br/>non-saturated MSM)"| UNDER["Underestimates ❌"]

  style EE fill:#f0f4ff,stroke:#3182BD
  style IF fill:#fff5eb,stroke:#E6550D
  style VM fill:#e8f5e9,stroke:#31A354
  style SC fill:#fff5eb,stroke:#E6550D
  style UNDER fill:#fee,stroke:#c00

Figure 1: The variance estimation pipeline for causal effects. The influence function (IF) is the one method: write down the estimating equations for the full system (nuisance models + target estimand), compute each individual’s IF, then square and sum. The delta method (J V_β J^T) is a shortcut that captures only parameter uncertainty — it’s exact when predictions are constant per intervention, but underestimates otherwise.

1. The sandwich estimator for parametric models

This section derives $\text{Var}(\hat\beta)$ — the variance of model parameters. This is a building block, not the final answer. Section 3 shows how to get from $\text{Var}(\hat\beta)$ to $\text{Var}(\hat\mu)$.

1.1 What is an estimating equation?

The starting point is a simple idea: many statistical estimators can be defined as the solution to an equation of the form

\[ \sum_{i=1}^n \psi(O_i; \beta) = 0 \]

where $O_i$ is the observed data for individual $i$, $\beta$ is the parameter we want to estimate, and $\psi$ is a function — called the estimating function — that maps each observation and a candidate parameter value to a vector. The estimator $\hat\beta$ is the value of $\beta$ that makes this sum equal to zero.

But what is $\psi$, concretely? And where does it come from?

1.2 Deriving $\psi$: three routes to the same place

Route 1: Score equations (maximum likelihood)

The most familiar source of $\psi$ is maximum likelihood. If observations have log-likelihood $\ell_i(\beta) = \log f(O_i; \beta)$, the MLE maximises $\sum_i \ell_i(\beta)$. Setting the derivative to zero gives:

\[ \sum_{i=1}^n \underbrace{\frac{\partial \ell_i(\beta)}{\partial \beta}}_ {\psi(O_i;\beta)} = 0 \]

So $\psi$ is the score function — the gradient of the individual log-likelihood. For a GLM with outcome $Y_i$, covariates $X_i$, linear predictor $\eta_i = X_i^T\beta$, and mean function $\mu_i = g^{-1}(\eta_i)$:

\[ \psi(O_i; \beta) = \underset{\color{#E6550D}{\text{residual}}}{\color{#E6550D}{(Y_i - \mu_i)}} \;\cdot\; \underset{\color{#3182BD}{\text{scale factor}}} {\color{#3182BD}{\frac{d\mu/d\eta}{V(\mu_i)}}} \;\cdot\; \underset{\color{#31A354}{\text{direction}}}{\color{#31A354}{X_i}} \]

where $V(\mu_i)$ is the variance function of the exponential family ($V(\mu) = \mu(1-\mu)$ for binomial, $V(\mu) = \mu$ for Poisson, $V(\mu) = 1$ for Gaussian — note that $\sigma^2$ enters separately as the dispersion parameter $\phi$, not as $V(\mu)$). Reading the three factors:

$\color{#E6550D}{(Y_i - \mu_i)}$ — Residual: how far the prediction is from the observation. This is zero on average at the truth, which is why $E[\psi_i] = 0$.
$\color{#3182BD}{(d\mu/d\eta) / V(\mu_i)}$ — Scale factor: combines the link derivative (converting from $\eta$-scale to $\mu$-scale) with the variance weight (downweighting noisy observations). For canonical links this product simplifies (see below).
$\color{#31A354}{X_i}$ — Direction: the covariates tell us which parameters this observation informs. An observation with large $|X_{ij}|$ provides more information about $\beta_j$.

Example: logistic regression

For logistic regression, $\mu_i = \text{expit}(\eta_i)$, so $d\mu/d\eta = \mu_i(1-\mu_i)$ and $V(\mu_i) = \mu_i(1 - \mu_i)$. The scale factor cancels to $1$, giving:

\[\psi(O_i;\beta) = (Y_i - \mu_i) \cdot X_i\]

The estimating equation $\sum_i (Y_i - \mu_i) X_i = 0$ says: at the MLE, the covariates are uncorrelated with the residuals.

Route 2: Moment conditions (method of moments)

Not every estimating equation comes from a likelihood. The generalised method of moments (GMM) starts from a theoretical property of the true parameter $\beta_0$ — a moment condition — and turns it into $\psi$.

For example, if we know that $E[X_i(Y_i - X_i^T\beta_0)] = 0$, we can define:

\[ \psi(O_i; \beta) = X_i(Y_i - X_i^T\beta) \]

Setting $\sum_i \psi = 0$ gives the OLS normal equations. No distributional assumption was needed — just the moment condition $E[X_i \varepsilon_i] = 0$.

This is the key insight of M-estimation: the estimating function $\psi$ does not have to be a score function. It just needs to satisfy $E[\psi(O_i; \beta_0)] = 0$ at the true parameter value (the unbiasedness condition). This is what guarantees consistency of $\hat\beta$.

Route 3: Quasi-likelihood

Quasi-likelihood provides a middle ground. We specify only the first two moments of $Y_i \mid X_i$ — the mean function $\mu_i = g^{-1}(X_i^T\beta)$ and a variance function $V(\mu_i)$ — without assuming a full distributional form. The quasi-score is:

\[ \psi(O_i; \beta) = \frac{(Y_i - \mu_i)}{V(\mu_i)} \cdot \frac{d\mu_i}{d\eta_i} \cdot X_i \]

This looks identical to the GLM score, but now $V(\mu)$ is a working variance — it might be wrong. The sandwich estimator (Section 1.4) protects against this misspecification.

1.3 The general M-estimation framework

All three routes produce estimators of the same form: find $\hat\beta$ solving $\sum_i \psi(O_i; \hat\beta) = 0$. The M-estimation framework (Huber, 1964; Stefanski & Boos, 2002) unifies them under two conditions:

Unbiasedness: $E[\psi(O_i; \beta_0)] = 0$ at the true $\beta_0$. This guarantees that the estimating equation has the right solution in expectation.
Regularity: $\psi$ is smooth enough in $\beta$ for a Taylor expansion.

Given these, $\hat\beta$ is consistent and asymptotically normal, and its variance has the sandwich form — regardless of whether $\psi$ came from a likelihood, a moment condition, or a quasi-likelihood.

Why “M-estimation”?

The “M” stands for “maximum” (as in maximum-likelihood), but the framework is much broader. Any estimator defined as the root of a sum of functions of the data and parameters is an M-estimator — this includes MLE, weighted least squares, quantile regression, robust regression (Huber loss), generalised estimating equations, and the causal estimators in Sections 4–5 (Stefanski & Boos, 2002).

1.4 Asymptotic variance of M-estimators

We want $\text{Var}(\hat\beta)$. The derivation has three steps: linearise $\hat\beta$ around the truth, identify the variance of the linear approximation, then estimate it from data.

Step 1: Linearise via Taylor expansion

We know that $\hat\beta$ solves $\sum_i \psi(O_i; \hat\beta) = 0$. We also know that the true $\beta_0$ satisfies $E[\psi(O_i; \beta_0)] = 0$, but the finite-sample sum $\sum_i \psi(O_i; \beta_0)$ is generally not zero — it is a random quantity that fluctuates around zero. The idea is to relate these two facts via a Taylor expansion.

Expand $\sum_i \psi(O_i; \hat\beta)$ around $\beta_0$, keeping only the first-order term (valid for large $n$ by regularity):

\[ \underbrace{\sum_i \psi(O_i; \hat\beta)}_{= \, 0 \text{ (by definition)}} \approx \sum_i \psi(O_i; \beta_0) + \left[\sum_i \frac{\partial \psi(O_i;\beta_0)} {\partial \beta^T}\right] (\hat\beta - \beta_0) \]

Rearranging for the estimation error $\hat\beta - \beta_0$:

\[ \hat\beta - \beta_0 \approx -\left[\sum_i \frac{\partial \psi(O_i; \beta_0)}{\partial \beta^T}\right]^{-1} \sum_i \psi(O_i; \beta_0) \]

Dividing numerator and denominator by $n$ and labelling the two factors:

\[ \underset{p \times 1}{\hat\beta - \beta_0} \;\approx\; \underbrace{\left[-\frac{1}{n}\sum_i \frac{\partial \psi_i} {\partial \beta^T}\right]^{-1}}_{\to\; \color{#3182BD}{A^{-1}} \;(p \times p)} \;\cdot\; \underbrace{\frac{1}{n}\sum_i \psi(O_i; \beta_0)}_{\text{mean zero, covariance } \color{#E6550D}{B/n}} \]

This is the key linearisation: the estimation error is approximately the bread inverse times the average score. The first factor converges to the constant matrix $\color{#3182BD}{A^{-1}}$; the second is a sum of iid mean-zero random $p$-vectors whose covariance we call $\color{#E6550D}{B}$.

Step 2: Identify the variance

Name the two matrices that appeared in Step 1:

Bread $\color{#3182BD}{A}$: the expected negative Jacobian of the estimating function ($p \times p$):

\[ \color{#3182BD}{A} \;\equiv\; -E\!\left[\frac{\partial \psi(O_i; \beta_0)} {\partial \beta^T}\right] \qquad (p \times p) \]

For a GLM, $\color{#3182BD}{A} = E[w_i X_i X_i^T]$ where $w_i = (d\mu/d\eta)^2 / V(\mu)$ are the IWLS working weights — essentially the curvature of the (quasi-)log-likelihood.

Meat $\color{#E6550D}{B}$: the covariance of the estimating function ($p \times p$):

\[ \color{#E6550D}{B} \;\equiv\; E\!\left[\psi(O_i; \beta_0)\,\psi(O_i; \beta_0)^T\right] \qquad (p \times p) \]

This measures how much the individual scores $\psi_i$ vary across observations. It appeared in Step 1 as the covariance of the second factor: $\text{Var}\! \left(\frac{1}{n}\sum_i \psi_i\right) = \frac{1}{n}\color{#E6550D}{B}$.

Now propagate through the linearisation. Taking $\text{Var}(\cdot)$ of both sides of the Step 1 equation:

\[ \underset{p \times p}{\text{Var}(\hat\beta)} \;\approx\; \underset{p \times p}{\color{#3182BD}{A^{-1}}} \;\cdot\; \underset{p \times p}{\frac{\color{#E6550D}{B}}{n}} \;\cdot\; \underset{p \times p}{\color{#3182BD}{(A^{-1})^T}} \;=\; \frac{1}{n}\, \color{#3182BD}{A^{-1}}\, \color{#E6550D}{B}\, \color{#3182BD}{(A^{-1})^T} \]

Step 3: Estimate from data

Replace the population expectations $A$ and $B$ with sample averages:

\[ \color{#3182BD}{\hat{A}} = -\frac{1}{n}\sum_i \frac{\partial \psi(O_i; \hat\beta)}{\partial \beta^T} \qquad (p \times p) \]

\[ \color{#E6550D}{\hat{B}} = \frac{1}{n}\sum_i \psi(O_i; \hat\beta)\,\psi(O_i; \hat\beta)^T \qquad (p \times p) \]

The sandwich estimator is:

\[ \underset{p \times p}{\widehat{\text{Var}}(\hat\beta)} = \frac{1}{n}\; \underset{p \times p}{\color{#3182BD}{\hat{A}^{-1}}} \; \underset{p \times p}{\color{#E6550D}{\hat{B}}} \; \underset{p \times p}{\color{#3182BD}{(\hat{A}^{-1})^T}} \tag{1}\]

so called because the meat $\color{#E6550D}{\hat B}$ is “sandwiched” between two slices of bread $\color{#3182BD}{\hat A^{-1}}$.

Why is this robust?

The bread $\color{#3182BD}{A}$ and meat $\color{#E6550D}{B}$ are estimated separately. Under correct model specification (e.g. the GLM variance function is right), $A = B$ and the sandwich simplifies to $\frac{1}{n} A^{-1}$ — the usual model-based variance. But if the variance function is wrong, $A \neq B$: the bread still captures the curvature correctly (it only depends on the mean model), while the meat captures the actual variability of the scores. The sandwich remains consistent regardless — this is the Huber–White robustness property (Huber, 1967; White, 1982).

\[ \widehat{\text{Var}}(\hat\beta) = \frac{1}{n}\hat A^{-1}\hat B\,\hat A^{-1} \]

with the meat $\hat B = \frac{1}{n}\sum_i \psi_i\psi_i^{\mathsf T}$ and the bread inverse $\hat A^{-1}$. For a GLM with canonical link, $\hat A = \frac{1}{n} X^{\mathsf T} W X$ where $W$ is the diagonal of $\mu_i(1-\mu_i)$ (binomial case).

Code

# Meat: outer product of per-individual score contributions.
psi   <- (Y - mu) * X                       # n x p: (Y-mu) * X
M_hat <- crossprod(psi) / n                 # (1/n) sum psi psi^T, p x p

# Bread^{-1}: for logistic, A_hat = (1/n) X^T W X, with W = diag(mu(1-mu))
W     <- diag(mu * (1 - mu))
A_hat_inv <- solve(crossprod(X, W %*% X) / n)

# Sandwich variance of beta
V_beta_manual <- A_hat_inv %*% M_hat %*% A_hat_inv / n

# Gold standard: sandwich::sandwich(fit)
V_beta_gold <- sandwich::sandwich(fit)

Sandwich $V_\beta$: manual vs `sandwich::sandwich(fit)`
cell	manual	reference	abs diff
(Intercept) × (Intercept)	0.0540815	0.0540815	0.00000000175022669
A × (Intercept)	-0.0544496	-0.0544496	0.00000000252000550
L × (Intercept)	0.0009559	0.0009559	0.00000000006972407
(Intercept) × A	-0.0544496	-0.0544496	0.00000000252000548
A × A	0.1076365	0.1076365	0.00000000984253996
L × A	-0.0081033	-0.0081033	0.00000000556503194
(Intercept) × L	0.0009559	0.0009559	0.00000000006972407
A × L	-0.0081033	-0.0081033	0.00000000556503193
L × L	0.0324843	0.0324843	0.00000000941471914

The manual sandwich and sandwich::sandwich(fit) agree to numerical precision (≲ 1e-8), confirming Equation 1 is implemented correctly.

1.5 Pseudo-code: sandwich variance for a GLM

Pseudo-code: sandwich_variance()

sandwich_variance <- function(X, Y, beta_hat, family) {
  n <- length(Y)
  eta <- X %*% beta_hat
  mu <- family$linkinv(eta)
  mu_eta <- family$mu.eta(eta)    # dmu/deta
  V <- family$variance(mu)        # Var(Y|X) under the model

  # Bread: A = (1/n) X^T W X where W = diag((dmu/deta)^2 / V(mu))
  W <- mu_eta^2 / V
  A <- (1 / n) * crossprod(X, X * W)

  # Meat: B = (1/n) sum psi_i psi_i^T
  residuals <- Y - mu
  score_weights <- mu_eta / V
  psi <- X * (residuals * score_weights)  # n x p matrix of score contributions
  B <- (1 / n) * crossprod(psi)

  # Sandwich: (1/n) A^{-1} B (A^{-1})^T
  A_inv <- solve(A)
  V_beta <- (1 / n) * A_inv %*% B %*% t(A_inv)
  V_beta
}

In practice

You almost never code this yourself. In R, sandwich::sandwich(model) computes $\hat A^{-1} \hat B (\hat A^{-1})^T$ from any glm object. The sandwich package (Zeileis, 2004, 2006) provides variants for clustered data (vcovCL()), panel data (vcovPL()), and heteroscedasticity-autocorrelation (vcovHAC()).

1.6 Concrete example: linear regression

Let us work through the sandwich for OLS step by step.

Model: $Y_i = X_i^T\beta + \varepsilon_i$ where $E[\varepsilon_i \mid X_i] = 0$.

Step 1: Write down $\psi$. The OLS estimator minimises $\sum_i (Y_i - X_i^T\beta)^2$. Setting the gradient to zero gives $\sum_i X_i(Y_i - X_i^T\beta) = 0$, so:

\[ \psi(O_i; \beta) = X_i(Y_i - X_i^T\beta) \qquad (p \times 1) \]

At $\beta_0$: $E[\psi_i] = E[X_i \varepsilon_i] = 0$ by exogeneity.

Step 2: Compute the bread $\color{#3182BD}{A}$.

\[ \frac{\partial \psi(O_i; \beta)}{\partial \beta^T} = \frac{\partial}{\partial \beta^T}\left[X_i(Y_i - X_i^T\beta)\right] = -X_i X_i^T \qquad (p \times p) \]

So $\color{#3182BD}{A} = E[X_i X_i^T]$, estimated by $\color{#3182BD}{\hat{A}} = \frac{1}{n} X^T X$.

Step 3: Compute the meat $\color{#E6550D}{B}$. Since $\psi_i = X_i \hat e_i$, the outer product is $\psi_i\psi_i^T = \hat e_i^2 X_iX_i^T$:

\[ \color{#E6550D}{\hat{B}} = \frac{1}{n}\sum_i \hat{e}_i^2 \, X_i X_i^T \qquad (p \times p) \]

The meat depends on the squared residuals, not on any assumed variance. This is the source of robustness.

Step 4: Assemble the sandwich.

\[ \underset{p \times p}{\widehat{\text{Var}}(\hat\beta)} = \frac{1}{n}\, \color{#3182BD}{\hat{A}^{-1}}\; \color{#E6550D}{\hat{B}}\; \color{#3182BD}{(\hat{A}^{-1})^T} \]

Substituting:

\[ = \underset{p \times p}{\color{#3182BD}{(X^TX)^{-1}}} \;\; \underset{p \times p}{\color{#E6550D}{\left[\sum_i \hat{e}_i^2\,X_iX_i^T\right]}} \;\; \underset{p \times p}{\color{#3182BD}{(X^TX)^{-1}}} \]

This is the HC0 heteroscedasticity-consistent estimator (White, 1980).

What if homoscedasticity holds? If $\text{Var}(\varepsilon_i \mid X_i) = \sigma^2$ for all $i$, then $E[\hat{e}_i^2 X_i X_i^T] = \sigma^2 E[X_i X_i^T]$, so $\color{#E6550D}{B} = \sigma^2 \color{#3182BD}{A}$ and the sandwich simplifies to $\frac{\sigma^2}{n} \color{#3182BD}{A^{-1}} = \hat\sigma^2(X^TX)^{-1}$, the usual OLS variance.

2. The influence function

Section 1 gave us $\text{Var}(\hat\beta)$ as a matrix product (the sandwich). The influence function (IF) gives the same answer but expressed as a per-individual contribution. This is more than a curiosity — IFs are the tool that lets us compute $\text{Var}(\hat\mu)$ for causal estimands (Section 3) and for multi-model pipelines (Section 5).

2.1 The idea: how much does each observation matter?

Imagine removing observation $i$ from the dataset and re-estimating $\hat\beta$. The change $\hat\beta_{(-i)} - \hat\beta$ tells you how much observation $i$ “influenced” the estimate. The influence function (IF) is the first-order approximation to this quantity, scaled by $n$:

\[ \text{IF}_i \approx -n(\hat\beta_{(-i)} - \hat\beta) \]

But we do not need to actually refit the model $n$ times. For M-estimators, the IF can be computed in closed form.

2.2 Deriving the IF from the linearisation

In Section 1.4, Step 1, we derived the linearisation:

\[ \underset{p \times 1}{\hat\beta - \beta_0} \;\approx\; \frac{1}{n}\sum_{i=1}^n \underset{p \times p}{\color{#3182BD}{A^{-1}}} \;\; \underset{p \times 1}{\psi(O_i; \beta_0)} \]

Each summand $\color{#3182BD}{A^{-1}}\psi_i$ is the contribution of observation $i$ to the total estimation error. This is exactly the influence function:

\[ \underset{p \times 1}{\text{IF}_i(\hat\beta)} \;=\; \underset{p \times p}{\color{#3182BD}{A^{-1}}} \;\cdot\; \underset{p \times 1}{\psi(O_i; \hat\beta)} \tag{2}\]

Reading each factor:

$\psi_i$ ($p \times 1$): the score from Section 1.2 — how much observation $i$’s data “disagrees” with the current $\hat\beta$.
$\color{#3182BD}{A^{-1}}$ ($p \times p$): the bread inverse from Section 1.4 — converts from score-space to parameter-space, accounting for curvature.
$\text{IF}_i$ ($p \times 1$): the resulting shift in $\hat\beta$ attributable to observation $i$ (times $n$).

No new derivation is needed. The IF is the linearisation, read term-by-term.

2.3 IF = sandwich (the key equivalence)

Property 1: Linearisation. The IF lets us write a complex estimator as a simple average:

\[ \hat\beta - \beta_0 \approx \frac{1}{n}\sum_i \text{IF}_i \]

Property 2: Variance recovery. Since $E[\text{IF}_i] = 0$:

\[ \text{Var}(\hat\beta) \approx \frac{1}{n^2}\sum_i \text{IF}_i \, \text{IF}_i^T \]

Expanding $\text{IF}_i \text{IF}_i^T = \color{#3182BD}{A^{-1}} \psi_i \psi_i^T \color{#3182BD}{(A^{-1})^T}$ and factoring:

\[ \frac{1}{n^2}\sum_i \color{#3182BD}{A^{-1}} \psi_i\psi_i^T \color{#3182BD}{(A^{-1})^T} = \frac{1}{n} \color{#3182BD}{A^{-1}} \underbrace{\frac{1}{n}\sum_i \psi_i\psi_i^T}_ {\color{#E6550D}{B}\text{ (the meat)}} \color{#3182BD}{(A^{-1})^T} \]

This is the sandwich from Section 1.4.

The key equivalence

\[ \underbrace{\frac{1}{n^2}\sum_i \text{IF}_i\,\text{IF}_i^T}_ {\text{IF-based variance}} \;=\; \underbrace{\frac{1}{n}\,\color{#3182BD}{A^{-1}}\,\color{#E6550D}{B}\, \color{#3182BD}{(A^{-1})^T}}_{\text{sandwich variance}} \tag{3}\]

The sandwich operates on matrices. The IF operates on per-individual vectors. Same answer, different representation. We will use the IF representation from here on, because it extends cleanly to multi-step estimators (Sections 3–5).

\[ \mathrm{IF}_i(\hat\beta) = \hat A^{-1}\psi_i, \qquad \widehat{\text{Var}}(\hat\beta) = \frac{1}{n^2}\sum_{i=1}^n \mathrm{IF}_i\,\mathrm{IF}_i^{\mathsf T}. \]

Code

# Per-individual IF for beta (n x p). psi and A_hat_inv come from the
# sandwich verification block above.
IF_beta <- psi %*% t(A_hat_inv)

# Assemble Var(beta) two different ways.
V_beta_via_IF  <- crossprod(IF_beta) / n^2
V_beta_via_sw  <- sandwich::sandwich(fit)

Sandwich $= \tfrac{1}{n^2}\sum_i \mathrm{IF}_i\mathrm{IF}_i^{\mathsf T}$
cell	IF (1/n^2 Σ IF²)	sandwich	abs diff
(Intercept) × (Intercept)	0.0540815	0.0540815	0.00000000175022673
A × (Intercept)	-0.0544496	-0.0544496	0.00000000252000553
L × (Intercept)	0.0009559	0.0009559	0.00000000006972406
(Intercept) × A	-0.0544496	-0.0544496	0.00000000252000554
A × A	0.1076365	0.1076365	0.00000000984253995
L × A	-0.0081033	-0.0081033	0.00000000556503194
(Intercept) × L	0.0009559	0.0009559	0.00000000006972406
A × L	-0.0081033	-0.0081033	0.00000000556503194
L × L	0.0324843	0.0324843	0.00000000941471913

The two representations are equal to machine precision (abs diff ≲ 1e-16) — so anywhere we see “sandwich variance”, we can equivalently write “sum of squared influence functions over $n^2$”.

2.4 Example: IF for a sample mean

$\hat\mu = \frac{1}{n}\sum_i Y_i$. Estimating equation: $\psi(O_i; \mu) = Y_i - \mu$. Bread: $A = 1$. IF: $\text{IF}_i = Y_i - \hat\mu$. Variance: $\frac{1}{n^2}\sum_i(Y_i - \hat\mu)^2 = s^2/n$.

\[ \mathrm{IF}_i(\hat\mu) = Y_i - \hat\mu, \qquad \widehat{\mathrm{Var}}(\hat\mu) = \frac{1}{n^2}\sum_{i=1}^n (Y_i - \hat\mu)^2 = \frac{n-1}{n}\cdot\frac{s^2}{n}. \]

The IF-based SE is, up to the $(n-1)/n$ bias factor, just the standard error of the mean from an intro stats textbook.

Code

y   <- dat$Y
mu  <- mean(y)

# IF-based variance: (1/n²) Σ (yᵢ - μ̂)²
var_if <- sum((y - mu)^2) / length(y)^2
se_if  <- sqrt(var_if)

# Textbook: s² / n   (n-1 denominator)
se_textbook <- sqrt(var(y) / length(y))

Sample mean IF variance matches the textbook SE
quantity	value
IF-based SE	0.03373
Textbook SE ($s / \sqrt{n}$)	0.03381
Ratio $\sqrt{n/(n-1)}$	1.00251

The two agree up to the standard $(n-1)/n$ degrees-of-freedom correction — on $n = 200$ the ratio is indistinguishable from $\sqrt{n/(n-1)}$.

3. Variance of a causal estimand

3.1 The problem

The sandwich (Section 1) gives $\text{Var}(\hat\beta)$. But the causal estimand is typically a nonlinear function of $\hat\beta$ and the data. For instance, the g-computation marginal mean under intervention $a$ is:

\[ \hat\mu_a = \frac{1}{n}\sum_{i=1}^n g^{-1}(X_i^{*T} \hat\beta) \]

where $X_i^*$ is individual $i$’s covariate vector with treatment set to $a$, and $g^{-1}(\cdot)$ is the inverse link function.

This quantity depends on two random things:

$\hat\beta$ — the estimated model parameters (random because of finite data).
$\{X_i\}$ — the covariate distribution (random because a different sample gives different covariates, hence a different average of predictions).

A complete variance estimator must account for both sources. As we will see, the standard “delta method” shortcut only captures source (1).

3.2 The two-channel influence function

The g-computation marginal mean is a two-step estimator: first estimate $\hat\beta$, then compute $\hat\mu_a = \frac{1}{n}\sum_i g^{-1}(X_i^{*T} \hat\beta)$. Individual $i$ affects $\hat\mu_a$ through two channels.

Channel 1: Direct contribution (covariate sampling)

Even if $\hat\beta$ were fixed, observation $i$’s prediction $g^{-1}(X_i^{*T}\hat\beta)$ is one of the $n$ terms being averaged. A different sample gives a different set of predictions. Individual $i$’s direct contribution to the estimation error is:

\[ \text{IF}_i^{\text{direct}} = g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a \qquad (\text{scalar}) \]

This is the same structure as the sample mean IF from Section 2.4 — the deviation of one term from the average.

Channel 2: Model correction (parameter uncertainty)

Observation $i$ also affects $\hat\mu_a$ indirectly by shifting $\hat\beta$. From Section 2.2, $i$’s influence on $\hat\beta$ is approximately $\frac{1}{n}\color{#3182BD}{A^{-1}}\psi_i$. This shift changes $\hat\mu_a$ by:

\[ \delta\hat\mu_a \;\approx\; \underset{1 \times p}{\color{#31A354}{J_a}} \;\cdot\; \underset{p \times 1}{\color{#3182BD}{A^{-1}}\psi_i} \]

where $\color{#31A354}{J_a} = \frac{\partial \hat\mu_a}{\partial \beta^T}$ is the $1 \times p$ Jacobian — how the marginal mean changes as each coefficient shifts:

\[ \underset{1 \times p}{\color{#31A354}{J_a}} = \frac{1}{n} \sum_{i=1}^n \underbrace{\frac{d\mu}{d\eta}\bigg|_{\eta_i^*}}_{\text{scalar}} \;\cdot\; \underbrace{X_i^{*T}}_{1 \times p} \]

The link derivatives for common GLMs:

Family	Link	$d\mu/d\eta$	Effect
Gaussian	Identity	$1$	All observations contribute equally
Binomial	Logit	$\mu(1 - \mu)$	Observations near $\mu = 0.5$ contribute most
Poisson	Log	$\mu$	Observations with larger means contribute more

Combining both channels

The full IF for the marginal mean $\hat\mu_a$ is:

\[ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1: covariate sampling}} \;+\; \underbrace{ \underset{1 \times p}{\color{#31A354}{J_a}} \;\; \underset{p \times p}{\color{#3182BD}{A^{-1}}} \;\; \underset{p \times 1}{\psi_i} }_{\text{Channel 2: parameter uncertainty}} \tag{4}\]

And the variance is:

\[ \widehat{\text{Var}}(\hat\mu_a) = \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i(\hat\mu_a)^2 \tag{5}\]

For two interventions $a$ and $b$, the covariance is:

\[ \widehat{\text{Cov}}(\hat\mu_a, \hat\mu_b) = \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i^{(a)} \cdot \text{IF}_i^{(b)} \]

This gives the full $k \times k$ covariance matrix needed for contrasts (risk difference, risk ratio, etc.).

Why the covariance matters

Both $\hat\mu_1$ and $\hat\mu_0$ are functions of the same data, so they are correlated. This covariance typically reduces the variance of the difference. Ignoring it (computing variances separately and adding them) would overestimate the SE.

For the g-computation estimate of $\hat\mu_1 = \mathbb{E}[Y^{A=1}]$ under a logistic outcome model,

\[ \mathrm{IF}_i(\hat\mu_1) = \underbrace{m(X_i^{(1)};\hat\beta) - \hat\mu_1}_{\text{Channel 1}} \;+\; \underbrace{J_1\,\hat A^{-1}\,\psi_i}_{\text{Channel 2}}, \]

with $m(X_i^{(1)};\hat\beta) = \text{expit}(X_i^{(1)\mathsf T}\hat\beta)$, $J_1 = \frac{1}{n}\sum_j X_j^{(1)}\,\mu_1^{(j)}(1-\mu_1^{(j)})$, and $\widehat{\mathrm{Var}}(\hat\mu_1) = \frac{1}{n^2}\sum_i \mathrm{IF}_i^2$.

Code

# Counterfactual design matrix and predictions with A set to 1 for everyone.
X1   <- X
X1[, "A"] <- 1
eta1 <- X1 %*% beta_hat
mu1_i <- as.numeric(plogis(eta1))
mu1_hat <- mean(mu1_i)

# Channel 1: direct covariate-sampling contribution.
C1 <- mu1_i - mu1_hat

# Channel 2: gradient of mu1_hat in beta times IF_beta.
# For logistic, d mu1_i / d beta = mu1_i (1 - mu1_i) * X1_i.
J1 <- colMeans(X1 * (mu1_i * (1 - mu1_i)))     # p-vector
C2 <- as.numeric(IF_beta %*% J1)

IF_mu1 <- C1 + C2
var_mu1_manual <- sum(IF_mu1^2) / n^2
se_mu1_manual  <- sqrt(var_mu1_manual)

# Gold standard: causatr's variance_if() routed through contrast().
fit_c <- causat(dat, outcome = "Y", treatment = "A",
                confounders = ~ L, family = binomial,
                model_fn = stats::glm)
res_c <- contrast(
  fit_c,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference    = "a0",
  ci_method    = "sandwich"
)
se_mu1_causatr <- res_c$estimates[intervention == "a1"]$se
mu1_causatr    <- res_c$estimates[intervention == "a1"]$estimate

Two-channel IF for g-comp: manual vs `causatr::variance_if`
quantity	manual	causatr	abs diff
$\mathbb{E}[Y^{A=1}]$	0.4278	0.4278	0.000000000000000
$\mathrm{SE}(\hat\mu_1)$	0.05039	0.05039	0.000000002311528

The manual two-channel IF reproduces causatr’s SE for $\hat\mu_1$ to $$1e-9. Both channels are needed — dropping Channel 1 would give a different (smaller) number, as the next tabset shows.

3.3 The delta method shortcut

A widely used shortcut computes the Jacobian $\color{#31A354}{J}$ and the parameter sandwich $V_\beta$ separately, then propagates via:

\[ \underset{k \times k}{\text{Var}(\hat\mu)} \;\approx\; \underset{k \times p}{\color{#31A354}{J}} \;\; \underset{p \times p}{V_\beta} \;\; \underset{p \times k}{\color{#31A354}{J^T}} \tag{6}\]

This is the multivariate delta method. To see what it captures, expand the IF-based variance (Equation 5) using the two-channel decomposition (Equation 4):

\[ \frac{1}{n^2}\sum_i \text{IF}_i^2 = \underbrace{\frac{1}{n^2}\sum_i \left(\text{Ch.1}_i\right)^2}_ {\text{covariate sampling}} + \underbrace{\frac{1}{n^2}\sum_i \left(\text{Ch.2}_i\right)^2}_{ = \; \color{#31A354}{J} \, V_\beta \, \color{#31A354}{J^T}} + \underbrace{\frac{2}{n^2}\sum_i \text{Ch.1}_i \cdot \text{Ch.2}_i}_ {\text{cross-term}} \]

The delta method gives only the middle term — Channel 2 squared. It misses the covariate sampling variance (Channel 1 squared) and the cross-term. All three terms are $O(1/n)$, so the delta method systematically underestimates.

\[ \widehat{\mathrm{Var}}(\hat\mu_a) = \underbrace{\tfrac{1}{n^2}\sum_i \mathrm{Ch.1}_i^2}_{\text{full}} + \underbrace{J\,V_\beta\,J^{\mathsf T}}_{\text{delta}} + \underbrace{\tfrac{2}{n^2}\sum_i \mathrm{Ch.1}_i\,\mathrm{Ch.2}_i}_{\text{cross}}. \]

The delta method is the second term alone.

Code

var_full   <- sum(IF_mu1^2) / n^2
var_ch1    <- sum(C1^2)     / n^2
var_ch2    <- sum(C2^2)     / n^2
var_cross  <- 2 * sum(C1 * C2) / n^2
var_delta  <- as.numeric(t(J1) %*% V_beta_gold %*% J1)  # J V_beta J^T

# Ch.2² and J V_beta J^T are algebraically identical; numerically they
# agree up to the bread solve precision.
stopifnot(abs(var_ch2 - var_delta) < 1e-6)
ratio <- sqrt(var_full) / sqrt(var_delta)

Decomposition of $\mathrm{Var}(\hat\mu_1)$: delta captures only the middle row
term	value
Full IF variance $\mathrm{Ch.1}^2 + \mathrm{Ch.2}^2 + \text{cross}$	0.0025392
Channel 1 squared (covariate sampling)	0.00014651
Channel 2 squared $=$ delta $J\,V_\beta\,J^{\mathsf T}$	0.00241281
Cross term $2 \cdot \mathrm{Ch.1} \cdot \mathrm{Ch.2}$	-0.00002013
SE ratio $\mathrm{SE}_{\text{full}} / \mathrm{SE}_{\text{delta}}$	1.02585667

On this example the delta-method SE is about $1/1.026 \approx 97\%$ of the correct SE — a modest underestimate because the covariate sampling variance (Channel $1^2$) happens to be small relative to the parameter uncertainty. Under non-saturated MSMs and models with strong covariate-driven heterogeneity, the ratio can be much larger; Equation 6 is not safe as a default.

The delta method treats covariates as fixed

$\color{#31A354}{J} V_\beta \color{#31A354}{J^T}$ answers the question: “given these specific covariates $\{X_i\}$, how does uncertainty in $\hat\beta$ propagate to $\hat\mu_a$?” It conditions on the covariate distribution. But $\{X_i\}$ are random — a different sample gives different covariates, hence a different average of predictions even for fixed $\beta$. The full IF captures both sources; the delta method captures only one.

3.4 When the shortcut is exact

The delta method underestimates whenever Channel 1 is nonzero. Channel 1 is:

\[ \text{Ch.1}_i = g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a \]

This is zero for all $i$ if and only if every individual gets the same prediction — i.e. $g^{-1}(X_i^{*T}\hat\beta)$ does not depend on $i$. This happens when:

The model is saturated in the treatment (e.g. $Y \sim A$ for binary treatment) and the intervention is static ($A = a$ for everyone). Then all individuals share the same $X^*$ (only the intercept and treatment indicator matter), so predictions are identical.
The model has no covariates (e.g. $Y \sim A$). Same reasoning.

The shortcut fails when predictions vary across individuals:

G-computation with confounders: $Y \sim A + L$. Predictions $g^{-1}(\hat\beta_0 + \hat\beta_1 a + \hat\beta_2 L_i)$ vary with $L_i$ across individuals. Channel 1 is nonzero.
Non-saturated MSM: $Y \sim A$ for continuous $A$ is a linear model, not saturated. With non-static interventions (e.g. shift each person’s treatment by $+\delta$), different individuals get different $A$ values, hence different predictions.

Situation	Predictions constant?	Delta method exact?
G-comp with confounders ($Y \sim A + L$)	No	No — underestimates
IPW/matching, binary $A$, saturated MSM ($Y \sim A$), static intervention	Yes	Yes
IPW/matching, continuous $A$, linear MSM ($Y \sim A$), static intervention	Yes	Yes
IPW, continuous $A$, non-static intervention (shift, MTP)	No	No — underestimates

3.5 The IF is the stacked sandwich, solved analytically

The two-channel formula (Equation 4) may seem to appear from thin air. In fact it is the sandwich (Equation 1) applied to the complete estimating equations — model score plus mean equation — with the block-triangular bread inverted by back-substitution. This section shows the derivation.

The complete estimating equations

G-computation has two estimating equations per individual:

\[ \underset{(p+1) \times 1}{\Psi(O_i;\,\beta,\,\mu_a)} = \begin{pmatrix} \underset{p \times 1}{\psi_\beta(O_i;\,\beta)} \\[4pt] \underset{1 \times 1}{\psi_\mu(O_i;\,\beta,\,\mu_a)} \end{pmatrix} = \begin{pmatrix} \text{GLM score (Section 1.2)} \\[4pt] g^{-1}(X_i^{*T}\beta) - \mu_a \end{pmatrix} \]

The first block $\psi_\beta$ defines the model. The second block $\psi_\mu$ defines the estimand: $\hat\mu_a$ is the value of $\mu_a$ that makes $\sum_i \psi_\mu = 0$, i.e. $\hat\mu_a = \frac{1}{n}\sum_i g^{-1}(X_i^{*T} \hat\beta)$.

The block-triangular bread

The bread matrix of the stacked system is:

\[ \underset{(p+1) \times (p+1)}{A} = -\frac{1}{n}\sum_i \frac{\partial\Psi_i}{\partial(\beta,\mu_a)^T} = \begin{pmatrix} \underset{p \times p}{\color{#3182BD}{A_{\beta\beta}}} & \underset{p \times 1}{0} \\[6pt] \underset{1 \times p}{\color{#E6550D}{A_{\mu\beta}}} & \underset{1 \times 1}{\color{#E6550D}{A_{\mu\mu}}} \end{pmatrix} \]

where:

$\color{#3182BD}{A_{\beta\beta}} = -\frac{1}{n}\sum_i \frac{\partial\psi_\beta}{\partial\beta^T}$ — the GLM Hessian (Section 1.4).
$\color{#E6550D}{A_{\mu\beta}} = -\frac{1}{n}\sum_i \frac{\partial\psi_\mu}{\partial\beta^T} = -\frac{1}{n}\sum_i \frac{d\mu}{d\eta}\big|_{\eta_i^*} X_i^{*T} = -\color{#31A354}{J_a}$ — the negative Jacobian from Section 3.2.
$\color{#E6550D}{A_{\mu\mu}} = -\frac{\partial\psi_\mu}{\partial\mu_a} = 1$.
Upper-right = $0$: the GLM score does not depend on $\mu_a$.

The block-triangular structure arises because $\psi_\beta$ depends only on $\beta$ (not on $\mu_a$), while $\psi_\mu$ depends on both $\beta$ and $\mu_a$.

Inverting by back-substitution

For a $2 \times 2$ block-lower-triangular matrix, the inverse is:

\[ A^{-1} = \begin{pmatrix} \color{#3182BD}{A_{\beta\beta}^{-1}} & 0 \\[4pt] -A_{\mu\mu}^{-1} \, \color{#E6550D}{A_{\mu\beta}} \, \color{#3182BD}{A_{\beta\beta}^{-1}} & A_{\mu\mu}^{-1} \end{pmatrix} \]

The IF for the full parameter $(\beta, \mu_a)$ is $A^{-1}\Psi_i$. Extracting the $\mu_a$ row (the last row of $A^{-1}$ times $\Psi_i$):

\[ \text{IF}_i(\hat\mu_a) = \underbrace{A_{\mu\mu}^{-1}\;\psi_{\mu,i}}_{\text{Channel 1}} \;+\; \underbrace{(-A_{\mu\mu}^{-1})\; \color{#E6550D}{A_{\mu\beta}}\; \color{#3182BD}{A_{\beta\beta}^{-1}}\; \psi_{\beta,i}}_{\text{Channel 2}} \]

Substituting $A_{\mu\mu} = 1$, $A_{\mu\beta} = -\color{#31A354}{J_a}$, and $\psi_{\mu,i} = g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a$:

\[ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1}} \;+\; \underbrace{ \color{#31A354}{J_a}\; \color{#3182BD}{A_{\beta\beta}^{-1}}\; \psi_{\beta,i}}_{\text{Channel 2}} \]

This is exactly Equation 4. The bread $\color{#3182BD}{A_{\beta\beta}^{-1}}$ is model_bread_inv() = $(X^TWX)^{-1}$. The score $\psi_{\beta,i}$ is $r_i \cdot (d\mu/d\eta) / V(\mu_i) \cdot X_i$. The Jacobian $\color{#31A354}{J_a}$ projects the $p$-vector result to a scalar.

The meat is hidden, not absent

The IF formula computes $\color{#3182BD}{A^{-1}}\psi_i$ for each individual, then squares: $\text{Var} = \frac{1}{n^2}\sum_i \text{IF}_i^2$. Expanding:

\[ \frac{1}{n^2}\sum_i \text{IF}_i^2 = \frac{1}{n}\;\color{#3182BD}{A^{-1}} \underbrace{\frac{1}{n}\sum_i \Psi_i \Psi_i^T}_ {\color{#E6550D}{B} \text{ (the meat)}} \color{#3182BD}{(A^{-1})^T} \]

This is the sandwich (Equation 1) for the stacked system. The IF avoids forming the $(p+1) \times (p+1)$ meat matrix $\color{#E6550D}{B}$ — it computes the same answer by summing scalar products.

3.6 Target subpopulations (ATT, ATU, custom subset)

Sections 3.1–3.5 define the marginal mean as an average over the full sample: $\hat\mu_a = (1/n)\sum_{i=1}^n g^{-1}(X_i^{*T}\hat\beta)$. Many causal estimands — the ATT, the ATU, or an estimand on a user-supplied subset — instead average only over a target subpopulation $T \subseteq \{1,\ldots,n\}$ of size $n_t$:

\[ \hat\mu_a \;=\; \frac{1}{n_t}\sum_{i\in T} g^{-1}(X_i^{*T}\hat\beta) \]

The stacked estimating equations need one small change to accommodate this: the mean equation is now indicator-weighted. Let $t_i = \mathbb{1}[i \in T]$ and write

\[ \psi_{\mu,i}(\beta,\mu_a) \;=\; t_i \cdot \bigl[g^{-1}(X_i^{*T}\beta) - \mu_a\bigr] \]

so that $\sum_i \psi_{\mu,i} = 0$ still recovers $\hat\mu_a$.

Redoing the bread calculation from Section 3.5 with this indicator-weighted mean equation:

$\color{#E6550D}{A_{\mu\mu}} = -\frac{1}{n}\sum_i \frac{\partial\psi_{\mu,i}}{\partial\mu_a} = -\frac{1}{n}\sum_i(-t_i) = \frac{n_t}{n}$, so $A_{\mu\mu}^{-1} = \frac{n}{n_t}$.
$\color{#E6550D}{A_{\mu\beta}} = -\frac{1}{n}\sum_i t_i \cdot \frac{d\mu}{d\eta}\Big|_{\eta_i^*} X_i^{*T} = -\frac{n_t}{n}\,\color{#31A354}{J_a}$, where $\color{#31A354}{J_a} = \frac{1}{n_t}\sum_{i\in T}\frac{d\mu}{d\eta}\big|_{\eta_i^*} X_i^{*T}$ is the Jacobian averaged over the target.

Plugging into the back-substitution formula for the last row of $A^{-1}$:

\[ \text{IF}_i(\hat\mu_a) \;=\; -A_{\mu\mu}^{-1}\,\color{#E6550D}{A_{\mu\beta}}\,\color{#3182BD}{A_{\beta\beta}^{-1}}\,\psi_{\beta,i} \;+\; A_{\mu\mu}^{-1}\,\psi_{\mu,i} \]

Substituting $A_{\mu\mu}^{-1} = n/n_t$ and $A_{\mu\beta} = -(n_t/n)\,\color{#31A354}{J_a}$:

\[ \text{IF}_i(\hat\mu_a) \;=\; \underbrace{\frac{n}{n_t}\cdot t_i \cdot (g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a)}_ {\text{Channel 1 (rescaled)}} \;+\; \underbrace{\color{#31A354}{J_a}\,\color{#3182BD}{A_{\beta\beta}^{-1}}\,\psi_{\beta,i}}_ {\text{Channel 2 (unchanged)}} \tag{7}\]

Two things to notice:

Channel 1 picks up a factor of $n/n_t$ and is nonzero only for individuals inside the target. Without this factor the variance is underestimated by $(n_t/n)^2$ — a serious undercount when the target is a proper subset (e.g., the ATT with only ~40% treated).
Channel 2 is unchanged. The $n_t/n$ in $A_{\mu\beta}$ cancels with the $n/n_t$ in $A_{\mu\mu}^{-1}$, so the Channel 2 formula from Section 3.2 carries over verbatim — $\color{#31A354}{J_a}$ is just averaged over the target population instead of the full sample, and $\psi_{\beta,i}$ still runs over every individual the outcome model used (inside and outside the target), because the outcome model was fit on all those rows.

The variance is still $\widehat{\text{Var}}(\hat\mu_a) = (1/n^2)\sum_{i=1}^n \text{IF}_i^2$, summed over all $n$ individuals. Individuals outside the target have $\text{Ch.1}_i = 0$ and contribute only through Channel 2.

Setting $T = \{1,\ldots,n\}$ (so $n_t = n$, $t_i \equiv 1$) recovers Equation 4 exactly. The full-sample case is the special case of Equation 7 where the scaling factor is $1$.

3.7 Pseudo-code: the full IF for a marginal mean

This incorporates the target subpopulation scaling from Section 3.6 and the correct non-canonical link handling for the score (multiply response residuals by $\mu_\eta/V$; see Section 1.2 route 3).

Pseudo-code: gcomp_influence()

gcomp_influence <- function(model, data_a, target_idx) {
  n <- nrow(data_a)
  beta_hat <- coef(model)
  pred_terms <- delete.response(terms(model))
  family <- model$family

  # Predictions under intervention (target rows only)
  X_star <- model.matrix(pred_terms, data = data_a[target_idx, ])
  eta_star <- as.numeric(X_star %*% beta_hat)
  preds <- family$linkinv(eta_star)
  mu_hat <- mean(preds)
  n_t <- length(preds)

  # Channel 1: (n/n_t) * t_i * (pred_i - mu_hat), zero outside target
  IF_direct <- rep(0, n)
  IF_direct[target_idx] <- preds - mu_hat

  # Channel 2 gradient: J_a = (1/n_t) sum_{target} (dmu/deta) * X_star
  mu_eta_star <- family$mu.eta(eta_star)
  J <- as.numeric(crossprod(X_star, mu_eta_star)) / n_t  # p x 1

  # Bread inverse (X'WX)^{-1} with IWLS working weights
  X_fit <- model.matrix(model)
  eta_fit <- model$linear.predictors
  mu_eta_fit <- family$mu.eta(eta_fit)
  V_fit <- family$variance(fitted(model))
  W <- mu_eta_fit^2 / V_fit
  XtWX_inv <- solve(crossprod(X_fit, X_fit * W))
  h <- as.numeric(XtWX_inv %*% J)  # p x 1

  # Per-observation directional sensitivity and SCORE residual.
  # Score residual = response residual * (dmu/deta) / V(mu), so that
  # psi_i = X_i * r_i * (mu_eta/V). For canonical links mu_eta/V = 1
  # and this reduces to the response residual.
  d_fit <- as.numeric(X_fit %*% h)
  r_response <- residuals(model, type = "response")
  r_score <- r_response * mu_eta_fit / V_fit
  IF_correction <- d_fit * r_score                         # n_fit x 1

  # Combine: full IF = (n/n_t) * Ch.1 + n * Ch.2
  # Factor of n on Ch.2 from A_{beta,beta}^{-1} = n * (X'WX)^{-1}
  # Factor of (n/n_t) on Ch.1 from A_{mu,mu}^{-1} = n/n_t (Sec. 3.6)
  IF <- (n / n_t) * IF_direct
  IF[seq_along(IF_correction)] <- IF[seq_along(IF_correction)] +
    n * IF_correction

  IF
}

4. The core causal estimators

This section applies the IF framework from Section 3 to g-computation, IPW, and matching. The variance formula is always the same (Equation 5); what differs is the estimating equations and whether Channel 1 is zero. AIPW combines the g-comp and IPW influence functions into a single doubly-robust IF — its variance derivation follows the same Channel 1 + Channel 2 structure; see R/variance_if_aipw.R for the implementation.

4.1 G-computation

Estimating equations. G-computation fits an outcome model $E[Y_i \mid A_i, L_i] = g^{-1}(X_i^T \beta)$ and computes $\hat\mu_a = \frac{1}{n}\sum_i g^{-1}(X_i^{*T}\hat\beta)$ where $X_i^*$ sets treatment to $a$. The complete estimating equations are:

\[ \begin{pmatrix} \psi_\beta(O_i; \beta) \\[4pt] \psi_\mu(O_i; \beta, \mu_a) \end{pmatrix} = \begin{pmatrix} \text{GLM score for outcome model} \\[4pt] g^{-1}(X_i^{*T}\beta) - \mu_a \end{pmatrix} \]

IF. From Section 3.2 (Equation 4):

\[ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1 (nonzero)}} \;+\; \underbrace{ \color{#31A354}{J_a} \color{#3182BD}{A^{-1}} \psi_i }_{\text{Channel 2}} \]

Both channels are needed because predictions $g^{-1}(\hat\beta_0 + \hat\beta_1 a + \hat\beta_2 L_i)$ vary across individuals through $L_i$. The delta method shortcut $\color{#31A354}{J} V_\beta \color{#31A354}{J^T}$ underestimates.

Variance: $\frac{1}{n^2}\sum_i \text{IF}_i^2$ using the full two-channel IF.

4.2 Inverse probability weighting (IPW)

IPW is more complex because two models are involved, and the uncertainty from the first (propensity scores) must propagate into the second (MSM).

Companion vignette

The density-ratio weight formulas (Horvitz–Thompson indicator, pushforward with sign/Jacobian, IPSI closed form), the make_weight_fn() closure design, and the numerical $A_{\beta\alpha}$ verification are developed in vignette("ipw-variance-theory").

The two-model setup

Propensity score model: estimate $\hat\alpha$ ($q$-vector) from $\text{logit}[P(A_i = 1 \mid L_i)] = L_i^T \alpha$.
Weighted outcome model (MSM): using the estimated weights $w_i = 1/\hat{P}(A_i \mid L_i; \hat\alpha)$, fit $E[Y_i \mid A_i] = g^{-1}(\beta_0 + \beta_1 A_i)$ by weighted regression to get $\hat\beta$ ($p$-vector).

The weights $w_i$ depend on $\hat\alpha$, so $\hat\beta$ depends on $\hat\alpha$.

Why the naive approach fails

A naive sandwich on the outcome model alone — treating the weights as known constants — ignores the variability in $\hat\alpha$. Each new sample produces different $\hat\alpha$, hence different weights, hence different $\hat\beta$. The true variability of $\hat\beta$ has two sources:

Outcome variability: randomness in $Y_i$ given fixed weights.
Weight variability: randomness in $\hat\alpha$ (and hence $w_i$).

The naive sandwich captures only the first. The direction of the bias depends on the estimand: for the ATE, using estimated (rather than known) propensity scores actually reduces the asymptotic variance, so the naive sandwich is conservative — it overestimates the SE (Hirano et al., 2003; Lunceford & Davidian, 2004). For the ATT, the effect can go either direction.

The correct approach: joint M-estimation

Treat both models together as a single M-estimation problem. Define the joint parameter $\theta = (\alpha, \beta)$ — a $(q+p)$-vector — and the stacked estimating function:

\[ \underset{(q+p) \times 1}{\Psi(O_i; \theta)} = \begin{pmatrix} \color{#3182BD}{\psi_\alpha(O_i; \alpha)} \\[4pt] \color{#E6550D}{\psi_\beta(O_i; \alpha, \beta)} \end{pmatrix} \]

where $\color{#3182BD}{\psi_\alpha}$ ($q \times 1$) is the score of the propensity model, and $\color{#E6550D}{\psi_\beta}$ ($p \times 1$) is the weighted score of the outcome model (which depends on $\alpha$ through $w_i$).

The bread matrix $\hat A = -\frac{1}{n}\sum_i \frac{\partial\Psi_i}{\partial\theta^T}$ has a block structure:

\[ \underset{(q+p) \times (q+p)}{\hat{A}} = \begin{pmatrix} \underset{q \times q}{\color{#3182BD}{A_{\alpha\alpha}}} & \underset{q \times p}{0} \\[6pt] \underset{p \times q}{\color{#E6550D}{A_{\beta\alpha}}} & \underset{p \times p}{\color{#E6550D}{A_{\beta\beta}}} \end{pmatrix} \]

Each block has a concrete meaning:

$\color{#3182BD}{A_{\alpha\alpha}}$ ($q \times q$): the Hessian of the propensity model — how $\psi_\alpha$ changes with $\alpha$.
$\color{#E6550D}{A_{\beta\beta}}$ ($p \times p$): the weighted Hessian of the outcome model — how $\psi_\beta$ changes with $\beta$.
$\color{#E6550D}{A_{\beta\alpha}}$ ($p \times q$): the cross-term — how the weighted outcome score changes with $\alpha$. This involves $\frac{\partial w_i}{\partial\alpha^T}$ (how the weights shift when the propensity parameters change). This is the block that captures the dependency.
Upper-right = $0$ ($q \times p$): the propensity score does not depend on $\beta$.

The full sandwich $\frac{1}{n}\hat A^{-1}\hat B(\hat A^{-1})^T$ applied to the stacked system correctly propagates weight uncertainty into $\text{Var}(\hat\beta)$ through the $A_{\beta\alpha}$ block.

In practice

causatr’s self-contained IPW engine computes this stacked sandwich directly via compute_ipw_if_self_contained_one(), which uses numDeriv::jacobian() on the make_weight_fn() closure to evaluate the $A_{\beta\alpha}$ block numerically. See vignette("ipw-variance-theory") for the closure design and verification.

The full IF for $\hat\mu_a$

The stacked sandwich above gives $V_\beta$ — the variance of the MSM parameters $\hat\beta$, accounting for weight uncertainty. But we need $\text{Var}(\hat\mu_a)$, not $\text{Var}(\hat\beta)$. Following the same approach as Section 3.5, we add the mean equation to the stack:

\[ \underset{(q+p+1) \times 1}{\Psi(O_i;\,\alpha,\,\beta,\,\mu_a)} = \begin{pmatrix} \color{#3182BD}{\psi_\alpha(O_i;\,\alpha)} \\[4pt] \color{#E6550D}{\psi_\beta(O_i;\,\alpha,\,\beta)} \\[4pt] \psi_\mu(O_i;\,\beta,\,\mu_a) \end{pmatrix} \]

The bread is now $3 \times 3$ block-lower-triangular:

\[ A = \begin{pmatrix} \color{#3182BD}{A_{\alpha\alpha}} & 0 & 0 \\[4pt] \color{#E6550D}{A_{\beta\alpha}} & \color{#E6550D}{A_{\beta\beta}} & 0 \\[4pt] 0 & A_{\mu\beta} & A_{\mu\mu} \end{pmatrix} \]

where $A_{\mu\beta} = -\color{#31A354}{J}$ (negative Jacobian of predictions w.r.t. $\beta$) and $A_{\mu\mu} = 1$ (same as Section 3.5). The zero in the $(\mu, \alpha)$ position arises because the mean equation $g^{-1}(X_i^{*T}\beta) - \mu_a$ does not depend on $\alpha$ directly — it depends on $\alpha$ only through $\beta$.

Inverting by back-substitution and extracting the $\mu_a$ row gives:

\[ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1 (direct)}} + \underbrace{\color{#31A354}{J}\; \color{#E6550D}{A_{\beta\beta}^{-1}}\; \psi_{\beta,i}}_{\text{MSM correction}} + \underbrace{\color{#31A354}{J}\; \color{#E6550D}{A_{\beta\beta}^{-1}}\; \color{#E6550D}{A_{\beta\alpha}}\; \color{#3182BD}{A_{\alpha\alpha}^{-1}}\; \psi_{\alpha,i}}_{\text{propensity correction}} \tag{8}\]

Reading each term:

Channel 1: how individual $i$’s prediction deviates from the mean (same as g-comp). Zero when predictions are constant (saturated MSM + static intervention); nonzero for non-static interventions or non-saturated MSMs.
MSM correction: how $i$’s outcome residual (captured in $\psi_{\beta,i}$) propagates to $\hat\mu_a$ through the MSM parameters. Same structure as the g-comp Channel 2.
Propensity correction: how $i$’s propensity score residual (captured in $\psi_{\alpha,i}$) propagates to $\hat\mu_a$ through the weights. The chain is: $\psi_{\alpha,i} \to \hat\alpha \to w_i \to \hat\beta \to \hat\mu_a$. The matrix product $\color{#E6550D}{A_{\beta\alpha}}\color{#3182BD}{A_{\alpha\alpha}^{-1}}$ captures the first two links (propensity parameters $\to$ weights $\to$ MSM parameters); $\color{#31A354}{J}\color{#E6550D}{A_{\beta\beta}^{-1}}$ captures the last two (MSM parameters $\to$ predictions $\to$ estimand).

The second and third terms can be combined:

\[ \text{IF}_i(\hat\mu_a) = \text{Ch.1}_i \;+\; \color{#31A354}{J}\; \underbrace{ \color{#E6550D}{A_{\beta\beta}^{-1}} \left( \psi_{\beta,i} + \color{#E6550D}{A_{\beta\alpha}}\; \color{#3182BD}{A_{\alpha\alpha}^{-1}}\; \psi_{\alpha,i} \right)}_{\text{IF}_i(\hat\beta) \text{ from the stacked } (\alpha,\beta) \text{ system}} \]

The expression in parentheses is the adjusted score: the MSM score $\psi_{\beta,i}$ corrected for propensity estimation. This is exactly what WeightIt::glm_weightit() computes internally when producing its M-estimation sandwich — we are now exposing it per-individual.

Where are the bread and meat?

The bread appears three times: $\color{#3182BD}{A_{\alpha\alpha}^{-1}}$ (propensity bread inverse), $\color{#E6550D}{A_{\beta\beta}^{-1}}$ (MSM bread inverse), and $\color{#31A354}{J}$ (Jacobian from the mean equation). The individual scores $\psi_{\alpha,i}$ and $\psi_{\beta,i}$ are what would form the meat if we squared and summed them. The IF avoids forming the $(q+p+1) \times (q+p+1)$ meat matrix.

For a saturated MSM $Y \sim A$ with inverse-probability weights $w_i = A_i / e(L_i)$ and propensity $e(L_i) = \mathrm{expit}(L_i^{\mathsf T}\alpha)$,

\[ \hat\mu_1 = \frac{1}{n}\sum_i w_i Y_i, \]

and the IF of $\hat\mu_1$ is the Channel-1 + propensity-corrected Channel-2 form given in Equation 8. causatr computes this via compute_ipw_if_self_contained_one(), which assembles the adjusted score per individual using numDeriv::jacobian() on the make_weight_fn() closure (see vignette("ipw-variance-theory")).

Code

# Fit the propensity model manually and check the point estimate matches
# causatr.
ps_fit <- glm(A ~ L, data = dat, family = binomial)
e_i    <- fitted(ps_fit)
w1     <- dat$A / e_i
mu1_ipw_manual <- sum(w1 * dat$Y) / n

fit_ipw <- causat(
  dat, outcome = "Y", treatment = "A", confounders = ~ L,
  estimator = "ipw", family = binomial,
  model_fn = stats::glm, propensity_model_fn = stats::glm
)
res_ipw <- contrast(
  fit_ipw,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0", ci_method = "sandwich"
)
mu1_ipw_causatr <- res_ipw$estimates[intervention == "a1"]$estimate
se_ipw_causatr  <- res_ipw$estimates[intervention == "a1"]$se

IPW $\hat\mu_1$ and its IF-based SE
quantity	value
$\hat\mu_1$ (Horvitz–Thompson)	0.4251
$\hat\mu_1$ (causatr IPW)	0.41726
$\mathrm{SE}(\hat\mu_1)$ via causatr IF engine	0.04932

The HT estimate and the causatr IPW estimate agree exactly (they are algebraically identical on a saturated MSM with $w_i = A_i/e(L_i)$), and the SE comes from the propensity-corrected adjusted score — the third term in Equation 8 — computed internally.

4.3 Matching

After matching, the outcome model is fitted on the matched sample. Two issues require special treatment.

Cluster-robust variance

Matched individuals within the same subclass (pair or stratum) are correlated — they were selected together. The standard sandwich (which assumes independence) would understate the meat.

The cluster-robust sandwich accounts for this. The bread $\color{#3182BD}{A}$ is unchanged, but the meat sums scores within each subclass before squaring:

\[ V_\beta = \color{#3182BD}{A^{-1}} \;\color{#E6550D}{B_{\text{CL}}}\; \color{#3182BD}{(A^{-1})^T} \]

where:

\[ \color{#E6550D}{B_{\text{CL}}} = \frac{1}{n}\sum_{c=1}^C \bar\psi_c \, \bar\psi_c^T, \qquad \bar\psi_c = \sum_{i \in \text{subclass } c} \psi_i \qquad (p \times 1) \]

Squaring the summed scores (rather than summing the squared scores) allows within-cluster correlations to inflate the meat.

The full IF for matching

The per-individual IF has the same two-channel structure as g-comp:

where $\psi_{\beta,i}$ is the outcome model score for individual $i$ on the matched data. The difference from g-comp is in the aggregation step: instead of treating each individual’s IF as independent, we account for within-subclass correlation by summing IFs within each matched set before squaring:

\[ \widehat{\text{Var}}(\hat\mu_a) = \frac{1}{n^2}\sum_{c=1}^{C} \left(\sum_{i \in \text{subclass } c} \text{IF}_i\right)^2 \]

This is the IF analogue of the cluster-robust sandwich: instead of summing squared IFs (which assumes independence), we sum the squared cluster-level IFs. It is identical to $\color{#3182BD}{A^{-1}} \color{#E6550D}{B_{\text{CL}}} \color{#3182BD}{(A^{-1})^T}$ from the cluster-robust meat above, applied to the stacked system.

For a fit on a matched sample with subclass indicator $c_i$,

\[ \widehat{\mathrm{Var}}(\hat\beta)_{\text{CL}} = \hat A^{-1} \left(\frac{1}{n}\sum_{c=1}^C \bar\psi_c\,\bar\psi_c^{\mathsf T}\right) \hat A^{-1} / n, \qquad \bar\psi_c = \sum_{i \in c}\psi_i. \]

Equivalently, $\widehat{\mathrm{Var}}(\hat\beta)_{\text{CL}} = \frac{1}{n^2}\sum_{c=1}^C \bar{\mathrm{IF}}_c\,\bar{\mathrm{IF}}_c^{\mathsf T}$, with $\bar{\mathrm{IF}}_c = \sum_{i \in c}\mathrm{IF}_i$.

Code

# Build a toy matched design on dat: assign each pair of consecutive
# rows to the same subclass. This is not a real match, but it exercises
# the cluster-robust machinery on a dataset we already have.
subclass <- rep(seq_len(n / 2), each = 2)

# Manual cluster-robust sandwich, using the IF decomposition.
# IF_beta was computed in Section 2.3 (n x p matrix).
IF_by_cluster <- rowsum(IF_beta, group = subclass)          # C x p
V_beta_CL_manual <- crossprod(IF_by_cluster) / n^2

# Gold standard: sandwich::vcovCL() with the same subclass grouping.
# Use HC0 (no finite-sample adjustment) so we match the textbook formula
# above exactly.
V_beta_CL_gold <- sandwich::vcovCL(
  fit,
  cluster = subclass,
  type    = "HC0",
  cadjust = FALSE
)

Cluster-robust $V_\beta$: manual vs `sandwich::vcovCL`
cell	manual	reference	abs diff
(Intercept) × (Intercept)	0.065728	0.065728	0.0000000048472695
A × (Intercept)	-0.067156	-0.067156	0.0000000046773747
L × (Intercept)	0.002579	0.002579	0.0000000003900766
(Intercept) × A	-0.067156	-0.067156	0.0000000046773747
A × A	0.120543	0.120543	0.0000000110195071
L × A	-0.005809	-0.005809	0.0000000067333065
(Intercept) × L	0.002579	0.002579	0.0000000003900766
A × L	-0.005809	-0.005809	0.0000000067333065
L × L	0.033929	0.033929	0.0000000096446357

The “sum within cluster, then square” IF aggregation reproduces sandwich::vcovCL() (with cadjust = FALSE) to numerical precision. Matching’s bread is unchanged from the g-comp case — only the meat is formed differently, by accumulating IFs inside each subclass before squaring.

4.4 Summary: one method, multiple estimators

All methods use the same variance formula — the IF squared and summed (Equation 5). What differs is the estimating equations and how Channel 2 is computed.

Method	Stacked equations	Channel 2 models	Aggregation
G-comp	Outcome model + mean eq.	1 (outcome)	Standard or user-cluster
IPW	Propensity + MSM + mean eq.	2 (propensity, MSM)	Standard or user-cluster
Matching	Outcome model + mean eq.	1 (outcome)	Cluster-robust on subclass
ICE	$K\!+\!1$ outcome models + mean eq.	$K\!+\!1$ (backward chain)	Standard or user-cluster

The IF formula handles all cases uniformly. The only per-method variation is which models contribute to Channel 2 and how IFs are aggregated (summed within clusters for matching, or within any user-supplied design cluster via contrast(cluster = ...); directly otherwise). Matching’s subclass cluster is structural — contrast(cluster = ...) is rejected there with a causatr_bad_cluster abort because layering a user cluster on top would double-count within-subclass rows. For gcomp, ipw, and ice, the same sum-within-cluster-then-square primitive that Section 4.3 derives for matching is exposed directly via the cluster argument; numerically equivalent to sandwich::vcovCL(type = "HC0") applied to the final predict-then-average step (up to a small $G/(G-1)$ cluster-count factor).

5. Chained estimators: ICE g-computation

5.1 The chain rule for influence functions

Section 3 derived the IF for a two-step estimator (fit $\hat\beta$, then compute $\hat\mu_a$). What happens with multiple models in a chain, where each model’s response is a prediction from the next? The same logic applies recursively. If $\hat\mu = f(\hat\beta_0, \hat\beta_1, \ldots, \hat\beta_K)$ and each $\hat\beta_k$ is an M-estimator with bread $\color{#3182BD}{A_k}$ and score $\psi_k$, the IF of $\hat\mu$ is (Hampel et al., 1986; Huber, 1981):

\[ \text{IF}_i(\hat\mu) \;=\; \text{IF}_i^{\text{direct}} \;+\; \sum_{k=0}^{K} \frac{\partial \hat\mu}{\partial \hat\beta_k^T} \;\cdot\; \color{#3182BD}{A_k^{-1}} \; \psi_k(O_i; \hat\beta_k) \tag{9}\]

Each term in the sum is a “Channel 2” from Section 3.2 applied to model $k$: the Jacobian $\partial\hat\mu/\partial\hat\beta_k^T$ (how $\hat\mu$ depends on model $k$) times the IF of model $k$’s parameters ($\color{#3182BD}{A_k^{-1}}\psi_{k,i}$).

Connections to other frameworks

This IF-based approach is equivalent to:

Stacked estimating equations (Stefanski & Boos, 2002; Zivich et al., 2024): concatenate all models’ estimating equations and apply the generic sandwich to the full stack.
Semiparametric efficiency theory (Tsiatis, 2006): derive the efficient influence function (EIF) and project onto the parametric model.

All yield the same variance estimate.

5.2 ICE g-computation: the motivating example

Iterated conditional expectation (ICE) g-computation estimates the counterfactual mean under a treatment strategy in longitudinal data (Robins, 1986; Bang & Robins, 2005; Zivich et al., 2024). It fits $K+1$ outcome models backward in time:

flowchart RL
  MK["Model K (final time)<br/>Fit: E[Y | A_K, H_K]<br/>Predict → Ỹ_K"] -->|"Ỹ_K becomes<br/>response for"| MK1["Model K-1<br/>Fit: E[Ỹ_K | A_{K-1}, H_{K-1}]<br/>Predict → Ỹ_{K-1}"]
  MK1 -->|"Ỹ_{K-1} becomes<br/>response for"| Mdots["⋯"]
  Mdots -->|"response for"| M0["Model 0 (baseline)<br/>Fit: E[Ỹ_1 | A_0, H_0]<br/>Predict → Ỹ_0"]
  M0 -->|"average"| MU["μ̂ = mean(Ỹ_0)"]

  style MK fill:#f0f4ff,stroke:#3182BD
  style MK1 fill:#f0f4ff,stroke:#3182BD
  style M0 fill:#f0f4ff,stroke:#3182BD
  style MU fill:#e8f5e9,stroke:#31A354

Figure 2: ICE backward iteration. Models are fit from final time to baseline. Each model’s predictions become the next model’s response.

Each model’s response ($\tilde{Y}_k$) is a prediction from the next model in the chain. This creates a cascade of dependencies.

5.3 Why naive approaches fail

Naive 1: Variance of pseudo-outcomes only. Uses only Channel 1 — ignores all model correction terms.

Naive 2: $\color{#31A354}{J}V_\beta\color{#31A354}{J^T}$ from model 0 only. Captures Channel 2 for model 0 but ignores models $1, \ldots, K$.

Correct: Full IF chain. Account for all $K+1$ models via the chain rule.

5.4 Deriving the IF for chained models

The IF decomposition

Applying the chain rule (Section 5.1):

\[ \text{IF}_i = \underbrace{\frac{1}{n_t}(\tilde{Y}_{0,i} - \hat\mu)}_{\text{Channel 1}} + \sum_{k=0}^{K} \underbrace{ \color{#E6550D}{d_{k,i}} \cdot r^{\text{score}}_{k,i} }_{\text{Channel 2 for model } k} \tag{10}\]

where $\color{#E6550D}{d_{k,i}}$ is the sensitivity — how much a perturbation at model $k$ propagates to $\hat\mu$ — and $r^{\text{score}}_{k,i}$ is individual $i$’s score residual at model $k$:

\[ r^{\text{score}}_{k,i} \;=\; (Y_{k,i} - \hat{Y}_{k,i}) \cdot \frac{d\mu/d\eta\big|_{\eta_{k,i}}}{V(\hat{Y}_{k,i})} \]

i.e. the response residual $r_{k,i} = Y_{k,i} - \hat{Y}_{k,i}$ weighted by the link-scale factor from the GLM score (Section 1.2, route 3). For canonical links (logistic-logit, Poisson-log, Gaussian-identity) the ratio $(d\mu/d\eta)/V(\mu) = 1$ and $r^{\text{score}}_{k,i}$ collapses to the ordinary response residual $r_{k,i}$. For non-canonical links (probit, cloglog, Gamma-log, Poisson-sqrt, etc.) the factor is not $1$ and must be applied.

Why the score residual, not the response residual

$\text{Ch.2}_i = J_k A_k^{-1} \psi_{k,i}$ contains the full GLM score $\psi_{k,i} = (Y - \mu)\cdot(d\mu/d\eta)/V(\mu)\cdot X_i$, not the bare residual $(Y - \mu)\cdot X_i$. Working through the algebra:

\[ \color{#31A354}{J_k}\color{#3182BD}{A_k^{-1}}\psi_{k,i} \;=\; n\cdot d_{k,i}\cdot (Y_{k,i} - \hat Y_{k,i})\cdot \frac{d\mu/d\eta\big|_{\eta_{k,i}}}{V(\hat Y_{k,i})} \;=\; n\cdot d_{k,i}\cdot r^{\text{score}}_{k,i} \]

The factor of $n$ arises from $A_k^{-1} = n(X_k^T W_k X_k)^{-1}$ (Section 5.5). Omitting the $(d\mu/d\eta)/V$ factor and using the plain response residual is correct only when it equals $1$ — i.e., only for canonical links. This mistake is silent on logistic/Poisson-log/Gaussian models but miscalibrates probit and cloglog.

What the sensitivities measure

Each model correction requires a sensitivity $\color{#E6550D}{d_{k,i}}$: a scalar that answers “if individual $i$’s residual at model $k$ increased by one unit, how much would $\hat\mu$ change?” Computing $\color{#E6550D}{d_{k,i}}$ takes three steps, each with a clear role:

Gradient $g_k$ ($p_k \times 1$): the total sensitivity of $\hat\mu$ to model $k$’s coefficient vector $\beta_k$. This is a Jacobian — the same object as $\color{#31A354}{J_a}$ from Section 3.2, but computed recursively for $k > 0$.
Solve $h_k = (X_k^T W_k X_k)^{-1} g_k$ ($p_k \times 1$): convert from “how $\hat\mu$ responds to $\beta_k$” to “how $\hat\mu$ responds to individual score contributions.” This is the bread inverse $\color{#3182BD}{A_k^{-1}}$ applied to $g_k$.
Distribute $\color{#E6550D}{d_{k,i}} = h_k^T X_{k,i}$ (scalar): project onto individual $i$’s covariate vector to get their personal sensitivity.

Base case: model 0

$\hat\mu$ is a direct average of model 0’s predictions. The gradient of $\hat\mu$ w.r.t. $\beta_0$ is therefore the Jacobian from Section 3.2:

\[ \underset{p_0 \times 1}{g_0} = \frac{1}{n_t}\sum_{j \in \text{target}} \frac{d\mu}{d\eta}\bigg|_{\eta_{0,j}^*} \;\cdot\; X_{0,j}^* \]

$g_0$ tells us the total sensitivity of $\hat\mu$ to $\beta_0$, but the IF requires a per-individual quantity. We need to ask: “of the total shift in $\hat\beta_0$ caused by the data, how much came from individual $i$?” The answer uses two facts: (a) individual $i$’s influence on $\hat\beta_0$ is proportional to their score $\psi_{0,i} = (d\mu/d\eta)_i / V(\mu_i) \cdot r_{0,i} \cdot X_{0,i}$; and (b) the constant of proportionality involves the bread inverse $(X_0^T W_0 X_0)^{-1}$. Combining:

\[ h_0 = \underbrace{(X_0^T W_0 X_0)^{-1}}_{\text{bread inverse}} g_0, \qquad \color{#E6550D}{d_{0,i}} = h_0^T X_{0,i} \]

Now $\color{#E6550D}{d_{0,i}} \cdot r^{\text{score}}_{0,i}$ gives individual $i$’s Channel 2 contribution: their score residual $r^{\text{score}}_{0,i} = (Y_{0,i} - \hat Y_{0,i}) \cdot (d\mu/d\eta)_i / V(\mu_i)$ (how much their data “disagrees” with model 0, on the link-scaled score metric) scaled by $\color{#E6550D}{d_{0,i}}$ (how much that disagreement matters for $\hat\mu$). For canonical links $(d\mu/d\eta)/V = 1$ and the score residual collapses to the response residual $r_{0,i}$. For non-canonical links the link-scale factor must be kept. This is equivalent to $\color{#31A354}{J_0}\color{#3182BD}{A_0^{-1}}\psi_{0,i}$ from Section 3.2, factored into the three steps above.

Recursive case: model $k = 1, \ldots, K$

Model $k$ does not feed into $\hat\mu$ directly. Instead, model $k$’s predictions become model $k-1$’s pseudo-outcomes (response variable). We already know how sensitive $\hat\mu$ is to each of those pseudo-outcomes — that is $\color{#E6550D}{d_{k-1,j}}$ from the previous step.

The gradient of $\hat\mu$ w.r.t. $\beta_k$ therefore weights each individual’s contribution by the previously-computed sensitivity, further scaled by the prior weight $w_{k-1,j}$ that the $(k-1)$-th score function carries for row $j$:

\[ \underset{p_k \times 1}{g_k} = \sum_{j \in \text{fit}_{k-1}} w_{k-1,j}\;\cdot\; \color{#E6550D}{d_{k-1,j}} \;\cdot\; \frac{d\mu}{d\eta}\bigg|_{\eta_{k,j}^*} \;\cdot\; X_{k,j}^* \]

The $w_{k-1,j}$ factor comes from $A_{k-1,k} = E[\partial s_{k-1,j}/\partial \beta_k]$, and $s_{k-1,j}$ inherits $w_{k-1,j}$ because the $(k-1)$-th pseudo-outcome model is a weighted fit. In the unweighted case $w_{k-1,j} \equiv 1$ and the factor drops out; dropping it under non-uniform external weights (e.g. IPCW, survey weights) silently underestimates the sandwich variance by a factor that scales with weight heterogeneity.

Compare this to $g_0$: the uniform weight $\frac{1}{n_t}$ over the target population is replaced by $w_{k-1,j} \cdot \color{#E6550D}{d_{k-1,j}}$ over the previous model’s fitting set. Everything else — solve and distribute — is identical:

\[ h_k = (X_k^T W_k X_k)^{-1} g_k, \qquad \color{#E6550D}{d_{k,i}} = h_k^T X_{k,i} \]

5.5 Scaling: the factor of $n$

Each model correction involves $\color{#3182BD}{A_k^{-1}}$. Since $\color{#3182BD}{A_k} = \frac{1}{n_k} X_k^T W_k X_k$, we have $\color{#3182BD}{A_k^{-1}} = n_k(X_k^T W_k X_k)^{-1}$. So the correction at step $k$ is:

\[ n_k \cdot \color{#E6550D}{d_{k,i}} \cdot r^{\text{score}}_{k,i} \]

The factor $n_k$ ensures each IF contribution is $O(1)$ per individual, giving the correct $\text{Var}(\hat\mu) = O(1/n)$.

A common bug

Omitting $n_k$ makes each correction $O(1/n)$, yielding $\text{Var} = O(1/n^3)$ — confidence intervals that are much too narrow.

For a 2-period ICE with outcome model $m_1$ at $k=1$ and pseudo-outcome model $m_0$ at $k=0$, the IF is

\[ \mathrm{IF}_i(\hat\mu) = \big(\hat m_0(H_0^{(a)}) - \hat\mu\big) + n_1\,d_{1,i}\,r^{\text{score}}_{1,i} + n_0\,d_{0,i}\,r^{\text{score}}_{0,i}, \]

with the sensitivities $d_{k,i}$ propagated forward by the recursion of Section 5.4. The $n_k$ factors are essential (Section 5.5). For the linear-Gaussian DGP below, the true joint ATE under $(a_0=1, a_1=1)$ vs $(a_0=0, a_1=0)$ is exactly $0.8 + 0.6 = 1.4$ by construction, and we check that causatr’s sandwich CI covers it.

Code

set.seed(7)
n_ice <- 500
id    <- seq_len(n_ice)
L0    <- rnorm(n_ice)
A0    <- rbinom(n_ice, 1, plogis(0.3 * L0))
L1    <- 0.5 * L0 + 0.4 * A0 + rnorm(n_ice)
A1    <- rbinom(n_ice, 1, plogis(0.2 * L1 + 0.3 * A0))
Y_ice <- 1 + 0.8 * A0 + 0.6 * A1 + 0.5 * L0 + 0.4 * L1 + rnorm(n_ice)

long <- rbind(
  data.frame(id = id, time = 0L, A = A0, L = L0, Y = NA_real_),
  data.frame(id = id, time = 1L, A = A1, L = L1, Y = Y_ice)
)

fit_ice <- causat(
  long, outcome = "Y", treatment = "A",
  confounders     = ~ 1,
  confounders_tv  = ~ L,
  id = "id", time = "time",
  model_fn = stats::glm
)
res_ice <- contrast(
  fit_ice,
  interventions = list(a11 = static(1), a00 = static(0)),
  reference     = "a00",
  ci_method     = "sandwich"
)

ate_ice <- res_ice$contrasts$estimate
se_ice  <- res_ice$contrasts$se
ci_lo   <- res_ice$contrasts$ci_lower
ci_hi   <- res_ice$contrasts$ci_upper
truth   <- 0.8 + 0.6  # structural sum of the two causal coefficients
covered <- (truth >= ci_lo) && (truth <= ci_hi)

ICE chained IF on a 2-period linear-Gaussian DGP
quantity	value
ICE ATE (causatr sandwich)	1.35387462881508
$\mathrm{SE}$	0.131353458569219
95% CI lower	1.09642658077464
95% CI upper	1.61132267685553
Truth ($0.8 + 0.6$)	1.4
Truth $\in$ 95% CI?	yes

The sandwich CI produced by variance_if_ice() covers the structural truth of $1.4$, and the chained IF machinery is doing what Section 5.4 says it should.

5.6 Pseudo-code: IF for a chained estimator

Pseudo-code: chained_influence()

chained_influence <- function(models, data_iv, target) {
  n <- number_of_individuals
  K <- length(models) - 1
  mu_hat <- mean(pseudo_final[target])
  n_target <- sum(target)

  # Channel 1 with target-subpopulation scaling (see Sec. 3.6).
  IF <- rep(0, n)
  IF[target] <- (n / n_target) * (pseudo_final[target] - mu_hat)

  d <- rep(0, n)  # sensitivities from previous step

  for (step in 0:K) {
    model_k <- models[[step + 1]]
    family_k <- model_k$family
    X_k <- model.matrix(model_k)         # fitting design matrix
    eta_k <- model_k$linear.predictors
    mu_eta_k <- family_k$mu.eta(eta_k)   # dmu/deta at fitted values
    V_k <- family_k$variance(fitted(model_k))
    W_k <- mu_eta_k^2 / V_k              # IWLS working weights
    fit_ids_k <- model_k$fitting_ids
    n_k <- length(fit_ids_k)

    # Score residual = response residual * (dmu/deta) / V(mu). For
    # canonical links this factor is 1; for non-canonical (probit,
    # cloglog, Gamma-log, ...) it is not.
    r_response_k <- residuals(model_k, type = "response")
    r_score_k <- r_response_k * mu_eta_k / V_k

    # Intervention design matrix and link derivatives (dmu/deta)
    X_star_k <- model.matrix(model_k, newdata = intervention_data[[step + 1]])
    eta_star <- X_star_k %*% coef(model_k)
    mu_eta_star <- family_k$mu.eta(eta_star)

    # Step 1: Gradient g_k = d mu_hat / d beta_k. The step > 0 branch
    # carries w_{k-1} from the previous-step fit's prior weights.
    if (step == 0) {
      g_k <- crossprod(X_star_k[target, ], mu_eta_star[target]) / n_target
    } else {
      prev_ids <- models[[step]]$fitting_ids
      w_prev <- models[[step]]$prior.weights  # 1-vector if unweighted
      g_k <- crossprod(
        X_star_k[prev_ids, ],
        w_prev * d[prev_ids] * mu_eta_star[prev_ids]
      )
    }

    # Step 2: Solve h_k = (X'WX)^{-1} g_k
    h_k <- solve(crossprod(X_k, X_k * W_k)) %*% g_k

    # Step 3: Distribute d_{k,i} = h_k' X_{k,i}
    d <- rep(0, n)
    d[fit_ids_k] <- X_k %*% h_k

    # Channel 2 for model k: n_k * d_{k,i} * r_score_{k,i}
    r_vec <- rep(0, n)
    r_vec[fit_ids_k] <- r_score_k
    IF <- IF + n_k * d * r_vec
  }
  IF
}

5.7 From individual IFs to the variance

Once we have the per-individual $\text{IF}_i$ from the recursion above, variance estimation follows from the same formula as always (Equation 5):

\[ \widehat{\text{Var}}(\hat\mu) = \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i^2 \]

For two interventions $a$ and $b$ on the same data:

\[ \widehat{\text{Cov}}(\hat\mu_a, \hat\mu_b) = \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i^{(a)} \cdot \text{IF}_i^{(b)} \]

The full $k \times k$ covariance matrix is then used for contrasts.

6. IPCW: variance under outcome censoring

When outcome data are missing at random (MAR) and causatr fits a censoring model $P(C = 0 \mid A, L)$, the IPCW weights become estimated nuisance parameters. The variance of the causal estimand must account for this estimation uncertainty — otherwise the sandwich SE is conservative.

6.1 The censoring model as a third nuisance block

The censoring model adds a block to the stacked estimating equations. For point-treatment g-computation with IPCW, the stack becomes:

\[ \Psi(\theta) = \begin{pmatrix} \psi_\gamma \\ \psi_\beta \\ \psi_\mu \end{pmatrix} = \begin{pmatrix} \text{censoring model score} \\ \text{outcome model score (IPCW-weighted)} \\ \text{target estimand} \end{pmatrix} \]

where $\theta = (\gamma, \beta, \mu)^T$, $\gamma$ are censoring model parameters, $\beta$ are outcome model parameters, and $\mu$ is the marginal mean.

6.2 The block-triangular bread

The bread matrix has the same block-triangular structure as in Section 3.5, now with one extra block:

\[ A = \begin{pmatrix} A_{\gamma\gamma} & 0 & 0 \\ A_{\beta\gamma} & A_{\beta\beta} & 0 \\ 0 & A_{\mu\beta} & A_{\mu\mu} \end{pmatrix} \]

The key new block is $A_{\beta\gamma}$: the cross-derivative of the outcome model’s score with respect to the censoring parameters. This captures how changes in IPCW weights (driven by $\gamma$) affect the outcome model fit.

Concretely:

\[ A_{\beta\gamma} = -\frac{\partial}{\partial\gamma} \left[\frac{1}{n} \sum_i X_i \, w_{\text{ipcw}}(\gamma)_i \, \frac{(y_i - \mu_i)\,\mu'(\eta_i)}{V(\mu_i)}\right] \]

In causatr, this is computed via numDeriv::jacobian() on a closure that perturbs the IPCW weights while holding $\hat\beta$ fixed.

6.3 The Channel 2 correction for censoring

Back-substitution through the bread yields an additional Channel 2 correction term in the IF:

\[ \text{IF}_i = \underbrace{\text{Ch1}_i}_{\text{covariate sampling}} + \underbrace{\text{Ch2}_i^{(\text{outcome})}}_{\text{outcome model}} - \underbrace{\text{Ch2}_i^{(\text{censoring})}}_{\text{censoring model}} \]

The censoring correction is:

\[ \text{Ch2}_i^{(\text{censoring})} = g_\gamma^T \, A_{\gamma\gamma}^{-1} \, \psi_{\gamma,i} \]

where $g_\gamma = A_{\beta\gamma}^T \, h_\beta$ and $h_\beta = A_{\beta\beta}^{-1} J_\mu^T$ projects the marginal-mean gradient through the outcome model bread.

The minus sign on the censoring correction is deliberate: Robins, Rotnitzky & Zhao (1994) showed this correction reduces variance by accounting for the information in the censoring model. Ignoring it (treating IPCW weights as known) is conservative.

6.4 Point-treatment IPW with IPCW

For IPW + IPCW, the stack has four blocks: censoring model, propensity model, MSM, and target. The bread becomes $4 \times 4$ block-triangular:

\[ A = \begin{pmatrix} A_{\gamma\gamma} & 0 & 0 & 0 \\ 0 & A_{\alpha\alpha} & 0 & 0 \\ A_{\beta\gamma} & A_{\beta\alpha} & A_{\beta\beta} & 0 \\ 0 & 0 & A_{\mu\beta} & A_{\mu\mu} \end{pmatrix} \]

Two off-diagonal blocks feed into the MSM score: $A_{\beta\gamma}$ (IPCW weights) and $A_{\beta\alpha}$ (propensity weights). Each generates an independent Channel 2 correction.

6.5 Longitudinal IPCW: the conservative approximation

For longitudinal treatments with per-period censoring models $\gamma_1, \ldots, \gamma_K$, the full correction requires cross-derivatives $A_{\beta_k, \gamma_j}$ for all $k \geq j$, projected through the ICE cascade’s accumulated sensitivity $d_k$. This is technically supported by stacked M-estimation theory (Stefanski & Boos 2002; Tsiatis 2006, Ch. 10), but the cascade’s $d_k$ vectors are built incrementally during the backward loop and cannot be evaluated inside a numDeriv closure.

causatr currently omits the longitudinal censoring Channel 2 correction. This makes the sandwich SE conservative (larger than the true asymptotic SE). Empirically the ratio is $\approx 0.94$ (sandwich/bootstrap), confirming the omission is small.

Recommendation: For longitudinal IPCW, use ci_method = "bootstrap" for the tightest CIs. The sandwich remains valid for coverage guarantees.

No existing package (lmtp, ltmle, gfoRmula) implements the longitudinal censoring Channel 2 correction — all use bootstrap variance in this setting.

7. Summary

One method: the influence function. For any causal estimand $\hat\mu$, write down the complete estimating equations (nuisance models + target estimand), compute each individual’s $\text{IF}_i$, and take $\widehat{\text{Var}}(\hat\mu) = \frac{1}{n^2}\sum_i \text{IF}_i^2$.
The IF is the sandwich: the IF formula is the sandwich $\color{#3182BD}{A^{-1}} \color{#E6550D}{B} \color{#3182BD}{(A^{-1})^T}$ applied to the stacked estimating equations, with the block-triangular bread inverted by back-substitution. The bread, meat, and scores are all present — the IF just computes $A^{-1}\psi_i$ per individual instead of forming the full matrices.
Two channels: the IF always decomposes into a direct term (Channel 1: covariate sampling) and model corrections (Channel 2: one per nuisance model). What varies across methods is only which models contribute to Channel 2.
G-computation: one model correction (outcome model). Channel 1 nonzero.
IPW: two model corrections (MSM + propensity). The propensity correction captures weight estimation uncertainty via the cross-derivative $A_{\beta\alpha}$.
Matching: one model correction (outcome model), same as g-comp. Aggregation is cluster-robust (sum IFs within matched subclasses).
User-supplied cluster (contrast(cluster = ...)): the same sum-within-cluster-then-square aggregation matching uses internally is exposed directly on gcomp, IPW, and ICE. Numerically equivalent to sandwich::vcovCL(type = "HC0") on the final predict-then-average step; rejected on matching because subclass clustering is already structural.
Chained estimators (ICE): $K\!+\!1$ model corrections computed by backward recursion of sensitivities.
IPCW adds a censoring-model block to the stacked equations. The censoring Channel 2 correction reduces variance by accounting for weight estimation uncertainty. For longitudinal IPCW, the correction is omitted (conservative); use bootstrap for tighter CIs.
The delta method ($\color{#31A354}{J} V_\beta \color{#31A354}{J^T}$) is a shortcut that captures Channel 2 only. It is exact when Channel 1 = 0 (constant predictions), but underestimates otherwise. The full IF avoids this limitation.

8. Numeric variance tiers (custom `model_fn`)

The analytic IF engine in variance_if() requires access to the model’s score residuals and bread matrix. Standard GLMs (stats::glm) and GAMs (mgcv::gam) expose these via family$mu.eta / family$variance and $Vp, respectively. But users can pass any fitting function via model_fn = (e.g. MASS::glm.nb, betareg::betareg), and those models may not expose the standard GLM internals.

causatr handles this with a two-tier numeric fallback (variance_if_numeric()):

Tier 1: `sandwich::estfun` available

If sandwich::estfun(model) returns the per-observation score matrix $\psi_i$ and sandwich::bread(model) returns the bread, the fallback recovers the full IF (Channel 1 + Channel 2 + cross-term) exactly, matching the analytic main path. The marginal-mean Jacobian $J$ is computed via numDeriv::jacobian() on predict() instead of analytically.

This applies to MASS::glm.nb (negative binomial) and most model classes with a sandwich method.

Tier 2: plain delta shortcut

If no estfun method exists (e.g. betareg::betareg), the fallback computes $V_1 + V_2$ where $V_1 = (1/n^2) \sum_i \text{Ch.1}_i^2$ is the Channel 1 variance and $V_2 = J \, V_\beta \, J^T$ is the delta-method approximation for Channel 2. This drops the cross-term $(1/n^2)(\text{Ch.1}^T \text{Ch.2} + \text{Ch.2}^T \text{Ch.1})$ and emits a causatr_tier2_fallback warning so the user knows the SE is slightly miscalibrated.

For Tier 2 models, ci_method = "bootstrap" gives exact coverage.

Which tier does my `model_fn` use?

`model_fn`	Tier	Exact?
`stats::glm`	Analytic (main path)	Yes
`mgcv::gam`	Analytic (GAM path)	Yes
`MASS::glm.nb`	Tier 1 (numeric + estfun)	Yes
`betareg::betareg`	Tier 2 (delta shortcut)	Approximate (use bootstrap for exact)

If unsure, run with ci_method = "sandwich" first — if you see a causatr_tier2_fallback warning, the model is on Tier 2.

References

Bang H, Robins JM (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61:962–973.
Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley.
Hirano K, Imbens GW, Ridder G (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71:1161–1189.
Hernán MA, Robins JM (2025). Causal Inference: What If. Chapman & Hall/CRC.
Huber PJ (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium 1:221–233.
Huber PJ (1981). Robust Statistics. Wiley.
Lunceford JK, Davidian M (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine 23:2937–2960.
Robins JM (1986). A new approach to causal inference in mortality studies with a sustained exposure period. Mathematical Modelling 7:1393–1512.
Robins JM, Rotnitzky A, Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89:846–866.
Stefanski LA, Boos DD (2002). The calculus of M-estimation. The American Statistician 56:29–38.
Tsiatis AA (2006). Semiparametric Theory and Missing Data. Springer.
van der Laan MJ, Robins JM (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer.
White H (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48:817–838.
White H (1982). Maximum likelihood estimation of misspecified models. Econometrica 50:1–25.
Zeileis A (2004). Econometric computing with HC and HAC covariance matrix estimators. Journal of Statistical Software 11(10):1–17.
Zeileis A (2006). Object-oriented computation of sandwich estimators. Journal of Statistical Software 16(9):1–16.
Zivich PN, Klose M, Cole SR, Edwards JK, Shook-Sa BE (2022). Delicatessen: M-Estimation in Python. arXiv preprint arXiv:2203.11300.
Zivich PN, Ross RK, Shook-Sa BE, Cole SR, Edwards JK (2024). Empirical sandwich variance estimator for iterated conditional expectation g-computation. Statistics in Medicine 43:5562–5572.

--- title: "Variance estimation for causal effects" toc-depth: 4 code-fold: show code-tools: true vignette: > %\VignetteIndexEntry{Variance estimation for causal effects} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ```{=html} <style> mjx-container[jax="CHTML"][display="true"] { overflow-x: auto; overflow-y: visible; padding: 0.5em 0; } </style> ``` ```{r} #| include: false knitr::opts_chunk$set(collapse = TRUE, comment = "#>") options(tinytable_html_mathjax = TRUE) ``` This vignette develops the theory behind variance estimation for causal effect estimators in causatr. The core message is simple: **there is one method** — the influence function (IF) — applied uniformly to g-computation, IPW, and matching. What varies across methods is only the estimating equations that define the estimator; the variance formula is always the same. Every non-trivial equation in the sections that follow has a **"Verify numerically"** tabset alongside it: you can switch from the algebra to a short block of R code that computes the same quantity on a fixed running example, and to the resulting value. Those code blocks are real — the vignette is rendered each time the package is built, so if a formula and its implementation ever drift apart, the table in the "Result" tab will make it obvious. ## 0. Roadmap The destination is $\mathrm{Var}(\hat\mu)$, the sampling variance of a *causal* estimand like $\mathbb{E}[Y^a]$ or $\mathbb{E}[Y^{a_1}] - \mathbb{E}[Y^{a_0}]$. We get there in six steps: 1. **Section 1 — Sandwich variance of $\hat\beta$.** Any estimator defined as the root of $\sum_i \psi(O_i;\beta) = 0$ has an asymptotic variance of the form $A^{-1} M A^{-1}/n$ (bread × meat × bread). This is the "variance of the parameters", not yet of the estimand. 2. **Section 2 — The influence function.** The sandwich is equivalent to $\mathrm{Var}(\hat\beta) = \frac{1}{n^2}\sum_i \mathrm{IF}_i \mathrm{IF}_i^{\mathsf T}$, where $\mathrm{IF}_i = A^{-1}\psi_i$ is a per-individual "gradient of the estimate with respect to individual $i$". Same number, different parametrisation — but the IF form is the one that extends to causal estimands. 3. **Section 3 — Two-channel IF for $\hat\mu$.** A causal estimand $\hat\mu = \frac{1}{n}\sum_i m(X_i;\hat\beta)$ mixes two sources of randomness: the random draw of $X_i$ (Channel 1) and the estimation of $\hat\beta$ (Channel 2). The correct IF is the sum of both channels. The delta method $J V_\beta J^{\mathsf T}$ is the Channel-2-only shortcut — **exact when predictions are constant per intervention, too small otherwise.** 4. **Section 4 — The core estimators.** G-computation, IPW, and matching are three choices of estimating equations plugged into the same machinery. Each gets its own IF, but all are aggregated the same way. AIPW combines the g-comp and IPW influence functions into a doubly-robust IF; its variance follows the same framework. 5. **Section 5 — Chained estimators (ICE).** When several nuisance models feed into each other (as in iterated conditional expectation g-computation for longitudinal data), the IF picks up one extra Channel-2 correction per model. We derive these via a forward sensitivity recursion. 6. **Section 6 — Summary.** A table mapping each estimator to its estimating equations, its IF, and the line in causatr where that IF is computed. A good way to read this vignette is: skim the math in a section, then expand the **Verify numerically** tabset under each key equation to see the formula instantiated on real data and cross-checked against either `sandwich::sandwich()` or causatr's own `variance_if()` engine. ### 0.1 Running example Every verification block below uses the same small logistic example. It's deliberately tiny (so the tables fit in the vignette) but not so tiny that asymptotic approximations fall apart. ```{r} #| label: setup-data #| message: false library(causatr) library(sandwich) library(tinytable) set.seed(42) n <- 200 L <- rnorm(n) A <- rbinom(n, 1, plogis(0.5 * L)) Y <- rbinom(n, 1, plogis(-1 + 0.8 * A + 0.6 * L)) dat <- data.frame(Y = Y, A = A, L = L) # The outcome model we'll use throughout Sections 1-3. fit <- glm(Y ~ A + L, data = dat, family = binomial) beta_hat <- coef(fit) X <- model.matrix(fit) # design matrix, n x p mu <- fitted(fit) # predicted probabilities p <- length(beta_hat) # A small helper: print two named numeric matrices / vectors side-by-side # as a tinytable, so that "manual" and "gold-standard" results sit next # to each other and any disagreement is visible. verify_tbl <- function(manual, reference, label_m = "manual", label_r = "reference", digits = 4) { m <- as.numeric(manual) r <- as.numeric(reference) out <- data.frame( cell = if (length(m) == 1) "scalar" else seq_along(m), `value` = m, reference = r, `abs diff` = abs(m - r), check.names = FALSE ) names(out)[2] <- label_m names(out)[3] <- label_r tt(out) |> format_tt(j = 2:4, digits = digits, num_fmt = "significant") } ``` A quick sanity check: $\hat\beta$ and the model-based (non-sandwich) fit. ```{r} #| code-fold: true tt(data.frame(coef = names(beta_hat), estimate = round(beta_hat, 4))) ``` ```{mermaid} %%| label: fig-pipeline %%| fig-cap: "The variance estimation pipeline for causal effects. The influence function (IF) is the one method: write down the estimating equations for the full system (nuisance models + target estimand), compute each individual's IF, then square and sum. The delta method (J V_β J^T) is a shortcut that captures only parameter uncertainty — it's exact when predictions are constant per intervention, but underestimates otherwise." flowchart TB EE["Complete estimating equations (nuisance models + estimand)"] -->|"linearise (Sec 1–2)"| IF["IF_i per individual = direct + model correction"] IF -->|"Var = (1/n²) Σ IF² (always correct)"| VM["Var(μ̂)"] EE -->|"shortcut: J V_β Jᵀ"| SC{"Predictions constant?"} SC -->|"Yes (saturated MSM, static intervention)"| VM SC -->|"No (g-comp, non-saturated MSM)"| UNDER["Underestimates ❌"] style EE fill:#f0f4ff,stroke:#3182BD style IF fill:#fff5eb,stroke:#E6550D style VM fill:#e8f5e9,stroke:#31A354 style SC fill:#fff5eb,stroke:#E6550D style UNDER fill:#fee,stroke:#c00 ``` ## 1. The sandwich estimator for parametric models This section derives $\text{Var}(\hat\beta)$ — the variance of model parameters. This is a building block, not the final answer. Section 3 shows how to get from $\text{Var}(\hat\beta)$ to $\text{Var}(\hat\mu)$. ### 1.1 What is an estimating equation? The starting point is a simple idea: many statistical estimators can be defined as the solution to an equation of the form $$ \sum_{i=1}^n \psi(O_i; \beta) = 0 $$ where $O_i$ is the observed data for individual $i$, $\beta$ is the parameter we want to estimate, and $\psi$ is a function — called the **estimating function** — that maps each observation and a candidate parameter value to a vector. The estimator $\hat\beta$ is the value of $\beta$ that makes this sum equal to zero. But what *is* $\psi$, concretely? And where does it come from? ### 1.2 Deriving $\psi$: three routes to the same place #### Route 1: Score equations (maximum likelihood) The most familiar source of $\psi$ is maximum likelihood. If observations have log-likelihood $\ell_i(\beta) = \log f(O_i; \beta)$, the MLE maximises $\sum_i \ell_i(\beta)$. Setting the derivative to zero gives: $$ \sum_{i=1}^n \underbrace{\frac{\partial \ell_i(\beta)}{\partial \beta}}_ {\psi(O_i;\beta)} = 0 $$ So $\psi$ is the **score function** — the gradient of the individual log-likelihood. For a GLM with outcome $Y_i$, covariates $X_i$, linear predictor $\eta_i = X_i^T\beta$, and mean function $\mu_i = g^{-1}(\eta_i)$: $$ \psi(O_i; \beta) = \underset{\color{#E6550D}{\text{residual}}}{\color{#E6550D}{(Y_i - \mu_i)}} \;\cdot\; \underset{\color{#3182BD}{\text{scale factor}}} {\color{#3182BD}{\frac{d\mu/d\eta}{V(\mu_i)}}} \;\cdot\; \underset{\color{#31A354}{\text{direction}}}{\color{#31A354}{X_i}} $$ where $V(\mu_i)$ is the variance function of the exponential family ($V(\mu) = \mu(1-\mu)$ for binomial, $V(\mu) = \mu$ for Poisson, $V(\mu) = 1$ for Gaussian — note that $\sigma^2$ enters separately as the dispersion parameter $\phi$, not as $V(\mu)$). Reading the three factors: - $\color{#E6550D}{(Y_i - \mu_i)}$ — **Residual**: how far the prediction is from the observation. This is zero on average at the truth, which is why $E[\psi_i] = 0$. - $\color{#3182BD}{(d\mu/d\eta) / V(\mu_i)}$ — **Scale factor**: combines the link derivative (converting from $\eta$-scale to $\mu$-scale) with the variance weight (downweighting noisy observations). For canonical links this product simplifies (see below). - $\color{#31A354}{X_i}$ — **Direction**: the covariates tell us *which* parameters this observation informs. An observation with large $|X_{ij}|$ provides more information about $\beta_j$. ::: {.callout-note} ### Example: logistic regression For logistic regression, $\mu_i = \text{expit}(\eta_i)$, so $d\mu/d\eta = \mu_i(1-\mu_i)$ and $V(\mu_i) = \mu_i(1 - \mu_i)$. The scale factor cancels to $1$, giving: $$\psi(O_i;\beta) = (Y_i - \mu_i) \cdot X_i$$ The estimating equation $\sum_i (Y_i - \mu_i) X_i = 0$ says: at the MLE, the covariates are uncorrelated with the residuals. ::: #### Route 2: Moment conditions (method of moments) Not every estimating equation comes from a likelihood. The **generalised method of moments** (GMM) starts from a theoretical property of the true parameter $\beta_0$ — a *moment condition* — and turns it into $\psi$. For example, if we know that $E[X_i(Y_i - X_i^T\beta_0)] = 0$, we can define: $$ \psi(O_i; \beta) = X_i(Y_i - X_i^T\beta) $$ Setting $\sum_i \psi = 0$ gives the OLS normal equations. No distributional assumption was needed — just the moment condition $E[X_i \varepsilon_i] = 0$. This is the key insight of M-estimation: the estimating function $\psi$ does not have to be a score function. It just needs to satisfy $E[\psi(O_i; \beta_0)] = 0$ at the true parameter value (the **unbiasedness** condition). This is what guarantees consistency of $\hat\beta$. #### Route 3: Quasi-likelihood Quasi-likelihood provides a middle ground. We specify only the first two moments of $Y_i \mid X_i$ — the mean function $\mu_i = g^{-1}(X_i^T\beta)$ and a variance function $V(\mu_i)$ — without assuming a full distributional form. The quasi-score is: $$ \psi(O_i; \beta) = \frac{(Y_i - \mu_i)}{V(\mu_i)} \cdot \frac{d\mu_i}{d\eta_i} \cdot X_i $$ This looks identical to the GLM score, but now $V(\mu)$ is a *working* variance — it might be wrong. The sandwich estimator (Section 1.4) protects against this misspecification. ### 1.3 The general M-estimation framework All three routes produce estimators of the same form: find $\hat\beta$ solving $\sum_i \psi(O_i; \hat\beta) = 0$. The M-estimation framework (Huber, 1964; Stefanski & Boos, 2002) unifies them under two conditions: 1. **Unbiasedness**: $E[\psi(O_i; \beta_0)] = 0$ at the true $\beta_0$. This guarantees that the estimating equation has the right solution in expectation. 2. **Regularity**: $\psi$ is smooth enough in $\beta$ for a Taylor expansion. Given these, $\hat\beta$ is consistent and asymptotically normal, and its variance has the sandwich form — regardless of whether $\psi$ came from a likelihood, a moment condition, or a quasi-likelihood. ::: {.callout-tip} ### Why "M-estimation"? The "M" stands for "maximum" (as in maximum-likelihood), but the framework is much broader. Any estimator defined as the root of a sum of functions of the data and parameters is an M-estimator — this includes MLE, weighted least squares, quantile regression, robust regression (Huber loss), generalised estimating equations, and the causal estimators in Sections 4–5 (Stefanski & Boos, 2002). ::: ### 1.4 Asymptotic variance of M-estimators We want $\text{Var}(\hat\beta)$. The derivation has three steps: linearise $\hat\beta$ around the truth, identify the variance of the linear approximation, then estimate it from data. #### Step 1: Linearise via Taylor expansion We know that $\hat\beta$ solves $\sum_i \psi(O_i; \hat\beta) = 0$. We also know that the true $\beta_0$ satisfies $E[\psi(O_i; \beta_0)] = 0$, but the finite-sample sum $\sum_i \psi(O_i; \beta_0)$ is generally *not* zero — it is a random quantity that fluctuates around zero. The idea is to relate these two facts via a Taylor expansion. Expand $\sum_i \psi(O_i; \hat\beta)$ around $\beta_0$, keeping only the first-order term (valid for large $n$ by regularity): $$ \underbrace{\sum_i \psi(O_i; \hat\beta)}_{= \, 0 \text{ (by definition)}} \approx \sum_i \psi(O_i; \beta_0) + \left[\sum_i \frac{\partial \psi(O_i;\beta_0)} {\partial \beta^T}\right] (\hat\beta - \beta_0) $$ Rearranging for the estimation error $\hat\beta - \beta_0$: $$ \hat\beta - \beta_0 \approx -\left[\sum_i \frac{\partial \psi(O_i; \beta_0)}{\partial \beta^T}\right]^{-1} \sum_i \psi(O_i; \beta_0) $$ Dividing numerator and denominator by $n$ and labelling the two factors: $$ \underset{p \times 1}{\hat\beta - \beta_0} \;\approx\; \underbrace{\left[-\frac{1}{n}\sum_i \frac{\partial \psi_i} {\partial \beta^T}\right]^{-1}}_{\to\; \color{#3182BD}{A^{-1}} \;(p \times p)} \;\cdot\; \underbrace{\frac{1}{n}\sum_i \psi(O_i; \beta_0)}_{\text{mean zero, covariance } \color{#E6550D}{B/n}} $$ This is the key linearisation: the estimation error is approximately the bread inverse times the average score. The first factor converges to the constant matrix $\color{#3182BD}{A^{-1}}$; the second is a sum of iid mean-zero random $p$-vectors whose covariance we call $\color{#E6550D}{B}$. #### Step 2: Identify the variance Name the two matrices that appeared in Step 1: - **Bread** $\color{#3182BD}{A}$: the expected negative Jacobian of the estimating function ($p \times p$): $$ \color{#3182BD}{A} \;\equiv\; -E\!\left[\frac{\partial \psi(O_i; \beta_0)} {\partial \beta^T}\right] \qquad (p \times p) $$ For a GLM, $\color{#3182BD}{A} = E[w_i X_i X_i^T]$ where $w_i = (d\mu/d\eta)^2 / V(\mu)$ are the IWLS working weights — essentially the curvature of the (quasi-)log-likelihood. - **Meat** $\color{#E6550D}{B}$: the covariance of the estimating function ($p \times p$): $$ \color{#E6550D}{B} \;\equiv\; E\!\left[\psi(O_i; \beta_0)\,\psi(O_i; \beta_0)^T\right] \qquad (p \times p) $$ This measures how much the individual scores $\psi_i$ vary across observations. It appeared in Step 1 as the covariance of the second factor: $\text{Var}\! \left(\frac{1}{n}\sum_i \psi_i\right) = \frac{1}{n}\color{#E6550D}{B}$. Now propagate through the linearisation. Taking $\text{Var}(\cdot)$ of both sides of the Step 1 equation: $$ \underset{p \times p}{\text{Var}(\hat\beta)} \;\approx\; \underset{p \times p}{\color{#3182BD}{A^{-1}}} \;\cdot\; \underset{p \times p}{\frac{\color{#E6550D}{B}}{n}} \;\cdot\; \underset{p \times p}{\color{#3182BD}{(A^{-1})^T}} \;=\; \frac{1}{n}\, \color{#3182BD}{A^{-1}}\, \color{#E6550D}{B}\, \color{#3182BD}{(A^{-1})^T} $$ #### Step 3: Estimate from data Replace the population expectations $A$ and $B$ with sample averages: $$ \color{#3182BD}{\hat{A}} = -\frac{1}{n}\sum_i \frac{\partial \psi(O_i; \hat\beta)}{\partial \beta^T} \qquad (p \times p) $$ $$ \color{#E6550D}{\hat{B}} = \frac{1}{n}\sum_i \psi(O_i; \hat\beta)\,\psi(O_i; \hat\beta)^T \qquad (p \times p) $$ The **sandwich estimator** is: $$ \underset{p \times p}{\widehat{\text{Var}}(\hat\beta)} = \frac{1}{n}\; \underset{p \times p}{\color{#3182BD}{\hat{A}^{-1}}} \; \underset{p \times p}{\color{#E6550D}{\hat{B}}} \; \underset{p \times p}{\color{#3182BD}{(\hat{A}^{-1})^T}} $$ {#eq-sandwich} so called because the meat $\color{#E6550D}{\hat B}$ is "sandwiched" between two slices of bread $\color{#3182BD}{\hat A^{-1}}$. ::: {.callout-important} ### Why is this robust? The bread $\color{#3182BD}{A}$ and meat $\color{#E6550D}{B}$ are estimated *separately*. Under correct model specification (e.g. the GLM variance function is right), $A = B$ and the sandwich simplifies to $\frac{1}{n} A^{-1}$ — the usual model-based variance. But if the variance function is wrong, $A \neq B$: the bread still captures the curvature correctly (it only depends on the mean model), while the meat captures the *actual* variability of the scores. The sandwich remains consistent regardless — this is the Huber–White robustness property (Huber, 1967; White, 1982). ::: ::: {.panel-tabset} ## Math $$ \widehat{\text{Var}}(\hat\beta) = \frac{1}{n}\hat A^{-1}\hat B\,\hat A^{-1} $$ with the meat $\hat B = \frac{1}{n}\sum_i \psi_i\psi_i^{\mathsf T}$ and the bread inverse $\hat A^{-1}$. For a GLM with canonical link, $\hat A = \frac{1}{n} X^{\mathsf T} W X$ where $W$ is the diagonal of $\mu_i(1-\mu_i)$ (binomial case). ## Code ```{r} #| label: verify-sandwich # Meat: outer product of per-individual score contributions. psi <- (Y - mu) * X # n x p: (Y-mu) * X M_hat <- crossprod(psi) / n # (1/n) sum psi psi^T, p x p # Bread^{-1}: for logistic, A_hat = (1/n) X^T W X, with W = diag(mu(1-mu)) W <- diag(mu * (1 - mu)) A_hat_inv <- solve(crossprod(X, W %*% X) / n) # Sandwich variance of beta V_beta_manual <- A_hat_inv %*% M_hat %*% A_hat_inv / n # Gold standard: sandwich::sandwich(fit) V_beta_gold <- sandwich::sandwich(fit) ``` ## Result ```{r} #| label: verify-sandwich-result #| echo: false # Collapse both p x p matrices into long form so the table stays readable. cells <- outer(rownames(V_beta_manual), colnames(V_beta_manual), paste, sep = " × ") out <- data.frame( cell = as.vector(cells), manual = as.vector(V_beta_manual), reference = as.vector(V_beta_gold), `abs diff` = abs(as.vector(V_beta_manual) - as.vector(V_beta_gold)), check.names = FALSE ) tt(out, caption = "Sandwich $V_\\beta$: manual vs `sandwich::sandwich(fit)`") |> format_tt(j = 2:4, digits = 4, num_fmt = "significant") ``` The manual sandwich and `sandwich::sandwich(fit)` agree to numerical precision (≲ 1e-8), confirming @eq-sandwich is implemented correctly. ::: ### 1.5 Pseudo-code: sandwich variance for a GLM ```{r} #| eval: false #| code-fold: true #| code-summary: "Pseudo-code: sandwich_variance()" sandwich_variance <- function(X, Y, beta_hat, family) { n <- length(Y) eta <- X %*% beta_hat mu <- family$linkinv(eta) mu_eta <- family$mu.eta(eta) # dmu/deta V <- family$variance(mu) # Var(Y|X) under the model # Bread: A = (1/n) X^T W X where W = diag((dmu/deta)^2 / V(mu)) W <- mu_eta^2 / V A <- (1 / n) * crossprod(X, X * W) # Meat: B = (1/n) sum psi_i psi_i^T residuals <- Y - mu score_weights <- mu_eta / V psi <- X * (residuals * score_weights) # n x p matrix of score contributions B <- (1 / n) * crossprod(psi) # Sandwich: (1/n) A^{-1} B (A^{-1})^T A_inv <- solve(A) V_beta <- (1 / n) * A_inv %*% B %*% t(A_inv) V_beta } ``` ::: {.callout-tip} ### In practice You almost never code this yourself. In R, `sandwich::sandwich(model)` computes $\hat A^{-1} \hat B (\hat A^{-1})^T$ from any `glm` object. The `sandwich` package (Zeileis, 2004, 2006) provides variants for clustered data (`vcovCL()`), panel data (`vcovPL()`), and heteroscedasticity-autocorrelation (`vcovHAC()`). ::: ### 1.6 Concrete example: linear regression Let us work through the sandwich for OLS step by step. **Model**: $Y_i = X_i^T\beta + \varepsilon_i$ where $E[\varepsilon_i \mid X_i] = 0$. **Step 1: Write down $\psi$.** The OLS estimator minimises $\sum_i (Y_i - X_i^T\beta)^2$. Setting the gradient to zero gives $\sum_i X_i(Y_i - X_i^T\beta) = 0$, so: $$ \psi(O_i; \beta) = X_i(Y_i - X_i^T\beta) \qquad (p \times 1) $$ At $\beta_0$: $E[\psi_i] = E[X_i \varepsilon_i] = 0$ by exogeneity. **Step 2: Compute the bread $\color{#3182BD}{A}$.** $$ \frac{\partial \psi(O_i; \beta)}{\partial \beta^T} = \frac{\partial}{\partial \beta^T}\left[X_i(Y_i - X_i^T\beta)\right] = -X_i X_i^T \qquad (p \times p) $$ So $\color{#3182BD}{A} = E[X_i X_i^T]$, estimated by $\color{#3182BD}{\hat{A}} = \frac{1}{n} X^T X$. **Step 3: Compute the meat $\color{#E6550D}{B}$.** Since $\psi_i = X_i \hat e_i$, the outer product is $\psi_i\psi_i^T = \hat e_i^2 X_iX_i^T$: $$ \color{#E6550D}{\hat{B}} = \frac{1}{n}\sum_i \hat{e}_i^2 \, X_i X_i^T \qquad (p \times p) $$ The meat depends on the *squared residuals*, not on any assumed variance. This is the source of robustness. **Step 4: Assemble the sandwich.** $$ \underset{p \times p}{\widehat{\text{Var}}(\hat\beta)} = \frac{1}{n}\, \color{#3182BD}{\hat{A}^{-1}}\; \color{#E6550D}{\hat{B}}\; \color{#3182BD}{(\hat{A}^{-1})^T} $$ Substituting: $$ = \underset{p \times p}{\color{#3182BD}{(X^TX)^{-1}}} \;\; \underset{p \times p}{\color{#E6550D}{\left[\sum_i \hat{e}_i^2\,X_iX_i^T\right]}} \;\; \underset{p \times p}{\color{#3182BD}{(X^TX)^{-1}}} $$ This is the *HC0* heteroscedasticity-consistent estimator (White, 1980). **What if homoscedasticity holds?** If $\text{Var}(\varepsilon_i \mid X_i) = \sigma^2$ for all $i$, then $E[\hat{e}_i^2 X_i X_i^T] = \sigma^2 E[X_i X_i^T]$, so $\color{#E6550D}{B} = \sigma^2 \color{#3182BD}{A}$ and the sandwich simplifies to $\frac{\sigma^2}{n} \color{#3182BD}{A^{-1}} = \hat\sigma^2(X^TX)^{-1}$, the usual OLS variance. ## 2. The influence function Section 1 gave us $\text{Var}(\hat\beta)$ as a matrix product (the sandwich). The **influence function** (IF) gives the same answer but expressed as a per-individual contribution. This is more than a curiosity — IFs are the tool that lets us compute $\text{Var}(\hat\mu)$ for causal estimands (Section 3) and for multi-model pipelines (Section 5). ### 2.1 The idea: how much does each observation matter? Imagine removing observation $i$ from the dataset and re-estimating $\hat\beta$. The change $\hat\beta_{(-i)} - \hat\beta$ tells you how much observation $i$ "influenced" the estimate. The **influence function** (IF) is the first-order approximation to this quantity, scaled by $n$: $$ \text{IF}_i \approx -n(\hat\beta_{(-i)} - \hat\beta) $$ But we do not need to actually refit the model $n$ times. For M-estimators, the IF can be computed in closed form. ### 2.2 Deriving the IF from the linearisation In Section 1.4, Step 1, we derived the linearisation: $$ \underset{p \times 1}{\hat\beta - \beta_0} \;\approx\; \frac{1}{n}\sum_{i=1}^n \underset{p \times p}{\color{#3182BD}{A^{-1}}} \;\; \underset{p \times 1}{\psi(O_i; \beta_0)} $$ Each summand $\color{#3182BD}{A^{-1}}\psi_i$ is the contribution of observation $i$ to the total estimation error. This is exactly the **influence function**: $$ \underset{p \times 1}{\text{IF}_i(\hat\beta)} \;=\; \underset{p \times p}{\color{#3182BD}{A^{-1}}} \;\cdot\; \underset{p \times 1}{\psi(O_i; \hat\beta)} $$ {#eq-if} Reading each factor: - $\psi_i$ ($p \times 1$): the score from Section 1.2 — how much observation $i$'s data "disagrees" with the current $\hat\beta$. - $\color{#3182BD}{A^{-1}}$ ($p \times p$): the bread inverse from Section 1.4 — converts from score-space to parameter-space, accounting for curvature. - $\text{IF}_i$ ($p \times 1$): the resulting shift in $\hat\beta$ attributable to observation $i$ (times $n$). No new derivation is needed. The IF *is* the linearisation, read term-by-term. ### 2.3 IF = sandwich (the key equivalence) **Property 1: Linearisation.** The IF lets us write a complex estimator as a simple average: $$ \hat\beta - \beta_0 \approx \frac{1}{n}\sum_i \text{IF}_i $$ **Property 2: Variance recovery.** Since $E[\text{IF}_i] = 0$: $$ \text{Var}(\hat\beta) \approx \frac{1}{n^2}\sum_i \text{IF}_i \, \text{IF}_i^T $$ Expanding $\text{IF}_i \text{IF}_i^T = \color{#3182BD}{A^{-1}} \psi_i \psi_i^T \color{#3182BD}{(A^{-1})^T}$ and factoring: $$ \frac{1}{n^2}\sum_i \color{#3182BD}{A^{-1}} \psi_i\psi_i^T \color{#3182BD}{(A^{-1})^T} = \frac{1}{n} \color{#3182BD}{A^{-1}} \underbrace{\frac{1}{n}\sum_i \psi_i\psi_i^T}_ {\color{#E6550D}{B}\text{ (the meat)}} \color{#3182BD}{(A^{-1})^T} $$ This is the sandwich from Section 1.4. ::: {.callout-note} ### The key equivalence $$ \underbrace{\frac{1}{n^2}\sum_i \text{IF}_i\,\text{IF}_i^T}_ {\text{IF-based variance}} \;=\; \underbrace{\frac{1}{n}\,\color{#3182BD}{A^{-1}}\,\color{#E6550D}{B}\, \color{#3182BD}{(A^{-1})^T}}_{\text{sandwich variance}} $$ {#eq-if-sandwich} The sandwich operates on matrices. The IF operates on per-individual vectors. Same answer, different representation. We will use the IF representation from here on, because it extends cleanly to multi-step estimators (Sections 3–5). ::: ::: {.panel-tabset} ## Math $$ \mathrm{IF}_i(\hat\beta) = \hat A^{-1}\psi_i, \qquad \widehat{\text{Var}}(\hat\beta) = \frac{1}{n^2}\sum_{i=1}^n \mathrm{IF}_i\,\mathrm{IF}_i^{\mathsf T}. $$ ## Code ```{r} #| label: verify-if-equivalence # Per-individual IF for beta (n x p). psi and A_hat_inv come from the # sandwich verification block above. IF_beta <- psi %*% t(A_hat_inv) # Assemble Var(beta) two different ways. V_beta_via_IF <- crossprod(IF_beta) / n^2 V_beta_via_sw <- sandwich::sandwich(fit) ``` ## Result ```{r} #| label: verify-if-equivalence-result #| echo: false cells <- outer(rownames(V_beta_via_IF), colnames(V_beta_via_IF), paste, sep = " × ") out <- data.frame( cell = as.vector(cells), `IF (1/n^2 Σ IF²)` = as.vector(V_beta_via_IF), `sandwich` = as.vector(V_beta_via_sw), `abs diff` = abs(as.vector(V_beta_via_IF) - as.vector(V_beta_via_sw)), check.names = FALSE ) tt(out, caption = "Sandwich $= \\tfrac{1}{n^2}\\sum_i \\mathrm{IF}_i\\mathrm{IF}_i^{\\mathsf T}$") |> format_tt(j = 2:4, digits = 4, num_fmt = "significant") ``` The two representations are equal to machine precision (abs diff ≲ 1e-16) — so anywhere we see "sandwich variance", we can equivalently write "sum of squared influence functions over $n^2$". ::: ### 2.4 Example: IF for a sample mean $\hat\mu = \frac{1}{n}\sum_i Y_i$. Estimating equation: $\psi(O_i; \mu) = Y_i - \mu$. Bread: $A = 1$. IF: $\text{IF}_i = Y_i - \hat\mu$. Variance: $\frac{1}{n^2}\sum_i(Y_i - \hat\mu)^2 = s^2/n$. ::: {.panel-tabset} ## Math $$ \mathrm{IF}_i(\hat\mu) = Y_i - \hat\mu, \qquad \widehat{\mathrm{Var}}(\hat\mu) = \frac{1}{n^2}\sum_{i=1}^n (Y_i - \hat\mu)^2 = \frac{n-1}{n}\cdot\frac{s^2}{n}. $$ The IF-based SE is, up to the $(n-1)/n$ bias factor, just the standard error of the mean from an intro stats textbook. ## Code ```{r} #| label: verify-sample-mean y <- dat$Y mu <- mean(y) # IF-based variance: (1/n²) Σ (yᵢ - μ̂)² var_if <- sum((y - mu)^2) / length(y)^2 se_if <- sqrt(var_if) # Textbook: s² / n (n-1 denominator) se_textbook <- sqrt(var(y) / length(y)) ``` ## Result ```{r} #| label: verify-sample-mean-result #| echo: false out <- data.frame( quantity = c("IF-based SE", "Textbook SE ($s / \\sqrt{n}$)", "Ratio $\\sqrt{n/(n-1)}$"), value = c(se_if, se_textbook, se_textbook / se_if) ) tt(out, caption = "Sample mean IF variance matches the textbook SE") |> format_tt(j = "quantity", markdown = TRUE) |> format_tt(j = 2, digits = 4, num_fmt = "significant") ``` The two agree up to the standard $(n-1)/n$ degrees-of-freedom correction — on $n = `r length(y)`$ the ratio is indistinguishable from $\sqrt{n/(n-1)}$. ::: ## 3. Variance of a causal estimand ### 3.1 The problem The sandwich (Section 1) gives $\text{Var}(\hat\beta)$. But the causal estimand is typically a nonlinear function of $\hat\beta$ **and** the data. For instance, the g-computation marginal mean under intervention $a$ is: $$ \hat\mu_a = \frac{1}{n}\sum_{i=1}^n g^{-1}(X_i^{*T} \hat\beta) $$ where $X_i^*$ is individual $i$'s covariate vector with treatment set to $a$, and $g^{-1}(\cdot)$ is the inverse link function. This quantity depends on **two** random things: 1. $\hat\beta$ — the estimated model parameters (random because of finite data). 2. $\{X_i\}$ — the covariate distribution (random because a different sample gives different covariates, hence a different average of predictions). A complete variance estimator must account for both sources. As we will see, the standard "delta method" shortcut only captures source (1). ### 3.2 The two-channel influence function The g-computation marginal mean is a **two-step** estimator: first estimate $\hat\beta$, then compute $\hat\mu_a = \frac{1}{n}\sum_i g^{-1}(X_i^{*T} \hat\beta)$. Individual $i$ affects $\hat\mu_a$ through two channels. #### Channel 1: Direct contribution (covariate sampling) Even if $\hat\beta$ were fixed, observation $i$'s prediction $g^{-1}(X_i^{*T}\hat\beta)$ is one of the $n$ terms being averaged. A different sample gives a different set of predictions. Individual $i$'s direct contribution to the estimation error is: $$ \text{IF}_i^{\text{direct}} = g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a \qquad (\text{scalar}) $$ This is the same structure as the sample mean IF from Section 2.4 — the deviation of one term from the average. #### Channel 2: Model correction (parameter uncertainty) Observation $i$ also affects $\hat\mu_a$ indirectly by shifting $\hat\beta$. From Section 2.2, $i$'s influence on $\hat\beta$ is approximately $\frac{1}{n}\color{#3182BD}{A^{-1}}\psi_i$. This shift changes $\hat\mu_a$ by: $$ \delta\hat\mu_a \;\approx\; \underset{1 \times p}{\color{#31A354}{J_a}} \;\cdot\; \underset{p \times 1}{\color{#3182BD}{A^{-1}}\psi_i} $$ where $\color{#31A354}{J_a} = \frac{\partial \hat\mu_a}{\partial \beta^T}$ is the $1 \times p$ **Jacobian** — how the marginal mean changes as each coefficient shifts: $$ \underset{1 \times p}{\color{#31A354}{J_a}} = \frac{1}{n} \sum_{i=1}^n \underbrace{\frac{d\mu}{d\eta}\bigg|_{\eta_i^*}}_{\text{scalar}} \;\cdot\; \underbrace{X_i^{*T}}_{1 \times p} $$ The link derivatives for common GLMs: | Family | Link | $d\mu/d\eta$ | Effect | |---|---|---|---| | Gaussian | Identity | $1$ | All observations contribute equally | | Binomial | Logit | $\mu(1 - \mu)$ | Observations near $\mu = 0.5$ contribute most | | Poisson | Log | $\mu$ | Observations with larger means contribute more | #### Combining both channels The full IF for the marginal mean $\hat\mu_a$ is: $$ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1: covariate sampling}} \;+\; \underbrace{ \underset{1 \times p}{\color{#31A354}{J_a}} \;\; \underset{p \times p}{\color{#3182BD}{A^{-1}}} \;\; \underset{p \times 1}{\psi_i} }_{\text{Channel 2: parameter uncertainty}} $$ {#eq-if-gcomp} And the variance is: $$ \widehat{\text{Var}}(\hat\mu_a) = \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i(\hat\mu_a)^2 $$ {#eq-var-if} For two interventions $a$ and $b$, the covariance is: $$ \widehat{\text{Cov}}(\hat\mu_a, \hat\mu_b) = \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i^{(a)} \cdot \text{IF}_i^{(b)} $$ This gives the full $k \times k$ covariance matrix needed for contrasts (risk difference, risk ratio, etc.). ::: {.callout-note} ### Why the covariance matters Both $\hat\mu_1$ and $\hat\mu_0$ are functions of the *same* data, so they are correlated. This covariance typically *reduces* the variance of the difference. Ignoring it (computing variances separately and adding them) would overestimate the SE. ::: ::: {.panel-tabset} ## Math For the g-computation estimate of $\hat\mu_1 = \mathbb{E}[Y^{A=1}]$ under a logistic outcome model, $$ \mathrm{IF}_i(\hat\mu_1) = \underbrace{m(X_i^{(1)};\hat\beta) - \hat\mu_1}_{\text{Channel 1}} \;+\; \underbrace{J_1\,\hat A^{-1}\,\psi_i}_{\text{Channel 2}}, $$ with $m(X_i^{(1)};\hat\beta) = \text{expit}(X_i^{(1)\mathsf T}\hat\beta)$, $J_1 = \frac{1}{n}\sum_j X_j^{(1)}\,\mu_1^{(j)}(1-\mu_1^{(j)})$, and $\widehat{\mathrm{Var}}(\hat\mu_1) = \frac{1}{n^2}\sum_i \mathrm{IF}_i^2$. ## Code ```{r} #| label: verify-gcomp-if # Counterfactual design matrix and predictions with A set to 1 for everyone. X1 <- X X1[, "A"] <- 1 eta1 <- X1 %*% beta_hat mu1_i <- as.numeric(plogis(eta1)) mu1_hat <- mean(mu1_i) # Channel 1: direct covariate-sampling contribution. C1 <- mu1_i - mu1_hat # Channel 2: gradient of mu1_hat in beta times IF_beta. # For logistic, d mu1_i / d beta = mu1_i (1 - mu1_i) * X1_i. J1 <- colMeans(X1 * (mu1_i * (1 - mu1_i))) # p-vector C2 <- as.numeric(IF_beta %*% J1) IF_mu1 <- C1 + C2 var_mu1_manual <- sum(IF_mu1^2) / n^2 se_mu1_manual <- sqrt(var_mu1_manual) # Gold standard: causatr's variance_if() routed through contrast(). fit_c <- causat(dat, outcome = "Y", treatment = "A", confounders = ~ L, family = binomial, model_fn = stats::glm) res_c <- contrast( fit_c, interventions = list(a1 = static(1), a0 = static(0)), reference = "a0", ci_method = "sandwich" ) se_mu1_causatr <- res_c$estimates[intervention == "a1"]$se mu1_causatr <- res_c$estimates[intervention == "a1"]$estimate ``` ## Result ```{r} #| label: verify-gcomp-if-result #| echo: false out <- data.frame( quantity = c("$\\mathbb{E}[Y^{A=1}]$", "$\\mathrm{SE}(\\hat\\mu_1)$"), manual = c(mu1_hat, se_mu1_manual), causatr = c(mu1_causatr, se_mu1_causatr), `abs diff` = c(abs(mu1_hat - mu1_causatr), abs(se_mu1_manual - se_mu1_causatr)), check.names = FALSE ) tt(out, caption = "Two-channel IF for g-comp: manual vs `causatr::variance_if`") |> format_tt(j = "quantity", markdown = TRUE) |> format_tt(j = 2:4, digits = 4, num_fmt = "significant") ``` The manual two-channel IF reproduces causatr's SE for $\hat\mu_1$ to $\sim$1e-9. Both channels are needed — dropping Channel 1 would give a different (smaller) number, as the next tabset shows. ::: ### 3.3 The delta method shortcut A widely used shortcut computes the Jacobian $\color{#31A354}{J}$ and the parameter sandwich $V_\beta$ separately, then propagates via: $$ \underset{k \times k}{\text{Var}(\hat\mu)} \;\approx\; \underset{k \times p}{\color{#31A354}{J}} \;\; \underset{p \times p}{V_\beta} \;\; \underset{p \times k}{\color{#31A354}{J^T}} $$ {#eq-delta} This is the **multivariate delta method**. To see what it captures, expand the IF-based variance (@eq-var-if) using the two-channel decomposition (@eq-if-gcomp): $$ \frac{1}{n^2}\sum_i \text{IF}_i^2 = \underbrace{\frac{1}{n^2}\sum_i \left(\text{Ch.1}_i\right)^2}_ {\text{covariate sampling}} + \underbrace{\frac{1}{n^2}\sum_i \left(\text{Ch.2}_i\right)^2}_{ = \; \color{#31A354}{J} \, V_\beta \, \color{#31A354}{J^T}} + \underbrace{\frac{2}{n^2}\sum_i \text{Ch.1}_i \cdot \text{Ch.2}_i}_ {\text{cross-term}} $$ The delta method gives only the middle term — **Channel 2 squared**. It misses the covariate sampling variance (Channel 1 squared) and the cross-term. All three terms are $O(1/n)$, so the delta method systematically underestimates. ::: {.panel-tabset} ## Math $$ \widehat{\mathrm{Var}}(\hat\mu_a) = \underbrace{\tfrac{1}{n^2}\sum_i \mathrm{Ch.1}_i^2}_{\text{full}} + \underbrace{J\,V_\beta\,J^{\mathsf T}}_{\text{delta}} + \underbrace{\tfrac{2}{n^2}\sum_i \mathrm{Ch.1}_i\,\mathrm{Ch.2}_i}_{\text{cross}}. $$ The delta method is the second term alone. ## Code ```{r} #| label: verify-delta var_full <- sum(IF_mu1^2) / n^2 var_ch1 <- sum(C1^2) / n^2 var_ch2 <- sum(C2^2) / n^2 var_cross <- 2 * sum(C1 * C2) / n^2 var_delta <- as.numeric(t(J1) %*% V_beta_gold %*% J1) # J V_beta J^T # Ch.2² and J V_beta J^T are algebraically identical; numerically they # agree up to the bread solve precision. stopifnot(abs(var_ch2 - var_delta) < 1e-6) ratio <- sqrt(var_full) / sqrt(var_delta) ``` ## Result ```{r} #| label: verify-delta-result #| echo: false out <- data.frame( term = c( "Full IF variance $\\mathrm{Ch.1}^2 + \\mathrm{Ch.2}^2 + \\text{cross}$", "Channel 1 squared (covariate sampling)", "Channel 2 squared $=$ delta $J\\,V_\\beta\\,J^{\\mathsf T}$", "Cross term $2 \\cdot \\mathrm{Ch.1} \\cdot \\mathrm{Ch.2}$", "SE ratio $\\mathrm{SE}_{\\text{full}} / \\mathrm{SE}_{\\text{delta}}$" ), value = c(var_full, var_ch1, var_ch2, var_cross, ratio) ) tt(out, caption = "Decomposition of $\\mathrm{Var}(\\hat\\mu_1)$: delta captures only the middle row") |> format_tt(j = "term", markdown = TRUE) |> format_tt(j = 2, digits = 4, num_fmt = "significant") ``` On this example the delta-method SE is about $1/1.026 \approx 97\%$ of the correct SE — a modest underestimate because the covariate sampling variance (Channel $1^2$) happens to be small relative to the parameter uncertainty. Under non-saturated MSMs and models with strong covariate-driven heterogeneity, the ratio can be much larger; @eq-delta is not safe as a default. ::: ::: {.callout-important} ### The delta method treats covariates as fixed $\color{#31A354}{J} V_\beta \color{#31A354}{J^T}$ answers the question: "given these specific covariates $\{X_i\}$, how does uncertainty in $\hat\beta$ propagate to $\hat\mu_a$?" It conditions on the covariate distribution. But $\{X_i\}$ are random — a different sample gives different covariates, hence a different average of predictions even for fixed $\beta$. The full IF captures both sources; the delta method captures only one. ::: ### 3.4 When the shortcut is exact The delta method underestimates whenever Channel 1 is nonzero. Channel 1 is: $$ \text{Ch.1}_i = g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a $$ This is zero for all $i$ if and only if **every individual gets the same prediction** — i.e. $g^{-1}(X_i^{*T}\hat\beta)$ does not depend on $i$. This happens when: 1. **The model is saturated in the treatment** (e.g. $Y \sim A$ for binary treatment) **and** the intervention is static ($A = a$ for everyone). Then all individuals share the same $X^*$ (only the intercept and treatment indicator matter), so predictions are identical. 2. **The model has no covariates** (e.g. $Y \sim A$). Same reasoning. The shortcut **fails** when predictions vary across individuals: - **G-computation** with confounders: $Y \sim A + L$. Predictions $g^{-1}(\hat\beta_0 + \hat\beta_1 a + \hat\beta_2 L_i)$ vary with $L_i$ across individuals. Channel 1 is nonzero. - **Non-saturated MSM**: $Y \sim A$ for continuous $A$ is a linear model, not saturated. With non-static interventions (e.g. shift each person's treatment by $+\delta$), different individuals get different $A$ values, hence different predictions. | Situation | Predictions constant? | Delta method exact? | |---|---|---| | G-comp with confounders ($Y \sim A + L$) | No | **No** — underestimates | | IPW/matching, binary $A$, saturated MSM ($Y \sim A$), static intervention | Yes | Yes | | IPW/matching, continuous $A$, linear MSM ($Y \sim A$), static intervention | Yes | Yes | | IPW, continuous $A$, non-static intervention (shift, MTP) | No | **No** — underestimates | ### 3.5 The IF is the stacked sandwich, solved analytically The two-channel formula (@eq-if-gcomp) may seem to appear from thin air. In fact it is the sandwich (@eq-sandwich) applied to the **complete** estimating equations — model score plus mean equation — with the block-triangular bread inverted by back-substitution. This section shows the derivation. #### The complete estimating equations G-computation has two estimating equations per individual: $$ \underset{(p+1) \times 1}{\Psi(O_i;\,\beta,\,\mu_a)} = \begin{pmatrix} \underset{p \times 1}{\psi_\beta(O_i;\,\beta)} \\[4pt] \underset{1 \times 1}{\psi_\mu(O_i;\,\beta,\,\mu_a)} \end{pmatrix} = \begin{pmatrix} \text{GLM score (Section 1.2)} \\[4pt] g^{-1}(X_i^{*T}\beta) - \mu_a \end{pmatrix} $$ The first block $\psi_\beta$ defines the model. The second block $\psi_\mu$ defines the estimand: $\hat\mu_a$ is the value of $\mu_a$ that makes $\sum_i \psi_\mu = 0$, i.e. $\hat\mu_a = \frac{1}{n}\sum_i g^{-1}(X_i^{*T} \hat\beta)$. #### The block-triangular bread The bread matrix of the stacked system is: $$ \underset{(p+1) \times (p+1)}{A} = -\frac{1}{n}\sum_i \frac{\partial\Psi_i}{\partial(\beta,\mu_a)^T} = \begin{pmatrix} \underset{p \times p}{\color{#3182BD}{A_{\beta\beta}}} & \underset{p \times 1}{0} \\[6pt] \underset{1 \times p}{\color{#E6550D}{A_{\mu\beta}}} & \underset{1 \times 1}{\color{#E6550D}{A_{\mu\mu}}} \end{pmatrix} $$ where: - $\color{#3182BD}{A_{\beta\beta}} = -\frac{1}{n}\sum_i \frac{\partial\psi_\beta}{\partial\beta^T}$ — the GLM Hessian (Section 1.4). - $\color{#E6550D}{A_{\mu\beta}} = -\frac{1}{n}\sum_i \frac{\partial\psi_\mu}{\partial\beta^T} = -\frac{1}{n}\sum_i \frac{d\mu}{d\eta}\big|_{\eta_i^*} X_i^{*T} = -\color{#31A354}{J_a}$ — the negative Jacobian from Section 3.2. - $\color{#E6550D}{A_{\mu\mu}} = -\frac{\partial\psi_\mu}{\partial\mu_a} = 1$. - Upper-right = $0$: the GLM score does not depend on $\mu_a$. The block-triangular structure arises because $\psi_\beta$ depends only on $\beta$ (not on $\mu_a$), while $\psi_\mu$ depends on both $\beta$ and $\mu_a$. #### Inverting by back-substitution For a $2 \times 2$ block-lower-triangular matrix, the inverse is: $$ A^{-1} = \begin{pmatrix} \color{#3182BD}{A_{\beta\beta}^{-1}} & 0 \\[4pt] -A_{\mu\mu}^{-1} \, \color{#E6550D}{A_{\mu\beta}} \, \color{#3182BD}{A_{\beta\beta}^{-1}} & A_{\mu\mu}^{-1} \end{pmatrix} $$ The IF for the full parameter $(\beta, \mu_a)$ is $A^{-1}\Psi_i$. Extracting the $\mu_a$ row (the last row of $A^{-1}$ times $\Psi_i$): $$ \text{IF}_i(\hat\mu_a) = \underbrace{A_{\mu\mu}^{-1}\;\psi_{\mu,i}}_{\text{Channel 1}} \;+\; \underbrace{(-A_{\mu\mu}^{-1})\; \color{#E6550D}{A_{\mu\beta}}\; \color{#3182BD}{A_{\beta\beta}^{-1}}\; \psi_{\beta,i}}_{\text{Channel 2}} $$ Substituting $A_{\mu\mu} = 1$, $A_{\mu\beta} = -\color{#31A354}{J_a}$, and $\psi_{\mu,i} = g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a$: $$ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1}} \;+\; \underbrace{ \color{#31A354}{J_a}\; \color{#3182BD}{A_{\beta\beta}^{-1}}\; \psi_{\beta,i}}_{\text{Channel 2}} $$ This is exactly @eq-if-gcomp. The bread $\color{#3182BD}{A_{\beta\beta}^{-1}}$ is `model_bread_inv()` = $(X^TWX)^{-1}$. The score $\psi_{\beta,i}$ is $r_i \cdot (d\mu/d\eta) / V(\mu_i) \cdot X_i$. The Jacobian $\color{#31A354}{J_a}$ projects the $p$-vector result to a scalar. ::: {.callout-note} ### The meat is hidden, not absent The IF formula computes $\color{#3182BD}{A^{-1}}\psi_i$ for each individual, then squares: $\text{Var} = \frac{1}{n^2}\sum_i \text{IF}_i^2$. Expanding: $$ \frac{1}{n^2}\sum_i \text{IF}_i^2 = \frac{1}{n}\;\color{#3182BD}{A^{-1}} \underbrace{\frac{1}{n}\sum_i \Psi_i \Psi_i^T}_ {\color{#E6550D}{B} \text{ (the meat)}} \color{#3182BD}{(A^{-1})^T} $$ This is the sandwich (@eq-sandwich) for the stacked system. The IF avoids forming the $(p+1) \times (p+1)$ meat matrix $\color{#E6550D}{B}$ — it computes the same answer by summing scalar products. ::: ### 3.6 Target subpopulations (ATT, ATU, custom subset) Sections 3.1–3.5 define the marginal mean as an average over the **full sample**: $\hat\mu_a = (1/n)\sum_{i=1}^n g^{-1}(X_i^{*T}\hat\beta)$. Many causal estimands — the ATT, the ATU, or an estimand on a user-supplied subset — instead average only over a **target subpopulation** $T \subseteq \{1,\ldots,n\}$ of size $n_t$: $$ \hat\mu_a \;=\; \frac{1}{n_t}\sum_{i\in T} g^{-1}(X_i^{*T}\hat\beta) $$ The stacked estimating equations need one small change to accommodate this: the mean equation is now indicator-weighted. Let $t_i = \mathbb{1}[i \in T]$ and write $$ \psi_{\mu,i}(\beta,\mu_a) \;=\; t_i \cdot \bigl[g^{-1}(X_i^{*T}\beta) - \mu_a\bigr] $$ so that $\sum_i \psi_{\mu,i} = 0$ still recovers $\hat\mu_a$. Redoing the bread calculation from Section 3.5 with this indicator-weighted mean equation: - $\color{#E6550D}{A_{\mu\mu}} = -\frac{1}{n}\sum_i \frac{\partial\psi_{\mu,i}}{\partial\mu_a} = -\frac{1}{n}\sum_i(-t_i) = \frac{n_t}{n}$, so $A_{\mu\mu}^{-1} = \frac{n}{n_t}$. - $\color{#E6550D}{A_{\mu\beta}} = -\frac{1}{n}\sum_i t_i \cdot \frac{d\mu}{d\eta}\Big|_{\eta_i^*} X_i^{*T} = -\frac{n_t}{n}\,\color{#31A354}{J_a}$, where $\color{#31A354}{J_a} = \frac{1}{n_t}\sum_{i\in T}\frac{d\mu}{d\eta}\big|_{\eta_i^*} X_i^{*T}$ is the Jacobian averaged over the target. Plugging into the back-substitution formula for the last row of $A^{-1}$: $$ \text{IF}_i(\hat\mu_a) \;=\; -A_{\mu\mu}^{-1}\,\color{#E6550D}{A_{\mu\beta}}\,\color{#3182BD}{A_{\beta\beta}^{-1}}\,\psi_{\beta,i} \;+\; A_{\mu\mu}^{-1}\,\psi_{\mu,i} $$ Substituting $A_{\mu\mu}^{-1} = n/n_t$ and $A_{\mu\beta} = -(n_t/n)\,\color{#31A354}{J_a}$: $$ \text{IF}_i(\hat\mu_a) \;=\; \underbrace{\frac{n}{n_t}\cdot t_i \cdot (g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a)}_ {\text{Channel 1 (rescaled)}} \;+\; \underbrace{\color{#31A354}{J_a}\,\color{#3182BD}{A_{\beta\beta}^{-1}}\,\psi_{\beta,i}}_ {\text{Channel 2 (unchanged)}} $$ {#eq-if-gcomp-target} Two things to notice: 1. **Channel 1 picks up a factor of $n/n_t$** and is nonzero only for individuals inside the target. Without this factor the variance is underestimated by $(n_t/n)^2$ — a serious undercount when the target is a proper subset (e.g., the ATT with only ~40% treated). 2. **Channel 2 is unchanged.** The $n_t/n$ in $A_{\mu\beta}$ cancels with the $n/n_t$ in $A_{\mu\mu}^{-1}$, so the Channel 2 formula from Section 3.2 carries over verbatim — $\color{#31A354}{J_a}$ is just averaged over the target population instead of the full sample, and $\psi_{\beta,i}$ still runs over every individual the outcome model used (inside **and** outside the target), because the outcome model was fit on all those rows. The variance is still $\widehat{\text{Var}}(\hat\mu_a) = (1/n^2)\sum_{i=1}^n \text{IF}_i^2$, summed over all $n$ individuals. Individuals outside the target have $\text{Ch.1}_i = 0$ and contribute only through Channel 2. Setting $T = \{1,\ldots,n\}$ (so $n_t = n$, $t_i \equiv 1$) recovers @eq-if-gcomp exactly. The full-sample case is the special case of @eq-if-gcomp-target where the scaling factor is $1$. ### 3.7 Pseudo-code: the full IF for a marginal mean This incorporates the target subpopulation scaling from Section 3.6 and the correct non-canonical link handling for the score (multiply response residuals by $\mu_\eta/V$; see Section 1.2 route 3). ```{r} #| eval: false #| code-fold: true #| code-summary: "Pseudo-code: gcomp_influence()" gcomp_influence <- function(model, data_a, target_idx) { n <- nrow(data_a) beta_hat <- coef(model) pred_terms <- delete.response(terms(model)) family <- model$family # Predictions under intervention (target rows only) X_star <- model.matrix(pred_terms, data = data_a[target_idx, ]) eta_star <- as.numeric(X_star %*% beta_hat) preds <- family$linkinv(eta_star) mu_hat <- mean(preds) n_t <- length(preds) # Channel 1: (n/n_t) * t_i * (pred_i - mu_hat), zero outside target IF_direct <- rep(0, n) IF_direct[target_idx] <- preds - mu_hat # Channel 2 gradient: J_a = (1/n_t) sum_{target} (dmu/deta) * X_star mu_eta_star <- family$mu.eta(eta_star) J <- as.numeric(crossprod(X_star, mu_eta_star)) / n_t # p x 1 # Bread inverse (X'WX)^{-1} with IWLS working weights X_fit <- model.matrix(model) eta_fit <- model$linear.predictors mu_eta_fit <- family$mu.eta(eta_fit) V_fit <- family$variance(fitted(model)) W <- mu_eta_fit^2 / V_fit XtWX_inv <- solve(crossprod(X_fit, X_fit * W)) h <- as.numeric(XtWX_inv %*% J) # p x 1 # Per-observation directional sensitivity and SCORE residual. # Score residual = response residual * (dmu/deta) / V(mu), so that # psi_i = X_i * r_i * (mu_eta/V). For canonical links mu_eta/V = 1 # and this reduces to the response residual. d_fit <- as.numeric(X_fit %*% h) r_response <- residuals(model, type = "response") r_score <- r_response * mu_eta_fit / V_fit IF_correction <- d_fit * r_score # n_fit x 1 # Combine: full IF = (n/n_t) * Ch.1 + n * Ch.2 # Factor of n on Ch.2 from A_{beta,beta}^{-1} = n * (X'WX)^{-1} # Factor of (n/n_t) on Ch.1 from A_{mu,mu}^{-1} = n/n_t (Sec. 3.6) IF <- (n / n_t) * IF_direct IF[seq_along(IF_correction)] <- IF[seq_along(IF_correction)] + n * IF_correction IF } ``` ## 4. The core causal estimators This section applies the IF framework from Section 3 to g-computation, IPW, and matching. The variance formula is always the same (@eq-var-if); what differs is the estimating equations and whether Channel 1 is zero. AIPW combines the g-comp and IPW influence functions into a single doubly-robust IF — its variance derivation follows the same Channel 1 + Channel 2 structure; see `R/variance_if_aipw.R` for the implementation. ### 4.1 G-computation **Estimating equations.** G-computation fits an outcome model $E[Y_i \mid A_i, L_i] = g^{-1}(X_i^T \beta)$ and computes $\hat\mu_a = \frac{1}{n}\sum_i g^{-1}(X_i^{*T}\hat\beta)$ where $X_i^*$ sets treatment to $a$. The complete estimating equations are: $$ \begin{pmatrix} \psi_\beta(O_i; \beta) \\[4pt] \psi_\mu(O_i; \beta, \mu_a) \end{pmatrix} = \begin{pmatrix} \text{GLM score for outcome model} \\[4pt] g^{-1}(X_i^{*T}\beta) - \mu_a \end{pmatrix} $$ **IF.** From Section 3.2 (@eq-if-gcomp): $$ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1 (nonzero)}} \;+\; \underbrace{ \color{#31A354}{J_a} \color{#3182BD}{A^{-1}} \psi_i }_{\text{Channel 2}} $$ **Both channels are needed** because predictions $g^{-1}(\hat\beta_0 + \hat\beta_1 a + \hat\beta_2 L_i)$ vary across individuals through $L_i$. The delta method shortcut $\color{#31A354}{J} V_\beta \color{#31A354}{J^T}$ underestimates. **Variance**: $\frac{1}{n^2}\sum_i \text{IF}_i^2$ using the full two-channel IF. ### 4.2 Inverse probability weighting (IPW) IPW is more complex because two models are involved, and the uncertainty from the first (propensity scores) must propagate into the second (MSM). ::: {.callout-tip} ### Companion vignette The density-ratio weight formulas (Horvitz--Thompson indicator, pushforward with sign/Jacobian, IPSI closed form), the `make_weight_fn()` closure design, and the numerical $A_{\beta\alpha}$ verification are developed in `vignette("ipw-variance-theory")`. ::: #### The two-model setup 1. **Propensity score model**: estimate $\hat\alpha$ ($q$-vector) from $\text{logit}[P(A_i = 1 \mid L_i)] = L_i^T \alpha$. 2. **Weighted outcome model (MSM)**: using the estimated weights $w_i = 1/\hat{P}(A_i \mid L_i; \hat\alpha)$, fit $E[Y_i \mid A_i] = g^{-1}(\beta_0 + \beta_1 A_i)$ by weighted regression to get $\hat\beta$ ($p$-vector). The weights $w_i$ depend on $\hat\alpha$, so $\hat\beta$ depends on $\hat\alpha$. #### Why the naive approach fails A naive sandwich on the outcome model alone — treating the weights as *known constants* — ignores the variability in $\hat\alpha$. Each new sample produces different $\hat\alpha$, hence different weights, hence different $\hat\beta$. The true variability of $\hat\beta$ has two sources: - **Outcome variability**: randomness in $Y_i$ given fixed weights. - **Weight variability**: randomness in $\hat\alpha$ (and hence $w_i$). The naive sandwich captures only the first. The direction of the bias depends on the estimand: for the **ATE**, using estimated (rather than known) propensity scores actually *reduces* the asymptotic variance, so the naive sandwich is conservative — it **overestimates** the SE (Hirano et al., 2003; Lunceford & Davidian, 2004). For the **ATT**, the effect can go either direction. #### The correct approach: joint M-estimation Treat both models together as a single M-estimation problem. Define the joint parameter $\theta = (\alpha, \beta)$ — a $(q+p)$-vector — and the stacked estimating function: $$ \underset{(q+p) \times 1}{\Psi(O_i; \theta)} = \begin{pmatrix} \color{#3182BD}{\psi_\alpha(O_i; \alpha)} \\[4pt] \color{#E6550D}{\psi_\beta(O_i; \alpha, \beta)} \end{pmatrix} $$ where $\color{#3182BD}{\psi_\alpha}$ ($q \times 1$) is the score of the propensity model, and $\color{#E6550D}{\psi_\beta}$ ($p \times 1$) is the *weighted* score of the outcome model (which depends on $\alpha$ through $w_i$). The bread matrix $\hat A = -\frac{1}{n}\sum_i \frac{\partial\Psi_i}{\partial\theta^T}$ has a block structure: $$ \underset{(q+p) \times (q+p)}{\hat{A}} = \begin{pmatrix} \underset{q \times q}{\color{#3182BD}{A_{\alpha\alpha}}} & \underset{q \times p}{0} \\[6pt] \underset{p \times q}{\color{#E6550D}{A_{\beta\alpha}}} & \underset{p \times p}{\color{#E6550D}{A_{\beta\beta}}} \end{pmatrix} $$ Each block has a concrete meaning: - $\color{#3182BD}{A_{\alpha\alpha}}$ ($q \times q$): the Hessian of the propensity model — how $\psi_\alpha$ changes with $\alpha$. - $\color{#E6550D}{A_{\beta\beta}}$ ($p \times p$): the weighted Hessian of the outcome model — how $\psi_\beta$ changes with $\beta$. - $\color{#E6550D}{A_{\beta\alpha}}$ ($p \times q$): the **cross-term** — how the weighted outcome score changes with $\alpha$. This involves $\frac{\partial w_i}{\partial\alpha^T}$ (how the weights shift when the propensity parameters change). This is the block that captures the dependency. - Upper-right = $0$ ($q \times p$): the propensity score does not depend on $\beta$. The full sandwich $\frac{1}{n}\hat A^{-1}\hat B(\hat A^{-1})^T$ applied to the stacked system correctly propagates weight uncertainty into $\text{Var}(\hat\beta)$ through the $A_{\beta\alpha}$ block. ::: {.callout-tip} ### In practice causatr's self-contained IPW engine computes this stacked sandwich directly via `compute_ipw_if_self_contained_one()`, which uses `numDeriv::jacobian()` on the `make_weight_fn()` closure to evaluate the $A_{\beta\alpha}$ block numerically. See `vignette("ipw-variance-theory")` for the closure design and verification. ::: #### The full IF for $\hat\mu_a$ The stacked sandwich above gives $V_\beta$ — the variance of the MSM parameters $\hat\beta$, accounting for weight uncertainty. But we need $\text{Var}(\hat\mu_a)$, not $\text{Var}(\hat\beta)$. Following the same approach as Section 3.5, we add the **mean equation** to the stack: $$ \underset{(q+p+1) \times 1}{\Psi(O_i;\,\alpha,\,\beta,\,\mu_a)} = \begin{pmatrix} \color{#3182BD}{\psi_\alpha(O_i;\,\alpha)} \\[4pt] \color{#E6550D}{\psi_\beta(O_i;\,\alpha,\,\beta)} \\[4pt] \psi_\mu(O_i;\,\beta,\,\mu_a) \end{pmatrix} $$ The bread is now $3 \times 3$ block-lower-triangular: $$ A = \begin{pmatrix} \color{#3182BD}{A_{\alpha\alpha}} & 0 & 0 \\[4pt] \color{#E6550D}{A_{\beta\alpha}} & \color{#E6550D}{A_{\beta\beta}} & 0 \\[4pt] 0 & A_{\mu\beta} & A_{\mu\mu} \end{pmatrix} $$ where $A_{\mu\beta} = -\color{#31A354}{J}$ (negative Jacobian of predictions w.r.t. $\beta$) and $A_{\mu\mu} = 1$ (same as Section 3.5). The zero in the $(\mu, \alpha)$ position arises because the mean equation $g^{-1}(X_i^{*T}\beta) - \mu_a$ does not depend on $\alpha$ directly — it depends on $\alpha$ only through $\beta$. Inverting by back-substitution and extracting the $\mu_a$ row gives: $$ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1 (direct)}} + \underbrace{\color{#31A354}{J}\; \color{#E6550D}{A_{\beta\beta}^{-1}}\; \psi_{\beta,i}}_{\text{MSM correction}} + \underbrace{\color{#31A354}{J}\; \color{#E6550D}{A_{\beta\beta}^{-1}}\; \color{#E6550D}{A_{\beta\alpha}}\; \color{#3182BD}{A_{\alpha\alpha}^{-1}}\; \psi_{\alpha,i}}_{\text{propensity correction}} $$ {#eq-if-ipw} Reading each term: - **Channel 1**: how individual $i$'s prediction deviates from the mean (same as g-comp). Zero when predictions are constant (saturated MSM + static intervention); nonzero for non-static interventions or non-saturated MSMs. - **MSM correction**: how $i$'s outcome residual (captured in $\psi_{\beta,i}$) propagates to $\hat\mu_a$ through the MSM parameters. Same structure as the g-comp Channel 2. - **Propensity correction**: how $i$'s propensity score residual (captured in $\psi_{\alpha,i}$) propagates to $\hat\mu_a$ through the weights. The chain is: $\psi_{\alpha,i} \to \hat\alpha \to w_i \to \hat\beta \to \hat\mu_a$. The matrix product $\color{#E6550D}{A_{\beta\alpha}}\color{#3182BD}{A_{\alpha\alpha}^{-1}}$ captures the first two links (propensity parameters $\to$ weights $\to$ MSM parameters); $\color{#31A354}{J}\color{#E6550D}{A_{\beta\beta}^{-1}}$ captures the last two (MSM parameters $\to$ predictions $\to$ estimand). The second and third terms can be combined: $$ \text{IF}_i(\hat\mu_a) = \text{Ch.1}_i \;+\; \color{#31A354}{J}\; \underbrace{ \color{#E6550D}{A_{\beta\beta}^{-1}} \left( \psi_{\beta,i} + \color{#E6550D}{A_{\beta\alpha}}\; \color{#3182BD}{A_{\alpha\alpha}^{-1}}\; \psi_{\alpha,i} \right)}_{\text{IF}_i(\hat\beta) \text{ from the stacked } (\alpha,\beta) \text{ system}} $$ The expression in parentheses is the **adjusted score**: the MSM score $\psi_{\beta,i}$ corrected for propensity estimation. This is exactly what `WeightIt::glm_weightit()` computes internally when producing its M-estimation sandwich — we are now exposing it per-individual. ::: {.callout-note} ### Where are the bread and meat? The bread appears three times: $\color{#3182BD}{A_{\alpha\alpha}^{-1}}$ (propensity bread inverse), $\color{#E6550D}{A_{\beta\beta}^{-1}}$ (MSM bread inverse), and $\color{#31A354}{J}$ (Jacobian from the mean equation). The individual scores $\psi_{\alpha,i}$ and $\psi_{\beta,i}$ are what would form the meat if we squared and summed them. The IF avoids forming the $(q+p+1) \times (q+p+1)$ meat matrix. ::: ::: {.panel-tabset} ## Math For a saturated MSM $Y \sim A$ with inverse-probability weights $w_i = A_i / e(L_i)$ and propensity $e(L_i) = \mathrm{expit}(L_i^{\mathsf T}\alpha)$, $$ \hat\mu_1 = \frac{1}{n}\sum_i w_i Y_i, $$ and the IF of $\hat\mu_1$ is the Channel-1 + propensity-corrected Channel-2 form given in @eq-if-ipw. causatr computes this via `compute_ipw_if_self_contained_one()`, which assembles the adjusted score per individual using `numDeriv::jacobian()` on the `make_weight_fn()` closure (see `vignette("ipw-variance-theory")`). ## Code ```{r} #| label: verify-ipw-if # Fit the propensity model manually and check the point estimate matches # causatr. ps_fit <- glm(A ~ L, data = dat, family = binomial) e_i <- fitted(ps_fit) w1 <- dat$A / e_i mu1_ipw_manual <- sum(w1 * dat$Y) / n fit_ipw <- causat( dat, outcome = "Y", treatment = "A", confounders = ~ L, estimator = "ipw", family = binomial, model_fn = stats::glm, propensity_model_fn = stats::glm ) res_ipw <- contrast( fit_ipw, interventions = list(a1 = static(1), a0 = static(0)), reference = "a0", ci_method = "sandwich" ) mu1_ipw_causatr <- res_ipw$estimates[intervention == "a1"]$estimate se_ipw_causatr <- res_ipw$estimates[intervention == "a1"]$se ``` ## Result ```{r} #| label: verify-ipw-if-result #| echo: false out <- data.frame( quantity = c("$\\hat\\mu_1$ (Horvitz–Thompson)", "$\\hat\\mu_1$ (causatr IPW)", "$\\mathrm{SE}(\\hat\\mu_1)$ via causatr IF engine"), value = c(mu1_ipw_manual, mu1_ipw_causatr, se_ipw_causatr) ) tt(out, caption = "IPW $\\hat\\mu_1$ and its IF-based SE") |> format_tt(j = "quantity", markdown = TRUE) |> format_tt(j = 2, digits = 4, num_fmt = "significant") ``` The HT estimate and the causatr IPW estimate agree exactly (they are algebraically identical on a saturated MSM with $w_i = A_i/e(L_i)$), and the SE comes from the propensity-corrected adjusted score — the third term in @eq-if-ipw — computed internally. ::: ### 4.3 Matching After matching, the outcome model is fitted on the matched sample. Two issues require special treatment. #### Cluster-robust variance Matched individuals within the same subclass (pair or stratum) are correlated — they were selected together. The standard sandwich (which assumes independence) would understate the meat. The **cluster-robust** sandwich accounts for this. The bread $\color{#3182BD}{A}$ is unchanged, but the meat sums scores *within* each subclass before squaring: $$ V_\beta = \color{#3182BD}{A^{-1}} \;\color{#E6550D}{B_{\text{CL}}}\; \color{#3182BD}{(A^{-1})^T} $$ where: $$ \color{#E6550D}{B_{\text{CL}}} = \frac{1}{n}\sum_{c=1}^C \bar\psi_c \, \bar\psi_c^T, \qquad \bar\psi_c = \sum_{i \in \text{subclass } c} \psi_i \qquad (p \times 1) $$ Squaring the summed scores (rather than summing the squared scores) allows within-cluster correlations to inflate the meat. #### The full IF for matching The per-individual IF has the same two-channel structure as g-comp: $$ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1}} \;+\; \underbrace{ \color{#31A354}{J_a}\; \color{#3182BD}{A_{\beta\beta}^{-1}}\; \psi_{\beta,i}}_{\text{Channel 2}} $$ where $\psi_{\beta,i}$ is the outcome model score for individual $i$ on the matched data. The difference from g-comp is in the **aggregation step**: instead of treating each individual's IF as independent, we account for within-subclass correlation by summing IFs within each matched set before squaring: $$ \widehat{\text{Var}}(\hat\mu_a) = \frac{1}{n^2}\sum_{c=1}^{C} \left(\sum_{i \in \text{subclass } c} \text{IF}_i\right)^2 $$ This is the IF analogue of the cluster-robust sandwich: instead of summing squared IFs (which assumes independence), we sum the squared *cluster-level* IFs. It is identical to $\color{#3182BD}{A^{-1}} \color{#E6550D}{B_{\text{CL}}} \color{#3182BD}{(A^{-1})^T}$ from the cluster-robust meat above, applied to the stacked system. ::: {.panel-tabset} ## Math For a fit on a matched sample with subclass indicator $c_i$, $$ \widehat{\mathrm{Var}}(\hat\beta)_{\text{CL}} = \hat A^{-1} \left(\frac{1}{n}\sum_{c=1}^C \bar\psi_c\,\bar\psi_c^{\mathsf T}\right) \hat A^{-1} / n, \qquad \bar\psi_c = \sum_{i \in c}\psi_i. $$ Equivalently, $\widehat{\mathrm{Var}}(\hat\beta)_{\text{CL}} = \frac{1}{n^2}\sum_{c=1}^C \bar{\mathrm{IF}}_c\,\bar{\mathrm{IF}}_c^{\mathsf T}$, with $\bar{\mathrm{IF}}_c = \sum_{i \in c}\mathrm{IF}_i$. ## Code ```{r} #| label: verify-matching-cluster # Build a toy matched design on dat: assign each pair of consecutive # rows to the same subclass. This is not a real match, but it exercises # the cluster-robust machinery on a dataset we already have. subclass <- rep(seq_len(n / 2), each = 2) # Manual cluster-robust sandwich, using the IF decomposition. # IF_beta was computed in Section 2.3 (n x p matrix). IF_by_cluster <- rowsum(IF_beta, group = subclass) # C x p V_beta_CL_manual <- crossprod(IF_by_cluster) / n^2 # Gold standard: sandwich::vcovCL() with the same subclass grouping. # Use HC0 (no finite-sample adjustment) so we match the textbook formula # above exactly. V_beta_CL_gold <- sandwich::vcovCL( fit, cluster = subclass, type = "HC0", cadjust = FALSE ) ``` ## Result ```{r} #| label: verify-matching-cluster-result #| echo: false cells <- outer(rownames(V_beta_CL_manual), colnames(V_beta_CL_manual), paste, sep = " × ") out <- data.frame( cell = as.vector(cells), manual = as.vector(V_beta_CL_manual), reference = as.vector(V_beta_CL_gold), `abs diff` = abs(as.vector(V_beta_CL_manual) - as.vector(V_beta_CL_gold)), check.names = FALSE ) tt(out, caption = "Cluster-robust $V_\\beta$: manual vs `sandwich::vcovCL`") |> format_tt(j = 2:4, digits = 4, num_fmt = "significant") ``` The "sum within cluster, then square" IF aggregation reproduces `sandwich::vcovCL()` (with `cadjust = FALSE`) to numerical precision. Matching's bread is unchanged from the g-comp case — only the meat is formed differently, by accumulating IFs inside each subclass before squaring. ::: ### 4.4 Summary: one method, multiple estimators All methods use the same variance formula — the IF squared and summed (@eq-var-if). What differs is the estimating equations and how Channel 2 is computed. | Method | Stacked equations | Channel 2 models | Aggregation | |---|---|---|---| | **G-comp** | Outcome model + mean eq. | 1 (outcome) | Standard or user-cluster | | **IPW** | Propensity + MSM + mean eq. | 2 (propensity, MSM) | Standard or user-cluster | | **Matching** | Outcome model + mean eq. | 1 (outcome) | Cluster-robust on subclass | | **ICE** | $K\!+\!1$ outcome models + mean eq. | $K\!+\!1$ (backward chain) | Standard or user-cluster | The IF formula handles all cases uniformly. The only per-method variation is **which models contribute** to Channel 2 and **how IFs are aggregated** (summed within clusters for matching, or within any user-supplied design cluster via `contrast(cluster = ...)`; directly otherwise). Matching's subclass cluster is structural — `contrast(cluster = ...)` is rejected there with a `causatr_bad_cluster` abort because layering a user cluster on top would double-count within-subclass rows. For `gcomp`, `ipw`, and `ice`, the same sum-within-cluster-then-square primitive that Section 4.3 derives for matching is exposed directly via the `cluster` argument; numerically equivalent to `sandwich::vcovCL(type = "HC0")` applied to the final predict-then-average step (up to a small $G/(G-1)$ cluster-count factor). ## 5. Chained estimators: ICE g-computation ### 5.1 The chain rule for influence functions Section 3 derived the IF for a two-step estimator (fit $\hat\beta$, then compute $\hat\mu_a$). What happens with **multiple models in a chain**, where each model's response is a prediction from the next? The same logic applies recursively. If $\hat\mu = f(\hat\beta_0, \hat\beta_1, \ldots, \hat\beta_K)$ and each $\hat\beta_k$ is an M-estimator with bread $\color{#3182BD}{A_k}$ and score $\psi_k$, the IF of $\hat\mu$ is (Hampel et al., 1986; Huber, 1981): $$ \text{IF}_i(\hat\mu) \;=\; \text{IF}_i^{\text{direct}} \;+\; \sum_{k=0}^{K} \frac{\partial \hat\mu}{\partial \hat\beta_k^T} \;\cdot\; \color{#3182BD}{A_k^{-1}} \; \psi_k(O_i; \hat\beta_k) $$ {#eq-if-chain} Each term in the sum is a "Channel 2" from Section 3.2 applied to model $k$: the Jacobian $\partial\hat\mu/\partial\hat\beta_k^T$ (how $\hat\mu$ depends on model $k$) times the IF of model $k$'s parameters ($\color{#3182BD}{A_k^{-1}}\psi_{k,i}$). ::: {.callout-note} ### Connections to other frameworks This IF-based approach is equivalent to: 1. **Stacked estimating equations** (Stefanski & Boos, 2002; Zivich et al., 2024): concatenate all models' estimating equations and apply the generic sandwich to the full stack. 2. **Semiparametric efficiency theory** (Tsiatis, 2006): derive the efficient influence function (EIF) and project onto the parametric model. All yield the same variance estimate. ::: ### 5.2 ICE g-computation: the motivating example Iterated conditional expectation (ICE) g-computation estimates the counterfactual mean under a treatment strategy in longitudinal data (Robins, 1986; Bang & Robins, 2005; Zivich et al., 2024). It fits $K+1$ outcome models backward in time: ```{mermaid} %%| label: fig-ice-chain %%| fig-cap: "ICE backward iteration. Models are fit from final time to baseline. Each model's predictions become the next model's response." flowchart RL MK["Model K (final time) Fit: E[Y | A_K, H_K] Predict → Ỹ_K"] -->|"Ỹ_K becomes response for"| MK1["Model K-1 Fit: E[Ỹ_K | A_{K-1}, H_{K-1}] Predict → Ỹ_{K-1}"] MK1 -->|"Ỹ_{K-1} becomes response for"| Mdots["⋯"] Mdots -->|"response for"| M0["Model 0 (baseline) Fit: E[Ỹ_1 | A_0, H_0] Predict → Ỹ_0"] M0 -->|"average"| MU["μ̂ = mean(Ỹ_0)"] style MK fill:#f0f4ff,stroke:#3182BD style MK1 fill:#f0f4ff,stroke:#3182BD style M0 fill:#f0f4ff,stroke:#3182BD style MU fill:#e8f5e9,stroke:#31A354 ``` Each model's response ($\tilde{Y}_k$) is a *prediction* from the next model in the chain. This creates a cascade of dependencies. ### 5.3 Why naive approaches fail **Naive 1: Variance of pseudo-outcomes only.** Uses only Channel 1 — ignores all model correction terms. **Naive 2: $\color{#31A354}{J}V_\beta\color{#31A354}{J^T}$ from model 0 only.** Captures Channel 2 for model 0 but ignores models $1, \ldots, K$. **Correct: Full IF chain.** Account for all $K+1$ models via the chain rule. ### 5.4 Deriving the IF for chained models #### The IF decomposition Applying the chain rule (Section 5.1): $$ \text{IF}_i = \underbrace{\frac{1}{n_t}(\tilde{Y}_{0,i} - \hat\mu)}_{\text{Channel 1}} + \sum_{k=0}^{K} \underbrace{ \color{#E6550D}{d_{k,i}} \cdot r^{\text{score}}_{k,i} }_{\text{Channel 2 for model } k} $$ {#eq-if-ice} where $\color{#E6550D}{d_{k,i}}$ is the **sensitivity** — how much a perturbation at model $k$ propagates to $\hat\mu$ — and $r^{\text{score}}_{k,i}$ is individual $i$'s **score residual** at model $k$: $$ r^{\text{score}}_{k,i} \;=\; (Y_{k,i} - \hat{Y}_{k,i}) \cdot \frac{d\mu/d\eta\big|_{\eta_{k,i}}}{V(\hat{Y}_{k,i})} $$ i.e. the response residual $r_{k,i} = Y_{k,i} - \hat{Y}_{k,i}$ weighted by the link-scale factor from the GLM score (Section 1.2, route 3). For **canonical** links (logistic-logit, Poisson-log, Gaussian-identity) the ratio $(d\mu/d\eta)/V(\mu) = 1$ and $r^{\text{score}}_{k,i}$ collapses to the ordinary response residual $r_{k,i}$. For **non-canonical** links (probit, cloglog, Gamma-log, Poisson-sqrt, etc.) the factor is not $1$ and must be applied. ::: {.callout-warning} ### Why the score residual, not the response residual $\text{Ch.2}_i = J_k A_k^{-1} \psi_{k,i}$ contains the full GLM score $\psi_{k,i} = (Y - \mu)\cdot(d\mu/d\eta)/V(\mu)\cdot X_i$, not the bare residual $(Y - \mu)\cdot X_i$. Working through the algebra: $$ \color{#31A354}{J_k}\color{#3182BD}{A_k^{-1}}\psi_{k,i} \;=\; n\cdot d_{k,i}\cdot (Y_{k,i} - \hat Y_{k,i})\cdot \frac{d\mu/d\eta\big|_{\eta_{k,i}}}{V(\hat Y_{k,i})} \;=\; n\cdot d_{k,i}\cdot r^{\text{score}}_{k,i} $$ The factor of $n$ arises from $A_k^{-1} = n(X_k^T W_k X_k)^{-1}$ (Section 5.5). Omitting the $(d\mu/d\eta)/V$ factor and using the plain response residual is correct only when it equals $1$ — i.e., only for canonical links. This mistake is silent on logistic/Poisson-log/Gaussian models but miscalibrates probit and cloglog. ::: #### What the sensitivities measure Each model correction requires a **sensitivity** $\color{#E6550D}{d_{k,i}}$: a scalar that answers "if individual $i$'s residual at model $k$ increased by one unit, how much would $\hat\mu$ change?" Computing $\color{#E6550D}{d_{k,i}}$ takes three steps, each with a clear role: 1. **Gradient** $g_k$ ($p_k \times 1$): the total sensitivity of $\hat\mu$ to model $k$'s coefficient vector $\beta_k$. This is a Jacobian — the same object as $\color{#31A354}{J_a}$ from Section 3.2, but computed recursively for $k > 0$. 2. **Solve** $h_k = (X_k^T W_k X_k)^{-1} g_k$ ($p_k \times 1$): convert from "how $\hat\mu$ responds to $\beta_k$" to "how $\hat\mu$ responds to individual score contributions." This is the bread inverse $\color{#3182BD}{A_k^{-1}}$ applied to $g_k$. 3. **Distribute** $\color{#E6550D}{d_{k,i}} = h_k^T X_{k,i}$ (scalar): project onto individual $i$'s covariate vector to get their personal sensitivity. #### Base case: model 0 $\hat\mu$ is a direct average of model 0's predictions. The gradient of $\hat\mu$ w.r.t. $\beta_0$ is therefore the Jacobian from Section 3.2: $$ \underset{p_0 \times 1}{g_0} = \frac{1}{n_t}\sum_{j \in \text{target}} \frac{d\mu}{d\eta}\bigg|_{\eta_{0,j}^*} \;\cdot\; X_{0,j}^* $$ $g_0$ tells us the total sensitivity of $\hat\mu$ to $\beta_0$, but the IF requires a per-individual quantity. We need to ask: "of the total shift in $\hat\beta_0$ caused by the data, how much came from individual $i$?" The answer uses two facts: (a) individual $i$'s influence on $\hat\beta_0$ is proportional to their score $\psi_{0,i} = (d\mu/d\eta)_i / V(\mu_i) \cdot r_{0,i} \cdot X_{0,i}$; and (b) the constant of proportionality involves the bread inverse $(X_0^T W_0 X_0)^{-1}$. Combining: $$ h_0 = \underbrace{(X_0^T W_0 X_0)^{-1}}_{\text{bread inverse}} g_0, \qquad \color{#E6550D}{d_{0,i}} = h_0^T X_{0,i} $$ Now $\color{#E6550D}{d_{0,i}} \cdot r^{\text{score}}_{0,i}$ gives individual $i$'s Channel 2 contribution: their **score residual** $r^{\text{score}}_{0,i} = (Y_{0,i} - \hat Y_{0,i}) \cdot (d\mu/d\eta)_i / V(\mu_i)$ (how much their data "disagrees" with model 0, on the link-scaled score metric) scaled by $\color{#E6550D}{d_{0,i}}$ (how much that disagreement matters for $\hat\mu$). For canonical links $(d\mu/d\eta)/V = 1$ and the score residual collapses to the response residual $r_{0,i}$. For non-canonical links the link-scale factor must be kept. This is equivalent to $\color{#31A354}{J_0}\color{#3182BD}{A_0^{-1}}\psi_{0,i}$ from Section 3.2, factored into the three steps above. #### Recursive case: model $k = 1, \ldots, K$ Model $k$ does not feed into $\hat\mu$ directly. Instead, model $k$'s predictions become model $k-1$'s pseudo-outcomes (response variable). We already know how sensitive $\hat\mu$ is to each of those pseudo-outcomes — that is $\color{#E6550D}{d_{k-1,j}}$ from the previous step. The gradient of $\hat\mu$ w.r.t. $\beta_k$ therefore weights each individual's contribution by the previously-computed sensitivity, further scaled by the prior weight $w_{k-1,j}$ that the $(k-1)$-th score function carries for row $j$: $$ \underset{p_k \times 1}{g_k} = \sum_{j \in \text{fit}_{k-1}} w_{k-1,j}\;\cdot\; \color{#E6550D}{d_{k-1,j}} \;\cdot\; \frac{d\mu}{d\eta}\bigg|_{\eta_{k,j}^*} \;\cdot\; X_{k,j}^* $$ The $w_{k-1,j}$ factor comes from $A_{k-1,k} = E[\partial s_{k-1,j}/\partial \beta_k]$, and $s_{k-1,j}$ inherits $w_{k-1,j}$ because the $(k-1)$-th pseudo-outcome model is a weighted fit. In the unweighted case $w_{k-1,j} \equiv 1$ and the factor drops out; dropping it under non-uniform external weights (e.g. IPCW, survey weights) silently underestimates the sandwich variance by a factor that scales with weight heterogeneity. Compare this to $g_0$: the uniform weight $\frac{1}{n_t}$ over the target population is replaced by $w_{k-1,j} \cdot \color{#E6550D}{d_{k-1,j}}$ over the previous model's fitting set. Everything else — solve and distribute — is identical: $$ h_k = (X_k^T W_k X_k)^{-1} g_k, \qquad \color{#E6550D}{d_{k,i}} = h_k^T X_{k,i} $$ ### 5.5 Scaling: the factor of $n$ Each model correction involves $\color{#3182BD}{A_k^{-1}}$. Since $\color{#3182BD}{A_k} = \frac{1}{n_k} X_k^T W_k X_k$, we have $\color{#3182BD}{A_k^{-1}} = n_k(X_k^T W_k X_k)^{-1}$. So the correction at step $k$ is: $$ n_k \cdot \color{#E6550D}{d_{k,i}} \cdot r^{\text{score}}_{k,i} $$ The factor $n_k$ ensures each IF contribution is $O(1)$ per individual, giving the correct $\text{Var}(\hat\mu) = O(1/n)$. ::: {.callout-warning} ### A common bug Omitting $n_k$ makes each correction $O(1/n)$, yielding $\text{Var} = O(1/n^3)$ — confidence intervals that are much too narrow. ::: ::: {.panel-tabset} ## Math For a 2-period ICE with outcome model $m_1$ at $k=1$ and pseudo-outcome model $m_0$ at $k=0$, the IF is $$ \mathrm{IF}_i(\hat\mu) = \big(\hat m_0(H_0^{(a)}) - \hat\mu\big) + n_1\,d_{1,i}\,r^{\text{score}}_{1,i} + n_0\,d_{0,i}\,r^{\text{score}}_{0,i}, $$ with the sensitivities $d_{k,i}$ propagated forward by the recursion of Section 5.4. The $n_k$ factors are essential (Section 5.5). For the linear-Gaussian DGP below, the true joint ATE under $(a_0=1, a_1=1)$ vs $(a_0=0, a_1=0)$ is exactly $0.8 + 0.6 = 1.4$ by construction, and we check that causatr's sandwich CI covers it. ## Code ```{r} #| label: verify-ice set.seed(7) n_ice <- 500 id <- seq_len(n_ice) L0 <- rnorm(n_ice) A0 <- rbinom(n_ice, 1, plogis(0.3 * L0)) L1 <- 0.5 * L0 + 0.4 * A0 + rnorm(n_ice) A1 <- rbinom(n_ice, 1, plogis(0.2 * L1 + 0.3 * A0)) Y_ice <- 1 + 0.8 * A0 + 0.6 * A1 + 0.5 * L0 + 0.4 * L1 + rnorm(n_ice) long <- rbind( data.frame(id = id, time = 0L, A = A0, L = L0, Y = NA_real_), data.frame(id = id, time = 1L, A = A1, L = L1, Y = Y_ice) ) fit_ice <- causat( long, outcome = "Y", treatment = "A", confounders = ~ 1, confounders_tv = ~ L, id = "id", time = "time", model_fn = stats::glm ) res_ice <- contrast( fit_ice, interventions = list(a11 = static(1), a00 = static(0)), reference = "a00", ci_method = "sandwich" ) ate_ice <- res_ice$contrasts$estimate se_ice <- res_ice$contrasts$se ci_lo <- res_ice$contrasts$ci_lower ci_hi <- res_ice$contrasts$ci_upper truth <- 0.8 + 0.6 # structural sum of the two causal coefficients covered <- (truth >= ci_lo) && (truth <= ci_hi) ``` ## Result ```{r} #| label: verify-ice-result #| echo: false out <- data.frame( quantity = c("ICE ATE (causatr sandwich)", "$\\mathrm{SE}$", "95% CI lower", "95% CI upper", "Truth ($0.8 + 0.6$)", "Truth $\\in$ 95% CI?"), value = c(ate_ice, se_ice, ci_lo, ci_hi, truth, ifelse(covered, "yes", "no")) ) tt(out, caption = "ICE chained IF on a 2-period linear-Gaussian DGP") |> format_tt(j = "quantity", markdown = TRUE) |> format_tt(j = 2, digits = 4) ``` The sandwich CI produced by `variance_if_ice()` covers the structural truth of $1.4$, and the chained IF machinery is doing what Section 5.4 says it should. ::: ### 5.6 Pseudo-code: IF for a chained estimator ```{r} #| eval: false #| code-fold: true #| code-summary: "Pseudo-code: chained_influence()" chained_influence <- function(models, data_iv, target) { n <- number_of_individuals K <- length(models) - 1 mu_hat <- mean(pseudo_final[target]) n_target <- sum(target) # Channel 1 with target-subpopulation scaling (see Sec. 3.6). IF <- rep(0, n) IF[target] <- (n / n_target) * (pseudo_final[target] - mu_hat) d <- rep(0, n) # sensitivities from previous step for (step in 0:K) { model_k <- models[[step + 1]] family_k <- model_k$family X_k <- model.matrix(model_k) # fitting design matrix eta_k <- model_k$linear.predictors mu_eta_k <- family_k$mu.eta(eta_k) # dmu/deta at fitted values V_k <- family_k$variance(fitted(model_k)) W_k <- mu_eta_k^2 / V_k # IWLS working weights fit_ids_k <- model_k$fitting_ids n_k <- length(fit_ids_k) # Score residual = response residual * (dmu/deta) / V(mu). For # canonical links this factor is 1; for non-canonical (probit, # cloglog, Gamma-log, ...) it is not. r_response_k <- residuals(model_k, type = "response") r_score_k <- r_response_k * mu_eta_k / V_k # Intervention design matrix and link derivatives (dmu/deta) X_star_k <- model.matrix(model_k, newdata = intervention_data[[step + 1]]) eta_star <- X_star_k %*% coef(model_k) mu_eta_star <- family_k$mu.eta(eta_star) # Step 1: Gradient g_k = d mu_hat / d beta_k. The step > 0 branch # carries w_{k-1} from the previous-step fit's prior weights. if (step == 0) { g_k <- crossprod(X_star_k[target, ], mu_eta_star[target]) / n_target } else { prev_ids <- models[[step]]$fitting_ids w_prev <- models[[step]]$prior.weights # 1-vector if unweighted g_k <- crossprod( X_star_k[prev_ids, ], w_prev * d[prev_ids] * mu_eta_star[prev_ids] ) } # Step 2: Solve h_k = (X'WX)^{-1} g_k h_k <- solve(crossprod(X_k, X_k * W_k)) %*% g_k # Step 3: Distribute d_{k,i} = h_k' X_{k,i} d <- rep(0, n) d[fit_ids_k] <- X_k %*% h_k # Channel 2 for model k: n_k * d_{k,i} * r_score_{k,i} r_vec <- rep(0, n) r_vec[fit_ids_k] <- r_score_k IF <- IF + n_k * d * r_vec } IF } ``` ### 5.7 From individual IFs to the variance Once we have the per-individual $\text{IF}_i$ from the recursion above, variance estimation follows from the same formula as always (@eq-var-if): $$ \widehat{\text{Var}}(\hat\mu) = \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i^2 $$ For two interventions $a$ and $b$ on the same data: $$ \widehat{\text{Cov}}(\hat\mu_a, \hat\mu_b) = \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i^{(a)} \cdot \text{IF}_i^{(b)} $$ The full $k \times k$ covariance matrix is then used for contrasts. ## 6. IPCW: variance under outcome censoring When outcome data are missing at random (MAR) and causatr fits a censoring model $P(C = 0 \mid A, L)$, the IPCW weights become estimated nuisance parameters. The variance of the causal estimand must account for this estimation uncertainty — otherwise the sandwich SE is conservative. ### 6.1 The censoring model as a third nuisance block The censoring model adds a block to the stacked estimating equations. For **point-treatment g-computation** with IPCW, the stack becomes: $$ \Psi(\theta) = \begin{pmatrix} \psi_\gamma \\ \psi_\beta \\ \psi_\mu \end{pmatrix} = \begin{pmatrix} \text{censoring model score} \\ \text{outcome model score (IPCW-weighted)} \\ \text{target estimand} \end{pmatrix} $$ where $\theta = (\gamma, \beta, \mu)^T$, $\gamma$ are censoring model parameters, $\beta$ are outcome model parameters, and $\mu$ is the marginal mean. ### 6.2 The block-triangular bread The bread matrix has the same block-triangular structure as in Section 3.5, now with one extra block: $$ A = \begin{pmatrix} A_{\gamma\gamma} & 0 & 0 \\ A_{\beta\gamma} & A_{\beta\beta} & 0 \\ 0 & A_{\mu\beta} & A_{\mu\mu} \end{pmatrix} $$ The key new block is $A_{\beta\gamma}$: the cross-derivative of the outcome model's score with respect to the censoring parameters. This captures how changes in IPCW weights (driven by $\gamma$) affect the outcome model fit. Concretely: $$ A_{\beta\gamma} = -\frac{\partial}{\partial\gamma} \left[\frac{1}{n} \sum_i X_i \, w_{\text{ipcw}}(\gamma)_i \, \frac{(y_i - \mu_i)\,\mu'(\eta_i)}{V(\mu_i)}\right] $$ In causatr, this is computed via `numDeriv::jacobian()` on a closure that perturbs the IPCW weights while holding $\hat\beta$ fixed. ### 6.3 The Channel 2 correction for censoring Back-substitution through the bread yields an additional Channel 2 correction term in the IF: $$ \text{IF}_i = \underbrace{\text{Ch1}_i}_{\text{covariate sampling}} + \underbrace{\text{Ch2}_i^{(\text{outcome})}}_{\text{outcome model}} - \underbrace{\text{Ch2}_i^{(\text{censoring})}}_{\text{censoring model}} $$ The censoring correction is: $$ \text{Ch2}_i^{(\text{censoring})} = g_\gamma^T \, A_{\gamma\gamma}^{-1} \, \psi_{\gamma,i} $$ where $g_\gamma = A_{\beta\gamma}^T \, h_\beta$ and $h_\beta = A_{\beta\beta}^{-1} J_\mu^T$ projects the marginal-mean gradient through the outcome model bread. The minus sign on the censoring correction is deliberate: Robins, Rotnitzky & Zhao (1994) showed this correction **reduces** variance by accounting for the information in the censoring model. Ignoring it (treating IPCW weights as known) is conservative. ### 6.4 Point-treatment IPW with IPCW For IPW + IPCW, the stack has four blocks: censoring model, propensity model, MSM, and target. The bread becomes $4 \times 4$ block-triangular: $$ A = \begin{pmatrix} A_{\gamma\gamma} & 0 & 0 & 0 \\ 0 & A_{\alpha\alpha} & 0 & 0 \\ A_{\beta\gamma} & A_{\beta\alpha} & A_{\beta\beta} & 0 \\ 0 & 0 & A_{\mu\beta} & A_{\mu\mu} \end{pmatrix} $$ Two off-diagonal blocks feed into the MSM score: $A_{\beta\gamma}$ (IPCW weights) and $A_{\beta\alpha}$ (propensity weights). Each generates an independent Channel 2 correction. ### 6.5 Longitudinal IPCW: the conservative approximation For longitudinal treatments with per-period censoring models $\gamma_1, \ldots, \gamma_K$, the full correction requires cross-derivatives $A_{\beta_k, \gamma_j}$ for all $k \geq j$, projected through the ICE cascade's accumulated sensitivity $d_k$. This is technically supported by stacked M-estimation theory (Stefanski & Boos 2002; Tsiatis 2006, Ch. 10), but the cascade's $d_k$ vectors are built incrementally during the backward loop and cannot be evaluated inside a `numDeriv` closure. causatr currently **omits** the longitudinal censoring Channel 2 correction. This makes the sandwich SE conservative (larger than the true asymptotic SE). Empirically the ratio is $\approx 0.94$ (sandwich/bootstrap), confirming the omission is small. **Recommendation:** For longitudinal IPCW, use `ci_method = "bootstrap"` for the tightest CIs. The sandwich remains valid for coverage guarantees. No existing package (`lmtp`, `ltmle`, `gfoRmula`) implements the longitudinal censoring Channel 2 correction — all use bootstrap variance in this setting. ## 7. Summary 1. **One method**: the influence function. For any causal estimand $\hat\mu$, write down the complete estimating equations (nuisance models + target estimand), compute each individual's $\text{IF}_i$, and take $\widehat{\text{Var}}(\hat\mu) = \frac{1}{n^2}\sum_i \text{IF}_i^2$. 2. **The IF is the sandwich**: the IF formula is the sandwich $\color{#3182BD}{A^{-1}} \color{#E6550D}{B} \color{#3182BD}{(A^{-1})^T}$ applied to the stacked estimating equations, with the block-triangular bread inverted by back-substitution. The bread, meat, and scores are all present — the IF just computes $A^{-1}\psi_i$ per individual instead of forming the full matrices. 3. **Two channels**: the IF always decomposes into a **direct term** (Channel 1: covariate sampling) and **model corrections** (Channel 2: one per nuisance model). What varies across methods is only which models contribute to Channel 2. 4. **G-computation**: one model correction (outcome model). Channel 1 nonzero. 5. **IPW**: two model corrections (MSM + propensity). The propensity correction captures weight estimation uncertainty via the cross-derivative $A_{\beta\alpha}$. 6. **Matching**: one model correction (outcome model), same as g-comp. Aggregation is cluster-robust (sum IFs within matched subclasses). 7. **User-supplied cluster** (`contrast(cluster = ...)`): the same sum-within-cluster-then-square aggregation matching uses internally is exposed directly on gcomp, IPW, and ICE. Numerically equivalent to `sandwich::vcovCL(type = "HC0")` on the final predict-then-average step; rejected on matching because subclass clustering is already structural. 8. **Chained estimators** (ICE): $K\!+\!1$ model corrections computed by backward recursion of sensitivities. 9. **IPCW** adds a censoring-model block to the stacked equations. The censoring Channel 2 correction **reduces** variance by accounting for weight estimation uncertainty. For longitudinal IPCW, the correction is omitted (conservative); use bootstrap for tighter CIs. 10. **The delta method** ($\color{#31A354}{J} V_\beta \color{#31A354}{J^T}$) is a shortcut that captures Channel 2 only. It is exact when Channel 1 = 0 (constant predictions), but underestimates otherwise. The full IF avoids this limitation. ## 8. Numeric variance tiers (custom `model_fn`) The analytic IF engine in `variance_if()` requires access to the model's score residuals and bread matrix. Standard GLMs (`stats::glm`) and GAMs (`mgcv::gam`) expose these via `family$mu.eta` / `family$variance` and `$Vp`, respectively. But users can pass any fitting function via `model_fn =` (e.g. `MASS::glm.nb`, `betareg::betareg`), and those models may not expose the standard GLM internals. causatr handles this with a two-tier numeric fallback (`variance_if_numeric()`): ### Tier 1: `sandwich::estfun` available If `sandwich::estfun(model)` returns the per-observation score matrix $\psi_i$ and `sandwich::bread(model)` returns the bread, the fallback recovers the **full IF** (Channel 1 + Channel 2 + cross-term) exactly, matching the analytic main path. The marginal-mean Jacobian $J$ is computed via `numDeriv::jacobian()` on `predict()` instead of analytically. This applies to `MASS::glm.nb` (negative binomial) and most model classes with a `sandwich` method. ### Tier 2: plain delta shortcut If no `estfun` method exists (e.g. `betareg::betareg`), the fallback computes $V_1 + V_2$ where $V_1 = (1/n^2) \sum_i \text{Ch.1}_i^2$ is the Channel 1 variance and $V_2 = J \, V_\beta \, J^T$ is the delta-method approximation for Channel 2. This **drops the cross-term** $(1/n^2)(\text{Ch.1}^T \text{Ch.2} + \text{Ch.2}^T \text{Ch.1})$ and emits a `causatr_tier2_fallback` warning so the user knows the SE is slightly miscalibrated. For Tier 2 models, `ci_method = "bootstrap"` gives exact coverage. ### Which tier does my `model_fn` use? | `model_fn` | Tier | Exact? | |---|---|---| | `stats::glm` | Analytic (main path) | Yes | | `mgcv::gam` | Analytic (GAM path) | Yes | | `MASS::glm.nb` | Tier 1 (numeric + estfun) | Yes | | `betareg::betareg` | Tier 2 (delta shortcut) | Approximate (use bootstrap for exact) | If unsure, run with `ci_method = "sandwich"` first — if you see a `causatr_tier2_fallback` warning, the model is on Tier 2. ## References - Bang H, Robins JM (2005). Doubly robust estimation in missing data and causal inference models. *Biometrics* 61:962–973. - Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (1986). *Robust Statistics: The Approach Based on Influence Functions*. Wiley. - Hirano K, Imbens GW, Ridder G (2003). Efficient estimation of average treatment effects using the estimated propensity score. *Econometrica* 71:1161–1189. - Hernán MA, Robins JM (2025). *Causal Inference: What If*. Chapman & Hall/CRC. - Huber PJ (1967). The behavior of maximum likelihood estimates under nonstandard conditions. *Proceedings of the Fifth Berkeley Symposium* 1:221–233. - Huber PJ (1981). *Robust Statistics*. Wiley. - Lunceford JK, Davidian M (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. *Statistics in Medicine* 23:2937–2960. - Robins JM (1986). A new approach to causal inference in mortality studies with a sustained exposure period. *Mathematical Modelling* 7:1393–1512. - Robins JM, Rotnitzky A, Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. *Journal of the American Statistical Association* 89:846–866. - Stefanski LA, Boos DD (2002). The calculus of M-estimation. *The American Statistician* 56:29–38. - Tsiatis AA (2006). *Semiparametric Theory and Missing Data*. Springer. - van der Laan MJ, Robins JM (2003). *Unified Methods for Censored Longitudinal Data and Causality*. Springer. - White H (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. *Econometrica* 48:817–838. - White H (1982). Maximum likelihood estimation of misspecified models. *Econometrica* 50:1–25. - Zeileis A (2004). Econometric computing with HC and HAC covariance matrix estimators. *Journal of Statistical Software* 11(10):1–17. - Zeileis A (2006). Object-oriented computation of sandwich estimators. *Journal of Statistical Software* 16(9):1–16. - Zivich PN, Klose M, Cole SR, Edwards JK, Shook-Sa BE (2022). Delicatessen: M-Estimation in Python. *arXiv preprint* arXiv:2203.11300. - Zivich PN, Ross RK, Shook-Sa BE, Cole SR, Edwards JK (2024). Empirical sandwich variance estimator for iterated conditional expectation g-computation. *Statistics in Medicine* 43:5562–5572.

0. Roadmap

0.1 Running example

1. The sandwich estimator for parametric models

1.1 What is an estimating equation?

1.2 Deriving \(\psi\): three routes to the same place

Route 1: Score equations (maximum likelihood)

Route 2: Moment conditions (method of moments)

Route 3: Quasi-likelihood

1.3 The general M-estimation framework

1.4 Asymptotic variance of M-estimators

Step 1: Linearise via Taylor expansion

Step 2: Identify the variance

Step 3: Estimate from data

1.5 Pseudo-code: sandwich variance for a GLM

1.6 Concrete example: linear regression

2. The influence function

2.1 The idea: how much does each observation matter?

2.2 Deriving the IF from the linearisation

2.3 IF = sandwich (the key equivalence)

2.4 Example: IF for a sample mean

3. Variance of a causal estimand

3.1 The problem

3.2 The two-channel influence function

Channel 1: Direct contribution (covariate sampling)

Channel 2: Model correction (parameter uncertainty)

Combining both channels

3.3 The delta method shortcut

3.4 When the shortcut is exact

3.5 The IF is the stacked sandwich, solved analytically

The complete estimating equations

The block-triangular bread

Inverting by back-substitution

3.6 Target subpopulations (ATT, ATU, custom subset)

3.7 Pseudo-code: the full IF for a marginal mean

4. The core causal estimators

4.1 G-computation

4.2 Inverse probability weighting (IPW)

The two-model setup

Why the naive approach fails

The correct approach: joint M-estimation

The full IF for \(\hat\mu_a\)

4.3 Matching

Cluster-robust variance

The full IF for matching

4.4 Summary: one method, multiple estimators

5. Chained estimators: ICE g-computation

5.1 The chain rule for influence functions

5.2 ICE g-computation: the motivating example

5.3 Why naive approaches fail

5.4 Deriving the IF for chained models

The IF decomposition

What the sensitivities measure

Base case: model 0

Recursive case: model \(k = 1, \ldots, K\)

5.5 Scaling: the factor of \(n\)

5.6 Pseudo-code: IF for a chained estimator

5.7 From individual IFs to the variance

6. IPCW: variance under outcome censoring

6.1 The censoring model as a third nuisance block

6.2 The block-triangular bread

6.3 The Channel 2 correction for censoring

6.4 Point-treatment IPW with IPCW

6.5 Longitudinal IPCW: the conservative approximation

7. Summary

8. Numeric variance tiers (custom model_fn)

Tier 1: sandwich::estfun available

Tier 2: plain delta shortcut

Which tier does my model_fn use?

References

8. Numeric variance tiers (custom `model_fn`)

Tier 1: `sandwich::estfun` available

Which tier does my `model_fn` use?