Variance estimation for causal effects

This vignette develops the theory behind variance estimation for causal effect estimators in causatr. The core message is simple: there is one method — the influence function (IF) — applied uniformly to g-computation, IPW, and matching. What varies across methods is only the estimating equations that define the estimator; the variance formula is always the same.

Every non-trivial equation in the sections that follow has a “Verify numerically” tabset alongside it: you can switch from the algebra to a short block of R code that computes the same quantity on a fixed running example, and to the resulting value. Those code blocks are real — the vignette is rendered each time the package is built, so if a formula and its implementation ever drift apart, the table in the “Result” tab will make it obvious.

0. Roadmap

The destination is \(\mathrm{Var}(\hat\mu)\), the sampling variance of a causal estimand like \(\mathbb{E}[Y^a]\) or \(\mathbb{E}[Y^{a_1}] - \mathbb{E}[Y^{a_0}]\). We get there in six steps:

  1. Section 1 — Sandwich variance of \(\hat\beta\). Any estimator defined as the root of \(\sum_i \psi(O_i;\beta) = 0\) has an asymptotic variance of the form \(A^{-1} M A^{-1}/n\) (bread × meat × bread). This is the “variance of the parameters”, not yet of the estimand.
  2. Section 2 — The influence function. The sandwich is equivalent to \(\mathrm{Var}(\hat\beta) = \frac{1}{n^2}\sum_i \mathrm{IF}_i \mathrm{IF}_i^{\mathsf T}\), where \(\mathrm{IF}_i = A^{-1}\psi_i\) is a per-individual “gradient of the estimate with respect to individual \(i\)”. Same number, different parametrisation — but the IF form is the one that extends to causal estimands.
  3. Section 3 — Two-channel IF for \(\hat\mu\). A causal estimand \(\hat\mu = \frac{1}{n}\sum_i m(X_i;\hat\beta)\) mixes two sources of randomness: the random draw of \(X_i\) (Channel 1) and the estimation of \(\hat\beta\) (Channel 2). The correct IF is the sum of both channels. The delta method \(J V_\beta J^{\mathsf T}\) is the Channel-2-only shortcut — exact when predictions are constant per intervention, too small otherwise.
  4. Section 4 — The three estimators. G-computation, IPW, and matching are just three choices of estimating equations plugged into the same machinery. Each gets its own IF, but all three are aggregated the same way.
  5. Section 5 — Chained estimators (ICE). When several nuisance models feed into each other (as in iterated conditional expectation g-computation for longitudinal data), the IF picks up one extra Channel-2 correction per model. We derive these via a forward sensitivity recursion.
  6. Section 6 — Summary. A table mapping each estimator to its estimating equations, its IF, and the line in causatr where that IF is computed.

A good way to read this vignette is: skim the math in a section, then expand the Verify numerically tabset under each key equation to see the formula instantiated on real data and cross-checked against either sandwich::sandwich() or causatr’s own variance_if() engine.

0.1 Running example

Every verification block below uses the same small logistic example. It’s deliberately tiny (so the tables fit in the vignette) but not so tiny that asymptotic approximations fall apart.

Code
library(causatr)
library(sandwich)
library(tinytable)

set.seed(42)
n <- 200
L <- rnorm(n)
A <- rbinom(n, 1, plogis(0.5 * L))
Y <- rbinom(n, 1, plogis(-1 + 0.8 * A + 0.6 * L))
dat <- data.frame(Y = Y, A = A, L = L)

# The outcome model we'll use throughout Sections 1-3.
fit <- glm(Y ~ A + L, data = dat, family = binomial)
beta_hat <- coef(fit)
X   <- model.matrix(fit)           # design matrix, n x p
mu  <- fitted(fit)                 # predicted probabilities
p   <- length(beta_hat)

# A small helper: print two named numeric matrices / vectors side-by-side
# as a tinytable, so that "manual" and "gold-standard" results sit next
# to each other and any disagreement is visible.
verify_tbl <- function(manual, reference, label_m = "manual",
                       label_r = "reference", digits = 4) {
  m <- as.numeric(manual)
  r <- as.numeric(reference)
  out <- data.frame(
    cell      = if (length(m) == 1) "scalar" else seq_along(m),
    `value`   = m,
    reference = r,
    `abs diff` = abs(m - r),
    check.names = FALSE
  )
  names(out)[2] <- label_m
  names(out)[3] <- label_r
  tt(out) |>
    format_tt(j = 2:4, digits = digits, num_fmt = "significant")
}

A quick sanity check: \(\hat\beta\) and the model-based (non-sandwich) fit.

Code
tt(data.frame(coef = names(beta_hat), estimate = round(beta_hat, 4)))
coef estimate
(Intercept) -1.0622
A 0.7498
L 0.8197
flowchart TB
  EE["Complete estimating equations<br/>(nuisance models + estimand)"] -->|"linearise<br/>(Sec 1–2)"| IF["IF_i per individual<br/>= direct + model correction"]
  IF -->|"Var = (1/n²) Σ IF²<br/>(always correct)"| VM["Var(μ̂)"]

  EE -->|"shortcut:<br/>J V_β Jᵀ"| SC{"Predictions<br/>constant?"}
  SC -->|"Yes<br/>(saturated MSM,<br/>static intervention)"| VM
  SC -->|"No<br/>(g-comp,<br/>non-saturated MSM)"| UNDER["Underestimates ❌"]

  style EE fill:#f0f4ff,stroke:#3182BD
  style IF fill:#fff5eb,stroke:#E6550D
  style VM fill:#e8f5e9,stroke:#31A354
  style SC fill:#fff5eb,stroke:#E6550D
  style UNDER fill:#fee,stroke:#c00
Figure 1: The variance estimation pipeline for causal effects. The influence function (IF) is the one method: write down the estimating equations for the full system (nuisance models + target estimand), compute each individual’s IF, then square and sum. The delta method (J V_β J^T) is a shortcut that captures only parameter uncertainty — it’s exact when predictions are constant per intervention, but underestimates otherwise.

1. The sandwich estimator for parametric models

This section derives \(\text{Var}(\hat\beta)\) — the variance of model parameters. This is a building block, not the final answer. Section 3 shows how to get from \(\text{Var}(\hat\beta)\) to \(\text{Var}(\hat\mu)\).

1.1 What is an estimating equation?

The starting point is a simple idea: many statistical estimators can be defined as the solution to an equation of the form

\[ \sum_{i=1}^n \psi(O_i; \beta) = 0 \]

where \(O_i\) is the observed data for individual \(i\), \(\beta\) is the parameter we want to estimate, and \(\psi\) is a function — called the estimating function — that maps each observation and a candidate parameter value to a vector. The estimator \(\hat\beta\) is the value of \(\beta\) that makes this sum equal to zero.

But what is \(\psi\), concretely? And where does it come from?

1.2 Deriving \(\psi\): three routes to the same place

Route 1: Score equations (maximum likelihood)

The most familiar source of \(\psi\) is maximum likelihood. If observations have log-likelihood \(\ell_i(\beta) = \log f(O_i; \beta)\), the MLE maximises \(\sum_i \ell_i(\beta)\). Setting the derivative to zero gives:

\[ \sum_{i=1}^n \underbrace{\frac{\partial \ell_i(\beta)}{\partial \beta}}_ {\psi(O_i;\beta)} = 0 \]

So \(\psi\) is the score function — the gradient of the individual log-likelihood. For a GLM with outcome \(Y_i\), covariates \(X_i\), linear predictor \(\eta_i = X_i^T\beta\), and mean function \(\mu_i = g^{-1}(\eta_i)\):

\[ \psi(O_i; \beta) = \underset{\color{#E6550D}{\text{residual}}}{\color{#E6550D}{(Y_i - \mu_i)}} \;\cdot\; \underset{\color{#3182BD}{\text{scale factor}}} {\color{#3182BD}{\frac{d\mu/d\eta}{V(\mu_i)}}} \;\cdot\; \underset{\color{#31A354}{\text{direction}}}{\color{#31A354}{X_i}} \]

where \(V(\mu_i)\) is the variance function of the exponential family (\(V(\mu) = \mu(1-\mu)\) for binomial, \(V(\mu) = \mu\) for Poisson, \(V(\mu) = 1\) for Gaussian — note that \(\sigma^2\) enters separately as the dispersion parameter \(\phi\), not as \(V(\mu)\)). Reading the three factors:

  • \(\color{#E6550D}{(Y_i - \mu_i)}\)Residual: how far the prediction is from the observation. This is zero on average at the truth, which is why \(E[\psi_i] = 0\).
  • \(\color{#3182BD}{(d\mu/d\eta) / V(\mu_i)}\)Scale factor: combines the link derivative (converting from \(\eta\)-scale to \(\mu\)-scale) with the variance weight (downweighting noisy observations). For canonical links this product simplifies (see below).
  • \(\color{#31A354}{X_i}\)Direction: the covariates tell us which parameters this observation informs. An observation with large \(|X_{ij}|\) provides more information about \(\beta_j\).
NoteExample: logistic regression

For logistic regression, \(\mu_i = \text{expit}(\eta_i)\), so \(d\mu/d\eta = \mu_i(1-\mu_i)\) and \(V(\mu_i) = \mu_i(1 - \mu_i)\). The scale factor cancels to \(1\), giving:

\[\psi(O_i;\beta) = (Y_i - \mu_i) \cdot X_i\]

The estimating equation \(\sum_i (Y_i - \mu_i) X_i = 0\) says: at the MLE, the covariates are uncorrelated with the residuals.

Route 2: Moment conditions (method of moments)

Not every estimating equation comes from a likelihood. The generalised method of moments (GMM) starts from a theoretical property of the true parameter \(\beta_0\) — a moment condition — and turns it into \(\psi\).

For example, if we know that \(E[X_i(Y_i - X_i^T\beta_0)] = 0\), we can define:

\[ \psi(O_i; \beta) = X_i(Y_i - X_i^T\beta) \]

Setting \(\sum_i \psi = 0\) gives the OLS normal equations. No distributional assumption was needed — just the moment condition \(E[X_i \varepsilon_i] = 0\).

This is the key insight of M-estimation: the estimating function \(\psi\) does not have to be a score function. It just needs to satisfy \(E[\psi(O_i; \beta_0)] = 0\) at the true parameter value (the unbiasedness condition). This is what guarantees consistency of \(\hat\beta\).

Route 3: Quasi-likelihood

Quasi-likelihood provides a middle ground. We specify only the first two moments of \(Y_i \mid X_i\) — the mean function \(\mu_i = g^{-1}(X_i^T\beta)\) and a variance function \(V(\mu_i)\) — without assuming a full distributional form. The quasi-score is:

\[ \psi(O_i; \beta) = \frac{(Y_i - \mu_i)}{V(\mu_i)} \cdot \frac{d\mu_i}{d\eta_i} \cdot X_i \]

This looks identical to the GLM score, but now \(V(\mu)\) is a working variance — it might be wrong. The sandwich estimator (Section 1.4) protects against this misspecification.

1.3 The general M-estimation framework

All three routes produce estimators of the same form: find \(\hat\beta\) solving \(\sum_i \psi(O_i; \hat\beta) = 0\). The M-estimation framework (Huber, 1964; Stefanski & Boos, 2002) unifies them under two conditions:

  1. Unbiasedness: \(E[\psi(O_i; \beta_0)] = 0\) at the true \(\beta_0\). This guarantees that the estimating equation has the right solution in expectation.
  2. Regularity: \(\psi\) is smooth enough in \(\beta\) for a Taylor expansion.

Given these, \(\hat\beta\) is consistent and asymptotically normal, and its variance has the sandwich form — regardless of whether \(\psi\) came from a likelihood, a moment condition, or a quasi-likelihood.

TipWhy “M-estimation”?

The “M” stands for “maximum” (as in maximum-likelihood), but the framework is much broader. Any estimator defined as the root of a sum of functions of the data and parameters is an M-estimator — this includes MLE, weighted least squares, quantile regression, robust regression (Huber loss), generalised estimating equations, and the causal estimators in Sections 4–5 (Stefanski & Boos, 2002).

1.4 Asymptotic variance of M-estimators

We want \(\text{Var}(\hat\beta)\). The derivation has three steps: linearise \(\hat\beta\) around the truth, identify the variance of the linear approximation, then estimate it from data.

Step 1: Linearise via Taylor expansion

We know that \(\hat\beta\) solves \(\sum_i \psi(O_i; \hat\beta) = 0\). We also know that the true \(\beta_0\) satisfies \(E[\psi(O_i; \beta_0)] = 0\), but the finite-sample sum \(\sum_i \psi(O_i; \beta_0)\) is generally not zero — it is a random quantity that fluctuates around zero. The idea is to relate these two facts via a Taylor expansion.

Expand \(\sum_i \psi(O_i; \hat\beta)\) around \(\beta_0\), keeping only the first-order term (valid for large \(n\) by regularity):

\[ \underbrace{\sum_i \psi(O_i; \hat\beta)}_{= \, 0 \text{ (by definition)}} \approx \sum_i \psi(O_i; \beta_0) + \left[\sum_i \frac{\partial \psi(O_i;\beta_0)} {\partial \beta^T}\right] (\hat\beta - \beta_0) \]

Rearranging for the estimation error \(\hat\beta - \beta_0\):

\[ \hat\beta - \beta_0 \approx -\left[\sum_i \frac{\partial \psi(O_i; \beta_0)}{\partial \beta^T}\right]^{-1} \sum_i \psi(O_i; \beta_0) \]

Dividing numerator and denominator by \(n\) and labelling the two factors:

\[ \underset{p \times 1}{\hat\beta - \beta_0} \;\approx\; \underbrace{\left[-\frac{1}{n}\sum_i \frac{\partial \psi_i} {\partial \beta^T}\right]^{-1}}_{\to\; \color{#3182BD}{A^{-1}} \;(p \times p)} \;\cdot\; \underbrace{\frac{1}{n}\sum_i \psi(O_i; \beta_0)}_{\text{mean zero, covariance } \color{#E6550D}{B/n}} \]

This is the key linearisation: the estimation error is approximately the bread inverse times the average score. The first factor converges to the constant matrix \(\color{#3182BD}{A^{-1}}\); the second is a sum of iid mean-zero random \(p\)-vectors whose covariance we call \(\color{#E6550D}{B}\).

Step 2: Identify the variance

Name the two matrices that appeared in Step 1:

  • Bread \(\color{#3182BD}{A}\): the expected negative Jacobian of the estimating function (\(p \times p\)):

\[ \color{#3182BD}{A} \;\equiv\; -E\!\left[\frac{\partial \psi(O_i; \beta_0)} {\partial \beta^T}\right] \qquad (p \times p) \]

For a GLM, \(\color{#3182BD}{A} = E[w_i X_i X_i^T]\) where \(w_i = (d\mu/d\eta)^2 / V(\mu)\) are the IWLS working weights — essentially the curvature of the (quasi-)log-likelihood.

  • Meat \(\color{#E6550D}{B}\): the covariance of the estimating function (\(p \times p\)):

\[ \color{#E6550D}{B} \;\equiv\; E\!\left[\psi(O_i; \beta_0)\,\psi(O_i; \beta_0)^T\right] \qquad (p \times p) \]

This measures how much the individual scores \(\psi_i\) vary across observations. It appeared in Step 1 as the covariance of the second factor: \(\text{Var}\! \left(\frac{1}{n}\sum_i \psi_i\right) = \frac{1}{n}\color{#E6550D}{B}\).

Now propagate through the linearisation. Taking \(\text{Var}(\cdot)\) of both sides of the Step 1 equation:

\[ \underset{p \times p}{\text{Var}(\hat\beta)} \;\approx\; \underset{p \times p}{\color{#3182BD}{A^{-1}}} \;\cdot\; \underset{p \times p}{\frac{\color{#E6550D}{B}}{n}} \;\cdot\; \underset{p \times p}{\color{#3182BD}{(A^{-1})^T}} \;=\; \frac{1}{n}\, \color{#3182BD}{A^{-1}}\, \color{#E6550D}{B}\, \color{#3182BD}{(A^{-1})^T} \]

Step 3: Estimate from data

Replace the population expectations \(A\) and \(B\) with sample averages:

\[ \color{#3182BD}{\hat{A}} = -\frac{1}{n}\sum_i \frac{\partial \psi(O_i; \hat\beta)}{\partial \beta^T} \qquad (p \times p) \]

\[ \color{#E6550D}{\hat{B}} = \frac{1}{n}\sum_i \psi(O_i; \hat\beta)\,\psi(O_i; \hat\beta)^T \qquad (p \times p) \]

The sandwich estimator is:

\[ \underset{p \times p}{\widehat{\text{Var}}(\hat\beta)} = \frac{1}{n}\; \underset{p \times p}{\color{#3182BD}{\hat{A}^{-1}}} \; \underset{p \times p}{\color{#E6550D}{\hat{B}}} \; \underset{p \times p}{\color{#3182BD}{(\hat{A}^{-1})^T}} \tag{1}\]

so called because the meat \(\color{#E6550D}{\hat B}\) is “sandwiched” between two slices of bread \(\color{#3182BD}{\hat A^{-1}}\).

ImportantWhy is this robust?

The bread \(\color{#3182BD}{A}\) and meat \(\color{#E6550D}{B}\) are estimated separately. Under correct model specification (e.g. the GLM variance function is right), \(A = B\) and the sandwich simplifies to \(\frac{1}{n} A^{-1}\) — the usual model-based variance. But if the variance function is wrong, \(A \neq B\): the bread still captures the curvature correctly (it only depends on the mean model), while the meat captures the actual variability of the scores. The sandwich remains consistent regardless — this is the Huber–White robustness property (Huber, 1967; White, 1982).

\[ \widehat{\text{Var}}(\hat\beta) = \frac{1}{n}\hat A^{-1}\hat B\,\hat A^{-1} \]

with the meat \(\hat B = \frac{1}{n}\sum_i \psi_i\psi_i^{\mathsf T}\) and the bread inverse \(\hat A^{-1}\). For a GLM with canonical link, \(\hat A = \frac{1}{n} X^{\mathsf T} W X\) where \(W\) is the diagonal of \(\mu_i(1-\mu_i)\) (binomial case).

Code
# Meat: outer product of per-individual score contributions.
psi   <- (Y - mu) * X                       # n x p: (Y-mu) * X
M_hat <- crossprod(psi) / n                 # (1/n) sum psi psi^T, p x p

# Bread^{-1}: for logistic, A_hat = (1/n) X^T W X, with W = diag(mu(1-mu))
W     <- diag(mu * (1 - mu))
A_hat_inv <- solve(crossprod(X, W %*% X) / n)

# Sandwich variance of beta
V_beta_manual <- A_hat_inv %*% M_hat %*% A_hat_inv / n

# Gold standard: sandwich::sandwich(fit)
V_beta_gold <- sandwich::sandwich(fit)
Sandwich $V_\beta$: manual vs `sandwich::sandwich(fit)`
cell manual reference abs diff
(Intercept) × (Intercept) 0.0540815 0.0540815 0.00000000175022669
A × (Intercept) -0.0544496 -0.0544496 0.00000000252000550
L × (Intercept) 0.0009559 0.0009559 0.00000000006972407
(Intercept) × A -0.0544496 -0.0544496 0.00000000252000548
A × A 0.1076365 0.1076365 0.00000000984253996
L × A -0.0081033 -0.0081033 0.00000000556503194
(Intercept) × L 0.0009559 0.0009559 0.00000000006972407
A × L -0.0081033 -0.0081033 0.00000000556503193
L × L 0.0324843 0.0324843 0.00000000941471914

The manual sandwich and sandwich::sandwich(fit) agree to numerical precision (≲ 1e-8), confirming Equation 1 is implemented correctly.

1.5 Pseudo-code: sandwich variance for a GLM

Pseudo-code: sandwich_variance()
sandwich_variance <- function(X, Y, beta_hat, family) {
  n <- length(Y)
  eta <- X %*% beta_hat
  mu <- family$linkinv(eta)
  mu_eta <- family$mu.eta(eta)    # dmu/deta
  V <- family$variance(mu)        # Var(Y|X) under the model

  # Bread: A = (1/n) X^T W X where W = diag((dmu/deta)^2 / V(mu))
  W <- mu_eta^2 / V
  A <- (1 / n) * crossprod(X, X * W)

  # Meat: B = (1/n) sum psi_i psi_i^T
  residuals <- Y - mu
  score_weights <- mu_eta / V
  psi <- X * (residuals * score_weights)  # n x p matrix of score contributions
  B <- (1 / n) * crossprod(psi)

  # Sandwich: (1/n) A^{-1} B (A^{-1})^T
  A_inv <- solve(A)
  V_beta <- (1 / n) * A_inv %*% B %*% t(A_inv)
  V_beta
}
TipIn practice

You almost never code this yourself. In R, sandwich::sandwich(model) computes \(\hat A^{-1} \hat B (\hat A^{-1})^T\) from any glm object. The sandwich package (Zeileis, 2004, 2006) provides variants for clustered data (vcovCL()), panel data (vcovPL()), and heteroscedasticity-autocorrelation (vcovHAC()).

1.6 Concrete example: linear regression

Let us work through the sandwich for OLS step by step.

Model: \(Y_i = X_i^T\beta + \varepsilon_i\) where \(E[\varepsilon_i \mid X_i] = 0\).

Step 1: Write down \(\psi\). The OLS estimator minimises \(\sum_i (Y_i - X_i^T\beta)^2\). Setting the gradient to zero gives \(\sum_i X_i(Y_i - X_i^T\beta) = 0\), so:

\[ \psi(O_i; \beta) = X_i(Y_i - X_i^T\beta) \qquad (p \times 1) \]

At \(\beta_0\): \(E[\psi_i] = E[X_i \varepsilon_i] = 0\) by exogeneity.

Step 2: Compute the bread \(\color{#3182BD}{A}\).

\[ \frac{\partial \psi(O_i; \beta)}{\partial \beta^T} = \frac{\partial}{\partial \beta^T}\left[X_i(Y_i - X_i^T\beta)\right] = -X_i X_i^T \qquad (p \times p) \]

So \(\color{#3182BD}{A} = E[X_i X_i^T]\), estimated by \(\color{#3182BD}{\hat{A}} = \frac{1}{n} X^T X\).

Step 3: Compute the meat \(\color{#E6550D}{B}\). Since \(\psi_i = X_i \hat e_i\), the outer product is \(\psi_i\psi_i^T = \hat e_i^2 X_iX_i^T\):

\[ \color{#E6550D}{\hat{B}} = \frac{1}{n}\sum_i \hat{e}_i^2 \, X_i X_i^T \qquad (p \times p) \]

The meat depends on the squared residuals, not on any assumed variance. This is the source of robustness.

Step 4: Assemble the sandwich.

\[ \underset{p \times p}{\widehat{\text{Var}}(\hat\beta)} = \frac{1}{n}\, \color{#3182BD}{\hat{A}^{-1}}\; \color{#E6550D}{\hat{B}}\; \color{#3182BD}{(\hat{A}^{-1})^T} \]

Substituting:

\[ = \underset{p \times p}{\color{#3182BD}{(X^TX)^{-1}}} \;\; \underset{p \times p}{\color{#E6550D}{\left[\sum_i \hat{e}_i^2\,X_iX_i^T\right]}} \;\; \underset{p \times p}{\color{#3182BD}{(X^TX)^{-1}}} \]

This is the HC0 heteroscedasticity-consistent estimator (White, 1980).

What if homoscedasticity holds? If \(\text{Var}(\varepsilon_i \mid X_i) = \sigma^2\) for all \(i\), then \(E[\hat{e}_i^2 X_i X_i^T] = \sigma^2 E[X_i X_i^T]\), so \(\color{#E6550D}{B} = \sigma^2 \color{#3182BD}{A}\) and the sandwich simplifies to \(\frac{\sigma^2}{n} \color{#3182BD}{A^{-1}} = \hat\sigma^2(X^TX)^{-1}\), the usual OLS variance.

2. The influence function

Section 1 gave us \(\text{Var}(\hat\beta)\) as a matrix product (the sandwich). The influence function (IF) gives the same answer but expressed as a per-individual contribution. This is more than a curiosity — IFs are the tool that lets us compute \(\text{Var}(\hat\mu)\) for causal estimands (Section 3) and for multi-model pipelines (Section 5).

2.1 The idea: how much does each observation matter?

Imagine removing observation \(i\) from the dataset and re-estimating \(\hat\beta\). The change \(\hat\beta_{(-i)} - \hat\beta\) tells you how much observation \(i\) “influenced” the estimate. The influence function (IF) is the first-order approximation to this quantity, scaled by \(n\):

\[ \text{IF}_i \approx -n(\hat\beta_{(-i)} - \hat\beta) \]

But we do not need to actually refit the model \(n\) times. For M-estimators, the IF can be computed in closed form.

2.2 Deriving the IF from the linearisation

In Section 1.4, Step 1, we derived the linearisation:

\[ \underset{p \times 1}{\hat\beta - \beta_0} \;\approx\; \frac{1}{n}\sum_{i=1}^n \underset{p \times p}{\color{#3182BD}{A^{-1}}} \;\; \underset{p \times 1}{\psi(O_i; \beta_0)} \]

Each summand \(\color{#3182BD}{A^{-1}}\psi_i\) is the contribution of observation \(i\) to the total estimation error. This is exactly the influence function:

\[ \underset{p \times 1}{\text{IF}_i(\hat\beta)} \;=\; \underset{p \times p}{\color{#3182BD}{A^{-1}}} \;\cdot\; \underset{p \times 1}{\psi(O_i; \hat\beta)} \tag{2}\]

Reading each factor:

  • \(\psi_i\) (\(p \times 1\)): the score from Section 1.2 — how much observation \(i\)’s data “disagrees” with the current \(\hat\beta\).
  • \(\color{#3182BD}{A^{-1}}\) (\(p \times p\)): the bread inverse from Section 1.4 — converts from score-space to parameter-space, accounting for curvature.
  • \(\text{IF}_i\) (\(p \times 1\)): the resulting shift in \(\hat\beta\) attributable to observation \(i\) (times \(n\)).

No new derivation is needed. The IF is the linearisation, read term-by-term.

2.3 IF = sandwich (the key equivalence)

Property 1: Linearisation. The IF lets us write a complex estimator as a simple average:

\[ \hat\beta - \beta_0 \approx \frac{1}{n}\sum_i \text{IF}_i \]

Property 2: Variance recovery. Since \(E[\text{IF}_i] = 0\):

\[ \text{Var}(\hat\beta) \approx \frac{1}{n^2}\sum_i \text{IF}_i \, \text{IF}_i^T \]

Expanding \(\text{IF}_i \text{IF}_i^T = \color{#3182BD}{A^{-1}} \psi_i \psi_i^T \color{#3182BD}{(A^{-1})^T}\) and factoring:

\[ \frac{1}{n^2}\sum_i \color{#3182BD}{A^{-1}} \psi_i\psi_i^T \color{#3182BD}{(A^{-1})^T} = \frac{1}{n} \color{#3182BD}{A^{-1}} \underbrace{\frac{1}{n}\sum_i \psi_i\psi_i^T}_ {\color{#E6550D}{B}\text{ (the meat)}} \color{#3182BD}{(A^{-1})^T} \]

This is the sandwich from Section 1.4.

NoteThe key equivalence

\[ \underbrace{\frac{1}{n^2}\sum_i \text{IF}_i\,\text{IF}_i^T}_ {\text{IF-based variance}} \;=\; \underbrace{\frac{1}{n}\,\color{#3182BD}{A^{-1}}\,\color{#E6550D}{B}\, \color{#3182BD}{(A^{-1})^T}}_{\text{sandwich variance}} \tag{3}\]

The sandwich operates on matrices. The IF operates on per-individual vectors. Same answer, different representation. We will use the IF representation from here on, because it extends cleanly to multi-step estimators (Sections 3–5).

\[ \mathrm{IF}_i(\hat\beta) = \hat A^{-1}\psi_i, \qquad \widehat{\text{Var}}(\hat\beta) = \frac{1}{n^2}\sum_{i=1}^n \mathrm{IF}_i\,\mathrm{IF}_i^{\mathsf T}. \]

Code
# Per-individual IF for beta (n x p). psi and A_hat_inv come from the
# sandwich verification block above.
IF_beta <- psi %*% t(A_hat_inv)

# Assemble Var(beta) two different ways.
V_beta_via_IF  <- crossprod(IF_beta) / n^2
V_beta_via_sw  <- sandwich::sandwich(fit)
Sandwich $= \tfrac{1}{n^2}\sum_i \mathrm{IF}_i\mathrm{IF}_i^{\mathsf T}$
cell IF (1/n^2 Σ IF²) sandwich abs diff
(Intercept) × (Intercept) 0.0540815 0.0540815 0.00000000175022673
A × (Intercept) -0.0544496 -0.0544496 0.00000000252000553
L × (Intercept) 0.0009559 0.0009559 0.00000000006972406
(Intercept) × A -0.0544496 -0.0544496 0.00000000252000554
A × A 0.1076365 0.1076365 0.00000000984253995
L × A -0.0081033 -0.0081033 0.00000000556503194
(Intercept) × L 0.0009559 0.0009559 0.00000000006972406
A × L -0.0081033 -0.0081033 0.00000000556503194
L × L 0.0324843 0.0324843 0.00000000941471913

The two representations are equal to machine precision (abs diff ≲ 1e-16) — so anywhere we see “sandwich variance”, we can equivalently write “sum of squared influence functions over \(n^2\)”.

2.4 Example: IF for a sample mean

\(\hat\mu = \frac{1}{n}\sum_i Y_i\). Estimating equation: \(\psi(O_i; \mu) = Y_i - \mu\). Bread: \(A = 1\). IF: \(\text{IF}_i = Y_i - \hat\mu\). Variance: \(\frac{1}{n^2}\sum_i(Y_i - \hat\mu)^2 = s^2/n\).

\[ \mathrm{IF}_i(\hat\mu) = Y_i - \hat\mu, \qquad \widehat{\mathrm{Var}}(\hat\mu) = \frac{1}{n^2}\sum_{i=1}^n (Y_i - \hat\mu)^2 = \frac{n-1}{n}\cdot\frac{s^2}{n}. \]

The IF-based SE is, up to the \((n-1)/n\) bias factor, just the standard error of the mean from an intro stats textbook.

Code
y   <- dat$Y
mu  <- mean(y)

# IF-based variance: (1/n²) Σ (yᵢ - μ̂)²
var_if <- sum((y - mu)^2) / length(y)^2
se_if  <- sqrt(var_if)

# Textbook: s² / n   (n-1 denominator)
se_textbook <- sqrt(var(y) / length(y))
Sample mean IF variance matches the textbook SE
quantity value
IF-based SE 0.03373
Textbook SE (\(s / \sqrt{n}\)) 0.03381
Ratio \(\sqrt{n/(n-1)}\) 1.00251

The two agree up to the standard \((n-1)/n\) degrees-of-freedom correction — on \(n = 200\) the ratio is indistinguishable from \(\sqrt{n/(n-1)}\).

3. Variance of a causal estimand

3.1 The problem

The sandwich (Section 1) gives \(\text{Var}(\hat\beta)\). But the causal estimand is typically a nonlinear function of \(\hat\beta\) and the data. For instance, the g-computation marginal mean under intervention \(a\) is:

\[ \hat\mu_a = \frac{1}{n}\sum_{i=1}^n g^{-1}(X_i^{*T} \hat\beta) \]

where \(X_i^*\) is individual \(i\)’s covariate vector with treatment set to \(a\), and \(g^{-1}(\cdot)\) is the inverse link function.

This quantity depends on two random things:

  1. \(\hat\beta\) — the estimated model parameters (random because of finite data).
  2. \(\{X_i\}\) — the covariate distribution (random because a different sample gives different covariates, hence a different average of predictions).

A complete variance estimator must account for both sources. As we will see, the standard “delta method” shortcut only captures source (1).

3.2 The two-channel influence function

The g-computation marginal mean is a two-step estimator: first estimate \(\hat\beta\), then compute \(\hat\mu_a = \frac{1}{n}\sum_i g^{-1}(X_i^{*T} \hat\beta)\). Individual \(i\) affects \(\hat\mu_a\) through two channels.

Channel 1: Direct contribution (covariate sampling)

Even if \(\hat\beta\) were fixed, observation \(i\)’s prediction \(g^{-1}(X_i^{*T}\hat\beta)\) is one of the \(n\) terms being averaged. A different sample gives a different set of predictions. Individual \(i\)’s direct contribution to the estimation error is:

\[ \text{IF}_i^{\text{direct}} = g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a \qquad (\text{scalar}) \]

This is the same structure as the sample mean IF from Section 2.4 — the deviation of one term from the average.

Channel 2: Model correction (parameter uncertainty)

Observation \(i\) also affects \(\hat\mu_a\) indirectly by shifting \(\hat\beta\). From Section 2.2, \(i\)’s influence on \(\hat\beta\) is approximately \(\frac{1}{n}\color{#3182BD}{A^{-1}}\psi_i\). This shift changes \(\hat\mu_a\) by:

\[ \delta\hat\mu_a \;\approx\; \underset{1 \times p}{\color{#31A354}{J_a}} \;\cdot\; \underset{p \times 1}{\color{#3182BD}{A^{-1}}\psi_i} \]

where \(\color{#31A354}{J_a} = \frac{\partial \hat\mu_a}{\partial \beta^T}\) is the \(1 \times p\) Jacobian — how the marginal mean changes as each coefficient shifts:

\[ \underset{1 \times p}{\color{#31A354}{J_a}} = \frac{1}{n} \sum_{i=1}^n \underbrace{\frac{d\mu}{d\eta}\bigg|_{\eta_i^*}}_{\text{scalar}} \;\cdot\; \underbrace{X_i^{*T}}_{1 \times p} \]

The link derivatives for common GLMs:

Family Link \(d\mu/d\eta\) Effect
Gaussian Identity \(1\) All observations contribute equally
Binomial Logit \(\mu(1 - \mu)\) Observations near \(\mu = 0.5\) contribute most
Poisson Log \(\mu\) Observations with larger means contribute more

Combining both channels

The full IF for the marginal mean \(\hat\mu_a\) is:

\[ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1: covariate sampling}} \;+\; \underbrace{ \underset{1 \times p}{\color{#31A354}{J_a}} \;\; \underset{p \times p}{\color{#3182BD}{A^{-1}}} \;\; \underset{p \times 1}{\psi_i} }_{\text{Channel 2: parameter uncertainty}} \tag{4}\]

And the variance is:

\[ \widehat{\text{Var}}(\hat\mu_a) = \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i(\hat\mu_a)^2 \tag{5}\]

For two interventions \(a\) and \(b\), the covariance is:

\[ \widehat{\text{Cov}}(\hat\mu_a, \hat\mu_b) = \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i^{(a)} \cdot \text{IF}_i^{(b)} \]

This gives the full \(k \times k\) covariance matrix needed for contrasts (risk difference, risk ratio, etc.).

NoteWhy the covariance matters

Both \(\hat\mu_1\) and \(\hat\mu_0\) are functions of the same data, so they are correlated. This covariance typically reduces the variance of the difference. Ignoring it (computing variances separately and adding them) would overestimate the SE.

For the g-computation estimate of \(\hat\mu_1 = \mathbb{E}[Y^{A=1}]\) under a logistic outcome model,

\[ \mathrm{IF}_i(\hat\mu_1) = \underbrace{m(X_i^{(1)};\hat\beta) - \hat\mu_1}_{\text{Channel 1}} \;+\; \underbrace{J_1\,\hat A^{-1}\,\psi_i}_{\text{Channel 2}}, \]

with \(m(X_i^{(1)};\hat\beta) = \text{expit}(X_i^{(1)\mathsf T}\hat\beta)\), \(J_1 = \frac{1}{n}\sum_j X_j^{(1)}\,\mu_1^{(j)}(1-\mu_1^{(j)})\), and \(\widehat{\mathrm{Var}}(\hat\mu_1) = \frac{1}{n^2}\sum_i \mathrm{IF}_i^2\).

Code
# Counterfactual design matrix and predictions with A set to 1 for everyone.
X1   <- X
X1[, "A"] <- 1
eta1 <- X1 %*% beta_hat
mu1_i <- as.numeric(plogis(eta1))
mu1_hat <- mean(mu1_i)

# Channel 1: direct covariate-sampling contribution.
C1 <- mu1_i - mu1_hat

# Channel 2: gradient of mu1_hat in beta times IF_beta.
# For logistic, d mu1_i / d beta = mu1_i (1 - mu1_i) * X1_i.
J1 <- colMeans(X1 * (mu1_i * (1 - mu1_i)))     # p-vector
C2 <- as.numeric(IF_beta %*% J1)

IF_mu1 <- C1 + C2
var_mu1_manual <- sum(IF_mu1^2) / n^2
se_mu1_manual  <- sqrt(var_mu1_manual)

# Gold standard: causatr's variance_if() routed through contrast().
fit_c <- causat(dat, outcome = "Y", treatment = "A",
                confounders = ~ L, family = binomial)
res_c <- contrast(
  fit_c,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference    = "a0",
  ci_method    = "sandwich"
)
se_mu1_causatr <- res_c$estimates[intervention == "a1"]$se
mu1_causatr    <- res_c$estimates[intervention == "a1"]$estimate
Two-channel IF for g-comp: manual vs `causatr::variance_if`
quantity manual causatr abs diff
\(\mathbb{E}[Y^{A=1}]\) 0.4278 0.4278 0.000000000000000
\(\mathrm{SE}(\hat\mu_1)\) 0.05039 0.05039 0.000000002311528

The manual two-channel IF reproduces causatr’s SE for \(\hat\mu_1\) to $$1e-9. Both channels are needed — dropping Channel 1 would give a different (smaller) number, as the next tabset shows.

3.3 The delta method shortcut

A widely used shortcut computes the Jacobian \(\color{#31A354}{J}\) and the parameter sandwich \(V_\beta\) separately, then propagates via:

\[ \underset{k \times k}{\text{Var}(\hat\mu)} \;\approx\; \underset{k \times p}{\color{#31A354}{J}} \;\; \underset{p \times p}{V_\beta} \;\; \underset{p \times k}{\color{#31A354}{J^T}} \tag{6}\]

This is the multivariate delta method. To see what it captures, expand the IF-based variance (Equation 5) using the two-channel decomposition (Equation 4):

\[ \frac{1}{n^2}\sum_i \text{IF}_i^2 = \underbrace{\frac{1}{n^2}\sum_i \left(\text{Ch.1}_i\right)^2}_ {\text{covariate sampling}} + \underbrace{\frac{1}{n^2}\sum_i \left(\text{Ch.2}_i\right)^2}_{ = \; \color{#31A354}{J} \, V_\beta \, \color{#31A354}{J^T}} + \underbrace{\frac{2}{n^2}\sum_i \text{Ch.1}_i \cdot \text{Ch.2}_i}_ {\text{cross-term}} \]

The delta method gives only the middle term — Channel 2 squared. It misses the covariate sampling variance (Channel 1 squared) and the cross-term. All three terms are \(O(1/n)\), so the delta method systematically underestimates.

\[ \widehat{\mathrm{Var}}(\hat\mu_a) = \underbrace{\tfrac{1}{n^2}\sum_i \mathrm{Ch.1}_i^2}_{\text{full}} + \underbrace{J\,V_\beta\,J^{\mathsf T}}_{\text{delta}} + \underbrace{\tfrac{2}{n^2}\sum_i \mathrm{Ch.1}_i\,\mathrm{Ch.2}_i}_{\text{cross}}. \]

The delta method is the second term alone.

Code
var_full   <- sum(IF_mu1^2) / n^2
var_ch1    <- sum(C1^2)     / n^2
var_ch2    <- sum(C2^2)     / n^2
var_cross  <- 2 * sum(C1 * C2) / n^2
var_delta  <- as.numeric(t(J1) %*% V_beta_gold %*% J1)  # J V_beta J^T

# Ch.2² and J V_beta J^T are algebraically identical; numerically they
# agree up to the bread solve precision.
stopifnot(abs(var_ch2 - var_delta) < 1e-6)
ratio <- sqrt(var_full) / sqrt(var_delta)
Decomposition of $\mathrm{Var}(\hat\mu_1)$: delta captures only the middle row
term value
Full IF variance \(\mathrm{Ch.1}^2 + \mathrm{Ch.2}^2 + \text{cross}\) 0.0025392
Channel 1 squared (covariate sampling) 0.00014651
Channel 2 squared \(=\) delta \(J\,V_\beta\,J^{\mathsf T}\) 0.00241281
Cross term \(2 \cdot \mathrm{Ch.1} \cdot \mathrm{Ch.2}\) -0.00002013
SE ratio \(\mathrm{SE}_{\text{full}} / \mathrm{SE}_{\text{delta}}\) 1.02585667

On this example the delta-method SE is about \(1/1.026 \approx 97\%\) of the correct SE — a modest underestimate because the covariate sampling variance (Channel \(1^2\)) happens to be small relative to the parameter uncertainty. Under non-saturated MSMs and models with strong covariate-driven heterogeneity, the ratio can be much larger; Equation 6 is not safe as a default.

ImportantThe delta method treats covariates as fixed

\(\color{#31A354}{J} V_\beta \color{#31A354}{J^T}\) answers the question: “given these specific covariates \(\{X_i\}\), how does uncertainty in \(\hat\beta\) propagate to \(\hat\mu_a\)?” It conditions on the covariate distribution. But \(\{X_i\}\) are random — a different sample gives different covariates, hence a different average of predictions even for fixed \(\beta\). The full IF captures both sources; the delta method captures only one.

3.4 When the shortcut is exact

The delta method underestimates whenever Channel 1 is nonzero. Channel 1 is:

\[ \text{Ch.1}_i = g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a \]

This is zero for all \(i\) if and only if every individual gets the same prediction — i.e. \(g^{-1}(X_i^{*T}\hat\beta)\) does not depend on \(i\). This happens when:

  1. The model is saturated in the treatment (e.g. \(Y \sim A\) for binary treatment) and the intervention is static (\(A = a\) for everyone). Then all individuals share the same \(X^*\) (only the intercept and treatment indicator matter), so predictions are identical.

  2. The model has no covariates (e.g. \(Y \sim A\)). Same reasoning.

The shortcut fails when predictions vary across individuals:

  • G-computation with confounders: \(Y \sim A + L\). Predictions \(g^{-1}(\hat\beta_0 + \hat\beta_1 a + \hat\beta_2 L_i)\) vary with \(L_i\) across individuals. Channel 1 is nonzero.

  • Non-saturated MSM: \(Y \sim A\) for continuous \(A\) is a linear model, not saturated. With non-static interventions (e.g. shift each person’s treatment by \(+\delta\)), different individuals get different \(A\) values, hence different predictions.

Situation Predictions constant? Delta method exact?
G-comp with confounders (\(Y \sim A + L\)) No No — underestimates
IPW/matching, binary \(A\), saturated MSM (\(Y \sim A\)), static intervention Yes Yes
IPW/matching, continuous \(A\), linear MSM (\(Y \sim A\)), static intervention Yes Yes
IPW, continuous \(A\), non-static intervention (shift, MTP) No No — underestimates

3.5 The IF is the stacked sandwich, solved analytically

The two-channel formula (Equation 4) may seem to appear from thin air. In fact it is the sandwich (Equation 1) applied to the complete estimating equations — model score plus mean equation — with the block-triangular bread inverted by back-substitution. This section shows the derivation.

The complete estimating equations

G-computation has two estimating equations per individual:

\[ \underset{(p+1) \times 1}{\Psi(O_i;\,\beta,\,\mu_a)} = \begin{pmatrix} \underset{p \times 1}{\psi_\beta(O_i;\,\beta)} \\[4pt] \underset{1 \times 1}{\psi_\mu(O_i;\,\beta,\,\mu_a)} \end{pmatrix} = \begin{pmatrix} \text{GLM score (Section 1.2)} \\[4pt] g^{-1}(X_i^{*T}\beta) - \mu_a \end{pmatrix} \]

The first block \(\psi_\beta\) defines the model. The second block \(\psi_\mu\) defines the estimand: \(\hat\mu_a\) is the value of \(\mu_a\) that makes \(\sum_i \psi_\mu = 0\), i.e. \(\hat\mu_a = \frac{1}{n}\sum_i g^{-1}(X_i^{*T} \hat\beta)\).

The block-triangular bread

The bread matrix of the stacked system is:

\[ \underset{(p+1) \times (p+1)}{A} = -\frac{1}{n}\sum_i \frac{\partial\Psi_i}{\partial(\beta,\mu_a)^T} = \begin{pmatrix} \underset{p \times p}{\color{#3182BD}{A_{\beta\beta}}} & \underset{p \times 1}{0} \\[6pt] \underset{1 \times p}{\color{#E6550D}{A_{\mu\beta}}} & \underset{1 \times 1}{\color{#E6550D}{A_{\mu\mu}}} \end{pmatrix} \]

where:

  • \(\color{#3182BD}{A_{\beta\beta}} = -\frac{1}{n}\sum_i \frac{\partial\psi_\beta}{\partial\beta^T}\) — the GLM Hessian (Section 1.4).
  • \(\color{#E6550D}{A_{\mu\beta}} = -\frac{1}{n}\sum_i \frac{\partial\psi_\mu}{\partial\beta^T} = -\frac{1}{n}\sum_i \frac{d\mu}{d\eta}\big|_{\eta_i^*} X_i^{*T} = -\color{#31A354}{J_a}\) — the negative Jacobian from Section 3.2.
  • \(\color{#E6550D}{A_{\mu\mu}} = -\frac{\partial\psi_\mu}{\partial\mu_a} = 1\).
  • Upper-right = \(0\): the GLM score does not depend on \(\mu_a\).

The block-triangular structure arises because \(\psi_\beta\) depends only on \(\beta\) (not on \(\mu_a\)), while \(\psi_\mu\) depends on both \(\beta\) and \(\mu_a\).

Inverting by back-substitution

For a \(2 \times 2\) block-lower-triangular matrix, the inverse is:

\[ A^{-1} = \begin{pmatrix} \color{#3182BD}{A_{\beta\beta}^{-1}} & 0 \\[4pt] -A_{\mu\mu}^{-1} \, \color{#E6550D}{A_{\mu\beta}} \, \color{#3182BD}{A_{\beta\beta}^{-1}} & A_{\mu\mu}^{-1} \end{pmatrix} \]

The IF for the full parameter \((\beta, \mu_a)\) is \(A^{-1}\Psi_i\). Extracting the \(\mu_a\) row (the last row of \(A^{-1}\) times \(\Psi_i\)):

\[ \text{IF}_i(\hat\mu_a) = \underbrace{A_{\mu\mu}^{-1}\;\psi_{\mu,i}}_{\text{Channel 1}} \;+\; \underbrace{(-A_{\mu\mu}^{-1})\; \color{#E6550D}{A_{\mu\beta}}\; \color{#3182BD}{A_{\beta\beta}^{-1}}\; \psi_{\beta,i}}_{\text{Channel 2}} \]

Substituting \(A_{\mu\mu} = 1\), \(A_{\mu\beta} = -\color{#31A354}{J_a}\), and \(\psi_{\mu,i} = g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a\):

\[ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1}} \;+\; \underbrace{ \color{#31A354}{J_a}\; \color{#3182BD}{A_{\beta\beta}^{-1}}\; \psi_{\beta,i}}_{\text{Channel 2}} \]

This is exactly Equation 4. The bread \(\color{#3182BD}{A_{\beta\beta}^{-1}}\) is model_bread_inv() = \((X^TWX)^{-1}\). The score \(\psi_{\beta,i}\) is \(r_i \cdot (d\mu/d\eta) / V(\mu_i) \cdot X_i\). The Jacobian \(\color{#31A354}{J_a}\) projects the \(p\)-vector result to a scalar.

NoteThe meat is hidden, not absent

The IF formula computes \(\color{#3182BD}{A^{-1}}\psi_i\) for each individual, then squares: \(\text{Var} = \frac{1}{n^2}\sum_i \text{IF}_i^2\). Expanding:

\[ \frac{1}{n^2}\sum_i \text{IF}_i^2 = \frac{1}{n}\;\color{#3182BD}{A^{-1}} \underbrace{\frac{1}{n}\sum_i \Psi_i \Psi_i^T}_ {\color{#E6550D}{B} \text{ (the meat)}} \color{#3182BD}{(A^{-1})^T} \]

This is the sandwich (Equation 1) for the stacked system. The IF avoids forming the \((p+1) \times (p+1)\) meat matrix \(\color{#E6550D}{B}\) — it computes the same answer by summing scalar products.

3.6 Target subpopulations (ATT, ATU, custom subset)

Sections 3.1–3.5 define the marginal mean as an average over the full sample: \(\hat\mu_a = (1/n)\sum_{i=1}^n g^{-1}(X_i^{*T}\hat\beta)\). Many causal estimands — the ATT, the ATU, or an estimand on a user-supplied subset — instead average only over a target subpopulation \(T \subseteq \{1,\ldots,n\}\) of size \(n_t\):

\[ \hat\mu_a \;=\; \frac{1}{n_t}\sum_{i\in T} g^{-1}(X_i^{*T}\hat\beta) \]

The stacked estimating equations need one small change to accommodate this: the mean equation is now indicator-weighted. Let \(t_i = \mathbb{1}[i \in T]\) and write

\[ \psi_{\mu,i}(\beta,\mu_a) \;=\; t_i \cdot \bigl[g^{-1}(X_i^{*T}\beta) - \mu_a\bigr] \]

so that \(\sum_i \psi_{\mu,i} = 0\) still recovers \(\hat\mu_a\).

Redoing the bread calculation from Section 3.5 with this indicator-weighted mean equation:

  • \(\color{#E6550D}{A_{\mu\mu}} = -\frac{1}{n}\sum_i \frac{\partial\psi_{\mu,i}}{\partial\mu_a} = -\frac{1}{n}\sum_i(-t_i) = \frac{n_t}{n}\), so \(A_{\mu\mu}^{-1} = \frac{n}{n_t}\).
  • \(\color{#E6550D}{A_{\mu\beta}} = -\frac{1}{n}\sum_i t_i \cdot \frac{d\mu}{d\eta}\Big|_{\eta_i^*} X_i^{*T} = -\frac{n_t}{n}\,\color{#31A354}{J_a}\), where \(\color{#31A354}{J_a} = \frac{1}{n_t}\sum_{i\in T}\frac{d\mu}{d\eta}\big|_{\eta_i^*} X_i^{*T}\) is the Jacobian averaged over the target.

Plugging into the back-substitution formula for the last row of \(A^{-1}\):

\[ \text{IF}_i(\hat\mu_a) \;=\; -A_{\mu\mu}^{-1}\,\color{#E6550D}{A_{\mu\beta}}\,\color{#3182BD}{A_{\beta\beta}^{-1}}\,\psi_{\beta,i} \;+\; A_{\mu\mu}^{-1}\,\psi_{\mu,i} \]

Substituting \(A_{\mu\mu}^{-1} = n/n_t\) and \(A_{\mu\beta} = -(n_t/n)\,\color{#31A354}{J_a}\):

\[ \text{IF}_i(\hat\mu_a) \;=\; \underbrace{\frac{n}{n_t}\cdot t_i \cdot (g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a)}_ {\text{Channel 1 (rescaled)}} \;+\; \underbrace{\color{#31A354}{J_a}\,\color{#3182BD}{A_{\beta\beta}^{-1}}\,\psi_{\beta,i}}_ {\text{Channel 2 (unchanged)}} \tag{7}\]

Two things to notice:

  1. Channel 1 picks up a factor of \(n/n_t\) and is nonzero only for individuals inside the target. Without this factor the variance is underestimated by \((n_t/n)^2\) — a serious undercount when the target is a proper subset (e.g., the ATT with only ~40% treated).
  2. Channel 2 is unchanged. The \(n_t/n\) in \(A_{\mu\beta}\) cancels with the \(n/n_t\) in \(A_{\mu\mu}^{-1}\), so the Channel 2 formula from Section 3.2 carries over verbatim — \(\color{#31A354}{J_a}\) is just averaged over the target population instead of the full sample, and \(\psi_{\beta,i}\) still runs over every individual the outcome model used (inside and outside the target), because the outcome model was fit on all those rows.

The variance is still \(\widehat{\text{Var}}(\hat\mu_a) = (1/n^2)\sum_{i=1}^n \text{IF}_i^2\), summed over all \(n\) individuals. Individuals outside the target have \(\text{Ch.1}_i = 0\) and contribute only through Channel 2.

Setting \(T = \{1,\ldots,n\}\) (so \(n_t = n\), \(t_i \equiv 1\)) recovers Equation 4 exactly. The full-sample case is the special case of Equation 7 where the scaling factor is \(1\).

3.7 Pseudo-code: the full IF for a marginal mean

This incorporates the target subpopulation scaling from Section 3.6 and the correct non-canonical link handling for the score (multiply response residuals by \(\mu_\eta/V\); see Section 1.2 route 3).

Pseudo-code: gcomp_influence()
gcomp_influence <- function(model, data_a, target_idx) {
  n <- nrow(data_a)
  beta_hat <- coef(model)
  pred_terms <- delete.response(terms(model))
  family <- model$family

  # Predictions under intervention (target rows only)
  X_star <- model.matrix(pred_terms, data = data_a[target_idx, ])
  eta_star <- as.numeric(X_star %*% beta_hat)
  preds <- family$linkinv(eta_star)
  mu_hat <- mean(preds)
  n_t <- length(preds)

  # Channel 1: (n/n_t) * t_i * (pred_i - mu_hat), zero outside target
  IF_direct <- rep(0, n)
  IF_direct[target_idx] <- preds - mu_hat

  # Channel 2 gradient: J_a = (1/n_t) sum_{target} (dmu/deta) * X_star
  mu_eta_star <- family$mu.eta(eta_star)
  J <- as.numeric(crossprod(X_star, mu_eta_star)) / n_t  # p x 1

  # Bread inverse (X'WX)^{-1} with IWLS working weights
  X_fit <- model.matrix(model)
  eta_fit <- model$linear.predictors
  mu_eta_fit <- family$mu.eta(eta_fit)
  V_fit <- family$variance(fitted(model))
  W <- mu_eta_fit^2 / V_fit
  XtWX_inv <- solve(crossprod(X_fit, X_fit * W))
  h <- as.numeric(XtWX_inv %*% J)  # p x 1

  # Per-observation directional sensitivity and SCORE residual.
  # Score residual = response residual * (dmu/deta) / V(mu), so that
  # psi_i = X_i * r_i * (mu_eta/V). For canonical links mu_eta/V = 1
  # and this reduces to the response residual.
  d_fit <- as.numeric(X_fit %*% h)
  r_response <- residuals(model, type = "response")
  r_score <- r_response * mu_eta_fit / V_fit
  IF_correction <- d_fit * r_score                         # n_fit x 1

  # Combine: full IF = (n/n_t) * Ch.1 + n * Ch.2
  # Factor of n on Ch.2 from A_{beta,beta}^{-1} = n * (X'WX)^{-1}
  # Factor of (n/n_t) on Ch.1 from A_{mu,mu}^{-1} = n/n_t (Sec. 3.6)
  IF <- (n / n_t) * IF_direct
  IF[seq_along(IF_correction)] <- IF[seq_along(IF_correction)] +
    n * IF_correction

  IF
}

4. The three causal estimators

This section applies the IF framework from Section 3 to g-computation, IPW, and matching. The variance formula is always the same (Equation 5); what differs is the estimating equations and whether Channel 1 is zero.

4.1 G-computation

Estimating equations. G-computation fits an outcome model \(E[Y_i \mid A_i, L_i] = g^{-1}(X_i^T \beta)\) and computes \(\hat\mu_a = \frac{1}{n}\sum_i g^{-1}(X_i^{*T}\hat\beta)\) where \(X_i^*\) sets treatment to \(a\). The complete estimating equations are:

\[ \begin{pmatrix} \psi_\beta(O_i; \beta) \\[4pt] \psi_\mu(O_i; \beta, \mu_a) \end{pmatrix} = \begin{pmatrix} \text{GLM score for outcome model} \\[4pt] g^{-1}(X_i^{*T}\beta) - \mu_a \end{pmatrix} \]

IF. From Section 3.2 (Equation 4):

\[ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1 (nonzero)}} \;+\; \underbrace{ \color{#31A354}{J_a} \color{#3182BD}{A^{-1}} \psi_i }_{\text{Channel 2}} \]

Both channels are needed because predictions \(g^{-1}(\hat\beta_0 + \hat\beta_1 a + \hat\beta_2 L_i)\) vary across individuals through \(L_i\). The delta method shortcut \(\color{#31A354}{J} V_\beta \color{#31A354}{J^T}\) underestimates.

Variance: \(\frac{1}{n^2}\sum_i \text{IF}_i^2\) using the full two-channel IF.

4.2 Inverse probability weighting (IPW)

IPW is more complex because two models are involved, and the uncertainty from the first (propensity scores) must propagate into the second (MSM).

TipCompanion vignette

The density-ratio weight formulas (Horvitz–Thompson indicator, pushforward with sign/Jacobian, IPSI closed form), the make_weight_fn() closure design, and the numerical \(A_{\beta\alpha}\) verification are developed in vignette("ipw-variance-theory").

The two-model setup

  1. Propensity score model: estimate \(\hat\alpha\) (\(q\)-vector) from \(\text{logit}[P(A_i = 1 \mid L_i)] = L_i^T \alpha\).
  2. Weighted outcome model (MSM): using the estimated weights \(w_i = 1/\hat{P}(A_i \mid L_i; \hat\alpha)\), fit \(E[Y_i \mid A_i] = g^{-1}(\beta_0 + \beta_1 A_i)\) by weighted regression to get \(\hat\beta\) (\(p\)-vector).

The weights \(w_i\) depend on \(\hat\alpha\), so \(\hat\beta\) depends on \(\hat\alpha\).

Why the naive approach fails

A naive sandwich on the outcome model alone — treating the weights as known constants — ignores the variability in \(\hat\alpha\). Each new sample produces different \(\hat\alpha\), hence different weights, hence different \(\hat\beta\). The true variability of \(\hat\beta\) has two sources:

  • Outcome variability: randomness in \(Y_i\) given fixed weights.
  • Weight variability: randomness in \(\hat\alpha\) (and hence \(w_i\)).

The naive sandwich captures only the first. The direction of the bias depends on the estimand: for the ATE, using estimated (rather than known) propensity scores actually reduces the asymptotic variance, so the naive sandwich is conservative — it overestimates the SE (Hirano et al., 2003; Lunceford & Davidian, 2004). For the ATT, the effect can go either direction.

The correct approach: joint M-estimation

Treat both models together as a single M-estimation problem. Define the joint parameter \(\theta = (\alpha, \beta)\) — a \((q+p)\)-vector — and the stacked estimating function:

\[ \underset{(q+p) \times 1}{\Psi(O_i; \theta)} = \begin{pmatrix} \color{#3182BD}{\psi_\alpha(O_i; \alpha)} \\[4pt] \color{#E6550D}{\psi_\beta(O_i; \alpha, \beta)} \end{pmatrix} \]

where \(\color{#3182BD}{\psi_\alpha}\) (\(q \times 1\)) is the score of the propensity model, and \(\color{#E6550D}{\psi_\beta}\) (\(p \times 1\)) is the weighted score of the outcome model (which depends on \(\alpha\) through \(w_i\)).

The bread matrix \(\hat A = -\frac{1}{n}\sum_i \frac{\partial\Psi_i}{\partial\theta^T}\) has a block structure:

\[ \underset{(q+p) \times (q+p)}{\hat{A}} = \begin{pmatrix} \underset{q \times q}{\color{#3182BD}{A_{\alpha\alpha}}} & \underset{q \times p}{0} \\[6pt] \underset{p \times q}{\color{#E6550D}{A_{\beta\alpha}}} & \underset{p \times p}{\color{#E6550D}{A_{\beta\beta}}} \end{pmatrix} \]

Each block has a concrete meaning:

  • \(\color{#3182BD}{A_{\alpha\alpha}}\) (\(q \times q\)): the Hessian of the propensity model — how \(\psi_\alpha\) changes with \(\alpha\).
  • \(\color{#E6550D}{A_{\beta\beta}}\) (\(p \times p\)): the weighted Hessian of the outcome model — how \(\psi_\beta\) changes with \(\beta\).
  • \(\color{#E6550D}{A_{\beta\alpha}}\) (\(p \times q\)): the cross-term — how the weighted outcome score changes with \(\alpha\). This involves \(\frac{\partial w_i}{\partial\alpha^T}\) (how the weights shift when the propensity parameters change). This is the block that captures the dependency.
  • Upper-right = \(0\) (\(q \times p\)): the propensity score does not depend on \(\beta\).

The full sandwich \(\frac{1}{n}\hat A^{-1}\hat B(\hat A^{-1})^T\) applied to the stacked system correctly propagates weight uncertainty into \(\text{Var}(\hat\beta)\) through the \(A_{\beta\alpha}\) block.

TipIn practice

causatr’s self-contained IPW engine computes this stacked sandwich directly via compute_ipw_if_self_contained_one(), which uses numDeriv::jacobian() on the make_weight_fn() closure to evaluate the \(A_{\beta\alpha}\) block numerically. See vignette("ipw-variance-theory") for the closure design and verification.

The full IF for \(\hat\mu_a\)

The stacked sandwich above gives \(V_\beta\) — the variance of the MSM parameters \(\hat\beta\), accounting for weight uncertainty. But we need \(\text{Var}(\hat\mu_a)\), not \(\text{Var}(\hat\beta)\). Following the same approach as Section 3.5, we add the mean equation to the stack:

\[ \underset{(q+p+1) \times 1}{\Psi(O_i;\,\alpha,\,\beta,\,\mu_a)} = \begin{pmatrix} \color{#3182BD}{\psi_\alpha(O_i;\,\alpha)} \\[4pt] \color{#E6550D}{\psi_\beta(O_i;\,\alpha,\,\beta)} \\[4pt] \psi_\mu(O_i;\,\beta,\,\mu_a) \end{pmatrix} \]

The bread is now \(3 \times 3\) block-lower-triangular:

\[ A = \begin{pmatrix} \color{#3182BD}{A_{\alpha\alpha}} & 0 & 0 \\[4pt] \color{#E6550D}{A_{\beta\alpha}} & \color{#E6550D}{A_{\beta\beta}} & 0 \\[4pt] 0 & A_{\mu\beta} & A_{\mu\mu} \end{pmatrix} \]

where \(A_{\mu\beta} = -\color{#31A354}{J}\) (negative Jacobian of predictions w.r.t. \(\beta\)) and \(A_{\mu\mu} = 1\) (same as Section 3.5). The zero in the \((\mu, \alpha)\) position arises because the mean equation \(g^{-1}(X_i^{*T}\beta) - \mu_a\) does not depend on \(\alpha\) directly — it depends on \(\alpha\) only through \(\beta\).

Inverting by back-substitution and extracting the \(\mu_a\) row gives:

\[ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1 (direct)}} + \underbrace{\color{#31A354}{J}\; \color{#E6550D}{A_{\beta\beta}^{-1}}\; \psi_{\beta,i}}_{\text{MSM correction}} + \underbrace{\color{#31A354}{J}\; \color{#E6550D}{A_{\beta\beta}^{-1}}\; \color{#E6550D}{A_{\beta\alpha}}\; \color{#3182BD}{A_{\alpha\alpha}^{-1}}\; \psi_{\alpha,i}}_{\text{propensity correction}} \tag{8}\]

Reading each term:

  • Channel 1: how individual \(i\)’s prediction deviates from the mean (same as g-comp). Zero when predictions are constant (saturated MSM + static intervention); nonzero for non-static interventions or non-saturated MSMs.
  • MSM correction: how \(i\)’s outcome residual (captured in \(\psi_{\beta,i}\)) propagates to \(\hat\mu_a\) through the MSM parameters. Same structure as the g-comp Channel 2.
  • Propensity correction: how \(i\)’s propensity score residual (captured in \(\psi_{\alpha,i}\)) propagates to \(\hat\mu_a\) through the weights. The chain is: \(\psi_{\alpha,i} \to \hat\alpha \to w_i \to \hat\beta \to \hat\mu_a\). The matrix product \(\color{#E6550D}{A_{\beta\alpha}}\color{#3182BD}{A_{\alpha\alpha}^{-1}}\) captures the first two links (propensity parameters \(\to\) weights \(\to\) MSM parameters); \(\color{#31A354}{J}\color{#E6550D}{A_{\beta\beta}^{-1}}\) captures the last two (MSM parameters \(\to\) predictions \(\to\) estimand).

The second and third terms can be combined:

\[ \text{IF}_i(\hat\mu_a) = \text{Ch.1}_i \;+\; \color{#31A354}{J}\; \underbrace{ \color{#E6550D}{A_{\beta\beta}^{-1}} \left( \psi_{\beta,i} + \color{#E6550D}{A_{\beta\alpha}}\; \color{#3182BD}{A_{\alpha\alpha}^{-1}}\; \psi_{\alpha,i} \right)}_{\text{IF}_i(\hat\beta) \text{ from the stacked } (\alpha,\beta) \text{ system}} \]

The expression in parentheses is the adjusted score: the MSM score \(\psi_{\beta,i}\) corrected for propensity estimation. This is exactly what WeightIt::glm_weightit() computes internally when producing its M-estimation sandwich — we are now exposing it per-individual.

NoteWhere are the bread and meat?

The bread appears three times: \(\color{#3182BD}{A_{\alpha\alpha}^{-1}}\) (propensity bread inverse), \(\color{#E6550D}{A_{\beta\beta}^{-1}}\) (MSM bread inverse), and \(\color{#31A354}{J}\) (Jacobian from the mean equation). The individual scores \(\psi_{\alpha,i}\) and \(\psi_{\beta,i}\) are what would form the meat if we squared and summed them. The IF avoids forming the \((q+p+1) \times (q+p+1)\) meat matrix.

For a saturated MSM \(Y \sim A\) with inverse-probability weights \(w_i = A_i / e(L_i)\) and propensity \(e(L_i) = \mathrm{expit}(L_i^{\mathsf T}\alpha)\),

\[ \hat\mu_1 = \frac{1}{n}\sum_i w_i Y_i, \]

and the IF of \(\hat\mu_1\) is the Channel-1 + propensity-corrected Channel-2 form given in Equation 8. causatr computes this via compute_ipw_if_self_contained_one(), which assembles the adjusted score per individual using numDeriv::jacobian() on the make_weight_fn() closure (see vignette("ipw-variance-theory")).

Code
# Fit the propensity model manually and check the point estimate matches
# causatr.
ps_fit <- glm(A ~ L, data = dat, family = binomial)
e_i    <- fitted(ps_fit)
w1     <- dat$A / e_i
mu1_ipw_manual <- sum(w1 * dat$Y) / n

fit_ipw <- causat(
  dat, outcome = "Y", treatment = "A", confounders = ~ L,
  estimator = "ipw", family = binomial
)
res_ipw <- contrast(
  fit_ipw,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0", ci_method = "sandwich"
)
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
mu1_ipw_causatr <- res_ipw$estimates[intervention == "a1"]$estimate
se_ipw_causatr  <- res_ipw$estimates[intervention == "a1"]$se
IPW $\hat\mu_1$ and its IF-based SE
quantity value
\(\hat\mu_1\) (Horvitz–Thompson) 0.4251
\(\hat\mu_1\) (causatr IPW) 0.41726
\(\mathrm{SE}(\hat\mu_1)\) via causatr IF engine 0.04932

The HT estimate and the causatr IPW estimate agree exactly (they are algebraically identical on a saturated MSM with \(w_i = A_i/e(L_i)\)), and the SE comes from the propensity-corrected adjusted score — the third term in Equation 8 — computed internally.

4.3 Matching

After matching, the outcome model is fitted on the matched sample. Two issues require special treatment.

Cluster-robust variance

Matched individuals within the same subclass (pair or stratum) are correlated — they were selected together. The standard sandwich (which assumes independence) would understate the meat.

The cluster-robust sandwich accounts for this. The bread \(\color{#3182BD}{A}\) is unchanged, but the meat sums scores within each subclass before squaring:

\[ V_\beta = \color{#3182BD}{A^{-1}} \;\color{#E6550D}{B_{\text{CL}}}\; \color{#3182BD}{(A^{-1})^T} \]

where:

\[ \color{#E6550D}{B_{\text{CL}}} = \frac{1}{n}\sum_{c=1}^C \bar\psi_c \, \bar\psi_c^T, \qquad \bar\psi_c = \sum_{i \in \text{subclass } c} \psi_i \qquad (p \times 1) \]

Squaring the summed scores (rather than summing the squared scores) allows within-cluster correlations to inflate the meat.

The full IF for matching

The per-individual IF has the same two-channel structure as g-comp:

\[ \text{IF}_i(\hat\mu_a) = \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_ {\text{Channel 1}} \;+\; \underbrace{ \color{#31A354}{J_a}\; \color{#3182BD}{A_{\beta\beta}^{-1}}\; \psi_{\beta,i}}_{\text{Channel 2}} \]

where \(\psi_{\beta,i}\) is the outcome model score for individual \(i\) on the matched data. The difference from g-comp is in the aggregation step: instead of treating each individual’s IF as independent, we account for within-subclass correlation by summing IFs within each matched set before squaring:

\[ \widehat{\text{Var}}(\hat\mu_a) = \frac{1}{n^2}\sum_{c=1}^{C} \left(\sum_{i \in \text{subclass } c} \text{IF}_i\right)^2 \]

This is the IF analogue of the cluster-robust sandwich: instead of summing squared IFs (which assumes independence), we sum the squared cluster-level IFs. It is identical to \(\color{#3182BD}{A^{-1}} \color{#E6550D}{B_{\text{CL}}} \color{#3182BD}{(A^{-1})^T}\) from the cluster-robust meat above, applied to the stacked system.

For a fit on a matched sample with subclass indicator \(c_i\),

\[ \widehat{\mathrm{Var}}(\hat\beta)_{\text{CL}} = \hat A^{-1} \left(\frac{1}{n}\sum_{c=1}^C \bar\psi_c\,\bar\psi_c^{\mathsf T}\right) \hat A^{-1} / n, \qquad \bar\psi_c = \sum_{i \in c}\psi_i. \]

Equivalently, \(\widehat{\mathrm{Var}}(\hat\beta)_{\text{CL}} = \frac{1}{n^2}\sum_{c=1}^C \bar{\mathrm{IF}}_c\,\bar{\mathrm{IF}}_c^{\mathsf T}\), with \(\bar{\mathrm{IF}}_c = \sum_{i \in c}\mathrm{IF}_i\).

Code
# Build a toy matched design on dat: assign each pair of consecutive
# rows to the same subclass. This is not a real match, but it exercises
# the cluster-robust machinery on a dataset we already have.
subclass <- rep(seq_len(n / 2), each = 2)

# Manual cluster-robust sandwich, using the IF decomposition.
# IF_beta was computed in Section 2.3 (n x p matrix).
IF_by_cluster <- rowsum(IF_beta, group = subclass)          # C x p
V_beta_CL_manual <- crossprod(IF_by_cluster) / n^2

# Gold standard: sandwich::vcovCL() with the same subclass grouping.
# Use HC0 (no finite-sample adjustment) so we match the textbook formula
# above exactly.
V_beta_CL_gold <- sandwich::vcovCL(
  fit,
  cluster = subclass,
  type    = "HC0",
  cadjust = FALSE
)
Cluster-robust $V_\beta$: manual vs `sandwich::vcovCL`
cell manual reference abs diff
(Intercept) × (Intercept) 0.065728 0.065728 0.0000000048472695
A × (Intercept) -0.067156 -0.067156 0.0000000046773747
L × (Intercept) 0.002579 0.002579 0.0000000003900766
(Intercept) × A -0.067156 -0.067156 0.0000000046773747
A × A 0.120543 0.120543 0.0000000110195071
L × A -0.005809 -0.005809 0.0000000067333065
(Intercept) × L 0.002579 0.002579 0.0000000003900766
A × L -0.005809 -0.005809 0.0000000067333065
L × L 0.033929 0.033929 0.0000000096446357

The “sum within cluster, then square” IF aggregation reproduces sandwich::vcovCL() (with cadjust = FALSE) to numerical precision. Matching’s bread is unchanged from the g-comp case — only the meat is formed differently, by accumulating IFs inside each subclass before squaring.

4.4 Summary: one method, three estimators

All three methods use the same variance formula — the IF squared and summed (Equation 5). What differs is the estimating equations and how Channel 2 is computed.

Method Stacked equations Channel 2 models Aggregation
G-comp Outcome model + mean eq. 1 (outcome) Standard
IPW Propensity + MSM + mean eq. 2 (propensity, MSM) Standard
Matching Outcome model + mean eq. 1 (outcome) Cluster-robust
ICE \(K\!+\!1\) outcome models + mean eq. \(K\!+\!1\) (backward chain) Standard

The IF formula handles all cases uniformly. The only per-method variation is which models contribute to Channel 2 and how IFs are aggregated (summed within clusters for matching, directly for all others).

5. Chained estimators: ICE g-computation

5.1 The chain rule for influence functions

Section 3 derived the IF for a two-step estimator (fit \(\hat\beta\), then compute \(\hat\mu_a\)). What happens with multiple models in a chain, where each model’s response is a prediction from the next? The same logic applies recursively. If \(\hat\mu = f(\hat\beta_0, \hat\beta_1, \ldots, \hat\beta_K)\) and each \(\hat\beta_k\) is an M-estimator with bread \(\color{#3182BD}{A_k}\) and score \(\psi_k\), the IF of \(\hat\mu\) is (Hampel et al., 1986; Huber, 1981):

\[ \text{IF}_i(\hat\mu) \;=\; \text{IF}_i^{\text{direct}} \;+\; \sum_{k=0}^{K} \frac{\partial \hat\mu}{\partial \hat\beta_k^T} \;\cdot\; \color{#3182BD}{A_k^{-1}} \; \psi_k(O_i; \hat\beta_k) \tag{9}\]

Each term in the sum is a “Channel 2” from Section 3.2 applied to model \(k\): the Jacobian \(\partial\hat\mu/\partial\hat\beta_k^T\) (how \(\hat\mu\) depends on model \(k\)) times the IF of model \(k\)’s parameters (\(\color{#3182BD}{A_k^{-1}}\psi_{k,i}\)).

NoteConnections to other frameworks

This IF-based approach is equivalent to:

  1. Stacked estimating equations (Stefanski & Boos, 2002; Zivich et al., 2024): concatenate all models’ estimating equations and apply the generic sandwich to the full stack.
  2. Semiparametric efficiency theory (Tsiatis, 2006): derive the efficient influence function (EIF) and project onto the parametric model.

All yield the same variance estimate.

5.2 ICE g-computation: the motivating example

Iterated conditional expectation (ICE) g-computation estimates the counterfactual mean under a treatment strategy in longitudinal data (Robins, 1986; Bang & Robins, 2005; Zivich et al., 2024). It fits \(K+1\) outcome models backward in time:

flowchart RL
  MK["Model K (final time)<br/>Fit: E[Y | A_K, H_K]<br/>Predict → Ỹ_K"] -->|"Ỹ_K becomes<br/>response for"| MK1["Model K-1<br/>Fit: E[Ỹ_K | A_{K-1}, H_{K-1}]<br/>Predict → Ỹ_{K-1}"]
  MK1 -->|"Ỹ_{K-1} becomes<br/>response for"| Mdots["⋯"]
  Mdots -->|"response for"| M0["Model 0 (baseline)<br/>Fit: E[Ỹ_1 | A_0, H_0]<br/>Predict → Ỹ_0"]
  M0 -->|"average"| MU["μ̂ = mean(Ỹ_0)"]

  style MK fill:#f0f4ff,stroke:#3182BD
  style MK1 fill:#f0f4ff,stroke:#3182BD
  style M0 fill:#f0f4ff,stroke:#3182BD
  style MU fill:#e8f5e9,stroke:#31A354
Figure 2: ICE backward iteration. Models are fit from final time to baseline. Each model’s predictions become the next model’s response.

Each model’s response (\(\tilde{Y}_k\)) is a prediction from the next model in the chain. This creates a cascade of dependencies.

5.3 Why naive approaches fail

Naive 1: Variance of pseudo-outcomes only. Uses only Channel 1 — ignores all model correction terms.

Naive 2: \(\color{#31A354}{J}V_\beta\color{#31A354}{J^T}\) from model 0 only. Captures Channel 2 for model 0 but ignores models \(1, \ldots, K\).

Correct: Full IF chain. Account for all \(K+1\) models via the chain rule.

5.4 Deriving the IF for chained models

The IF decomposition

Applying the chain rule (Section 5.1):

\[ \text{IF}_i = \underbrace{\frac{1}{n_t}(\tilde{Y}_{0,i} - \hat\mu)}_{\text{Channel 1}} + \sum_{k=0}^{K} \underbrace{ \color{#E6550D}{d_{k,i}} \cdot r^{\text{score}}_{k,i} }_{\text{Channel 2 for model } k} \tag{10}\]

where \(\color{#E6550D}{d_{k,i}}\) is the sensitivity — how much a perturbation at model \(k\) propagates to \(\hat\mu\) — and \(r^{\text{score}}_{k,i}\) is individual \(i\)’s score residual at model \(k\):

\[ r^{\text{score}}_{k,i} \;=\; (Y_{k,i} - \hat{Y}_{k,i}) \cdot \frac{d\mu/d\eta\big|_{\eta_{k,i}}}{V(\hat{Y}_{k,i})} \]

i.e. the response residual \(r_{k,i} = Y_{k,i} - \hat{Y}_{k,i}\) weighted by the link-scale factor from the GLM score (Section 1.2, route 3). For canonical links (logistic-logit, Poisson-log, Gaussian-identity) the ratio \((d\mu/d\eta)/V(\mu) = 1\) and \(r^{\text{score}}_{k,i}\) collapses to the ordinary response residual \(r_{k,i}\). For non-canonical links (probit, cloglog, Gamma-log, Poisson-sqrt, etc.) the factor is not \(1\) and must be applied.

WarningWhy the score residual, not the response residual

\(\text{Ch.2}_i = J_k A_k^{-1} \psi_{k,i}\) contains the full GLM score \(\psi_{k,i} = (Y - \mu)\cdot(d\mu/d\eta)/V(\mu)\cdot X_i\), not the bare residual \((Y - \mu)\cdot X_i\). Working through the algebra:

\[ \color{#31A354}{J_k}\color{#3182BD}{A_k^{-1}}\psi_{k,i} \;=\; n\cdot d_{k,i}\cdot (Y_{k,i} - \hat Y_{k,i})\cdot \frac{d\mu/d\eta\big|_{\eta_{k,i}}}{V(\hat Y_{k,i})} \;=\; n\cdot d_{k,i}\cdot r^{\text{score}}_{k,i} \]

The factor of \(n\) arises from \(A_k^{-1} = n(X_k^T W_k X_k)^{-1}\) (Section 5.5). Omitting the \((d\mu/d\eta)/V\) factor and using the plain response residual is correct only when it equals \(1\) — i.e., only for canonical links. This mistake is silent on logistic/Poisson-log/Gaussian models but miscalibrates probit and cloglog.

What the sensitivities measure

Each model correction requires a sensitivity \(\color{#E6550D}{d_{k,i}}\): a scalar that answers “if individual \(i\)’s residual at model \(k\) increased by one unit, how much would \(\hat\mu\) change?” Computing \(\color{#E6550D}{d_{k,i}}\) takes three steps, each with a clear role:

  1. Gradient \(g_k\) (\(p_k \times 1\)): the total sensitivity of \(\hat\mu\) to model \(k\)’s coefficient vector \(\beta_k\). This is a Jacobian — the same object as \(\color{#31A354}{J_a}\) from Section 3.2, but computed recursively for \(k > 0\).
  2. Solve \(h_k = (X_k^T W_k X_k)^{-1} g_k\) (\(p_k \times 1\)): convert from “how \(\hat\mu\) responds to \(\beta_k\)” to “how \(\hat\mu\) responds to individual score contributions.” This is the bread inverse \(\color{#3182BD}{A_k^{-1}}\) applied to \(g_k\).
  3. Distribute \(\color{#E6550D}{d_{k,i}} = h_k^T X_{k,i}\) (scalar): project onto individual \(i\)’s covariate vector to get their personal sensitivity.

Base case: model 0

\(\hat\mu\) is a direct average of model 0’s predictions. The gradient of \(\hat\mu\) w.r.t. \(\beta_0\) is therefore the Jacobian from Section 3.2:

\[ \underset{p_0 \times 1}{g_0} = \frac{1}{n_t}\sum_{j \in \text{target}} \frac{d\mu}{d\eta}\bigg|_{\eta_{0,j}^*} \;\cdot\; X_{0,j}^* \]

\(g_0\) tells us the total sensitivity of \(\hat\mu\) to \(\beta_0\), but the IF requires a per-individual quantity. We need to ask: “of the total shift in \(\hat\beta_0\) caused by the data, how much came from individual \(i\)?” The answer uses two facts: (a) individual \(i\)’s influence on \(\hat\beta_0\) is proportional to their score \(\psi_{0,i} = (d\mu/d\eta)_i / V(\mu_i) \cdot r_{0,i} \cdot X_{0,i}\); and (b) the constant of proportionality involves the bread inverse \((X_0^T W_0 X_0)^{-1}\). Combining:

\[ h_0 = \underbrace{(X_0^T W_0 X_0)^{-1}}_{\text{bread inverse}} g_0, \qquad \color{#E6550D}{d_{0,i}} = h_0^T X_{0,i} \]

Now \(\color{#E6550D}{d_{0,i}} \cdot r^{\text{score}}_{0,i}\) gives individual \(i\)’s Channel 2 contribution: their score residual \(r^{\text{score}}_{0,i} = (Y_{0,i} - \hat Y_{0,i}) \cdot (d\mu/d\eta)_i / V(\mu_i)\) (how much their data “disagrees” with model 0, on the link-scaled score metric) scaled by \(\color{#E6550D}{d_{0,i}}\) (how much that disagreement matters for \(\hat\mu\)). For canonical links \((d\mu/d\eta)/V = 1\) and the score residual collapses to the response residual \(r_{0,i}\). For non-canonical links the link-scale factor must be kept. This is equivalent to \(\color{#31A354}{J_0}\color{#3182BD}{A_0^{-1}}\psi_{0,i}\) from Section 3.2, factored into the three steps above.

Recursive case: model \(k = 1, \ldots, K\)

Model \(k\) does not feed into \(\hat\mu\) directly. Instead, model \(k\)’s predictions become model \(k-1\)’s pseudo-outcomes (response variable). We already know how sensitive \(\hat\mu\) is to each of those pseudo-outcomes — that is \(\color{#E6550D}{d_{k-1,j}}\) from the previous step.

The gradient of \(\hat\mu\) w.r.t. \(\beta_k\) therefore weights each individual’s contribution by the previously-computed sensitivity, further scaled by the prior weight \(w_{k-1,j}\) that the \((k-1)\)-th score function carries for row \(j\):

\[ \underset{p_k \times 1}{g_k} = \sum_{j \in \text{fit}_{k-1}} w_{k-1,j}\;\cdot\; \color{#E6550D}{d_{k-1,j}} \;\cdot\; \frac{d\mu}{d\eta}\bigg|_{\eta_{k,j}^*} \;\cdot\; X_{k,j}^* \]

The \(w_{k-1,j}\) factor comes from \(A_{k-1,k} = E[\partial s_{k-1,j}/\partial \beta_k]\), and \(s_{k-1,j}\) inherits \(w_{k-1,j}\) because the \((k-1)\)-th pseudo-outcome model is a weighted fit. In the unweighted case \(w_{k-1,j} \equiv 1\) and the factor drops out; dropping it under non-uniform external weights (e.g. IPCW, survey weights) silently underestimates the sandwich variance by a factor that scales with weight heterogeneity.

Compare this to \(g_0\): the uniform weight \(\frac{1}{n_t}\) over the target population is replaced by \(w_{k-1,j} \cdot \color{#E6550D}{d_{k-1,j}}\) over the previous model’s fitting set. Everything else — solve and distribute — is identical:

\[ h_k = (X_k^T W_k X_k)^{-1} g_k, \qquad \color{#E6550D}{d_{k,i}} = h_k^T X_{k,i} \]

5.5 Scaling: the factor of \(n\)

Each model correction involves \(\color{#3182BD}{A_k^{-1}}\). Since \(\color{#3182BD}{A_k} = \frac{1}{n_k} X_k^T W_k X_k\), we have \(\color{#3182BD}{A_k^{-1}} = n_k(X_k^T W_k X_k)^{-1}\). So the correction at step \(k\) is:

\[ n_k \cdot \color{#E6550D}{d_{k,i}} \cdot r^{\text{score}}_{k,i} \]

The factor \(n_k\) ensures each IF contribution is \(O(1)\) per individual, giving the correct \(\text{Var}(\hat\mu) = O(1/n)\).

WarningA common bug

Omitting \(n_k\) makes each correction \(O(1/n)\), yielding \(\text{Var} = O(1/n^3)\) — confidence intervals that are much too narrow.

For a 2-period ICE with outcome model \(m_1\) at \(k=1\) and pseudo-outcome model \(m_0\) at \(k=0\), the IF is

\[ \mathrm{IF}_i(\hat\mu) = \big(\hat m_0(H_0^{(a)}) - \hat\mu\big) + n_1\,d_{1,i}\,r^{\text{score}}_{1,i} + n_0\,d_{0,i}\,r^{\text{score}}_{0,i}, \]

with the sensitivities \(d_{k,i}\) propagated forward by the recursion of Section 5.4. The \(n_k\) factors are essential (Section 5.5). For the linear-Gaussian DGP below, the true joint ATE under \((a_0=1, a_1=1)\) vs \((a_0=0, a_1=0)\) is exactly \(0.8 + 0.6 = 1.4\) by construction, and we check that causatr’s sandwich CI covers it.

Code
set.seed(7)
n_ice <- 500
id    <- seq_len(n_ice)
L0    <- rnorm(n_ice)
A0    <- rbinom(n_ice, 1, plogis(0.3 * L0))
L1    <- 0.5 * L0 + 0.4 * A0 + rnorm(n_ice)
A1    <- rbinom(n_ice, 1, plogis(0.2 * L1 + 0.3 * A0))
Y_ice <- 1 + 0.8 * A0 + 0.6 * A1 + 0.5 * L0 + 0.4 * L1 + rnorm(n_ice)

long <- rbind(
  data.frame(id = id, time = 0L, A = A0, L = L0, Y = NA_real_),
  data.frame(id = id, time = 1L, A = A1, L = L1, Y = Y_ice)
)

fit_ice <- causat(
  long, outcome = "Y", treatment = "A",
  confounders     = ~ 1,
  confounders_tv  = ~ L,
  id = "id", time = "time"
)
res_ice <- contrast(
  fit_ice,
  interventions = list(a11 = static(1), a00 = static(0)),
  reference     = "a00",
  ci_method     = "sandwich"
)

ate_ice <- res_ice$contrasts$estimate
se_ice  <- res_ice$contrasts$se
ci_lo   <- res_ice$contrasts$ci_lower
ci_hi   <- res_ice$contrasts$ci_upper
truth   <- 0.8 + 0.6  # structural sum of the two causal coefficients
covered <- (truth >= ci_lo) && (truth <= ci_hi)
ICE chained IF on a 2-period linear-Gaussian DGP
quantity value
ICE ATE (causatr sandwich) 1.35387462881508
\(\mathrm{SE}\) 0.131353458569219
95% CI lower 1.09642658077464
95% CI upper 1.61132267685553
Truth (\(0.8 + 0.6\)) 1.4
Truth \(\in\) 95% CI? yes

The sandwich CI produced by variance_if_ice() covers the structural truth of \(1.4\), and the chained IF machinery is doing what Section 5.4 says it should.

5.6 Pseudo-code: IF for a chained estimator

Pseudo-code: chained_influence()
chained_influence <- function(models, data_iv, target) {
  n <- number_of_individuals
  K <- length(models) - 1
  mu_hat <- mean(pseudo_final[target])
  n_target <- sum(target)

  # Channel 1 with target-subpopulation scaling (see Sec. 3.6).
  IF <- rep(0, n)
  IF[target] <- (n / n_target) * (pseudo_final[target] - mu_hat)

  d <- rep(0, n)  # sensitivities from previous step

  for (step in 0:K) {
    model_k <- models[[step + 1]]
    family_k <- model_k$family
    X_k <- model.matrix(model_k)         # fitting design matrix
    eta_k <- model_k$linear.predictors
    mu_eta_k <- family_k$mu.eta(eta_k)   # dmu/deta at fitted values
    V_k <- family_k$variance(fitted(model_k))
    W_k <- mu_eta_k^2 / V_k              # IWLS working weights
    fit_ids_k <- model_k$fitting_ids
    n_k <- length(fit_ids_k)

    # Score residual = response residual * (dmu/deta) / V(mu). For
    # canonical links this factor is 1; for non-canonical (probit,
    # cloglog, Gamma-log, ...) it is not.
    r_response_k <- residuals(model_k, type = "response")
    r_score_k <- r_response_k * mu_eta_k / V_k

    # Intervention design matrix and link derivatives (dmu/deta)
    X_star_k <- model.matrix(model_k, newdata = intervention_data[[step + 1]])
    eta_star <- X_star_k %*% coef(model_k)
    mu_eta_star <- family_k$mu.eta(eta_star)

    # Step 1: Gradient g_k = d mu_hat / d beta_k. The step > 0 branch
    # carries w_{k-1} from the previous-step fit's prior weights.
    if (step == 0) {
      g_k <- crossprod(X_star_k[target, ], mu_eta_star[target]) / n_target
    } else {
      prev_ids <- models[[step]]$fitting_ids
      w_prev <- models[[step]]$prior.weights  # 1-vector if unweighted
      g_k <- crossprod(
        X_star_k[prev_ids, ],
        w_prev * d[prev_ids] * mu_eta_star[prev_ids]
      )
    }

    # Step 2: Solve h_k = (X'WX)^{-1} g_k
    h_k <- solve(crossprod(X_k, X_k * W_k)) %*% g_k

    # Step 3: Distribute d_{k,i} = h_k' X_{k,i}
    d <- rep(0, n)
    d[fit_ids_k] <- X_k %*% h_k

    # Channel 2 for model k: n_k * d_{k,i} * r_score_{k,i}
    r_vec <- rep(0, n)
    r_vec[fit_ids_k] <- r_score_k
    IF <- IF + n_k * d * r_vec
  }
  IF
}

5.7 From individual IFs to the variance

Once we have the per-individual \(\text{IF}_i\) from the recursion above, variance estimation follows from the same formula as always (Equation 5):

\[ \widehat{\text{Var}}(\hat\mu) = \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i^2 \]

For two interventions \(a\) and \(b\) on the same data:

\[ \widehat{\text{Cov}}(\hat\mu_a, \hat\mu_b) = \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i^{(a)} \cdot \text{IF}_i^{(b)} \]

The full \(k \times k\) covariance matrix is then used for contrasts.

6. Summary

  1. One method: the influence function. For any causal estimand \(\hat\mu\), write down the complete estimating equations (nuisance models + target estimand), compute each individual’s \(\text{IF}_i\), and take \(\widehat{\text{Var}}(\hat\mu) = \frac{1}{n^2}\sum_i \text{IF}_i^2\).

  2. The IF is the sandwich: the IF formula is the sandwich \(\color{#3182BD}{A^{-1}} \color{#E6550D}{B} \color{#3182BD}{(A^{-1})^T}\) applied to the stacked estimating equations, with the block-triangular bread inverted by back-substitution. The bread, meat, and scores are all present — the IF just computes \(A^{-1}\psi_i\) per individual instead of forming the full matrices.

  3. Two channels: the IF always decomposes into a direct term (Channel 1: covariate sampling) and model corrections (Channel 2: one per nuisance model). What varies across methods is only which models contribute to Channel 2.

  4. G-computation: one model correction (outcome model). Channel 1 nonzero.

  5. IPW: two model corrections (MSM + propensity). The propensity correction captures weight estimation uncertainty via the cross-derivative \(A_{\beta\alpha}\).

  6. Matching: one model correction (outcome model), same as g-comp. Aggregation is cluster-robust (sum IFs within matched subclasses).

  7. Chained estimators (ICE): \(K\!+\!1\) model corrections computed by backward recursion of sensitivities.

  8. The delta method (\(\color{#31A354}{J} V_\beta \color{#31A354}{J^T}\)) is a shortcut that captures Channel 2 only. It is exact when Channel 1 = 0 (constant predictions), but underestimates otherwise. The full IF avoids this limitation.

References

  • Bang H, Robins JM (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61:962–973.
  • Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley.
  • Hirano K, Imbens GW, Ridder G (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71:1161–1189.
  • Hernán MA, Robins JM (2025). Causal Inference: What If. Chapman & Hall/CRC.
  • Huber PJ (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium 1:221–233.
  • Huber PJ (1981). Robust Statistics. Wiley.
  • Lunceford JK, Davidian M (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine 23:2937–2960.
  • Robins JM (1986). A new approach to causal inference in mortality studies with a sustained exposure period. Mathematical Modelling 7:1393–1512.
  • Stefanski LA, Boos DD (2002). The calculus of M-estimation. The American Statistician 56:29–38.
  • Tsiatis AA (2006). Semiparametric Theory and Missing Data. Springer.
  • van der Laan MJ, Robins JM (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer.
  • White H (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48:817–838.
  • White H (1982). Maximum likelihood estimation of misspecified models. Econometrica 50:1–25.
  • Zeileis A (2004). Econometric computing with HC and HAC covariance matrix estimators. Journal of Statistical Software 11(10):1–17.
  • Zeileis A (2006). Object-oriented computation of sandwich estimators. Journal of Statistical Software 16(9):1–16.
  • Zivich PN, Klose M, Cole SR, Edwards JK, Shook-Sa BE (2022). Delicatessen: a Python package for the generalized computation of the sandwich estimator. arXiv preprint arXiv:2203.11300.
  • Zivich PN, Ross RK, Shook-Sa BE, Cole SR, Edwards JK (2024). Empirical sandwich variance estimator for iterated conditional expectation g-computation. Statistics in Medicine 43:5562–5572.