---
title: "Variance estimation for causal effects"
toc-depth: 4
code-fold: show
code-tools: true
vignette: >
%\VignetteIndexEntry{Variance estimation for causal effects}
%\VignetteEngine{quarto::html}
%\VignetteEncoding{UTF-8}
---
```{=html}
<style>
mjx-container[jax="CHTML"][display="true"] {
overflow-x: auto;
overflow-y: visible;
padding: 0.5em 0;
}
</style>
```
```{r}
#| include: false
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
options(tinytable_html_mathjax = TRUE)
```
This vignette develops the theory behind variance estimation for causal
effect estimators in causatr. The core message is simple: **there is one
method** — the influence function (IF) — applied uniformly to
g-computation, IPW, and matching. What varies across methods is only the
estimating equations that define the estimator; the variance formula is
always the same.
Every non-trivial equation in the sections that follow has a
**"Verify numerically"** tabset alongside it: you can switch from the
algebra to a short block of R code that computes the same quantity on a
fixed running example, and to the resulting value. Those code blocks are
real — the vignette is rendered each time the package is built, so if a
formula and its implementation ever drift apart, the table in the "Result"
tab will make it obvious.
## 0. Roadmap
The destination is $\mathrm{Var}(\hat\mu)$, the sampling variance of a
*causal* estimand like $\mathbb{E}[Y^a]$ or $\mathbb{E}[Y^{a_1}] -
\mathbb{E}[Y^{a_0}]$. We get there in six steps:
1. **Section 1 — Sandwich variance of $\hat\beta$.** Any estimator defined
as the root of $\sum_i \psi(O_i;\beta) = 0$ has an asymptotic variance
of the form $A^{-1} M A^{-1}/n$ (bread × meat × bread). This is the
"variance of the parameters", not yet of the estimand.
2. **Section 2 — The influence function.** The sandwich is equivalent to
$\mathrm{Var}(\hat\beta) = \frac{1}{n^2}\sum_i \mathrm{IF}_i
\mathrm{IF}_i^{\mathsf T}$, where $\mathrm{IF}_i = A^{-1}\psi_i$ is a
per-individual "gradient of the estimate with respect to individual
$i$". Same number, different parametrisation — but the IF form is the
one that extends to causal estimands.
3. **Section 3 — Two-channel IF for $\hat\mu$.** A causal estimand
$\hat\mu = \frac{1}{n}\sum_i m(X_i;\hat\beta)$ mixes two sources of
randomness: the random draw of $X_i$ (Channel 1) and the estimation of
$\hat\beta$ (Channel 2). The correct IF is the sum of both channels.
The delta method $J V_\beta J^{\mathsf T}$ is the Channel-2-only
shortcut — **exact when predictions are constant per intervention, too
small otherwise.**
4. **Section 4 — The three estimators.** G-computation, IPW, and matching
are just three choices of estimating equations plugged into the same
machinery. Each gets its own IF, but all three are aggregated the same
way.
5. **Section 5 — Chained estimators (ICE).** When several nuisance models
feed into each other (as in iterated conditional expectation
g-computation for longitudinal data), the IF picks up one extra
Channel-2 correction per model. We derive these via a forward
sensitivity recursion.
6. **Section 6 — Summary.** A table mapping each estimator to its
estimating equations, its IF, and the line in causatr where that IF is
computed.
A good way to read this vignette is: skim the math in a section, then
expand the **Verify numerically** tabset under each key equation to see
the formula instantiated on real data and cross-checked against either
`sandwich::sandwich()` or causatr's own `variance_if()` engine.
### 0.1 Running example
Every verification block below uses the same small logistic example. It's
deliberately tiny (so the tables fit in the vignette) but not so tiny that
asymptotic approximations fall apart.
```{r}
#| label: setup-data
#| message: false
library(causatr)
library(sandwich)
library(tinytable)
set.seed(42)
n <- 200
L <- rnorm(n)
A <- rbinom(n, 1, plogis(0.5 * L))
Y <- rbinom(n, 1, plogis(-1 + 0.8 * A + 0.6 * L))
dat <- data.frame(Y = Y, A = A, L = L)
# The outcome model we'll use throughout Sections 1-3.
fit <- glm(Y ~ A + L, data = dat, family = binomial)
beta_hat <- coef(fit)
X <- model.matrix(fit) # design matrix, n x p
mu <- fitted(fit) # predicted probabilities
p <- length(beta_hat)
# A small helper: print two named numeric matrices / vectors side-by-side
# as a tinytable, so that "manual" and "gold-standard" results sit next
# to each other and any disagreement is visible.
verify_tbl <- function(manual, reference, label_m = "manual",
label_r = "reference", digits = 4) {
m <- as.numeric(manual)
r <- as.numeric(reference)
out <- data.frame(
cell = if (length(m) == 1) "scalar" else seq_along(m),
`value` = m,
reference = r,
`abs diff` = abs(m - r),
check.names = FALSE
)
names(out)[2] <- label_m
names(out)[3] <- label_r
tt(out) |>
format_tt(j = 2:4, digits = digits, num_fmt = "significant")
}
```
A quick sanity check: $\hat\beta$ and the model-based (non-sandwich) fit.
```{r}
#| code-fold: true
tt(data.frame(coef = names(beta_hat), estimate = round(beta_hat, 4)))
```
```{mermaid}
%%| label: fig-pipeline
%%| fig-cap: "The variance estimation pipeline for causal effects. The influence function (IF) is the one method: write down the estimating equations for the full system (nuisance models + target estimand), compute each individual's IF, then square and sum. The delta method (J V_β J^T) is a shortcut that captures only parameter uncertainty — it's exact when predictions are constant per intervention, but underestimates otherwise."
flowchart TB
EE["Complete estimating equations<br/>(nuisance models + estimand)"] -->|"linearise<br/>(Sec 1–2)"| IF["IF_i per individual<br/>= direct + model correction"]
IF -->|"Var = (1/n²) Σ IF²<br/>(always correct)"| VM["Var(μ̂)"]
EE -->|"shortcut:<br/>J V_β Jᵀ"| SC{"Predictions<br/>constant?"}
SC -->|"Yes<br/>(saturated MSM,<br/>static intervention)"| VM
SC -->|"No<br/>(g-comp,<br/>non-saturated MSM)"| UNDER["Underestimates ❌"]
style EE fill:#f0f4ff,stroke:#3182BD
style IF fill:#fff5eb,stroke:#E6550D
style VM fill:#e8f5e9,stroke:#31A354
style SC fill:#fff5eb,stroke:#E6550D
style UNDER fill:#fee,stroke:#c00
```
## 1. The sandwich estimator for parametric models
This section derives $\text{Var}(\hat\beta)$ — the variance of model
parameters. This is a building block, not the final answer. Section 3 shows how
to get from $\text{Var}(\hat\beta)$ to $\text{Var}(\hat\mu)$.
### 1.1 What is an estimating equation?
The starting point is a simple idea: many statistical estimators can be defined
as the solution to an equation of the form
$$
\sum_{i=1}^n \psi(O_i; \beta) = 0
$$
where $O_i$ is the observed data for individual $i$, $\beta$ is the parameter
we want to estimate, and $\psi$ is a function — called the **estimating
function** — that maps each observation and a candidate parameter value to a
vector. The estimator $\hat\beta$ is the value of $\beta$ that makes this sum
equal to zero.
But what *is* $\psi$, concretely? And where does it come from?
### 1.2 Deriving $\psi$: three routes to the same place
#### Route 1: Score equations (maximum likelihood)
The most familiar source of $\psi$ is maximum likelihood. If observations have
log-likelihood $\ell_i(\beta) = \log f(O_i; \beta)$, the MLE maximises
$\sum_i \ell_i(\beta)$. Setting the derivative to zero gives:
$$
\sum_{i=1}^n \underbrace{\frac{\partial \ell_i(\beta)}{\partial \beta}}_
{\psi(O_i;\beta)} = 0
$$
So $\psi$ is the **score function** — the gradient of the individual
log-likelihood. For a GLM with outcome $Y_i$, covariates $X_i$, linear
predictor $\eta_i = X_i^T\beta$, and mean function $\mu_i = g^{-1}(\eta_i)$:
$$
\psi(O_i; \beta) =
\underset{\color{#E6550D}{\text{residual}}}{\color{#E6550D}{(Y_i - \mu_i)}}
\;\cdot\;
\underset{\color{#3182BD}{\text{scale factor}}}
{\color{#3182BD}{\frac{d\mu/d\eta}{V(\mu_i)}}}
\;\cdot\;
\underset{\color{#31A354}{\text{direction}}}{\color{#31A354}{X_i}}
$$
where $V(\mu_i)$ is the variance function of the exponential family
($V(\mu) = \mu(1-\mu)$ for binomial, $V(\mu) = \mu$ for Poisson, $V(\mu) =
1$ for Gaussian — note that $\sigma^2$ enters separately as the dispersion
parameter $\phi$, not as $V(\mu)$). Reading the three factors:
- $\color{#E6550D}{(Y_i - \mu_i)}$ — **Residual**: how far the prediction is
from the observation. This is zero on average at the truth, which is why
$E[\psi_i] = 0$.
- $\color{#3182BD}{(d\mu/d\eta) / V(\mu_i)}$ — **Scale factor**: combines the
link derivative (converting from $\eta$-scale to $\mu$-scale) with the
variance weight (downweighting noisy observations). For canonical links this
product simplifies (see below).
- $\color{#31A354}{X_i}$ — **Direction**: the covariates tell us *which*
parameters this observation informs. An observation with large $|X_{ij}|$
provides more information about $\beta_j$.
::: {.callout-note}
### Example: logistic regression
For logistic regression, $\mu_i = \text{expit}(\eta_i)$, so $d\mu/d\eta =
\mu_i(1-\mu_i)$ and $V(\mu_i) = \mu_i(1 - \mu_i)$. The scale factor cancels
to $1$, giving:
$$\psi(O_i;\beta) = (Y_i - \mu_i) \cdot X_i$$
The estimating equation $\sum_i (Y_i - \mu_i) X_i = 0$ says: at the MLE, the
covariates are uncorrelated with the residuals.
:::
#### Route 2: Moment conditions (method of moments)
Not every estimating equation comes from a likelihood. The **generalised method
of moments** (GMM) starts from a theoretical property of the true parameter
$\beta_0$ — a *moment condition* — and turns it into $\psi$.
For example, if we know that $E[X_i(Y_i - X_i^T\beta_0)] = 0$, we can define:
$$
\psi(O_i; \beta) = X_i(Y_i - X_i^T\beta)
$$
Setting $\sum_i \psi = 0$ gives the OLS normal equations. No distributional
assumption was needed — just the moment condition $E[X_i \varepsilon_i] = 0$.
This is the key insight of M-estimation: the estimating function $\psi$ does not
have to be a score function. It just needs to satisfy $E[\psi(O_i; \beta_0)] =
0$ at the true parameter value (the **unbiasedness** condition). This is what
guarantees consistency of $\hat\beta$.
#### Route 3: Quasi-likelihood
Quasi-likelihood provides a middle ground. We specify only the first two moments
of $Y_i \mid X_i$ — the mean function $\mu_i = g^{-1}(X_i^T\beta)$ and a
variance function $V(\mu_i)$ — without assuming a full distributional form. The
quasi-score is:
$$
\psi(O_i; \beta) = \frac{(Y_i - \mu_i)}{V(\mu_i)} \cdot
\frac{d\mu_i}{d\eta_i} \cdot X_i
$$
This looks identical to the GLM score, but now $V(\mu)$ is a *working*
variance — it might be wrong. The sandwich estimator (Section 1.4) protects
against this misspecification.
### 1.3 The general M-estimation framework
All three routes produce estimators of the same form: find $\hat\beta$ solving
$\sum_i \psi(O_i; \hat\beta) = 0$. The M-estimation framework (Huber, 1964;
Stefanski & Boos, 2002) unifies them under two conditions:
1. **Unbiasedness**: $E[\psi(O_i; \beta_0)] = 0$ at the true $\beta_0$. This
guarantees that the estimating equation has the right solution in
expectation.
2. **Regularity**: $\psi$ is smooth enough in $\beta$ for a Taylor expansion.
Given these, $\hat\beta$ is consistent and asymptotically normal, and its
variance has the sandwich form — regardless of whether $\psi$ came from a
likelihood, a moment condition, or a quasi-likelihood.
::: {.callout-tip}
### Why "M-estimation"?
The "M" stands for "maximum" (as in maximum-likelihood), but the framework is
much broader. Any estimator defined as the root of a sum of functions of the
data and parameters is an M-estimator — this includes MLE, weighted least
squares, quantile regression, robust regression (Huber loss), generalised
estimating equations, and the causal estimators in Sections 4–5 (Stefanski &
Boos, 2002).
:::
### 1.4 Asymptotic variance of M-estimators
We want $\text{Var}(\hat\beta)$. The derivation has three steps: linearise
$\hat\beta$ around the truth, identify the variance of the linear
approximation, then estimate it from data.
#### Step 1: Linearise via Taylor expansion
We know that $\hat\beta$ solves $\sum_i \psi(O_i; \hat\beta) = 0$. We also
know that the true $\beta_0$ satisfies $E[\psi(O_i; \beta_0)] = 0$, but the
finite-sample sum $\sum_i \psi(O_i; \beta_0)$ is generally *not* zero — it is
a random quantity that fluctuates around zero. The idea is to relate these two
facts via a Taylor expansion.
Expand $\sum_i \psi(O_i; \hat\beta)$ around $\beta_0$, keeping only the
first-order term (valid for large $n$ by regularity):
$$
\underbrace{\sum_i \psi(O_i; \hat\beta)}_{= \, 0 \text{ (by definition)}}
\approx \sum_i \psi(O_i; \beta_0) +
\left[\sum_i \frac{\partial \psi(O_i;\beta_0)}
{\partial \beta^T}\right]
(\hat\beta - \beta_0)
$$
Rearranging for the estimation error $\hat\beta - \beta_0$:
$$
\hat\beta - \beta_0 \approx
-\left[\sum_i \frac{\partial \psi(O_i;
\beta_0)}{\partial \beta^T}\right]^{-1}
\sum_i \psi(O_i; \beta_0)
$$
Dividing numerator and denominator by $n$ and labelling the two factors:
$$
\underset{p \times 1}{\hat\beta - \beta_0}
\;\approx\;
\underbrace{\left[-\frac{1}{n}\sum_i \frac{\partial \psi_i}
{\partial \beta^T}\right]^{-1}}_{\to\;
\color{#3182BD}{A^{-1}} \;(p \times p)}
\;\cdot\;
\underbrace{\frac{1}{n}\sum_i \psi(O_i; \beta_0)}_{\text{mean zero,
covariance } \color{#E6550D}{B/n}}
$$
This is the key linearisation: the estimation error is approximately the bread
inverse times the average score. The first factor converges to the constant
matrix $\color{#3182BD}{A^{-1}}$; the second is a sum of iid mean-zero random
$p$-vectors whose covariance we call $\color{#E6550D}{B}$.
#### Step 2: Identify the variance
Name the two matrices that appeared in Step 1:
- **Bread** $\color{#3182BD}{A}$: the expected negative Jacobian of the
estimating function ($p \times p$):
$$
\color{#3182BD}{A}
\;\equiv\; -E\!\left[\frac{\partial \psi(O_i; \beta_0)}
{\partial \beta^T}\right]
\qquad (p \times p)
$$
For a GLM, $\color{#3182BD}{A} = E[w_i X_i X_i^T]$ where $w_i =
(d\mu/d\eta)^2 / V(\mu)$ are the IWLS working weights — essentially the
curvature of the (quasi-)log-likelihood.
- **Meat** $\color{#E6550D}{B}$: the covariance of the estimating function
($p \times p$):
$$
\color{#E6550D}{B}
\;\equiv\; E\!\left[\psi(O_i; \beta_0)\,\psi(O_i; \beta_0)^T\right]
\qquad (p \times p)
$$
This measures how much the individual scores $\psi_i$ vary across observations.
It appeared in Step 1 as the covariance of the second factor: $\text{Var}\!
\left(\frac{1}{n}\sum_i \psi_i\right) = \frac{1}{n}\color{#E6550D}{B}$.
Now propagate through the linearisation. Taking $\text{Var}(\cdot)$ of both
sides of the Step 1 equation:
$$
\underset{p \times p}{\text{Var}(\hat\beta)}
\;\approx\;
\underset{p \times p}{\color{#3182BD}{A^{-1}}}
\;\cdot\;
\underset{p \times p}{\frac{\color{#E6550D}{B}}{n}}
\;\cdot\;
\underset{p \times p}{\color{#3182BD}{(A^{-1})^T}}
\;=\;
\frac{1}{n}\,
\color{#3182BD}{A^{-1}}\,
\color{#E6550D}{B}\,
\color{#3182BD}{(A^{-1})^T}
$$
#### Step 3: Estimate from data
Replace the population expectations $A$ and $B$ with sample averages:
$$
\color{#3182BD}{\hat{A}}
= -\frac{1}{n}\sum_i
\frac{\partial \psi(O_i; \hat\beta)}{\partial \beta^T}
\qquad (p \times p)
$$
$$
\color{#E6550D}{\hat{B}}
= \frac{1}{n}\sum_i
\psi(O_i; \hat\beta)\,\psi(O_i; \hat\beta)^T
\qquad (p \times p)
$$
The **sandwich estimator** is:
$$
\underset{p \times p}{\widehat{\text{Var}}(\hat\beta)}
= \frac{1}{n}\;
\underset{p \times p}{\color{#3182BD}{\hat{A}^{-1}}}
\;
\underset{p \times p}{\color{#E6550D}{\hat{B}}}
\;
\underset{p \times p}{\color{#3182BD}{(\hat{A}^{-1})^T}}
$$ {#eq-sandwich}
so called because the meat $\color{#E6550D}{\hat B}$ is "sandwiched" between
two slices of bread $\color{#3182BD}{\hat A^{-1}}$.
::: {.callout-important}
### Why is this robust?
The bread $\color{#3182BD}{A}$ and meat $\color{#E6550D}{B}$ are estimated
*separately*. Under correct model specification (e.g. the GLM variance function
is right), $A = B$ and the sandwich simplifies to $\frac{1}{n} A^{-1}$ — the
usual model-based variance. But if the variance function is wrong, $A \neq B$:
the bread still captures the curvature correctly (it only depends on the mean
model), while the meat captures the *actual* variability of the scores. The
sandwich remains consistent regardless — this is the Huber–White robustness
property (Huber, 1967; White, 1982).
:::
::: {.panel-tabset}
## Math
$$
\widehat{\text{Var}}(\hat\beta)
= \frac{1}{n}\hat A^{-1}\hat B\,\hat A^{-1}
$$
with the meat $\hat B = \frac{1}{n}\sum_i \psi_i\psi_i^{\mathsf T}$ and
the bread inverse $\hat A^{-1}$. For a GLM with canonical link,
$\hat A = \frac{1}{n} X^{\mathsf T} W X$ where $W$ is the diagonal of
$\mu_i(1-\mu_i)$ (binomial case).
## Code
```{r}
#| label: verify-sandwich
# Meat: outer product of per-individual score contributions.
psi <- (Y - mu) * X # n x p: (Y-mu) * X
M_hat <- crossprod(psi) / n # (1/n) sum psi psi^T, p x p
# Bread^{-1}: for logistic, A_hat = (1/n) X^T W X, with W = diag(mu(1-mu))
W <- diag(mu * (1 - mu))
A_hat_inv <- solve(crossprod(X, W %*% X) / n)
# Sandwich variance of beta
V_beta_manual <- A_hat_inv %*% M_hat %*% A_hat_inv / n
# Gold standard: sandwich::sandwich(fit)
V_beta_gold <- sandwich::sandwich(fit)
```
## Result
```{r}
#| label: verify-sandwich-result
#| echo: false
# Collapse both p x p matrices into long form so the table stays readable.
cells <- outer(rownames(V_beta_manual), colnames(V_beta_manual),
paste, sep = " × ")
out <- data.frame(
cell = as.vector(cells),
manual = as.vector(V_beta_manual),
reference = as.vector(V_beta_gold),
`abs diff` = abs(as.vector(V_beta_manual) - as.vector(V_beta_gold)),
check.names = FALSE
)
tt(out, caption = "Sandwich $V_\\beta$: manual vs `sandwich::sandwich(fit)`") |>
format_tt(j = 2:4, digits = 4, num_fmt = "significant")
```
The manual sandwich and `sandwich::sandwich(fit)` agree to numerical
precision (≲ 1e-8), confirming @eq-sandwich is implemented correctly.
:::
### 1.5 Pseudo-code: sandwich variance for a GLM
```{r}
#| eval: false
#| code-fold: true
#| code-summary: "Pseudo-code: sandwich_variance()"
sandwich_variance <- function(X, Y, beta_hat, family) {
n <- length(Y)
eta <- X %*% beta_hat
mu <- family$linkinv(eta)
mu_eta <- family$mu.eta(eta) # dmu/deta
V <- family$variance(mu) # Var(Y|X) under the model
# Bread: A = (1/n) X^T W X where W = diag((dmu/deta)^2 / V(mu))
W <- mu_eta^2 / V
A <- (1 / n) * crossprod(X, X * W)
# Meat: B = (1/n) sum psi_i psi_i^T
residuals <- Y - mu
score_weights <- mu_eta / V
psi <- X * (residuals * score_weights) # n x p matrix of score contributions
B <- (1 / n) * crossprod(psi)
# Sandwich: (1/n) A^{-1} B (A^{-1})^T
A_inv <- solve(A)
V_beta <- (1 / n) * A_inv %*% B %*% t(A_inv)
V_beta
}
```
::: {.callout-tip}
### In practice
You almost never code this yourself. In R, `sandwich::sandwich(model)` computes
$\hat A^{-1} \hat B (\hat A^{-1})^T$ from any `glm` object. The `sandwich`
package (Zeileis, 2004, 2006) provides variants for clustered data
(`vcovCL()`), panel data (`vcovPL()`), and heteroscedasticity-autocorrelation
(`vcovHAC()`).
:::
### 1.6 Concrete example: linear regression
Let us work through the sandwich for OLS step by step.
**Model**: $Y_i = X_i^T\beta + \varepsilon_i$ where $E[\varepsilon_i \mid X_i]
= 0$.
**Step 1: Write down $\psi$.**
The OLS estimator minimises $\sum_i (Y_i - X_i^T\beta)^2$. Setting the
gradient to zero gives $\sum_i X_i(Y_i - X_i^T\beta) = 0$, so:
$$
\psi(O_i; \beta) = X_i(Y_i - X_i^T\beta)
\qquad (p \times 1)
$$
At $\beta_0$: $E[\psi_i] = E[X_i \varepsilon_i] = 0$ by exogeneity.
**Step 2: Compute the bread $\color{#3182BD}{A}$.**
$$
\frac{\partial \psi(O_i; \beta)}{\partial \beta^T}
= \frac{\partial}{\partial \beta^T}\left[X_i(Y_i - X_i^T\beta)\right]
= -X_i X_i^T
\qquad (p \times p)
$$
So $\color{#3182BD}{A} = E[X_i X_i^T]$, estimated by
$\color{#3182BD}{\hat{A}} = \frac{1}{n} X^T X$.
**Step 3: Compute the meat $\color{#E6550D}{B}$.**
Since $\psi_i = X_i \hat e_i$, the outer product is $\psi_i\psi_i^T =
\hat e_i^2 X_iX_i^T$:
$$
\color{#E6550D}{\hat{B}} = \frac{1}{n}\sum_i \hat{e}_i^2 \, X_i X_i^T
\qquad (p \times p)
$$
The meat depends on the *squared residuals*, not on any assumed variance. This
is the source of robustness.
**Step 4: Assemble the sandwich.**
$$
\underset{p \times p}{\widehat{\text{Var}}(\hat\beta)}
= \frac{1}{n}\,
\color{#3182BD}{\hat{A}^{-1}}\;
\color{#E6550D}{\hat{B}}\;
\color{#3182BD}{(\hat{A}^{-1})^T}
$$
Substituting:
$$
= \underset{p \times p}{\color{#3182BD}{(X^TX)^{-1}}}
\;\;
\underset{p \times p}{\color{#E6550D}{\left[\sum_i
\hat{e}_i^2\,X_iX_i^T\right]}}
\;\;
\underset{p \times p}{\color{#3182BD}{(X^TX)^{-1}}}
$$
This is the *HC0* heteroscedasticity-consistent estimator (White, 1980).
**What if homoscedasticity holds?** If $\text{Var}(\varepsilon_i \mid X_i) =
\sigma^2$ for all $i$, then $E[\hat{e}_i^2 X_i X_i^T] = \sigma^2 E[X_i
X_i^T]$, so $\color{#E6550D}{B} = \sigma^2 \color{#3182BD}{A}$ and the
sandwich simplifies to $\frac{\sigma^2}{n} \color{#3182BD}{A^{-1}}
= \hat\sigma^2(X^TX)^{-1}$, the usual OLS variance.
## 2. The influence function
Section 1 gave us $\text{Var}(\hat\beta)$ as a matrix product (the sandwich).
The **influence function** (IF) gives the same answer but expressed as a
per-individual contribution. This is more than a curiosity — IFs are the tool
that lets us compute $\text{Var}(\hat\mu)$ for causal estimands (Section 3) and
for multi-model pipelines (Section 5).
### 2.1 The idea: how much does each observation matter?
Imagine removing observation $i$ from the dataset and re-estimating
$\hat\beta$. The change $\hat\beta_{(-i)} - \hat\beta$ tells you how much
observation $i$ "influenced" the estimate. The **influence function** (IF) is
the first-order approximation to this quantity, scaled by $n$:
$$
\text{IF}_i \approx -n(\hat\beta_{(-i)} - \hat\beta)
$$
But we do not need to actually refit the model $n$ times. For M-estimators,
the IF can be computed in closed form.
### 2.2 Deriving the IF from the linearisation
In Section 1.4, Step 1, we derived the linearisation:
$$
\underset{p \times 1}{\hat\beta - \beta_0}
\;\approx\;
\frac{1}{n}\sum_{i=1}^n
\underset{p \times p}{\color{#3182BD}{A^{-1}}}
\;\;
\underset{p \times 1}{\psi(O_i; \beta_0)}
$$
Each summand $\color{#3182BD}{A^{-1}}\psi_i$ is the contribution of
observation $i$ to the total estimation error. This is exactly the **influence
function**:
$$
\underset{p \times 1}{\text{IF}_i(\hat\beta)}
\;=\;
\underset{p \times p}{\color{#3182BD}{A^{-1}}}
\;\cdot\;
\underset{p \times 1}{\psi(O_i; \hat\beta)}
$$ {#eq-if}
Reading each factor:
- $\psi_i$ ($p \times 1$): the score from Section 1.2 — how much observation
$i$'s data "disagrees" with the current $\hat\beta$.
- $\color{#3182BD}{A^{-1}}$ ($p \times p$): the bread inverse from Section
1.4 — converts from score-space to parameter-space, accounting for curvature.
- $\text{IF}_i$ ($p \times 1$): the resulting shift in $\hat\beta$ attributable
to observation $i$ (times $n$).
No new derivation is needed. The IF *is* the linearisation, read term-by-term.
### 2.3 IF = sandwich (the key equivalence)
**Property 1: Linearisation.** The IF lets us write a complex estimator as a
simple average:
$$
\hat\beta - \beta_0 \approx \frac{1}{n}\sum_i \text{IF}_i
$$
**Property 2: Variance recovery.** Since $E[\text{IF}_i] = 0$:
$$
\text{Var}(\hat\beta) \approx
\frac{1}{n^2}\sum_i \text{IF}_i \, \text{IF}_i^T
$$
Expanding $\text{IF}_i \text{IF}_i^T = \color{#3182BD}{A^{-1}} \psi_i
\psi_i^T \color{#3182BD}{(A^{-1})^T}$ and factoring:
$$
\frac{1}{n^2}\sum_i \color{#3182BD}{A^{-1}} \psi_i\psi_i^T
\color{#3182BD}{(A^{-1})^T}
= \frac{1}{n} \color{#3182BD}{A^{-1}}
\underbrace{\frac{1}{n}\sum_i \psi_i\psi_i^T}_
{\color{#E6550D}{B}\text{ (the meat)}}
\color{#3182BD}{(A^{-1})^T}
$$
This is the sandwich from Section 1.4.
::: {.callout-note}
### The key equivalence
$$
\underbrace{\frac{1}{n^2}\sum_i \text{IF}_i\,\text{IF}_i^T}_
{\text{IF-based variance}}
\;=\;
\underbrace{\frac{1}{n}\,\color{#3182BD}{A^{-1}}\,\color{#E6550D}{B}\,
\color{#3182BD}{(A^{-1})^T}}_{\text{sandwich variance}}
$$ {#eq-if-sandwich}
The sandwich operates on matrices. The IF operates on per-individual vectors.
Same answer, different representation. We will use the IF representation from
here on, because it extends cleanly to multi-step estimators (Sections 3–5).
:::
::: {.panel-tabset}
## Math
$$
\mathrm{IF}_i(\hat\beta) = \hat A^{-1}\psi_i,
\qquad
\widehat{\text{Var}}(\hat\beta)
= \frac{1}{n^2}\sum_{i=1}^n \mathrm{IF}_i\,\mathrm{IF}_i^{\mathsf T}.
$$
## Code
```{r}
#| label: verify-if-equivalence
# Per-individual IF for beta (n x p). psi and A_hat_inv come from the
# sandwich verification block above.
IF_beta <- psi %*% t(A_hat_inv)
# Assemble Var(beta) two different ways.
V_beta_via_IF <- crossprod(IF_beta) / n^2
V_beta_via_sw <- sandwich::sandwich(fit)
```
## Result
```{r}
#| label: verify-if-equivalence-result
#| echo: false
cells <- outer(rownames(V_beta_via_IF), colnames(V_beta_via_IF),
paste, sep = " × ")
out <- data.frame(
cell = as.vector(cells),
`IF (1/n^2 Σ IF²)` = as.vector(V_beta_via_IF),
`sandwich` = as.vector(V_beta_via_sw),
`abs diff` = abs(as.vector(V_beta_via_IF) -
as.vector(V_beta_via_sw)),
check.names = FALSE
)
tt(out, caption = "Sandwich $= \\tfrac{1}{n^2}\\sum_i \\mathrm{IF}_i\\mathrm{IF}_i^{\\mathsf T}$") |>
format_tt(j = 2:4, digits = 4, num_fmt = "significant")
```
The two representations are equal to machine precision (abs diff
≲ 1e-16) — so anywhere we see "sandwich variance", we can equivalently
write "sum of squared influence functions over $n^2$".
:::
### 2.4 Example: IF for a sample mean
$\hat\mu = \frac{1}{n}\sum_i Y_i$. Estimating equation: $\psi(O_i; \mu)
= Y_i - \mu$. Bread: $A = 1$. IF: $\text{IF}_i = Y_i - \hat\mu$. Variance:
$\frac{1}{n^2}\sum_i(Y_i - \hat\mu)^2 = s^2/n$.
::: {.panel-tabset}
## Math
$$
\mathrm{IF}_i(\hat\mu) = Y_i - \hat\mu,
\qquad
\widehat{\mathrm{Var}}(\hat\mu)
= \frac{1}{n^2}\sum_{i=1}^n (Y_i - \hat\mu)^2
= \frac{n-1}{n}\cdot\frac{s^2}{n}.
$$
The IF-based SE is, up to the $(n-1)/n$ bias factor, just the standard
error of the mean from an intro stats textbook.
## Code
```{r}
#| label: verify-sample-mean
y <- dat$Y
mu <- mean(y)
# IF-based variance: (1/n²) Σ (yᵢ - μ̂)²
var_if <- sum((y - mu)^2) / length(y)^2
se_if <- sqrt(var_if)
# Textbook: s² / n (n-1 denominator)
se_textbook <- sqrt(var(y) / length(y))
```
## Result
```{r}
#| label: verify-sample-mean-result
#| echo: false
out <- data.frame(
quantity = c("IF-based SE",
"Textbook SE ($s / \\sqrt{n}$)",
"Ratio $\\sqrt{n/(n-1)}$"),
value = c(se_if, se_textbook, se_textbook / se_if)
)
tt(out, caption = "Sample mean IF variance matches the textbook SE") |>
format_tt(j = "quantity", markdown = TRUE) |>
format_tt(j = 2, digits = 4, num_fmt = "significant")
```
The two agree up to the standard $(n-1)/n$ degrees-of-freedom correction
— on $n = `r length(y)`$ the ratio is indistinguishable from
$\sqrt{n/(n-1)}$.
:::
## 3. Variance of a causal estimand
### 3.1 The problem
The sandwich (Section 1) gives $\text{Var}(\hat\beta)$. But the causal
estimand is typically a nonlinear function of $\hat\beta$ **and** the data.
For instance, the g-computation marginal mean under intervention $a$ is:
$$
\hat\mu_a = \frac{1}{n}\sum_{i=1}^n
g^{-1}(X_i^{*T} \hat\beta)
$$
where $X_i^*$ is individual $i$'s covariate vector with treatment set to $a$,
and $g^{-1}(\cdot)$ is the inverse link function.
This quantity depends on **two** random things:
1. $\hat\beta$ — the estimated model parameters (random because of finite
data).
2. $\{X_i\}$ — the covariate distribution (random because a different sample
gives different covariates, hence a different average of predictions).
A complete variance estimator must account for both sources. As we will see,
the standard "delta method" shortcut only captures source (1).
### 3.2 The two-channel influence function
The g-computation marginal mean is a **two-step** estimator: first estimate
$\hat\beta$, then compute $\hat\mu_a = \frac{1}{n}\sum_i g^{-1}(X_i^{*T}
\hat\beta)$. Individual $i$ affects $\hat\mu_a$ through two channels.
#### Channel 1: Direct contribution (covariate sampling)
Even if $\hat\beta$ were fixed, observation $i$'s prediction
$g^{-1}(X_i^{*T}\hat\beta)$ is one of the $n$ terms being averaged. A
different sample gives a different set of predictions. Individual $i$'s direct
contribution to the estimation error is:
$$
\text{IF}_i^{\text{direct}} = g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a
\qquad (\text{scalar})
$$
This is the same structure as the sample mean IF from Section 2.4 — the
deviation of one term from the average.
#### Channel 2: Model correction (parameter uncertainty)
Observation $i$ also affects $\hat\mu_a$ indirectly by shifting $\hat\beta$.
From Section 2.2, $i$'s influence on $\hat\beta$ is approximately
$\frac{1}{n}\color{#3182BD}{A^{-1}}\psi_i$. This shift changes $\hat\mu_a$ by:
$$
\delta\hat\mu_a \;\approx\;
\underset{1 \times p}{\color{#31A354}{J_a}}
\;\cdot\;
\underset{p \times 1}{\color{#3182BD}{A^{-1}}\psi_i}
$$
where $\color{#31A354}{J_a} = \frac{\partial \hat\mu_a}{\partial \beta^T}$ is
the $1 \times p$ **Jacobian** — how the marginal mean changes as each
coefficient shifts:
$$
\underset{1 \times p}{\color{#31A354}{J_a}}
= \frac{1}{n} \sum_{i=1}^n
\underbrace{\frac{d\mu}{d\eta}\bigg|_{\eta_i^*}}_{\text{scalar}}
\;\cdot\;
\underbrace{X_i^{*T}}_{1 \times p}
$$
The link derivatives for common GLMs:
| Family | Link | $d\mu/d\eta$ | Effect |
|---|---|---|---|
| Gaussian | Identity | $1$ | All observations contribute equally |
| Binomial | Logit | $\mu(1 - \mu)$ | Observations near $\mu = 0.5$ contribute most |
| Poisson | Log | $\mu$ | Observations with larger means contribute more |
#### Combining both channels
The full IF for the marginal mean $\hat\mu_a$ is:
$$
\text{IF}_i(\hat\mu_a) =
\underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_
{\text{Channel 1: covariate sampling}}
\;+\;
\underbrace{
\underset{1 \times p}{\color{#31A354}{J_a}}
\;\;
\underset{p \times p}{\color{#3182BD}{A^{-1}}}
\;\;
\underset{p \times 1}{\psi_i}
}_{\text{Channel 2: parameter uncertainty}}
$$ {#eq-if-gcomp}
And the variance is:
$$
\widehat{\text{Var}}(\hat\mu_a) = \frac{1}{n^2}\sum_{i=1}^n
\text{IF}_i(\hat\mu_a)^2
$$ {#eq-var-if}
For two interventions $a$ and $b$, the covariance is:
$$
\widehat{\text{Cov}}(\hat\mu_a, \hat\mu_b)
= \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i^{(a)} \cdot \text{IF}_i^{(b)}
$$
This gives the full $k \times k$ covariance matrix needed for contrasts (risk
difference, risk ratio, etc.).
::: {.callout-note}
### Why the covariance matters
Both $\hat\mu_1$ and $\hat\mu_0$ are functions of the *same* data, so they are
correlated. This covariance typically *reduces* the variance of the difference.
Ignoring it (computing variances separately and adding them) would overestimate
the SE.
:::
::: {.panel-tabset}
## Math
For the g-computation estimate of $\hat\mu_1 = \mathbb{E}[Y^{A=1}]$ under
a logistic outcome model,
$$
\mathrm{IF}_i(\hat\mu_1) =
\underbrace{m(X_i^{(1)};\hat\beta) - \hat\mu_1}_{\text{Channel 1}}
\;+\;
\underbrace{J_1\,\hat A^{-1}\,\psi_i}_{\text{Channel 2}},
$$
with $m(X_i^{(1)};\hat\beta) = \text{expit}(X_i^{(1)\mathsf T}\hat\beta)$,
$J_1 = \frac{1}{n}\sum_j X_j^{(1)}\,\mu_1^{(j)}(1-\mu_1^{(j)})$, and
$\widehat{\mathrm{Var}}(\hat\mu_1) = \frac{1}{n^2}\sum_i \mathrm{IF}_i^2$.
## Code
```{r}
#| label: verify-gcomp-if
# Counterfactual design matrix and predictions with A set to 1 for everyone.
X1 <- X
X1[, "A"] <- 1
eta1 <- X1 %*% beta_hat
mu1_i <- as.numeric(plogis(eta1))
mu1_hat <- mean(mu1_i)
# Channel 1: direct covariate-sampling contribution.
C1 <- mu1_i - mu1_hat
# Channel 2: gradient of mu1_hat in beta times IF_beta.
# For logistic, d mu1_i / d beta = mu1_i (1 - mu1_i) * X1_i.
J1 <- colMeans(X1 * (mu1_i * (1 - mu1_i))) # p-vector
C2 <- as.numeric(IF_beta %*% J1)
IF_mu1 <- C1 + C2
var_mu1_manual <- sum(IF_mu1^2) / n^2
se_mu1_manual <- sqrt(var_mu1_manual)
# Gold standard: causatr's variance_if() routed through contrast().
fit_c <- causat(dat, outcome = "Y", treatment = "A",
confounders = ~ L, family = binomial)
res_c <- contrast(
fit_c,
interventions = list(a1 = static(1), a0 = static(0)),
reference = "a0",
ci_method = "sandwich"
)
se_mu1_causatr <- res_c$estimates[intervention == "a1"]$se
mu1_causatr <- res_c$estimates[intervention == "a1"]$estimate
```
## Result
```{r}
#| label: verify-gcomp-if-result
#| echo: false
out <- data.frame(
quantity = c("$\\mathbb{E}[Y^{A=1}]$",
"$\\mathrm{SE}(\\hat\\mu_1)$"),
manual = c(mu1_hat, se_mu1_manual),
causatr = c(mu1_causatr, se_mu1_causatr),
`abs diff` = c(abs(mu1_hat - mu1_causatr),
abs(se_mu1_manual - se_mu1_causatr)),
check.names = FALSE
)
tt(out, caption = "Two-channel IF for g-comp: manual vs `causatr::variance_if`") |>
format_tt(j = "quantity", markdown = TRUE) |>
format_tt(j = 2:4, digits = 4, num_fmt = "significant")
```
The manual two-channel IF reproduces causatr's SE for $\hat\mu_1$ to
$\sim$1e-9. Both channels are needed — dropping Channel 1 would give a
different (smaller) number, as the next tabset shows.
:::
### 3.3 The delta method shortcut
A widely used shortcut computes the Jacobian $\color{#31A354}{J}$ and the
parameter sandwich $V_\beta$ separately, then propagates via:
$$
\underset{k \times k}{\text{Var}(\hat\mu)}
\;\approx\;
\underset{k \times p}{\color{#31A354}{J}}
\;\;
\underset{p \times p}{V_\beta}
\;\;
\underset{p \times k}{\color{#31A354}{J^T}}
$$ {#eq-delta}
This is the **multivariate delta method**. To see what it captures, expand the
IF-based variance (@eq-var-if) using the two-channel decomposition
(@eq-if-gcomp):
$$
\frac{1}{n^2}\sum_i \text{IF}_i^2
= \underbrace{\frac{1}{n^2}\sum_i
\left(\text{Ch.1}_i\right)^2}_
{\text{covariate sampling}}
+ \underbrace{\frac{1}{n^2}\sum_i
\left(\text{Ch.2}_i\right)^2}_{
= \; \color{#31A354}{J} \, V_\beta \,
\color{#31A354}{J^T}}
+ \underbrace{\frac{2}{n^2}\sum_i
\text{Ch.1}_i \cdot \text{Ch.2}_i}_
{\text{cross-term}}
$$
The delta method gives only the middle term — **Channel 2 squared**. It misses
the covariate sampling variance (Channel 1 squared) and the cross-term. All
three terms are $O(1/n)$, so the delta method systematically underestimates.
::: {.panel-tabset}
## Math
$$
\widehat{\mathrm{Var}}(\hat\mu_a)
= \underbrace{\tfrac{1}{n^2}\sum_i \mathrm{Ch.1}_i^2}_{\text{full}}
+ \underbrace{J\,V_\beta\,J^{\mathsf T}}_{\text{delta}}
+ \underbrace{\tfrac{2}{n^2}\sum_i \mathrm{Ch.1}_i\,\mathrm{Ch.2}_i}_{\text{cross}}.
$$
The delta method is the second term alone.
## Code
```{r}
#| label: verify-delta
var_full <- sum(IF_mu1^2) / n^2
var_ch1 <- sum(C1^2) / n^2
var_ch2 <- sum(C2^2) / n^2
var_cross <- 2 * sum(C1 * C2) / n^2
var_delta <- as.numeric(t(J1) %*% V_beta_gold %*% J1) # J V_beta J^T
# Ch.2² and J V_beta J^T are algebraically identical; numerically they
# agree up to the bread solve precision.
stopifnot(abs(var_ch2 - var_delta) < 1e-6)
ratio <- sqrt(var_full) / sqrt(var_delta)
```
## Result
```{r}
#| label: verify-delta-result
#| echo: false
out <- data.frame(
term = c(
"Full IF variance $\\mathrm{Ch.1}^2 + \\mathrm{Ch.2}^2 + \\text{cross}$",
"Channel 1 squared (covariate sampling)",
"Channel 2 squared $=$ delta $J\\,V_\\beta\\,J^{\\mathsf T}$",
"Cross term $2 \\cdot \\mathrm{Ch.1} \\cdot \\mathrm{Ch.2}$",
"SE ratio $\\mathrm{SE}_{\\text{full}} / \\mathrm{SE}_{\\text{delta}}$"
),
value = c(var_full, var_ch1, var_ch2, var_cross, ratio)
)
tt(out, caption = "Decomposition of $\\mathrm{Var}(\\hat\\mu_1)$: delta captures only the middle row") |>
format_tt(j = "term", markdown = TRUE) |>
format_tt(j = 2, digits = 4, num_fmt = "significant")
```
On this example the delta-method SE is about $1/1.026 \approx 97\%$ of
the correct SE — a modest underestimate because the covariate sampling
variance (Channel $1^2$) happens to be small relative to the parameter
uncertainty. Under non-saturated MSMs and models with strong
covariate-driven heterogeneity, the ratio can be much larger; @eq-delta
is not safe as a default.
:::
::: {.callout-important}
### The delta method treats covariates as fixed
$\color{#31A354}{J} V_\beta \color{#31A354}{J^T}$ answers the question: "given
these specific covariates $\{X_i\}$, how does uncertainty in $\hat\beta$
propagate to $\hat\mu_a$?" It conditions on the covariate distribution. But
$\{X_i\}$ are random — a different sample gives different covariates, hence
a different average of predictions even for fixed $\beta$. The full IF captures
both sources; the delta method captures only one.
:::
### 3.4 When the shortcut is exact
The delta method underestimates whenever Channel 1 is nonzero. Channel 1 is:
$$
\text{Ch.1}_i = g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a
$$
This is zero for all $i$ if and only if **every individual gets the same
prediction** — i.e. $g^{-1}(X_i^{*T}\hat\beta)$ does not depend on $i$.
This happens when:
1. **The model is saturated in the treatment** (e.g. $Y \sim A$ for binary
treatment) **and** the intervention is static ($A = a$ for everyone). Then
all individuals share the same $X^*$ (only the intercept and treatment
indicator matter), so predictions are identical.
2. **The model has no covariates** (e.g. $Y \sim A$). Same reasoning.
The shortcut **fails** when predictions vary across individuals:
- **G-computation** with confounders: $Y \sim A + L$. Predictions
$g^{-1}(\hat\beta_0 + \hat\beta_1 a + \hat\beta_2 L_i)$ vary with $L_i$
across individuals. Channel 1 is nonzero.
- **Non-saturated MSM**: $Y \sim A$ for continuous $A$ is a linear model, not
saturated. With non-static interventions (e.g. shift each person's treatment
by $+\delta$), different individuals get different $A$ values, hence different
predictions.
| Situation | Predictions constant? | Delta method exact? |
|---|---|---|
| G-comp with confounders ($Y \sim A + L$) | No | **No** — underestimates |
| IPW/matching, binary $A$, saturated MSM ($Y \sim A$), static intervention | Yes | Yes |
| IPW/matching, continuous $A$, linear MSM ($Y \sim A$), static intervention | Yes | Yes |
| IPW, continuous $A$, non-static intervention (shift, MTP) | No | **No** — underestimates |
### 3.5 The IF is the stacked sandwich, solved analytically
The two-channel formula (@eq-if-gcomp) may seem to appear from thin air. In
fact it is the sandwich (@eq-sandwich) applied to the **complete** estimating
equations — model score plus mean equation — with the block-triangular bread
inverted by back-substitution. This section shows the derivation.
#### The complete estimating equations
G-computation has two estimating equations per individual:
$$
\underset{(p+1) \times 1}{\Psi(O_i;\,\beta,\,\mu_a)}
= \begin{pmatrix}
\underset{p \times 1}{\psi_\beta(O_i;\,\beta)} \\[4pt]
\underset{1 \times 1}{\psi_\mu(O_i;\,\beta,\,\mu_a)}
\end{pmatrix}
= \begin{pmatrix}
\text{GLM score (Section 1.2)} \\[4pt]
g^{-1}(X_i^{*T}\beta) - \mu_a
\end{pmatrix}
$$
The first block $\psi_\beta$ defines the model. The second block $\psi_\mu$
defines the estimand: $\hat\mu_a$ is the value of $\mu_a$ that makes $\sum_i
\psi_\mu = 0$, i.e. $\hat\mu_a = \frac{1}{n}\sum_i g^{-1}(X_i^{*T}
\hat\beta)$.
#### The block-triangular bread
The bread matrix of the stacked system is:
$$
\underset{(p+1) \times (p+1)}{A}
= -\frac{1}{n}\sum_i \frac{\partial\Psi_i}{\partial(\beta,\mu_a)^T}
= \begin{pmatrix}
\underset{p \times p}{\color{#3182BD}{A_{\beta\beta}}} &
\underset{p \times 1}{0} \\[6pt]
\underset{1 \times p}{\color{#E6550D}{A_{\mu\beta}}} &
\underset{1 \times 1}{\color{#E6550D}{A_{\mu\mu}}}
\end{pmatrix}
$$
where:
- $\color{#3182BD}{A_{\beta\beta}} = -\frac{1}{n}\sum_i
\frac{\partial\psi_\beta}{\partial\beta^T}$ — the GLM Hessian
(Section 1.4).
- $\color{#E6550D}{A_{\mu\beta}} = -\frac{1}{n}\sum_i
\frac{\partial\psi_\mu}{\partial\beta^T}
= -\frac{1}{n}\sum_i \frac{d\mu}{d\eta}\big|_{\eta_i^*} X_i^{*T}
= -\color{#31A354}{J_a}$ — the negative Jacobian from Section 3.2.
- $\color{#E6550D}{A_{\mu\mu}} = -\frac{\partial\psi_\mu}{\partial\mu_a}
= 1$.
- Upper-right = $0$: the GLM score does not depend on $\mu_a$.
The block-triangular structure arises because $\psi_\beta$ depends only on
$\beta$ (not on $\mu_a$), while $\psi_\mu$ depends on both $\beta$ and
$\mu_a$.
#### Inverting by back-substitution
For a $2 \times 2$ block-lower-triangular matrix, the inverse is:
$$
A^{-1} = \begin{pmatrix}
\color{#3182BD}{A_{\beta\beta}^{-1}} & 0 \\[4pt]
-A_{\mu\mu}^{-1} \, \color{#E6550D}{A_{\mu\beta}} \,
\color{#3182BD}{A_{\beta\beta}^{-1}} &
A_{\mu\mu}^{-1}
\end{pmatrix}
$$
The IF for the full parameter $(\beta, \mu_a)$ is $A^{-1}\Psi_i$. Extracting
the $\mu_a$ row (the last row of $A^{-1}$ times $\Psi_i$):
$$
\text{IF}_i(\hat\mu_a)
= \underbrace{A_{\mu\mu}^{-1}\;\psi_{\mu,i}}_{\text{Channel 1}}
\;+\;
\underbrace{(-A_{\mu\mu}^{-1})\;
\color{#E6550D}{A_{\mu\beta}}\;
\color{#3182BD}{A_{\beta\beta}^{-1}}\;
\psi_{\beta,i}}_{\text{Channel 2}}
$$
Substituting $A_{\mu\mu} = 1$, $A_{\mu\beta} = -\color{#31A354}{J_a}$, and
$\psi_{\mu,i} = g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a$:
$$
\text{IF}_i(\hat\mu_a) =
\underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_
{\text{Channel 1}}
\;+\;
\underbrace{
\color{#31A354}{J_a}\;
\color{#3182BD}{A_{\beta\beta}^{-1}}\;
\psi_{\beta,i}}_{\text{Channel 2}}
$$
This is exactly @eq-if-gcomp. The bread $\color{#3182BD}{A_{\beta\beta}^{-1}}$
is `model_bread_inv()` = $(X^TWX)^{-1}$. The score $\psi_{\beta,i}$ is
$r_i \cdot (d\mu/d\eta) / V(\mu_i) \cdot X_i$. The Jacobian
$\color{#31A354}{J_a}$ projects the $p$-vector result to a scalar.
::: {.callout-note}
### The meat is hidden, not absent
The IF formula computes $\color{#3182BD}{A^{-1}}\psi_i$ for each individual,
then squares: $\text{Var} = \frac{1}{n^2}\sum_i \text{IF}_i^2$. Expanding:
$$
\frac{1}{n^2}\sum_i \text{IF}_i^2
= \frac{1}{n}\;\color{#3182BD}{A^{-1}}
\underbrace{\frac{1}{n}\sum_i \Psi_i \Psi_i^T}_
{\color{#E6550D}{B} \text{ (the meat)}}
\color{#3182BD}{(A^{-1})^T}
$$
This is the sandwich (@eq-sandwich) for the stacked system. The IF avoids
forming the $(p+1) \times (p+1)$ meat matrix $\color{#E6550D}{B}$ — it
computes the same answer by summing scalar products.
:::
### 3.6 Target subpopulations (ATT, ATU, custom subset)
Sections 3.1–3.5 define the marginal mean as an average over the **full
sample**: $\hat\mu_a = (1/n)\sum_{i=1}^n g^{-1}(X_i^{*T}\hat\beta)$.
Many causal estimands — the ATT, the ATU, or an estimand on a
user-supplied subset — instead average only over a **target
subpopulation** $T \subseteq \{1,\ldots,n\}$ of size $n_t$:
$$
\hat\mu_a \;=\; \frac{1}{n_t}\sum_{i\in T} g^{-1}(X_i^{*T}\hat\beta)
$$
The stacked estimating equations need one small change to accommodate
this: the mean equation is now indicator-weighted. Let $t_i = \mathbb{1}[i \in T]$ and write
$$
\psi_{\mu,i}(\beta,\mu_a) \;=\; t_i \cdot \bigl[g^{-1}(X_i^{*T}\beta) - \mu_a\bigr]
$$
so that $\sum_i \psi_{\mu,i} = 0$ still recovers $\hat\mu_a$.
Redoing the bread calculation from Section 3.5 with this indicator-weighted mean equation:
- $\color{#E6550D}{A_{\mu\mu}} = -\frac{1}{n}\sum_i \frac{\partial\psi_{\mu,i}}{\partial\mu_a} = -\frac{1}{n}\sum_i(-t_i) = \frac{n_t}{n}$, so $A_{\mu\mu}^{-1} = \frac{n}{n_t}$.
- $\color{#E6550D}{A_{\mu\beta}} = -\frac{1}{n}\sum_i t_i \cdot \frac{d\mu}{d\eta}\Big|_{\eta_i^*} X_i^{*T} = -\frac{n_t}{n}\,\color{#31A354}{J_a}$, where $\color{#31A354}{J_a} = \frac{1}{n_t}\sum_{i\in T}\frac{d\mu}{d\eta}\big|_{\eta_i^*} X_i^{*T}$ is the Jacobian averaged over the target.
Plugging into the back-substitution formula for the last row of $A^{-1}$:
$$
\text{IF}_i(\hat\mu_a)
\;=\; -A_{\mu\mu}^{-1}\,\color{#E6550D}{A_{\mu\beta}}\,\color{#3182BD}{A_{\beta\beta}^{-1}}\,\psi_{\beta,i}
\;+\; A_{\mu\mu}^{-1}\,\psi_{\mu,i}
$$
Substituting $A_{\mu\mu}^{-1} = n/n_t$ and $A_{\mu\beta} = -(n_t/n)\,\color{#31A354}{J_a}$:
$$
\text{IF}_i(\hat\mu_a)
\;=\; \underbrace{\frac{n}{n_t}\cdot t_i \cdot (g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a)}_
{\text{Channel 1 (rescaled)}}
\;+\; \underbrace{\color{#31A354}{J_a}\,\color{#3182BD}{A_{\beta\beta}^{-1}}\,\psi_{\beta,i}}_
{\text{Channel 2 (unchanged)}}
$$ {#eq-if-gcomp-target}
Two things to notice:
1. **Channel 1 picks up a factor of $n/n_t$** and is nonzero only for
individuals inside the target. Without this factor the variance is
underestimated by $(n_t/n)^2$ — a serious undercount when the target
is a proper subset (e.g., the ATT with only ~40% treated).
2. **Channel 2 is unchanged.** The $n_t/n$ in $A_{\mu\beta}$ cancels with
the $n/n_t$ in $A_{\mu\mu}^{-1}$, so the Channel 2 formula from
Section 3.2 carries over verbatim — $\color{#31A354}{J_a}$ is just
averaged over the target population instead of the full sample, and
$\psi_{\beta,i}$ still runs over every individual the outcome model
used (inside **and** outside the target), because the outcome model
was fit on all those rows.
The variance is still
$\widehat{\text{Var}}(\hat\mu_a) = (1/n^2)\sum_{i=1}^n \text{IF}_i^2$,
summed over all $n$ individuals. Individuals outside the target have
$\text{Ch.1}_i = 0$ and contribute only through Channel 2.
Setting $T = \{1,\ldots,n\}$ (so $n_t = n$, $t_i \equiv 1$) recovers
@eq-if-gcomp exactly. The full-sample case is the special case of
@eq-if-gcomp-target where the scaling factor is $1$.
### 3.7 Pseudo-code: the full IF for a marginal mean
This incorporates the target subpopulation scaling from Section 3.6
and the correct non-canonical link handling for the score (multiply
response residuals by $\mu_\eta/V$; see Section 1.2 route 3).
```{r}
#| eval: false
#| code-fold: true
#| code-summary: "Pseudo-code: gcomp_influence()"
gcomp_influence <- function(model, data_a, target_idx) {
n <- nrow(data_a)
beta_hat <- coef(model)
pred_terms <- delete.response(terms(model))
family <- model$family
# Predictions under intervention (target rows only)
X_star <- model.matrix(pred_terms, data = data_a[target_idx, ])
eta_star <- as.numeric(X_star %*% beta_hat)
preds <- family$linkinv(eta_star)
mu_hat <- mean(preds)
n_t <- length(preds)
# Channel 1: (n/n_t) * t_i * (pred_i - mu_hat), zero outside target
IF_direct <- rep(0, n)
IF_direct[target_idx] <- preds - mu_hat
# Channel 2 gradient: J_a = (1/n_t) sum_{target} (dmu/deta) * X_star
mu_eta_star <- family$mu.eta(eta_star)
J <- as.numeric(crossprod(X_star, mu_eta_star)) / n_t # p x 1
# Bread inverse (X'WX)^{-1} with IWLS working weights
X_fit <- model.matrix(model)
eta_fit <- model$linear.predictors
mu_eta_fit <- family$mu.eta(eta_fit)
V_fit <- family$variance(fitted(model))
W <- mu_eta_fit^2 / V_fit
XtWX_inv <- solve(crossprod(X_fit, X_fit * W))
h <- as.numeric(XtWX_inv %*% J) # p x 1
# Per-observation directional sensitivity and SCORE residual.
# Score residual = response residual * (dmu/deta) / V(mu), so that
# psi_i = X_i * r_i * (mu_eta/V). For canonical links mu_eta/V = 1
# and this reduces to the response residual.
d_fit <- as.numeric(X_fit %*% h)
r_response <- residuals(model, type = "response")
r_score <- r_response * mu_eta_fit / V_fit
IF_correction <- d_fit * r_score # n_fit x 1
# Combine: full IF = (n/n_t) * Ch.1 + n * Ch.2
# Factor of n on Ch.2 from A_{beta,beta}^{-1} = n * (X'WX)^{-1}
# Factor of (n/n_t) on Ch.1 from A_{mu,mu}^{-1} = n/n_t (Sec. 3.6)
IF <- (n / n_t) * IF_direct
IF[seq_along(IF_correction)] <- IF[seq_along(IF_correction)] +
n * IF_correction
IF
}
```
## 4. The three causal estimators
This section applies the IF framework from Section 3 to g-computation, IPW, and
matching. The variance formula is always the same (@eq-var-if); what differs is
the estimating equations and whether Channel 1 is zero.
### 4.1 G-computation
**Estimating equations.** G-computation fits an outcome model
$E[Y_i \mid A_i, L_i] = g^{-1}(X_i^T \beta)$ and computes
$\hat\mu_a = \frac{1}{n}\sum_i g^{-1}(X_i^{*T}\hat\beta)$ where $X_i^*$ sets
treatment to $a$. The complete estimating equations are:
$$
\begin{pmatrix}
\psi_\beta(O_i; \beta) \\[4pt]
\psi_\mu(O_i; \beta, \mu_a)
\end{pmatrix}
=
\begin{pmatrix}
\text{GLM score for outcome model} \\[4pt]
g^{-1}(X_i^{*T}\beta) - \mu_a
\end{pmatrix}
$$
**IF.** From Section 3.2 (@eq-if-gcomp):
$$
\text{IF}_i(\hat\mu_a) =
\underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_
{\text{Channel 1 (nonzero)}}
\;+\;
\underbrace{
\color{#31A354}{J_a}
\color{#3182BD}{A^{-1}}
\psi_i
}_{\text{Channel 2}}
$$
**Both channels are needed** because predictions
$g^{-1}(\hat\beta_0 + \hat\beta_1 a + \hat\beta_2 L_i)$ vary across
individuals through $L_i$. The delta method shortcut
$\color{#31A354}{J} V_\beta \color{#31A354}{J^T}$ underestimates.
**Variance**: $\frac{1}{n^2}\sum_i \text{IF}_i^2$ using the full two-channel IF.
### 4.2 Inverse probability weighting (IPW)
IPW is more complex because two models are involved, and the uncertainty from
the first (propensity scores) must propagate into the second (MSM).
::: {.callout-tip}
### Companion vignette
The density-ratio weight formulas (Horvitz--Thompson indicator,
pushforward with sign/Jacobian, IPSI closed form), the `make_weight_fn()`
closure design, and the numerical $A_{\beta\alpha}$ verification are
developed in `vignette("ipw-variance-theory")`.
:::
#### The two-model setup
1. **Propensity score model**: estimate $\hat\alpha$ ($q$-vector) from
$\text{logit}[P(A_i = 1 \mid L_i)] = L_i^T \alpha$.
2. **Weighted outcome model (MSM)**: using the estimated weights $w_i =
1/\hat{P}(A_i \mid L_i; \hat\alpha)$, fit $E[Y_i \mid A_i] =
g^{-1}(\beta_0 + \beta_1 A_i)$ by weighted regression to get $\hat\beta$
($p$-vector).
The weights $w_i$ depend on $\hat\alpha$, so $\hat\beta$ depends on
$\hat\alpha$.
#### Why the naive approach fails
A naive sandwich on the outcome model alone — treating the weights as *known
constants* — ignores the variability in $\hat\alpha$. Each new sample produces
different $\hat\alpha$, hence different weights, hence different $\hat\beta$.
The true variability of $\hat\beta$ has two sources:
- **Outcome variability**: randomness in $Y_i$ given fixed weights.
- **Weight variability**: randomness in $\hat\alpha$ (and hence $w_i$).
The naive sandwich captures only the first. The direction of the bias depends
on the estimand: for the **ATE**, using estimated (rather than known) propensity
scores actually *reduces* the asymptotic variance, so the naive sandwich is
conservative — it **overestimates** the SE (Hirano et al., 2003; Lunceford &
Davidian, 2004). For the **ATT**, the effect can go either direction.
#### The correct approach: joint M-estimation
Treat both models together as a single M-estimation problem. Define the joint
parameter $\theta = (\alpha, \beta)$ — a $(q+p)$-vector — and the stacked
estimating function:
$$
\underset{(q+p) \times 1}{\Psi(O_i; \theta)} =
\begin{pmatrix}
\color{#3182BD}{\psi_\alpha(O_i; \alpha)} \\[4pt]
\color{#E6550D}{\psi_\beta(O_i; \alpha, \beta)}
\end{pmatrix}
$$
where $\color{#3182BD}{\psi_\alpha}$ ($q \times 1$) is the score of the
propensity model, and $\color{#E6550D}{\psi_\beta}$ ($p \times 1$) is the
*weighted* score of the outcome model (which depends on $\alpha$ through $w_i$).
The bread matrix $\hat A = -\frac{1}{n}\sum_i
\frac{\partial\Psi_i}{\partial\theta^T}$ has a block structure:
$$
\underset{(q+p) \times (q+p)}{\hat{A}} =
\begin{pmatrix}
\underset{q \times q}{\color{#3182BD}{A_{\alpha\alpha}}} &
\underset{q \times p}{0} \\[6pt]
\underset{p \times q}{\color{#E6550D}{A_{\beta\alpha}}} &
\underset{p \times p}{\color{#E6550D}{A_{\beta\beta}}}
\end{pmatrix}
$$
Each block has a concrete meaning:
- $\color{#3182BD}{A_{\alpha\alpha}}$ ($q \times q$): the Hessian of the
propensity model — how $\psi_\alpha$ changes with $\alpha$.
- $\color{#E6550D}{A_{\beta\beta}}$ ($p \times p$): the weighted Hessian of
the outcome model — how $\psi_\beta$ changes with $\beta$.
- $\color{#E6550D}{A_{\beta\alpha}}$ ($p \times q$): the **cross-term** — how
the weighted outcome score changes with $\alpha$. This involves
$\frac{\partial w_i}{\partial\alpha^T}$ (how the weights shift when the
propensity parameters change). This is the block that captures the dependency.
- Upper-right = $0$ ($q \times p$): the propensity score does not depend on
$\beta$.
The full sandwich $\frac{1}{n}\hat A^{-1}\hat B(\hat A^{-1})^T$ applied to
the stacked system correctly propagates weight uncertainty into
$\text{Var}(\hat\beta)$ through the $A_{\beta\alpha}$ block.
::: {.callout-tip}
### In practice
causatr's self-contained IPW engine computes this stacked sandwich
directly via `compute_ipw_if_self_contained_one()`, which uses
`numDeriv::jacobian()` on the `make_weight_fn()` closure to evaluate
the $A_{\beta\alpha}$ block numerically. See
`vignette("ipw-variance-theory")` for the closure design and
verification.
:::
#### The full IF for $\hat\mu_a$
The stacked sandwich above gives $V_\beta$ — the variance of the MSM
parameters $\hat\beta$, accounting for weight uncertainty. But we need
$\text{Var}(\hat\mu_a)$, not $\text{Var}(\hat\beta)$. Following the same
approach as Section 3.5, we add the **mean equation** to the stack:
$$
\underset{(q+p+1) \times 1}{\Psi(O_i;\,\alpha,\,\beta,\,\mu_a)}
= \begin{pmatrix}
\color{#3182BD}{\psi_\alpha(O_i;\,\alpha)} \\[4pt]
\color{#E6550D}{\psi_\beta(O_i;\,\alpha,\,\beta)} \\[4pt]
\psi_\mu(O_i;\,\beta,\,\mu_a)
\end{pmatrix}
$$
The bread is now $3 \times 3$ block-lower-triangular:
$$
A = \begin{pmatrix}
\color{#3182BD}{A_{\alpha\alpha}} & 0 & 0 \\[4pt]
\color{#E6550D}{A_{\beta\alpha}} &
\color{#E6550D}{A_{\beta\beta}} & 0 \\[4pt]
0 & A_{\mu\beta} & A_{\mu\mu}
\end{pmatrix}
$$
where $A_{\mu\beta} = -\color{#31A354}{J}$ (negative Jacobian of predictions
w.r.t. $\beta$) and $A_{\mu\mu} = 1$ (same as Section 3.5). The zero in the
$(\mu, \alpha)$ position arises because the mean equation
$g^{-1}(X_i^{*T}\beta) - \mu_a$ does not depend on $\alpha$ directly — it
depends on $\alpha$ only through $\beta$.
Inverting by back-substitution and extracting the $\mu_a$ row gives:
$$
\text{IF}_i(\hat\mu_a)
= \underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_
{\text{Channel 1 (direct)}}
+ \underbrace{\color{#31A354}{J}\;
\color{#E6550D}{A_{\beta\beta}^{-1}}\;
\psi_{\beta,i}}_{\text{MSM correction}}
+ \underbrace{\color{#31A354}{J}\;
\color{#E6550D}{A_{\beta\beta}^{-1}}\;
\color{#E6550D}{A_{\beta\alpha}}\;
\color{#3182BD}{A_{\alpha\alpha}^{-1}}\;
\psi_{\alpha,i}}_{\text{propensity correction}}
$$ {#eq-if-ipw}
Reading each term:
- **Channel 1**: how individual $i$'s prediction deviates from the mean (same
as g-comp). Zero when predictions are constant (saturated MSM + static
intervention); nonzero for non-static interventions or non-saturated MSMs.
- **MSM correction**: how $i$'s outcome residual (captured in $\psi_{\beta,i}$)
propagates to $\hat\mu_a$ through the MSM parameters. Same structure as the
g-comp Channel 2.
- **Propensity correction**: how $i$'s propensity score residual (captured in
$\psi_{\alpha,i}$) propagates to $\hat\mu_a$ through the weights. The chain
is: $\psi_{\alpha,i} \to \hat\alpha \to w_i \to \hat\beta \to \hat\mu_a$.
The matrix product
$\color{#E6550D}{A_{\beta\alpha}}\color{#3182BD}{A_{\alpha\alpha}^{-1}}$
captures the first two links (propensity parameters $\to$ weights $\to$ MSM
parameters); $\color{#31A354}{J}\color{#E6550D}{A_{\beta\beta}^{-1}}$
captures the last two (MSM parameters $\to$ predictions $\to$ estimand).
The second and third terms can be combined:
$$
\text{IF}_i(\hat\mu_a)
= \text{Ch.1}_i
\;+\;
\color{#31A354}{J}\;
\underbrace{
\color{#E6550D}{A_{\beta\beta}^{-1}}
\left(
\psi_{\beta,i}
+ \color{#E6550D}{A_{\beta\alpha}}\;
\color{#3182BD}{A_{\alpha\alpha}^{-1}}\;
\psi_{\alpha,i}
\right)}_{\text{IF}_i(\hat\beta) \text{ from the stacked }
(\alpha,\beta) \text{ system}}
$$
The expression in parentheses is the **adjusted score**: the MSM score
$\psi_{\beta,i}$ corrected for propensity estimation. This is exactly what
`WeightIt::glm_weightit()` computes internally when producing its
M-estimation sandwich — we are now exposing it per-individual.
::: {.callout-note}
### Where are the bread and meat?
The bread appears three times:
$\color{#3182BD}{A_{\alpha\alpha}^{-1}}$ (propensity bread inverse),
$\color{#E6550D}{A_{\beta\beta}^{-1}}$ (MSM bread inverse), and
$\color{#31A354}{J}$ (Jacobian from the mean equation). The individual
scores $\psi_{\alpha,i}$ and $\psi_{\beta,i}$ are what would form the meat
if we squared and summed them. The IF avoids forming the
$(q+p+1) \times (q+p+1)$ meat matrix.
:::
::: {.panel-tabset}
## Math
For a saturated MSM $Y \sim A$ with inverse-probability weights
$w_i = A_i / e(L_i)$ and propensity $e(L_i) = \mathrm{expit}(L_i^{\mathsf T}\alpha)$,
$$
\hat\mu_1 = \frac{1}{n}\sum_i w_i Y_i,
$$
and the IF of $\hat\mu_1$ is the Channel-1 + propensity-corrected
Channel-2 form given in @eq-if-ipw. causatr computes this via
`compute_ipw_if_self_contained_one()`, which assembles the adjusted
score per individual using `numDeriv::jacobian()` on the
`make_weight_fn()` closure (see `vignette("ipw-variance-theory")`).
## Code
```{r}
#| label: verify-ipw-if
# Fit the propensity model manually and check the point estimate matches
# causatr.
ps_fit <- glm(A ~ L, data = dat, family = binomial)
e_i <- fitted(ps_fit)
w1 <- dat$A / e_i
mu1_ipw_manual <- sum(w1 * dat$Y) / n
fit_ipw <- causat(
dat, outcome = "Y", treatment = "A", confounders = ~ L,
estimator = "ipw", family = binomial
)
res_ipw <- contrast(
fit_ipw,
interventions = list(a1 = static(1), a0 = static(0)),
reference = "a0", ci_method = "sandwich"
)
mu1_ipw_causatr <- res_ipw$estimates[intervention == "a1"]$estimate
se_ipw_causatr <- res_ipw$estimates[intervention == "a1"]$se
```
## Result
```{r}
#| label: verify-ipw-if-result
#| echo: false
out <- data.frame(
quantity = c("$\\hat\\mu_1$ (Horvitz–Thompson)",
"$\\hat\\mu_1$ (causatr IPW)",
"$\\mathrm{SE}(\\hat\\mu_1)$ via causatr IF engine"),
value = c(mu1_ipw_manual, mu1_ipw_causatr, se_ipw_causatr)
)
tt(out, caption = "IPW $\\hat\\mu_1$ and its IF-based SE") |>
format_tt(j = "quantity", markdown = TRUE) |>
format_tt(j = 2, digits = 4, num_fmt = "significant")
```
The HT estimate and the causatr IPW estimate agree exactly (they are
algebraically identical on a saturated MSM with $w_i = A_i/e(L_i)$),
and the SE comes from the propensity-corrected adjusted score — the
third term in @eq-if-ipw — computed internally.
:::
### 4.3 Matching
After matching, the outcome model is fitted on the matched sample. Two issues
require special treatment.
#### Cluster-robust variance
Matched individuals within the same subclass (pair or stratum) are correlated —
they were selected together. The standard sandwich (which assumes independence)
would understate the meat.
The **cluster-robust** sandwich accounts for this. The bread
$\color{#3182BD}{A}$ is unchanged, but the meat sums scores *within* each
subclass before squaring:
$$
V_\beta = \color{#3182BD}{A^{-1}}
\;\color{#E6550D}{B_{\text{CL}}}\;
\color{#3182BD}{(A^{-1})^T}
$$
where:
$$
\color{#E6550D}{B_{\text{CL}}} = \frac{1}{n}\sum_{c=1}^C
\bar\psi_c \, \bar\psi_c^T,
\qquad \bar\psi_c = \sum_{i \in \text{subclass } c} \psi_i
\qquad (p \times 1)
$$
Squaring the summed scores (rather than summing the squared scores) allows
within-cluster correlations to inflate the meat.
#### The full IF for matching
The per-individual IF has the same two-channel structure as g-comp:
$$
\text{IF}_i(\hat\mu_a) =
\underbrace{g^{-1}(X_i^{*T}\hat\beta) - \hat\mu_a}_
{\text{Channel 1}}
\;+\;
\underbrace{
\color{#31A354}{J_a}\;
\color{#3182BD}{A_{\beta\beta}^{-1}}\;
\psi_{\beta,i}}_{\text{Channel 2}}
$$
where $\psi_{\beta,i}$ is the outcome model score for individual $i$ on the
matched data. The difference from g-comp is in the **aggregation step**: instead
of treating each individual's IF as independent, we account for within-subclass
correlation by summing IFs within each matched set before squaring:
$$
\widehat{\text{Var}}(\hat\mu_a) = \frac{1}{n^2}\sum_{c=1}^{C}
\left(\sum_{i \in \text{subclass } c} \text{IF}_i\right)^2
$$
This is the IF analogue of the cluster-robust sandwich: instead of summing
squared IFs (which assumes independence), we sum the squared *cluster-level*
IFs. It is identical to $\color{#3182BD}{A^{-1}}
\color{#E6550D}{B_{\text{CL}}} \color{#3182BD}{(A^{-1})^T}$ from the
cluster-robust meat above, applied to the stacked system.
::: {.panel-tabset}
## Math
For a fit on a matched sample with subclass indicator $c_i$,
$$
\widehat{\mathrm{Var}}(\hat\beta)_{\text{CL}}
= \hat A^{-1}
\left(\frac{1}{n}\sum_{c=1}^C
\bar\psi_c\,\bar\psi_c^{\mathsf T}\right)
\hat A^{-1} / n,
\qquad
\bar\psi_c = \sum_{i \in c}\psi_i.
$$
Equivalently, $\widehat{\mathrm{Var}}(\hat\beta)_{\text{CL}} =
\frac{1}{n^2}\sum_{c=1}^C \bar{\mathrm{IF}}_c\,\bar{\mathrm{IF}}_c^{\mathsf T}$,
with $\bar{\mathrm{IF}}_c = \sum_{i \in c}\mathrm{IF}_i$.
## Code
```{r}
#| label: verify-matching-cluster
# Build a toy matched design on dat: assign each pair of consecutive
# rows to the same subclass. This is not a real match, but it exercises
# the cluster-robust machinery on a dataset we already have.
subclass <- rep(seq_len(n / 2), each = 2)
# Manual cluster-robust sandwich, using the IF decomposition.
# IF_beta was computed in Section 2.3 (n x p matrix).
IF_by_cluster <- rowsum(IF_beta, group = subclass) # C x p
V_beta_CL_manual <- crossprod(IF_by_cluster) / n^2
# Gold standard: sandwich::vcovCL() with the same subclass grouping.
# Use HC0 (no finite-sample adjustment) so we match the textbook formula
# above exactly.
V_beta_CL_gold <- sandwich::vcovCL(
fit,
cluster = subclass,
type = "HC0",
cadjust = FALSE
)
```
## Result
```{r}
#| label: verify-matching-cluster-result
#| echo: false
cells <- outer(rownames(V_beta_CL_manual),
colnames(V_beta_CL_manual),
paste, sep = " × ")
out <- data.frame(
cell = as.vector(cells),
manual = as.vector(V_beta_CL_manual),
reference = as.vector(V_beta_CL_gold),
`abs diff` = abs(as.vector(V_beta_CL_manual) -
as.vector(V_beta_CL_gold)),
check.names = FALSE
)
tt(out, caption = "Cluster-robust $V_\\beta$: manual vs `sandwich::vcovCL`") |>
format_tt(j = 2:4, digits = 4, num_fmt = "significant")
```
The "sum within cluster, then square" IF aggregation reproduces
`sandwich::vcovCL()` (with `cadjust = FALSE`) to numerical precision.
Matching's bread is unchanged from the g-comp case — only the meat is
formed differently, by accumulating IFs inside each subclass before
squaring.
:::
### 4.4 Summary: one method, three estimators
All three methods use the same variance formula — the IF squared and summed
(@eq-var-if). What differs is the estimating equations and how Channel 2
is computed.
| Method | Stacked equations | Channel 2 models | Aggregation |
|---|---|---|---|
| **G-comp** | Outcome model + mean eq. | 1 (outcome) | Standard |
| **IPW** | Propensity + MSM + mean eq. | 2 (propensity, MSM) | Standard |
| **Matching** | Outcome model + mean eq. | 1 (outcome) | Cluster-robust |
| **ICE** | $K\!+\!1$ outcome models + mean eq. | $K\!+\!1$ (backward chain) | Standard |
The IF formula handles all cases uniformly. The only per-method variation is
**which models contribute** to Channel 2 and **how IFs are aggregated** (summed
within clusters for matching, directly for all others).
## 5. Chained estimators: ICE g-computation
### 5.1 The chain rule for influence functions
Section 3 derived the IF for a two-step estimator (fit $\hat\beta$, then
compute $\hat\mu_a$). What happens with **multiple models in a chain**, where
each model's response is a prediction from the next? The same logic applies
recursively. If $\hat\mu = f(\hat\beta_0, \hat\beta_1, \ldots, \hat\beta_K)$
and each $\hat\beta_k$ is an M-estimator with bread $\color{#3182BD}{A_k}$ and
score $\psi_k$, the IF of $\hat\mu$ is (Hampel et al., 1986; Huber, 1981):
$$
\text{IF}_i(\hat\mu)
\;=\;
\text{IF}_i^{\text{direct}}
\;+\;
\sum_{k=0}^{K}
\frac{\partial \hat\mu}{\partial \hat\beta_k^T}
\;\cdot\;
\color{#3182BD}{A_k^{-1}} \; \psi_k(O_i; \hat\beta_k)
$$ {#eq-if-chain}
Each term in the sum is a "Channel 2" from Section 3.2 applied to model $k$:
the Jacobian $\partial\hat\mu/\partial\hat\beta_k^T$ (how $\hat\mu$ depends on
model $k$) times the IF of model $k$'s parameters
($\color{#3182BD}{A_k^{-1}}\psi_{k,i}$).
::: {.callout-note}
### Connections to other frameworks
This IF-based approach is equivalent to:
1. **Stacked estimating equations** (Stefanski & Boos, 2002; Zivich et al.,
2024): concatenate all models' estimating equations and apply the generic
sandwich to the full stack.
2. **Semiparametric efficiency theory** (Tsiatis, 2006): derive the efficient
influence function (EIF) and project onto the parametric model.
All yield the same variance estimate.
:::
### 5.2 ICE g-computation: the motivating example
Iterated conditional expectation (ICE) g-computation estimates the
counterfactual mean under a treatment strategy in longitudinal data (Robins,
1986; Bang & Robins, 2005; Zivich et al., 2024). It fits $K+1$ outcome models
backward in time:
```{mermaid}
%%| label: fig-ice-chain
%%| fig-cap: "ICE backward iteration. Models are fit from final time to baseline. Each model's predictions become the next model's response."
flowchart RL
MK["Model K (final time)<br/>Fit: E[Y | A_K, H_K]<br/>Predict → Ỹ_K"] -->|"Ỹ_K becomes<br/>response for"| MK1["Model K-1<br/>Fit: E[Ỹ_K | A_{K-1}, H_{K-1}]<br/>Predict → Ỹ_{K-1}"]
MK1 -->|"Ỹ_{K-1} becomes<br/>response for"| Mdots["⋯"]
Mdots -->|"response for"| M0["Model 0 (baseline)<br/>Fit: E[Ỹ_1 | A_0, H_0]<br/>Predict → Ỹ_0"]
M0 -->|"average"| MU["μ̂ = mean(Ỹ_0)"]
style MK fill:#f0f4ff,stroke:#3182BD
style MK1 fill:#f0f4ff,stroke:#3182BD
style M0 fill:#f0f4ff,stroke:#3182BD
style MU fill:#e8f5e9,stroke:#31A354
```
Each model's response ($\tilde{Y}_k$) is a *prediction* from the next model in
the chain. This creates a cascade of dependencies.
### 5.3 Why naive approaches fail
**Naive 1: Variance of pseudo-outcomes only.** Uses only Channel 1 — ignores
all model correction terms.
**Naive 2: $\color{#31A354}{J}V_\beta\color{#31A354}{J^T}$ from model 0
only.** Captures Channel 2 for model 0 but ignores models $1, \ldots, K$.
**Correct: Full IF chain.** Account for all $K+1$ models via the chain rule.
### 5.4 Deriving the IF for chained models
#### The IF decomposition
Applying the chain rule (Section 5.1):
$$
\text{IF}_i
= \underbrace{\frac{1}{n_t}(\tilde{Y}_{0,i} - \hat\mu)}_{\text{Channel 1}}
+ \sum_{k=0}^{K}
\underbrace{
\color{#E6550D}{d_{k,i}} \cdot r^{\text{score}}_{k,i}
}_{\text{Channel 2 for model } k}
$$ {#eq-if-ice}
where $\color{#E6550D}{d_{k,i}}$ is the **sensitivity** — how much a
perturbation at model $k$ propagates to $\hat\mu$ — and
$r^{\text{score}}_{k,i}$ is individual $i$'s **score residual** at
model $k$:
$$
r^{\text{score}}_{k,i}
\;=\; (Y_{k,i} - \hat{Y}_{k,i}) \cdot
\frac{d\mu/d\eta\big|_{\eta_{k,i}}}{V(\hat{Y}_{k,i})}
$$
i.e. the response residual $r_{k,i} = Y_{k,i} - \hat{Y}_{k,i}$ weighted
by the link-scale factor from the GLM score (Section 1.2, route 3).
For **canonical** links (logistic-logit, Poisson-log, Gaussian-identity)
the ratio $(d\mu/d\eta)/V(\mu) = 1$ and $r^{\text{score}}_{k,i}$
collapses to the ordinary response residual $r_{k,i}$. For
**non-canonical** links (probit, cloglog, Gamma-log, Poisson-sqrt,
etc.) the factor is not $1$ and must be applied.
::: {.callout-warning}
### Why the score residual, not the response residual
$\text{Ch.2}_i = J_k A_k^{-1} \psi_{k,i}$ contains the full GLM score
$\psi_{k,i} = (Y - \mu)\cdot(d\mu/d\eta)/V(\mu)\cdot X_i$, not the bare
residual $(Y - \mu)\cdot X_i$. Working through the algebra:
$$
\color{#31A354}{J_k}\color{#3182BD}{A_k^{-1}}\psi_{k,i}
\;=\; n\cdot d_{k,i}\cdot (Y_{k,i} - \hat Y_{k,i})\cdot
\frac{d\mu/d\eta\big|_{\eta_{k,i}}}{V(\hat Y_{k,i})}
\;=\; n\cdot d_{k,i}\cdot r^{\text{score}}_{k,i}
$$
The factor of $n$ arises from $A_k^{-1} = n(X_k^T W_k X_k)^{-1}$
(Section 5.5). Omitting the $(d\mu/d\eta)/V$ factor and using the plain
response residual is correct only when it equals $1$ — i.e., only for
canonical links. This mistake is silent on logistic/Poisson-log/Gaussian
models but miscalibrates probit and cloglog.
:::
#### What the sensitivities measure
Each model correction requires a **sensitivity** $\color{#E6550D}{d_{k,i}}$: a
scalar that answers "if individual $i$'s residual at model $k$ increased by one
unit, how much would $\hat\mu$ change?" Computing $\color{#E6550D}{d_{k,i}}$
takes three steps, each with a clear role:
1. **Gradient** $g_k$ ($p_k \times 1$): the total sensitivity of $\hat\mu$ to
model $k$'s coefficient vector $\beta_k$. This is a Jacobian — the same
object as $\color{#31A354}{J_a}$ from Section 3.2, but computed recursively
for $k > 0$.
2. **Solve** $h_k = (X_k^T W_k X_k)^{-1} g_k$ ($p_k \times 1$): convert from
"how $\hat\mu$ responds to $\beta_k$" to "how $\hat\mu$ responds to
individual score contributions." This is the bread inverse
$\color{#3182BD}{A_k^{-1}}$ applied to $g_k$.
3. **Distribute** $\color{#E6550D}{d_{k,i}} = h_k^T X_{k,i}$ (scalar): project
onto individual $i$'s covariate vector to get their personal sensitivity.
#### Base case: model 0
$\hat\mu$ is a direct average of model 0's predictions. The gradient of
$\hat\mu$ w.r.t. $\beta_0$ is therefore the Jacobian from Section 3.2:
$$
\underset{p_0 \times 1}{g_0}
= \frac{1}{n_t}\sum_{j \in \text{target}}
\frac{d\mu}{d\eta}\bigg|_{\eta_{0,j}^*}
\;\cdot\; X_{0,j}^*
$$
$g_0$ tells us the total sensitivity of $\hat\mu$ to $\beta_0$, but the IF
requires a per-individual quantity. We need to ask: "of the total shift in
$\hat\beta_0$ caused by the data, how much came from individual $i$?" The
answer uses two facts: (a) individual $i$'s influence on $\hat\beta_0$ is
proportional to their score $\psi_{0,i} = (d\mu/d\eta)_i / V(\mu_i) \cdot
r_{0,i} \cdot X_{0,i}$; and (b) the constant of proportionality involves
the bread inverse $(X_0^T W_0 X_0)^{-1}$. Combining:
$$
h_0 = \underbrace{(X_0^T W_0 X_0)^{-1}}_{\text{bread inverse}} g_0,
\qquad
\color{#E6550D}{d_{0,i}} = h_0^T X_{0,i}
$$
Now $\color{#E6550D}{d_{0,i}} \cdot r^{\text{score}}_{0,i}$ gives
individual $i$'s Channel 2 contribution: their **score residual**
$r^{\text{score}}_{0,i} = (Y_{0,i} - \hat Y_{0,i}) \cdot (d\mu/d\eta)_i /
V(\mu_i)$ (how much their data "disagrees" with model 0, on the
link-scaled score metric) scaled by $\color{#E6550D}{d_{0,i}}$ (how much
that disagreement matters for $\hat\mu$). For canonical links
$(d\mu/d\eta)/V = 1$ and the score residual collapses to the response
residual $r_{0,i}$. For non-canonical links the link-scale factor must
be kept. This is equivalent to
$\color{#31A354}{J_0}\color{#3182BD}{A_0^{-1}}\psi_{0,i}$ from
Section 3.2, factored into the three steps above.
#### Recursive case: model $k = 1, \ldots, K$
Model $k$ does not feed into $\hat\mu$ directly. Instead, model $k$'s
predictions become model $k-1$'s pseudo-outcomes (response variable). We already
know how sensitive $\hat\mu$ is to each of those pseudo-outcomes — that is
$\color{#E6550D}{d_{k-1,j}}$ from the previous step.
The gradient of $\hat\mu$ w.r.t. $\beta_k$ therefore weights each individual's
contribution by the previously-computed sensitivity, further scaled by the
prior weight $w_{k-1,j}$ that the $(k-1)$-th score function carries for
row $j$:
$$
\underset{p_k \times 1}{g_k}
= \sum_{j \in \text{fit}_{k-1}}
w_{k-1,j}\;\cdot\;
\color{#E6550D}{d_{k-1,j}}
\;\cdot\;
\frac{d\mu}{d\eta}\bigg|_{\eta_{k,j}^*}
\;\cdot\; X_{k,j}^*
$$
The $w_{k-1,j}$ factor comes from $A_{k-1,k} =
E[\partial s_{k-1,j}/\partial \beta_k]$, and $s_{k-1,j}$ inherits
$w_{k-1,j}$ because the $(k-1)$-th pseudo-outcome model is a weighted
fit. In the unweighted case $w_{k-1,j} \equiv 1$ and the factor drops
out; dropping it under non-uniform external weights (e.g. IPCW,
survey weights) silently underestimates the sandwich variance by a
factor that scales with weight heterogeneity.
Compare this to $g_0$: the uniform weight $\frac{1}{n_t}$ over the target
population is replaced by $w_{k-1,j} \cdot \color{#E6550D}{d_{k-1,j}}$ over
the previous model's fitting set. Everything else — solve and distribute —
is identical:
$$
h_k = (X_k^T W_k X_k)^{-1} g_k,
\qquad
\color{#E6550D}{d_{k,i}} = h_k^T X_{k,i}
$$
### 5.5 Scaling: the factor of $n$
Each model correction involves $\color{#3182BD}{A_k^{-1}}$. Since
$\color{#3182BD}{A_k} = \frac{1}{n_k} X_k^T W_k X_k$, we have
$\color{#3182BD}{A_k^{-1}} = n_k(X_k^T W_k X_k)^{-1}$. So the correction at
step $k$ is:
$$
n_k \cdot \color{#E6550D}{d_{k,i}} \cdot r^{\text{score}}_{k,i}
$$
The factor $n_k$ ensures each IF contribution is $O(1)$ per individual, giving
the correct $\text{Var}(\hat\mu) = O(1/n)$.
::: {.callout-warning}
### A common bug
Omitting $n_k$ makes each correction $O(1/n)$, yielding $\text{Var} = O(1/n^3)$
— confidence intervals that are much too narrow.
:::
::: {.panel-tabset}
## Math
For a 2-period ICE with outcome model $m_1$ at $k=1$ and pseudo-outcome
model $m_0$ at $k=0$, the IF is
$$
\mathrm{IF}_i(\hat\mu)
= \big(\hat m_0(H_0^{(a)}) - \hat\mu\big)
+ n_1\,d_{1,i}\,r^{\text{score}}_{1,i}
+ n_0\,d_{0,i}\,r^{\text{score}}_{0,i},
$$
with the sensitivities $d_{k,i}$ propagated forward by the recursion of
Section 5.4. The $n_k$ factors are essential (Section 5.5). For the
linear-Gaussian DGP below, the true joint ATE under
$(a_0=1, a_1=1)$ vs $(a_0=0, a_1=0)$ is exactly $0.8 + 0.6 = 1.4$ by
construction, and we check that causatr's sandwich CI covers it.
## Code
```{r}
#| label: verify-ice
set.seed(7)
n_ice <- 500
id <- seq_len(n_ice)
L0 <- rnorm(n_ice)
A0 <- rbinom(n_ice, 1, plogis(0.3 * L0))
L1 <- 0.5 * L0 + 0.4 * A0 + rnorm(n_ice)
A1 <- rbinom(n_ice, 1, plogis(0.2 * L1 + 0.3 * A0))
Y_ice <- 1 + 0.8 * A0 + 0.6 * A1 + 0.5 * L0 + 0.4 * L1 + rnorm(n_ice)
long <- rbind(
data.frame(id = id, time = 0L, A = A0, L = L0, Y = NA_real_),
data.frame(id = id, time = 1L, A = A1, L = L1, Y = Y_ice)
)
fit_ice <- causat(
long, outcome = "Y", treatment = "A",
confounders = ~ 1,
confounders_tv = ~ L,
id = "id", time = "time"
)
res_ice <- contrast(
fit_ice,
interventions = list(a11 = static(1), a00 = static(0)),
reference = "a00",
ci_method = "sandwich"
)
ate_ice <- res_ice$contrasts$estimate
se_ice <- res_ice$contrasts$se
ci_lo <- res_ice$contrasts$ci_lower
ci_hi <- res_ice$contrasts$ci_upper
truth <- 0.8 + 0.6 # structural sum of the two causal coefficients
covered <- (truth >= ci_lo) && (truth <= ci_hi)
```
## Result
```{r}
#| label: verify-ice-result
#| echo: false
out <- data.frame(
quantity = c("ICE ATE (causatr sandwich)",
"$\\mathrm{SE}$",
"95% CI lower",
"95% CI upper",
"Truth ($0.8 + 0.6$)",
"Truth $\\in$ 95% CI?"),
value = c(ate_ice, se_ice, ci_lo, ci_hi, truth,
ifelse(covered, "yes", "no"))
)
tt(out, caption = "ICE chained IF on a 2-period linear-Gaussian DGP") |>
format_tt(j = "quantity", markdown = TRUE) |>
format_tt(j = 2, digits = 4)
```
The sandwich CI produced by `variance_if_ice()` covers the structural
truth of $1.4$, and the chained IF machinery is doing what Section 5.4
says it should.
:::
### 5.6 Pseudo-code: IF for a chained estimator
```{r}
#| eval: false
#| code-fold: true
#| code-summary: "Pseudo-code: chained_influence()"
chained_influence <- function(models, data_iv, target) {
n <- number_of_individuals
K <- length(models) - 1
mu_hat <- mean(pseudo_final[target])
n_target <- sum(target)
# Channel 1 with target-subpopulation scaling (see Sec. 3.6).
IF <- rep(0, n)
IF[target] <- (n / n_target) * (pseudo_final[target] - mu_hat)
d <- rep(0, n) # sensitivities from previous step
for (step in 0:K) {
model_k <- models[[step + 1]]
family_k <- model_k$family
X_k <- model.matrix(model_k) # fitting design matrix
eta_k <- model_k$linear.predictors
mu_eta_k <- family_k$mu.eta(eta_k) # dmu/deta at fitted values
V_k <- family_k$variance(fitted(model_k))
W_k <- mu_eta_k^2 / V_k # IWLS working weights
fit_ids_k <- model_k$fitting_ids
n_k <- length(fit_ids_k)
# Score residual = response residual * (dmu/deta) / V(mu). For
# canonical links this factor is 1; for non-canonical (probit,
# cloglog, Gamma-log, ...) it is not.
r_response_k <- residuals(model_k, type = "response")
r_score_k <- r_response_k * mu_eta_k / V_k
# Intervention design matrix and link derivatives (dmu/deta)
X_star_k <- model.matrix(model_k, newdata = intervention_data[[step + 1]])
eta_star <- X_star_k %*% coef(model_k)
mu_eta_star <- family_k$mu.eta(eta_star)
# Step 1: Gradient g_k = d mu_hat / d beta_k. The step > 0 branch
# carries w_{k-1} from the previous-step fit's prior weights.
if (step == 0) {
g_k <- crossprod(X_star_k[target, ], mu_eta_star[target]) / n_target
} else {
prev_ids <- models[[step]]$fitting_ids
w_prev <- models[[step]]$prior.weights # 1-vector if unweighted
g_k <- crossprod(
X_star_k[prev_ids, ],
w_prev * d[prev_ids] * mu_eta_star[prev_ids]
)
}
# Step 2: Solve h_k = (X'WX)^{-1} g_k
h_k <- solve(crossprod(X_k, X_k * W_k)) %*% g_k
# Step 3: Distribute d_{k,i} = h_k' X_{k,i}
d <- rep(0, n)
d[fit_ids_k] <- X_k %*% h_k
# Channel 2 for model k: n_k * d_{k,i} * r_score_{k,i}
r_vec <- rep(0, n)
r_vec[fit_ids_k] <- r_score_k
IF <- IF + n_k * d * r_vec
}
IF
}
```
### 5.7 From individual IFs to the variance
Once we have the per-individual $\text{IF}_i$ from the recursion above, variance
estimation follows from the same formula as always (@eq-var-if):
$$
\widehat{\text{Var}}(\hat\mu)
= \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i^2
$$
For two interventions $a$ and $b$ on the same data:
$$
\widehat{\text{Cov}}(\hat\mu_a, \hat\mu_b)
= \frac{1}{n^2}\sum_{i=1}^n \text{IF}_i^{(a)} \cdot \text{IF}_i^{(b)}
$$
The full $k \times k$ covariance matrix is then used for contrasts.
## 6. Summary
1. **One method**: the influence function. For any causal estimand $\hat\mu$,
write down the complete estimating equations (nuisance models + target
estimand), compute each individual's $\text{IF}_i$, and take
$\widehat{\text{Var}}(\hat\mu) = \frac{1}{n^2}\sum_i \text{IF}_i^2$.
2. **The IF is the sandwich**: the IF formula is the sandwich
$\color{#3182BD}{A^{-1}} \color{#E6550D}{B} \color{#3182BD}{(A^{-1})^T}$
applied to the stacked estimating equations, with the block-triangular
bread inverted by back-substitution. The bread, meat, and scores are all
present — the IF just computes $A^{-1}\psi_i$ per individual instead of
forming the full matrices.
3. **Two channels**: the IF always decomposes into a **direct term**
(Channel 1: covariate sampling) and **model corrections** (Channel 2:
one per nuisance model). What varies across methods is only which models
contribute to Channel 2.
4. **G-computation**: one model correction (outcome model). Channel 1 nonzero.
5. **IPW**: two model corrections (MSM + propensity). The propensity
correction captures weight estimation uncertainty via the cross-derivative
$A_{\beta\alpha}$.
6. **Matching**: one model correction (outcome model), same as g-comp.
Aggregation is cluster-robust (sum IFs within matched subclasses).
7. **Chained estimators** (ICE): $K\!+\!1$ model corrections computed by
backward recursion of sensitivities.
8. **The delta method** ($\color{#31A354}{J} V_\beta
\color{#31A354}{J^T}$) is a shortcut that captures Channel 2 only. It is
exact when Channel 1 = 0 (constant predictions), but underestimates
otherwise. The full IF avoids this limitation.
## References
- Bang H, Robins JM (2005). Doubly robust estimation in missing data and
causal inference models. *Biometrics* 61:962–973.
- Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (1986). *Robust
Statistics: The Approach Based on Influence Functions*. Wiley.
- Hirano K, Imbens GW, Ridder G (2003). Efficient estimation of average
treatment effects using the estimated propensity score.
*Econometrica* 71:1161–1189.
- Hernán MA, Robins JM (2025). *Causal Inference: What If*.
Chapman & Hall/CRC.
- Huber PJ (1967). The behavior of maximum likelihood estimates under
nonstandard conditions. *Proceedings of the Fifth Berkeley Symposium*
1:221–233.
- Huber PJ (1981). *Robust Statistics*. Wiley.
- Lunceford JK, Davidian M (2004). Stratification and weighting via the
propensity score in estimation of causal treatment effects: a comparative
study. *Statistics in Medicine* 23:2937–2960.
- Robins JM (1986). A new approach to causal inference in mortality studies
with a sustained exposure period. *Mathematical Modelling* 7:1393–1512.
- Stefanski LA, Boos DD (2002). The calculus of M-estimation.
*The American Statistician* 56:29–38.
- Tsiatis AA (2006). *Semiparametric Theory and Missing Data*. Springer.
- van der Laan MJ, Robins JM (2003). *Unified Methods for Censored
Longitudinal Data and Causality*. Springer.
- White H (1980). A heteroskedasticity-consistent covariance matrix estimator
and a direct test for heteroskedasticity. *Econometrica* 48:817–838.
- White H (1982). Maximum likelihood estimation of misspecified models.
*Econometrica* 50:1–25.
- Zeileis A (2004). Econometric computing with HC and HAC covariance matrix
estimators. *Journal of Statistical Software* 11(10):1–17.
- Zeileis A (2006). Object-oriented computation of sandwich estimators.
*Journal of Statistical Software* 16(9):1–16.
- Zivich PN, Klose M, Cole SR, Edwards JK, Shook-Sa BE (2022).
Delicatessen: a Python package for the generalized computation of the
sandwich estimator. *arXiv preprint* arXiv:2203.11300.
- Zivich PN, Ross RK, Shook-Sa BE, Cole SR, Edwards JK (2024). Empirical
sandwich variance estimator for iterated conditional expectation
g-computation. *Statistics in Medicine* 43:5562–5572.