Handling missing data

Missing data is common in observational studies. causatr handles it differently depending on which variable is missing (outcome, treatment, or covariates) and which estimator is used. This vignette demonstrates the behaviour for each case and shows how to verify that your analysis is robust.

Setup

Code
library(causatr)
library(tinytable)
library(tinyplot)

We simulate a baseline dataset with known truth, then introduce missingness.

Data-generating mechanism. Two independent confounders \(L_1\) and \(L_2\) influence both a binary treatment \(A\) and a continuous outcome \(Y\). The treatment probability follows a logistic model in \(L_1\) and \(L_2\), and the outcome is linear in \(A\), \(L_1\), and \(L_2\) with additive Gaussian noise.

\[ \begin{aligned} L_1 &\sim \mathcal{N}(0, 1) \\ L_2 &\sim \mathcal{N}(0, 1) \\ A &\sim \text{Bernoulli}\bigl(\text{logit}^{-1}(0.3\,L_1 + 0.2\,L_2)\bigr) \\ Y &= 2 + 3\,A + 1.5\,L_1 + 0.8\,L_2 + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, 1) \end{aligned} \]

The treatment coefficient is 3 with no effect modification, so the true average treatment effect is \(\text{ATE} = E[Y^{a=1}] - E[Y^{a=0}] = 3\).

Code
set.seed(42)
n <- 2000
L1 <- rnorm(n)
L2 <- rnorm(n)
A <- rbinom(n, 1, plogis(0.3 * L1 + 0.2 * L2))
Y <- 2 + 3 * A + 1.5 * L1 + 0.8 * L2 + rnorm(n)
df <- data.frame(Y = Y, A = A, L1 = L1, L2 = L2)

Missing outcomes (MCAR)

When outcome values are missing completely at random (MCAR), the outcome model is fit on complete cases. contrast() predicts on all rows, then excludes rows with NA predictions from the target-population average. Under MCAR this is unbiased.

Code
df_y_miss <- df
df_y_miss$Y[sample(n, round(0.15 * n))] <- NA

fit_y <- causat(
  df_y_miss,
  outcome = "Y",
  treatment = "A",
  confounders = ~ L1 + L2
)

res_y <- contrast(
  fit_y,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "difference",
  ci_method = "sandwich"
)
res_y

Estimator: gcomp  ·  Estimand: ATE  ·  Contrast: difference  ·  CI method: sandwich  ·  N: 2000

Intervention means
intervention estimate se ci_lower ci_upper
a1 4.966 0.052 4.864 5.067
a0 1.955 0.051 1.854 2.056
Contrasts
comparison estimate se ci_lower ci_upper
a1 vs a0 3.011 0.049 2.914 3.107

The estimate remains close to 3 despite 15% missing outcomes.

Comparison: complete data vs missing-outcome data

Code
fit_complete <- causat(df, outcome = "Y", treatment = "A", confounders = ~ L1 + L2)
res_complete <- contrast(
  fit_complete,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "difference",
  ci_method = "sandwich"
)

comp_df <- data.frame(
  Analysis = c("Complete data", "15% outcome MCAR"),
  Estimate = c(res_complete$contrasts$estimate[1], res_y$contrasts$estimate[1]),
  SE = c(res_complete$contrasts$se[1], res_y$contrasts$se[1]),
  CI_lower = c(res_complete$contrasts$ci_lower[1], res_y$contrasts$ci_lower[1]),
  CI_upper = c(res_complete$contrasts$ci_upper[1], res_y$contrasts$ci_upper[1])
)
tt(comp_df, digits = 3)
Analysis Estimate SE CI_lower CI_upper
Complete data 2.99 0.0456 2.91 3.08
15% outcome MCAR 3.01 0.0492 2.91 3.11

The SE increases slightly with missing data (less information), but the point estimate stays on target.

Missing outcomes under IPW

IPW also handles missing outcomes. The treatment model is fit on the outcome-filtered subset (rows where Y is observed), and density-ratio weights are computed on that same subset.

Code
fit_y_ipw <- causat(
  df_y_miss,
  outcome = "Y",
  treatment = "A",
  confounders = ~ L1 + L2,
  estimator = "ipw"
)

res_y_ipw <- contrast(
  fit_y_ipw,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "difference",
  ci_method = "sandwich"
)
res_y_ipw

Estimator: ipw  ·  Estimand: ATE  ·  Contrast: difference  ·  CI method: sandwich  ·  N: 2000

Intervention means
intervention estimate se ci_lower ci_upper
a1 4.989 0.055 4.881 5.097
a0 1.979 0.055 1.871 2.086
Contrasts
comparison estimate se ci_lower ci_upper
a1 vs a0 3.01 0.049 2.913 3.107

Missing outcomes under matching

Matching handles missing outcomes through MatchIt’s internal mechanisms. Rows with missing Y are excluded from the matched sample.

Code
fit_y_match <- causat(
  df_y_miss,
  outcome = "Y",
  treatment = "A",
  confounders = ~ L1 + L2,
  estimator = "matching",
  estimand = "ATT"
)

res_y_match <- contrast(
  fit_y_match,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "difference",
  ci_method = "sandwich"
)
res_y_match

Estimator: matching  ·  Estimand: ATT  ·  Contrast: difference  ·  CI method: sandwich  ·  N: 998

Intervention means
intervention estimate se ci_lower ci_upper
a1 5.222 0.066 5.093 5.351
a0 1.753 0.068 1.621 1.885
Contrasts
comparison estimate se ci_lower ci_upper
a1 vs a0 3.469 0.055 3.362 3.577

Three-method comparison under outcome missingness

Code
methods_df <- data.frame(
  Method = c("G-comp", "IPW", "Matching (ATT)"),
  Estimate = c(
    res_y$contrasts$estimate[1],
    res_y_ipw$contrasts$estimate[1],
    res_y_match$contrasts$estimate[1]
  ),
  CI_lower = c(
    res_y$contrasts$ci_lower[1],
    res_y_ipw$contrasts$ci_lower[1],
    res_y_match$contrasts$ci_lower[1]
  ),
  CI_upper = c(
    res_y$contrasts$ci_upper[1],
    res_y_ipw$contrasts$ci_upper[1],
    res_y_match$contrasts$ci_upper[1]
  )
)

tinyplot(
  Estimate ~ Method,
  data = methods_df,
  type = "pointrange",
  ymin = methods_df$CI_lower,
  ymax = methods_df$CI_upper,
  ylab = "ATE estimate",
  main = "Three estimators under 15% outcome MCAR"
)
abline(h = 3, lty = 2, col = "grey40")

Forest plot comparing g-comp, IPW, and matching estimates under 15% missing outcomes.

All three estimators recover the truth (dashed line at 3).

Missing covariates

When covariates have missing values, glm(na.action = na.omit) drops those rows from the model fit. contrast() returns NA predictions for rows with missing covariates and excludes them from the target population.

Code
df_l_miss <- df
df_l_miss$L1[sample(n, round(0.10 * n))] <- NA

fit_l <- causat(
  df_l_miss,
  outcome = "Y",
  treatment = "A",
  confounders = ~ L1 + L2
)

res_l <- contrast(
  fit_l,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "difference",
  ci_method = "sandwich"
)
#> 200 row(s) with NA predictions excluded from the target population.
res_l

Estimator: gcomp  ·  Estimand: ATE  ·  Contrast: difference  ·  CI method: sandwich  ·  N: 1800

Intervention means
intervention estimate se ci_lower ci_upper
a1 4.995 0.052 4.893 5.096
a0 2.006 0.052 1.903 2.108
Contrasts
comparison estimate se ci_lower ci_upper
a1 vs a0 2.989 0.048 2.895 3.083

Under MCAR covariate missingness, the estimate is unbiased. Under MAR, multiple imputation is needed (planned via causat_mice()).

Both outcome and covariate missing

causatr gracefully handles the case where both Y and L have missing values (possibly on different rows):

Code
df_both <- df
df_both$Y[sample(n, round(0.10 * n))] <- NA
df_both$L1[sample(n, round(0.10 * n))] <- NA

fit_both <- causat(
  df_both,
  outcome = "Y",
  treatment = "A",
  confounders = ~ L1 + L2
)

res_both <- contrast(
  fit_both,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "difference",
  ci_method = "sandwich"
)
#> 200 row(s) with NA predictions excluded from the target population.
res_both

Estimator: gcomp  ·  Estimand: ATE  ·  Contrast: difference  ·  CI method: sandwich  ·  N: 1800

Intervention means
intervention estimate se ci_lower ci_upper
a1 4.95 0.053 4.846 5.055
a0 1.969 0.054 1.863 2.074
Contrasts
comparison estimate se ci_lower ci_upper
a1 vs a0 2.981 0.051 2.882 3.081

Missing treatment values

Treatment NAs are not allowed — you cannot define a counterfactual without knowing the observed treatment value. causatr aborts with an informative error:

Code
df_a_miss <- df
df_a_miss$A[1:10] <- NA

causat(
  df_a_miss,
  outcome = "Y",
  treatment = "A",
  confounders = ~ L1 + L2
)
#> Error in `causat()`:
#> ! Treatment variable 'A' has 10 missing values.
#> ℹ Use `censoring = '...'` for inverse probability of censoring weights.
#> ℹ Or remove incomplete cases before calling `causat()`.

If treatment is MCAR, drop the rows. If MAR, use multiple imputation before calling causat().

Missing outcomes with continuous treatment

The same principles apply to continuous treatments. Here we show IPW with a shift() intervention under outcome missingness.

Data-generating mechanism. A single confounder \(L\) affects both a continuous treatment \(A\) and a continuous outcome \(Y\). The treatment is Gaussian with its mean depending on \(L\), and the outcome is linear in \(A\) and \(L\).

\[ \begin{aligned} L &\sim \mathcal{N}(0, 1) \\ A &\sim \mathcal{N}(0.5\,L,\; 1) \\ Y &= 1 + 2\,A + L + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, 1) \end{aligned} \]

Under a unit additive shift intervention \(A^* = A + 1\), every individual’s treatment increases by 1 and the outcome increases by \(2 \times 1 = 2\). The true causal contrast is \(E[Y(A+1)] - E[Y(A)] = 2\).

Code
set.seed(43)
L <- rnorm(n)
A_cont <- rnorm(n, mean = 0.5 * L)
Y_cont <- 1 + 2 * A_cont + L + rnorm(n)
df_cont <- data.frame(Y = Y_cont, A = A_cont, L = L)
df_cont$Y[sample(n, round(0.10 * n))] <- NA

fit_cont_ipw <- causat(
  df_cont,
  outcome = "Y",
  treatment = "A",
  confounders = ~ L,
  estimator = "ipw"
)

res_cont_ipw <- contrast(
  fit_cont_ipw,
  interventions = list(up1 = shift(1), observed = NULL),
  reference = "observed",
  type = "difference",
  ci_method = "sandwich"
)
res_cont_ipw

Estimator: ipw  ·  Estimand: ATE  ·  Contrast: difference  ·  CI method: sandwich  ·  N: 2000

Intervention means
intervention estimate se ci_lower ci_upper
up1 3.043 0.135 2.779 3.307
observed 1.029 0.071 0.889 1.169
Contrasts
comparison estimate se ci_lower ci_upper
up1 vs observed 2.014 0.114 1.791 2.237

Structural truth: E[Y(A+1)] - E[Y(A)] = 2.

Missing data with categorical treatment

G-comp handles missing outcomes for categorical treatments.

Data-generating mechanism. A three-arm categorical treatment \(A \in \{A, B, C\}\) is drawn independently of the confounder \(L\), and the outcome depends on both the treatment arm and \(L\).

\[ \begin{aligned} A &\sim \text{Categorical}(0.3,\; 0.4,\; 0.3) \\ L &\sim \mathcal{N}(0, 1) \\ Y &= 1 + \mathbf{1}(A = B) \cdot 1 + \mathbf{1}(A = C) \cdot 2 + 0.5\,L + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, 1) \end{aligned} \]

The true contrasts relative to arm A are: \(E[Y^{B}] - E[Y^{A}] = 1\) and \(E[Y^{C}] - E[Y^{A}] = 2\).

Code
set.seed(44)
A_cat <- factor(sample(c("A", "B", "C"), n, replace = TRUE, prob = c(0.3, 0.4, 0.3)))
L <- rnorm(n)
Y_cat <- 1 + ifelse(A_cat == "B", 1, ifelse(A_cat == "C", 2, 0)) + 0.5 * L + rnorm(n)
df_cat <- data.frame(Y = Y_cat, A = A_cat, L = L)
df_cat$Y[sample(n, round(0.10 * n))] <- NA

fit_cat <- causat(
  df_cat,
  outcome = "Y",
  treatment = "A",
  confounders = ~ L
)

res_cat <- contrast(
  fit_cat,
  interventions = list(arm_a = static("A"), arm_b = static("B"), arm_c = static("C")),
  reference = "arm_a",
  type = "difference",
  ci_method = "sandwich"
)
res_cat

Estimator: gcomp  ·  Estimand: ATE  ·  Contrast: difference  ·  CI method: sandwich  ·  N: 2000

Intervention means
intervention estimate se ci_lower ci_upper
arm_a 0.999 0.046 0.909 1.089
arm_b 1.976 0.039 1.899 2.054
arm_c 3.005 0.042 2.923 3.088
Contrasts
comparison estimate se ci_lower ci_upper
arm_b vs arm_a 0.977 0.059 0.863 1.092
arm_c vs arm_a 2.006 0.06 1.888 2.125

Missing outcomes in longitudinal data (ICE)

ICE g-computation handles missing final-period outcomes. The backward recursion excludes NA rows when fitting each sequential outcome model.

Data-generating mechanism. A two-period panel (\(K = 2\)) with \(n = 500\) individuals. Each individual has a baseline confounder \(L_0\), a time-varying confounder \(L_k\), and a binary treatment \(A_k\) at each period. The outcome \(Y\) is observed only at the final period and depends on the cumulative treatment received across both periods.

\[ \begin{aligned} L_0 &\sim \mathcal{N}(0, 1) \\ L_k &= L_0 + \eta_k, \quad \eta_k \sim \mathcal{N}(0, 0.25) \\ A_k &\sim \text{Bernoulli}\bigl(\text{logit}^{-1}(0.3\,L_k)\bigr), \quad k = 1, 2 \\ Y &= 2 + 1.5\,(A_1 + A_2) + 0.5\,L_0 + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, 1) \end{aligned} \]

Under the “always treat” regime (\(A_1 = A_2 = 1\)) vs “never treat” (\(A_1 = A_2 = 0\)), the true ATE is \(1.5 \times 2 = 3\).

Code
set.seed(45)
n_id <- 500
ids <- rep(seq_len(n_id), each = 2)
times <- rep(1:2, n_id)
L0 <- rep(rnorm(n_id), each = 2)
L_tv <- L0 + rnorm(n_id * 2, 0, 0.5)
A_tv <- rbinom(n_id * 2, 1, plogis(0.3 * L_tv))

Y_long <- NA_real_
for (i in seq_len(n_id)) {
  rows <- which(ids == i)
  Y_long[rows[2]] <- 2 + 1.5 * sum(A_tv[rows]) + 0.5 * L0[rows[1]] + rnorm(1)
}

long_df <- data.frame(id = ids, time = times, Y = Y_long, A = A_tv, L0 = L0, L = L_tv)
long_df$Y[sample(which(!is.na(long_df$Y)), round(0.10 * n_id))] <- NA

fit_ice_miss <- causat(
  long_df,
  outcome = "Y",
  treatment = "A",
  confounders = ~ L0,
  confounders_tv = ~ L,
  id = "id",
  time = "time"
)

res_ice_miss <- contrast(
  fit_ice_miss,
  interventions = list(always = static(1), never = static(0)),
  reference = "never",
  type = "difference",
  ci_method = "sandwich"
)
res_ice_miss

Estimator: gcomp  ·  Estimand: ATE  ·  Contrast: difference  ·  CI method: sandwich  ·  N: 500

Intervention means
intervention estimate se ci_lower ci_upper
always 4.991 0.085 4.825 5.158
never 1.905 0.089 1.731 2.079
Contrasts
comparison estimate se ci_lower ci_upper
always vs never 3.086 0.14 2.812 3.361

Structural truth: ATE = 1.5 * 2 = 3 (two periods of binary treatment).

Bootstrap with missing data

Bootstrap inference correctly handles missing data — each replicate resamples individuals (with replacement) and refits the full pipeline:

Code
res_boot_miss <- contrast(
  fit_y,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "difference",
  ci_method = "bootstrap",
  n_boot = 50L
)
res_boot_miss

Estimator: gcomp  ·  Estimand: ATE  ·  Contrast: difference  ·  CI method: bootstrap  ·  N: 2000

Intervention means
intervention estimate se ci_lower ci_upper
a1 4.966 0.051 4.865 5.066
a0 1.955 0.054 1.848 2.061
Contrasts
comparison estimate se ci_lower ci_upper
a1 vs a0 3.011 0.04 2.932 3.09

Sandwich vs bootstrap with missing data

Code
boot_comp <- data.frame(
  Inference = c("Sandwich", "Bootstrap"),
  Estimate = c(res_y$contrasts$estimate[1], res_boot_miss$contrasts$estimate[1]),
  SE = c(res_y$contrasts$se[1], res_boot_miss$contrasts$se[1]),
  CI_lower = c(res_y$contrasts$ci_lower[1], res_boot_miss$contrasts$ci_lower[1]),
  CI_upper = c(res_y$contrasts$ci_upper[1], res_boot_miss$contrasts$ci_upper[1])
)
tt(boot_comp, digits = 3)
Inference Estimate SE CI_lower CI_upper
Sandwich 3.01 0.0492 2.91 3.11
Bootstrap 3.01 0.0402 2.93 3.09

Manual IPCW weights

When outcomes are missing at random (MAR) conditional on treatment and covariates, you can supply inverse-probability-of-censoring weights (IPCW) via the weights argument. Built-in IPCW is planned for Phase 14; for now, compute the weights externally.

Censoring mechanism. Starting from the baseline DGM above (ATE \(= 3\)), we introduce MAR outcome censoring. The probability of being censored (outcome set to NA) depends on both the treatment \(A\) and confounder \(L_1\):

\[ \begin{aligned} C_i &\sim \text{Bernoulli}\bigl(\text{logit}^{-1}(-2 + 0.5\,A_i + 0.3\,L_{1,i})\bigr) \\ Y_i^{\text{obs}} &= \begin{cases} Y_i & \text{if } C_i = 0 \\ \text{NA} & \text{if } C_i = 1 \end{cases} \end{aligned} \]

The intercept of \(-2\) keeps the censoring rate moderate. Because censoring depends on \(A\) and \(L_1\), a complete-case analysis is biased; IPCW weights \(w_i = 1 / P(C_i = 0 \mid A_i, L_{1,i}, L_{2,i})\) restore consistency. The true ATE remains 3.

Code
df_mar <- df
cens_prob <- plogis(-2 + 0.5 * df_mar$A + 0.3 * df_mar$L1)
df_mar$Y[runif(n) < cens_prob] <- NA

observed <- !is.na(df_mar$Y)
cens_model <- glm(observed ~ A + L1 + L2, data = df_mar, family = binomial)
ipcw <- ifelse(observed, 1 / predict(cens_model, type = "response"), 0)

fit_ipcw <- causat(
  df_mar[observed, ],
  outcome = "Y",
  treatment = "A",
  confounders = ~ L1 + L2,
  weights = ipcw[observed]
)

res_ipcw <- contrast(
  fit_ipcw,
  interventions = list(a1 = static(1), a0 = static(0)),
  reference = "a0",
  type = "difference",
  ci_method = "sandwich"
)
res_ipcw

Estimator: gcomp  ·  Estimand: ATE  ·  Contrast: difference  ·  CI method: sandwich  ·  N: 1690

Intervention means
intervention estimate se ci_lower ci_upper
a1 4.967 0.054 4.861 5.073
a0 1.978 0.055 1.871 2.085
Contrasts
comparison estimate se ci_lower ci_upper
a1 vs a0 2.988 0.05 2.89 3.086

Summary of missingness handling

Variable Mechanism Method Handling Status
Outcome (Y) MCAR G-comp Complete-case model, predict on all rows Supported
Outcome (Y) MCAR IPW Fit treatment model on observed-Y subset Supported
Outcome (Y) MCAR Matching MatchIt excludes NA-Y rows Supported
Outcome (Y) MAR G-comp + IPCW External IPCW weights via weights= Supported
Treatment (A) Any Any Rejected (abort with error) Rejected
Covariates (L) MCAR G-comp na.omit drops rows; predict returns NA Supported
Covariates (L) MAR Any Multiple imputation (causat_mice, planned) Planned
Y + L MCAR G-comp Combines both drop rules Supported

Key takeaways

  1. Missing outcomes (MCAR) are handled automatically by all estimators. The estimate is unbiased; the SE increases proportionally to the information loss.

  2. Missing treatment values are never allowed. Drop rows (MCAR) or impute (MAR) before calling causat().

  3. Missing covariates (MCAR) work out-of-the-box under g-comp via na.omit. For MAR missingness, multiple imputation is needed.

  4. MAR outcome missingness requires IPCW weights today (manual via weights =). Built-in IPCW is planned for Phase 14.

  5. Bootstrap handles all missingness patterns correctly by resampling the full pipeline.

References

Hernán MA, Robins JM (2025). Causal Inference: What If. Chapman & Hall/CRC. Chapter 12 (IP weighting under censoring) and Chapter 19 (censoring in longitudinal studies).

Little RJA, Rubin DB (2019). Statistical Analysis with Missing Data. 3rd ed. Wiley.