matchatr (development version)
2026-06-08 — Python cross-language oracles for the classical estimators
Adds an independent Python (statsmodels) cross-language oracle for every implemented classical estimator, so a bug shared between matchatr and its R engine cannot hide behind a same-package comparison. Each oracle reads the same committed dataset both languages share and compares against a committed Python result CSV; tests never invoke Python, so CI needs no Python toolchain (each is guarded with skip_if(!file.exists(...))). The fixtures and a regeneration recipe live in tests/testthat/fixtures/python/; the comparisons are in test-python-oracle.R.
- Covered: the unmatched logistic OR (vs
Logit), the matched conditional OR and the nested case-control HR (vsConditionalLogit), the Mantel–Haenszel summary OR with the Robins–Breslow–Greenland interval (vsStratifiedTable), the polytomous subtype ORs (vsMNLogit), and the homogeneity Wald χ² + GLS pooled common OR (hand-built fromMNLogit’s coefficients and covariance). statsmodelsanchors these maximum-likelihood estimators;delicatessen(M-estimation + sandwich) is reserved for the causal / sandwich estimands of the later case-control-weighting phases.- Test-only change: no user-facing behaviour is affected.
2026-06-08 — Risk-set control sampling: sample_ncc() (PHASE_5 Chunk 2)
Adds the exported sample_ncc(), which generates a nested case-control dataset from a cohort by risk-set (incidence-density) control sampling. Each event anchors a matched set holding the case and m controls drawn without replacement from the subjects at risk at that failure time; the result is an analysis-ready data.table (the cohort columns plus set, the per-set case indicator, and risk_time) that feeds straight into matcha(design = nested_cc(strata = "set", time = "risk_time")). This closes the generate-then-analyse loop opened by Chunk 1.
sample_ncc(cohort, time, event, m = 1, match = NULL, entry = NULL).match = ~ s1 + s2confines each case’s controls to its own population stratum;entry =supplies a delayed-entry (left-truncation) column so a subject counts as at risk only after entering follow-up. Sampling uses the ambient random stream (wrap inwithr::with_seed()/ precede withset.seed()), and the input cohort is never mutated.- Native sampler,
Epi::ccwcas an oracle. Risk-set sampling is implemented natively (base R + data.table) so it is always available and deterministically seedable — consistent with the package’s pattern of delegating only to Imports-tier estimation engines and hand-rolling sampling / closed forms.Epi::ccwcis an external cross-check in the tests (every sampled control must lie in the eligible risk-set pool), not a runtime dependency. - A case with no eligible control aborts with
matchatr_empty_risk_set— in the generation path, producing such a set is a sampling failure (a misspecified time origin/scale, an entry/exit mismatch, or over-finematchstrata), unlike an uninformative analysis stratum, whichclogitmerely drops with a warning. A late failure time with fewer thanmeligible controls keeps all of them (a smaller set), which is correct, not an error. matchstrata are crossed with collision-proofinteraction()codes, and a missing value in amatchcolumn is rejected up front (matchatr_bad_input): an undefined stratum must not silently merge with the literal string"NA"nor match anyone.- Validated by structural invariants (one case per set, controls genuinely at risk, no within-set reuse), the
Epi::ccwcrisk-set-definition cross-check, a cohort DGP whose known Cox log-HR the sampled subsample recovers within 3.5 SE and agrees with the full-cohortsurvival::coxphfit, and the classical(m+1)/mnull efficiency. The test-onlysample_ncc_riskset()fixture now delegates tosample_ncc(), so the Chunk 1 analysis tests also exercise the exported sampler.
2026-06-08 — Nested case-control hazard ratio (PHASE_5 Chunk 1)
Opens the time-to-event sampling designs with the classical nested case-control (NCC) analysis. matcha(design = nested_cc(strata = ..., time = ...), estimator = "clogit") now fits the risk-set conditional partial likelihood and contrast() reports the exposure’s hazard ratio (type = "hr"). The NCC analysis reuses the matched-design clogit engine unchanged — a sampled risk set and a matched set are the same stratum construction (outcome ~ exposure + confounders + strata(set)) — so the value is the design-faithful estimand, not new estimation machinery.
- OR = HR exactly under risk-set (incidence-density) sampling (Prentice & Breslow 1978), with no rare-disease assumption. The conditional partial likelihood gives
exp(beta); the sampling design fixes its meaning, so a matched case-control design reports the conditional odds ratio and a nested case-control design the hazard ratio. The sharedconditional_or_result()/stratum_specific_or_result()assemblies gained atypeargument that carries the scale label ("or"/"hr"); the arithmetic is identical. - A new contrast scale
type = "hr"(hazard ratio).contrast()defaults to it for the nested design (default_contrast_type()is now design-aware) and to"or"for the matched design. Each conditional design identifies exactly one scale: requesting an odds ratio from a risk-set design, or a hazard ratio from a matched design, names an estimand the design does not target and aborts withmatchatr_unidentified_estimand.summary()andprint()label the NCC table / contrast as a hazard ratio. - The design’s
timecolumn records how controls were sampled; the risk-set membership is read fromstrata, so the conditional likelihood does not entertimehere (it feeds the later inclusion-weight / weighted-Cox designs). Risk set reuse — and hence a cluster-robust variance — belongs to those designs, soci_method = "sandwich"/"bootstrap"stay unsupported; a risk set with no control is an uninformative stratum (matchatr_uninformative_stratum, dropped byclogit), consistent with the matched design. - Validated against a cohort DGP with a known Cox log-HR (
make_ncc_cohort()) from which an incidence-density NCC sample is drawn (sample_ncc_riskset()): the conditional likelihood recovers the cohort β within 3.5 SE and agrees with the full-cohortsurvival::coxphβ (the OR = HR check); exact pass-through against a hand-fitsurvival::clogit; and the relative efficiency at the null matches the classical(m+1)/mvariance ratio (Goldstein & Langholz 1992). Non-binary outcomes are rejected (matchatr_bad_outcome) and a within-set constant exposure ismatchatr_unestimable_exposure.
2026-06-08 — Homogeneity test + pooled common OR for disease subtypes (PHASE_4 Chunk 2)
Completes the multiple-case/control-group layer with test_homogeneity(), which answers the etiologic-heterogeneity question the polytomous design is for: does the exposure act the same way on every disease subtype? Given an unconstrained polytomous fit (estimator = "polytomous"), for each exposure term it tests the null that the exposure odds ratio is constant across the non-reference subtypes (H0: beta_1 = … = beta_M) and reports the efficient pooled (“common”) odds ratio that holds under homogeneity. A binary or continuous exposure yields one test; an unordered factor exposure one test per level. The new R/homogeneity.R (test_homogeneity(), homogeneity_one_term(), print / tidy methods for the matchatr_homogeneity class) carries it; new_matchatr_homogeneity() joins the constructors.
- The homogeneity test is the canonical Wald test of etiologic heterogeneity (Begg & Gray 1984;
riskclustr::eh_test_subtype):W = (C b)' (C V C')^-1 (C b) ~ chi-squared(M - 1), wherebis the stacked subtype log odds ratios andVtheir multinomial-information covariance (reused frommultinom_exposure_or(), now also returning the structured subtype / predictor pieces). The common odds ratio is the minimum-variance (GLS / inverse-variance) restricted estimator(1' V^-1 b)/(1' V^-1 1), asymptotically equal to the constrained maximum-likelihood fit. Both are computed on the already-fitted unconstrained model — no constrained refit — so there is no new dependency and continuous confounders are handled directly (chosen over a Poisson-surrogate orVGAMlikelihood-ratio test, which would need discrete covariate patterns or a heavy new dependency). It mirrors theC V C'contrast construction already used for stratum-specific odds ratios. - A non-polytomous fit (
logistic/mh/clogit) or a non-matchatr_fitobject ismatchatr_bad_input; a fit whose engine produced no model ismatchatr_not_estimated; a malformedconf_levelismatchatr_bad_input. - Validated against three oracles: the saturated 3-group / binary-exposure case reproduces the closed-form 2x2 Woolf chi-squared and pooled odds ratio (independent of
multinom’svcov()); the chi-squared / pooled OR equal the exactC V C'/ GLS functionals ofcontrast()on an adjusted (continuous-confounder) fit; a truth DGP gives correct size under equal subtype ORs and power under unequal ones, with the pooled SE smaller than each subtype SE (Begg & Gray efficiency); and the heterogeneity p-value cross-checks againstriskclustr::eh_test_subtype(themlogitengine), which joinsSuggests.
2026-06-03 — Polytomous logistic for multiple case / control groups (PHASE_4 Chunk 1)
Opens the multiple-case/control-group layer with unconstrained polytomous (multinomial) logistic regression. matcha(design = unmatched_cc(), estimator = "polytomous") fits a baseline-category multinomial model (outcome ~ exposure + confounders) via nnet::multinom for an outcome with three or more groups — multiple disease subtypes, or several control groups. The new reference = argument selects the baseline group (releveled to the front); each non-reference equation’s exposure coefficient is that subtype’s log odds ratio versus the reference, so contrast(type = "or") reports one OR per (subtype, exposure-coefficient) with an information-matrix Wald interval, and tidy() renders the full per-equation table with a y.level column. The new R/polytomous.R (fit_polytomous(), contrast_polytomous(), multinom_exposure_or(), tidy_multinom()) and resolve_polytomous_outcome() carry the engine; the dispatch grew an outcome_kind axis so matcha() picks the multi-group resolver instead of the binary one.
- A two-group, numeric, or logical outcome is
matchatr_bad_outcome(routed back to the binarylogistic/mh/clogitestimators); an out-of-rangereference, areferenceon a non-polytomous estimator, an ordered-factor exposure, andeffect_modifierare eachmatchatr_bad_input; a constant exposure — or one collinear with a confounder — ismatchatr_unestimable_exposure. Becausennet::multinom(unlikestats::glm) does not alias a rank-deficient column toNAbut splits the coefficient across the dependent columns (a silently attenuated OR),reject_collinear_exposure()catches it by the design-matrix rank, mirroring glm’s NA-aliasing; collinearity confined to the confounders leaves the exposure estimable and is not rejected.polytomouson a matched design ismatchatr_bad_estimator. RD / RR staymatchatr_unidentified_estimandand sandwich / bootstrap variances arematchatr_unsupported_variance(the engine reports the multinomial information matrix only). - Validated against two oracles: the saturated 3-group / binary-exposure multinomial reproduces the closed-form 2×2 Woolf log-OR and its variance per subtype (independent of multinom’s own
vcov()), and a truth-based cohort DGP recovers the known per-subtype exposure log-ORs within 3.5 SE, withnnet::multinomcoefficient / variance equality pinning the adjusted, continuous, and factor-exposure fits.nnetis a recommended R package, so it joinsImportswithout adding an install-time dependency.
2026-06-03 — Effect modification in matched case-control (PHASE_3 Chunk 3)
Completes the matched case-control layer with stratum-specific odds ratios. Passing effect_modifier = "m" to matcha(estimator = "clogit") fits outcome ~ exposure * m + confounders + strata(set), and contrast(type = "or") reports the exposure’s conditional odds ratio within each modifier level — the exposure main coefficient beta_x at the modifier’s reference level and the linear combination beta_x + beta_{x:level} elsewhere — with Wald intervals from the joint partial-likelihood variance (so the per-level SE accounts for the main–interaction covariance). The new R/effect_modification.R (stratum_specific_or_result()) assembles the per-level ORs via a contrast matrix C V C'; the modifier may coincide with a matching variable, which is the canonical use (does the exposure effect differ across the matching factor?).
- Supported for a single-coefficient exposure (binary, continuous, or two-level factor) crossed with a categorical (logical / character / factor) modifier; a character / logical modifier is coerced to a factor. A 3+-level factor exposure is
matchatr_unsupported_combination, a continuous (numeric) modifier or a non-clogitengine ismatchatr_bad_input, a modifier coinciding with the outcome / exposure ismatchatr_bad_input, and an aliased per-level interaction ismatchatr_unestimable_exposure. RD / RR stay unidentified and sandwich / bootstrap intervals remain declined. - Validated three ways: the within-level 1:1 McNemar closed form pins both the per-level point estimate (
OR = n10/n01) and the linear-combination variance (Var(log OR) = 1/n10 + 1/n01) independently ofsurvival::clogit; hand-builtsurvival::clogitlinear combinations check forwarding, the level labelling, and the interaction-column ordering for a three-level modifier with covariate adjustment; and a truth DGP with known per-level conditional log-ORs confirms recovery within a self-scaling SE band. - M:1-with-covariates and variable-ratio matching needed no new code (the conditional likelihood handles any matched-set composition); their truth-based cells were already covered by Chunk 1’s
clogitengine. - The effect modifier is
droplevels()’d before fitting: a modifier factor carrying an unused (zero-observation) level previously added an all-zero interaction column aliased toNAand aborted the whole stratum-specific OR as unestimable. Unused levels are now dropped (preserving the order of the remaining declared levels, so a user-set reference is kept).
2026-06-03 — McNemar 1:1 matched-pair odds ratio (PHASE_3 Chunk 2)
Adds the closed-form 1:1 estimator alongside the conditional logistic engine. matcha(design = matched_cc(strata = ...), estimator = "mcnemar") computes the matched-pair odds ratio directly from the discordant-pair counts — OR = n10/n01 with Var(log OR) = 1/n10 + 1/n01 (McNemar 1947; Breslow & Day 1980) — without fitting survival::clogit, and contrast(type = "or") reports it with a Wald interval. Pairs concordant on exposure carry no information and cancel.
- The estimator applies only to genuine 1:1 pairs: a matched set with more than one case or more than one control is M:1 (or richer) matching with no two-cell closed form and is rejected with
matchatr_not_one_to_one, pointing toestimator = "clogit". A one-sided (or empty) set of discordant pairs gives a boundary OR of 0 / ∞ and aborts withmatchatr_unestimable_exposure. - The exposure must be binary (logical / two-level factor / numeric 0/1); a non-binary exposure is declined (
matchatr_bad_input) towardclogit. RD / RR remain unidentified andci_method = "sandwich"/"bootstrap"are declined (matchatr_unsupported_variance); a missing pair member drops to complete pairs with amatchatr_dropped_rowswarning. - Validated three ways: exact agreement with
survival::clogiton the same 1:1 binary data (the conditional likelihood reduces to McNemar’s), the OR and variance against the hand-counted closed form (independent of clogit), and a matched-pair DGP with a known log-OR (CMLE recovers β within a self-scaling SE band). A dedicated test pins the OR²-bias invariant: the unconditional 1:1 MLE with a parameter per pair is exactly twice the conditional estimate.
2026-06-03 — Matched case-control conditional logistic regression (PHASE_3 Chunk 1)
Opens the matched case-control layer with the conditional maximum-likelihood odds ratio. matcha(design = matched_cc(strata = ...), estimator = "clogit") (the design’s default estimator) fits survival::clogit — outcome ~ exposure + confounders + strata(set), each matched set a stratum — and contrast(type = "or") reports the exposure’s conditional odds ratio with a partial-likelihood-information Wald interval. Conditioning on the matched-set totals removes the matching-variable nuisance parameters, so only the exposure / adjustment ORs are reported; the matching variables are controlled implicitly.
- The conditional likelihood is the correctness invariant: unconditional logistic regression on matched-set indicators biases the OR (for 1:1 matching its MLE converges to the squared OR; Pike et al. 1980, Breslow & Day 1980).
- Several matching columns cross into one
strata()term (frequency matching); a factor exposure reports one OR per level versus its reference; non-matching covariates adjust viaconfounders. - RD / RR are rejected as unidentified (shared with the logistic engine). The conditional fit reports the information-matrix interval only, so
ci_method = "sandwich"/"bootstrap"are declined (matchatr_unsupported_variance); cluster-robust variance for reused controls is deferred to the risk-set designs. An exposure with no within-stratum variation aborts withmatchatr_unestimable_exposure. - Validated against an independent closed-form oracle for 1:1 matching (McNemar: OR = n10/n01 and Var(log OR) = 1/n10 + 1/n01, exact and computed without clogit), a matched-set DGP built from the conditional likelihood with a known log-OR (CMLE recovers β for binary, continuous, mixed-ratio, and adjusted cases),
survival::clogitpass-through, and a regression pin against the canonicalinfertclogit example (induced ≈ 4.09, spontaneous ≈ 7.29). - The shared
conditional_or_result()assembly (exposure coefficient by term position, Wald interval on the log scale, exponentiated) now backs both the unmatched logistic and the matched conditional logistic engines.
2026-06-03 — Mantel-Haenszel stratified odds ratio (PHASE_2 Chunk 3)
Completes the unmatched case-control layer with the closed-form Mantel-Haenszel estimator. unmatched_cc() gains a strata argument (the stratifying variable(s); several are crossed into one factor, and none gives the crude single-table OR). matcha(estimator = "mh") computes the summary odds ratio over the per-stratum 2×2 tables, and contrast(type = "or") reports it with a Robins-Breslow-Greenland (1986) Wald confidence interval — valid in both the sparse-data and large-strata limits.
- The exposure must be binary (a 2×2 table per stratum); a categorical (k>2) or continuous exposure is rejected (
matchatr_bad_input) with a pointer toestimator = "logistic". - RD / RR are rejected as unidentified (shared with the logistic engine). The MH variance is the closed-form RBG estimator, so
ci_method = "sandwich"/"bootstrap"are declined (matchatr_unsupported_variance); a zero exposure-outcome margin aborts withmatchatr_unestimable_exposure. - Validated against
stats::mantelhaen.test(correct = FALSE), whose odds-ratio estimate and confidence interval use the same RBG variance — the matchatr OR and CI match it exactly — plus the closed-form 2×2 odds ratio for the crude case. - The fitter-agnostic coefficient-extraction helpers (
term_assign,estimable_vcov,exposure_coef_index,parametric_positions) moved toR/coef_extract.R, shared by the logistic and Mantel-Haenszel engines.
2026-06-03 — PHASE_2 Chunk 2 critical-review fixes
Follow-up fixes from an adversarial review of the categorical/GAM exposure work.
- Exposure-coefficient extraction is now by term position (collision-free) rather than by reconstructed coefficient name.
glmpermits non-unique coefficient names fromfactor x levelconcatenation — e.g. exposureseslevellowand confounderselevelslowboth yield"seslow"— and the previous name-based selection could grab the confounder’s coefficient as a spurious second exposure OR, while name-based SE indexing returned the wrong (first-match) orNAstandard error. Variance is now indexed by position via a unifiedterm_assign()(glmmodel.matrixassign / gam$assign+$pterms) and a position-basedestimable_vcov(), robust to both name collisions and aliasing. - An ordered-factor exposure is now rejected up front in
matcha()(matchatr_bad_input) instead of atcontrast()time, so no model is fit and no spurious missing-data warning is emitted for an analysis that cannot yield a per-level OR. tidy()/summary()on a GAM fit now report only the parametric coefficients. Previously the smooth-basis terms (s(age).1, …) were listed as rows and, withexponentiate = TRUE, theirexp()was reported as an “odds ratio” with a Wald p-value — penalized basis weights are neither.model_fnis validated more strictly: it must accept afamilyargument (or...) — caught atmatcha()with a classedmatchatr_bad_inputinstead of a raw “unused argument” base error — and the fitted object must be a binomial fit. A fitter that ignoresfamilyand returns, say, an OLSlmnow aborts withmatchatr_bad_model_fitrather than silently exponentiating a linear-probability slope into a bogus “odds ratio”.- The recorded reference level for a factor exposure is now read from the fitted model’s
xlevels(the baseline actually used), not the factor’s first declared level. A factor with an unused first level (e.g. declaredabsent < med < highbut onlymed/highobserved) previously mislabeled the contrast as “vs absent”; it now correctly reports “vs med”.
2026-06-03 — Categorical / ordinal / GAM exposures (PHASE_2 Chunk 2)
Extends the unmatched case-control logistic engine to every exposure type.
- Categorical (k>2) exposure: a factor exposure now yields one odds ratio per non-reference level;
contrast()returns k-1 rows and the result records the factor’s reference level (result$reference). - Ordinal trend: an ordinal exposure entered as a numeric score yields a single per-step trend OR.
- Pluggable fitter:
matcha(..., model_fn = )selects the logistic fitter (defaultstats::glm); e.g.model_fn = mgcv::gamwithconfounders = ~ s(age)adjusts for a confounder with a smooth term while the exposure stays parametric with an interpretable OR. Exposure-coefficient extraction is now by name, so it works acrossglmandgam. - Ordered-factor exposure rejected (
matchatr_bad_input):glm/gamfit it with polynomial contrasts (.L,.Q, …), which are not per-level ORs; the error points to a numeric score (trend) or an unordered factor (per-level). - Validated against
stats::glmfor every exposure type, and against the Ille-et-Vilaineesophcase-control data (handbook Ch3): the categorical alcohol ORs reproduce the canonical monotone dose-response.
2026-06-02 — PHASE_2 Chunk 1 critical-review fixes
Follow-up fixes from an adversarial review of the unmatched case-control logistic engine.
tidy()/summary()withrobust = TRUE, andcontrast(type = "or", ci_method = "sandwich"), returned misaligned standard errors when the fitted model had an aliased (rank-deficient / collinear) term.sandwich::sandwich()drops aliased columns whilestats::coef()keeps them, so the shorter SE vector was recycled against the coefficient vector and every term at or after the first aliased one got the wrong SE / CI / p-value. Standard errors are now aligned to the coefficient vector by name — aliased terms getNA— via a sharedestimable_vcov()helper that reduces both variance sources to the estimable coefficient set.- A constant or collinear exposure aliases to
NAinglm;contrast(type = "or")previously returned anNAodds ratio silently. It now aborts with the classedmatchatr_unestimable_exposure, mirroring the degenerate-outcome rejection. matchatr_result$nreportednrow(data), overcounting whenglmdropped rows with missing covariates. It now reportsstats::nobs(model)— the complete-case count actually used, matchingcausatr— and amatchatr_dropped_rowswarning reports how many rows were dropped. Actual missing-data handling (multiple imputation) stays delegated tocausatr::causat_mice/ the future imputation phase, not reimplemented here.contrast()now defaultstypeto the estimand the design identifies:"or"for the classical odds-ratio engines,"difference"otherwise.contrast(fit)on an unmatched case-control logistic fit therefore returns the conditional OR instead of erroring on the previously hard-coded risk-difference default.
2026-06-02 — Unmatched case-control logistic conditional OR (PHASE_2 Chunk 1)
First estimator engine: the classical unmatched case-control analysis. matcha() now runs the resolved engine as part of the fit — the "logistic" estimator fits stats::glm(family = binomial) for outcome ~ exposure + confounders and stores it in the model slot (engines without a wired estimator still leave model = NULL). Only the slope coefficients carry over from the cohort model; the case-control intercept is offset by the log sampling-fraction ratio (Prentice & Pyke, 1979) and is never reported as a baseline risk.
contrast(fit, type = "or")returns the exposure’s conditional odds ratio as amatchatr_result, with a Wald interval from the model information matrix (ci_method = "model", the new default) or the Huber-White sandwich (ci_method = "sandwich").contrast()gained aconf_levelargument.- The risk difference and risk ratio are not identified from an unmatched case-control sample without the source-population prevalence q0:
contrast(type = "difference" / "ratio")aborts with the classedmatchatr_unidentified_estimand, pointing totype = "or"or to a case-control-weighted estimator withprevalence. Bootstrap CIs for the conditional OR abort withmatchatr_unsupported_variance(a Wald interval is reported instead). - New S3 surface:
tidy.matchatr_fit()(broom-style coefficient / OR table on the log-odds or, withexponentiate = TRUE, the OR scale; model-based orrobustsandwich SE),tidy.matchatr_result(),summary.matchatr_fit(), andprint.matchatr_result(). - Tested against three oracles:
stats::glmpass-through (point estimate, Wald and sandwich CI), the closed-form 2×2 odds ratio with Woolf variance, and a cohort data-generating process with a known log-OR (the conditional OR recovers the cohort slope, since case-control sampling shifts only the intercept).
2026-06-02 — PHASE_1 validation hardening (critical review)
Follow-up fixes from an adversarial review of the PHASE_1 foundation, closing input-validation gaps before the estimation layers depend on them.
ratio = Inf(or-Inf) now raises the classedmatchatr_bad_ratioinstead of an unclassed base-R error.Inf %% 1isNaNandNaN != 0isNA, so the old guard reachedif (NA); a finiteness check now runs before the modulo.- A degenerate single-class outcome (all cases, all controls, or all
NA) is now rejected withmatchatr_bad_outcomefor every encoding. The contrast check previously fired only for numeric columns, so an all-controls logical or two-level factor slipped through withn_cases = 0;resolve_binary_outcome()now coerces to 0/1 first and applies one uniform both-classes-present check. matcha()rejectsdatawith duplicated column names (matchatr_bad_input).[[resolves a duplicated name to its first match, so a duplicated outcome / exposure / strata column would otherwise be silently chosen and the existence check could even blame the wrong column.matcha()rejects a column assigned to two incompatible roles (matchatr_bad_input): the outcome or exposure also appearing as a confounder or a design column (e.g. the exposure double-entered inconfounders, or the outcome used as the matched-set id). Confounders and design columns may still overlap, so frequency-matching on a variable and adjusting for it remains valid.
2026-06-02 — Design taxonomy, data model, and two-step API
First implemented layer: the sampling-design objects and the matcha() fit verb that the rest of the package builds on. No estimator runs yet — this phase is plumbing, validation, and dispatch.
- Six design constructors returning a unified
matchatr_designS3 object:unmatched_cc(),matched_cc(),nested_cc(),case_cohort(),two_phase(), andcounter_matched(). Each carries its sampling structure (strata, time, matching ratio, prevalence q0, subcohort / phase columns) and aweight_specdeclaring the intended weighting scheme. Structural arguments are validated at construction (q0 strictly in (0, 1); ratio a whole number >= 1; strata a non-empty character vector) with classedmatchatr_*errors. matcha(data, outcome, exposure, design, confounders, estimator)validates the request, resolves the orthogonal(design, estimator)axes to an estimation engine via an internal dispatch table, and returns amatchatr_fit. The case-control-weighted family ("ccw_gformula","ccw_ipw","ccw_aipw","ccw_tmle") routes on any design but requires a prevalence q0; the classical estimators are design-specific ("logistic"/"mh","clogit","cch", …). Themodelslot isNULLuntil an estimation engine is run.- Weights are never a data column: the fit reserves distinct
details$cc_weightsanddetails$design_weightsslots (case-control weights and inclusion-probability weights have different variance consequences) plus adetails$variance_kindslot for the eventual sampling-variance correction. - Classed rejection paths, all matching on a
matchatr_errorparent:matchatr_bad_estimator(unknown / design-incompatible estimator),matchatr_bad_outcome(non-binary case indicator),matchatr_missing_prevalence(CCW without q0),matchatr_bad_prevalence/matchatr_bad_ratio/matchatr_bad_strata(malformed construction),matchatr_bad_design(missing columns / wrong object), and amatchatr_uninformative_stratumwarning when a conditional likelihood would drop a matched set with no case or no control. printmethods formatchatr_designandmatchatr_fit.contrast(), the second-step verb, is defined as a validated skeleton: it fixes the public signature (type,ci_method) and thematchatr_resultreturn contract, and aborts withmatchatr_not_estimatedon a fit whose estimation engine has not run (model = NULL). The estimation body is filled in by the causal-contrast phases.
2026-06-02 — Package scaffold and design roadmap
Initial bootstrap of matchatr: causal inference for (matched) case-control, nested case-control, and case-cohort study designs, as part of the etverse ecosystem.
- Package scaffold: DESCRIPTION (Imports
causatr,survatr,survival), MIT license, testthat 3 edition, Air formatting,altdoc+ Quarto website (matching the other etverse packages), GitHub Actions (R-CMD-check, test-coverage / Codecov, format-check, altdoc), Makefile, and Claude Code configuration (CLAUDE.md,.claude/hard-rules.md, symlinked etverse skills, posit-dev skills bundle). - Full design roadmap in
PHASE_1–PHASE_20at the repository root, mapping the Handbook of Statistical Methods for Case-Control Studies (Borgan et al., 2018) to an implementation plan: design taxonomy and two-step API (PHASE_1); classical estimators (unmatched / matched / multiple-group, PHASE_2–4); time-to-event sampling designs (nested case-control, case-cohort, IPW-NCC, PHASE_5–7); the causal layer (strategy + case-control weighting g-formula / IPW / AIPW / TMLE + design-weighted causal survival, PHASE_8–10); and efficiency / advanced / extension phases (PHASE_11–20).
No estimator code is implemented yet — every phase is at Status: DESIGN.