Add survey-aware bootstrap for all estimators (Phase 6) by igerber · Pull Request #237 · igerber/diff-diff

igerber · 2026-03-25T00:53:38Z

Summary

Add survey-aware bootstrap inference for all 8 bootstrap-using estimators
Two strategies: PSU-level multiplier bootstrap (CS, ImputationDiD, TwoStageDiD, ContinuousDiD, EfficientDiD) and Rao-Wu rescaled bootstrap (SunAbraham, SyntheticDiD, TROP)
Expand CallawaySantAnna analytical survey support to full strata/PSU/FPC via compute_survey_if_variance()
Add shared infrastructure: generate_survey_multiplier_weights_batch, generate_rao_wu_weights, compute_survey_if_variance, aggregate_to_psu
Thread survey weights through bootstrap aggregation/IF/GMM score computation for all estimators
Add edge-case guards: lonely_psu="adjust" rejection, FPC validation, single-PSU handling, SyntheticDiD placebo+full-design guard
Update REGISTRY.md with Phase 6 survey bootstrap methodology section
Iteratively refined through 5 rounds of AI review (gpt-5.4-pro)

Methodology references (required if estimator / math changes)

Method name(s): Rao-Wu rescaled bootstrap, PSU-level multiplier bootstrap, Taylor Series Linearization
Paper / source link(s):
- Rao & Wu (1988) "Resampling Inference with Complex Survey Data", JASA 83(401)
- Rao, Wu & Yue (1992) "Some Recent Work on Resampling Methods for Complex Surveys", Survey Methodology 18(2)
- Kolenikov (2010) "Resampling Variance Estimation for Complex Survey Data"
- Shao (2003) "Impact of the Bootstrap on Sample Surveys", Statistical Science 18(2)
Any intentional deviations from the source (and why):
- FPC enters Rao-Wu via adjusted resample size m_h = round((1-f_h)*(n_h-1)) per Rao-Wu-Yue (1992) Section 3
- TROP uses cross-classified pseudo-strata (survey_stratum × treatment_group) for Rao-Wu
- lonely_psu="adjust" rejected for bootstrap paths (analytical path supports it)
- Rust TROP bootstrap remains pweight-only; Python fallback for full design

Validation

Tests added/updated: tests/test_survey.py, tests/test_survey_phase3.py, tests/test_survey_phase4.py, tests/test_survey_phase5.py
285 survey tests passing across all phases
All deferral tests converted to positive tests
Smoke tests + scale invariance + uniform weight equivalence tests

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Implement bootstrap + survey interaction for all 8 bootstrap-using estimators. Two strategies: PSU-level multiplier bootstrap (CS, ImputationDiD, TwoStageDiD, ContinuousDiD, EfficientDiD) and Rao-Wu rescaled bootstrap (SunAbraham, SyntheticDiD, TROP). Expand CS analytical support to full strata/PSU/FPC via compute_survey_if_variance. Add shared infrastructure: generate_survey_multiplier_weights_batch, generate_rao_wu_weights, compute_survey_if_variance, aggregate_to_psu. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P0: Thread survey weights through bootstrap aggregation/IF paths - CS: use survey_weight_sum for bootstrap re-aggregation weights - ImputationDiD: pass survey_weights_0 to _precompute_bootstrap_psi - TwoStageDiD: add survey weights to _compute_cluster_S_scores P1: Fix CS df_survey inconsistency (use unit-level df everywhere), fix ContinuousDiD event-study bootstrap weights P2: Update REGISTRY.md deferred language, clean TODO.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P0: Pass survey weights to TwoStageDiD Stage-2 solve_ols calls P1: ImputationDiD event-study/group bootstrap uses survey-weighted target weights; SunAbraham collapses to unit-level before Rao-Wu and stores NaN for failed draws P2: CS metadata from unit-level resolved survey; registry CS note updated for consistency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P0: SunAbraham pairs bootstrap now passes survey weights through to _fit_saturated_regression, _compute_iw_effects, _compute_overall_att P1: ImputationDiD passes both treated (sw_1) and untreated (sw_0) survey weights to _precompute_bootstrap_psi, fixing array indexing P1: ContinuousDiD IF scores now include per-unit w_i factor in sandwich meat (w_i * X_i * u_i), fixing weighted IF consistency P3: REGISTRY.md CS section updated for consistent survey support docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P0: TwoStage solve_ols now passes survey weights for all 3 Stage-2 paths (static, event-study, group) P0: SunAbraham pairs bootstrap passes resolved_survey=None to avoid stale design weights overriding bootstrap-resampled weights P1: CS bootstrap uses fixed cohort masses from precomputed survey weights (not per-cell survey_weight_sum) for overall and event study P1: Single-PSU unstratified guard in generate_survey_multiplier_weights_batch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Reject lonely_psu="adjust" for bootstrap with NotImplementedError - Add FPC validation in multiplier and Rao-Wu bootstrap generators - Gate SyntheticDiD placebo + full survey design (require bootstrap) - Update REGISTRY.md for SyntheticDiD, TROP, and CS survey support - Update TROP fit() docstring for full design support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-25T01:12:37Z

Overall Assessment

⛔ Blocker

The PR introduces one unmitigated P0 in the shared Rao-Wu bootstrap path and three P1 methodology/documentation issues in the new survey-bootstrap support. The P0 is a silent statistical correctness bug in shared code, so this should not merge as-is.

Executive Summary

The shared Rao-Wu rescaled bootstrap cannot produce the zero-variance full-census FPC case, so estimators that use it can report nonzero bootstrap uncertainty where the survey design implies zero variance.
ContinuousDiD’s new survey bootstrap targets the wrong ACRT^{glob} estimand under non-uniform survey weights: the point estimator and analytical IF are weighted, but the bootstrap perturbation is unweighted.
Both new survey-bootstrap generators reject lonely_psu="adjust", but the new Phase 6 registry text does not disclose that restriction.
TROP’s Rao-Wu path changes the no-strata design by treating treatment status as strata, which changes the bootstrap/FPC construction in the strata=None case without a clear Phase 6 note.
The new tests are mostly smoke tests; they do not cover Rao-Wu full-census FPC or ContinuousDiD’s survey-bootstrap overall_acrt_* outputs, which is why the methodology regressions above slip through.

Methodology

Severity: P0. Method: Rao-Wu rescaled bootstrap (SunAbraham, SyntheticDiD, TROP). Impact: Within this library’s own survey methodology, full-census FPC (f_h = 1) is a zero-variance case in the analytical survey path at diff_diff/survey.py:658 and diff_diff/survey.py:713, with regression coverage at tests/test_survey.py:2504 and tests/test_survey.py:2533. The new shared Rao-Wu generator clamps m_h to at least 1 at diff_diff/bootstrap_utils.py:608, so its “full census” guard at diff_diff/bootstrap_utils.py:612 is unreachable. Callers at diff_diff/sun_abraham.py:1372, diff_diff/synthetic_did.py:760, diff_diff/trop_local.py:1163, and diff_diff/trop_global.py:1149 can therefore emit non-degenerate replicate weights and nonzero SEs for full-census FPC designs. That is silent wrong statistical output. Concrete fix: Special-case f_h >= 1 before drawing PSUs and keep original weights / zero perturbation for that stratum; also update the Rao-Wu Phase 6 note at docs/methodology/REGISTRY.md:1960 to document the census case.
Severity: P1. Method: ContinuousDiD PSU-level multiplier bootstrap for ACRT^{glob}. Impact: The registry defines ACRT^{glob} as the treated average marginal effect at docs/methodology/REGISTRY.md:468. The survey point estimator and analytical IF both use survey-weighted treated averages at diff_diff/continuous_did.py:955, diff_diff/continuous_did.py:1005, and diff_diff/continuous_did.py:1176. But the new survey bootstrap perturbation uses an unweighted mean at diff_diff/continuous_did.py:1441. With non-uniform treated survey weights, overall_acrt_se, CI, and p-value no longer target the same estimand as the point estimate. Concrete fix: Use the stored weighted dpsi_bar (or equivalent weighted average when w_treated is present) in _bootstrap_gt_cell, and add a regression test covering overall_acrt_* under non-uniform treated survey weights.
Severity: P1. Method: survey-aware bootstrap support in generate_survey_multiplier_weights_batch() and generate_rao_wu_weights(). Impact: The general survey registry still presents lonely_psu modes including "adjust" at docs/methodology/REGISTRY.md:1887, and the new Phase 6 section advertises survey-aware bootstrap support at docs/methodology/REGISTRY.md:1945 and docs/methodology/REGISTRY.md:1955. But both new generators hard-reject "adjust" at diff_diff/bootstrap_utils.py:454 and diff_diff/bootstrap_utils.py:570. That is an undocumented behavioral/methodology restriction in a newly advertised feature. Concrete fix: Either implement "adjust" for bootstrap paths or add explicit **Note:** / **Deviation from R:** entries in REGISTRY.md stating that bootstrap currently supports only "remove" and "certainty", plus tests for the present rejection.
Severity: P1. Method: TROP Rao-Wu resampling design. Impact: In the TROP full-design bootstrap builders, the survey_design.strata is None case still creates strata from treatment group at diff_diff/trop_local.py:1073 and diff_diff/trop_global.py:1091. The shared survey layer otherwise supports genuinely unstratified FPC designs at diff_diff/survey.py:212. The Phase 6 Rao-Wu note describes resampling within actual strata at docs/methodology/REGISTRY.md:1960, while the TROP pseudo-strata note only describes the case where survey strata and treatment groups both exist at docs/methodology/REGISTRY.md:1977. In the strata=None case, the implementation therefore changes the bootstrap/FPC design without a clear matching Phase 6 note. Concrete fix: Either keep TROP unstratified when survey_design.strata is None, or document and justify the treatment-group pseudo-strata rule for that case, including its FPC interpretation, and add no-strata+FPC tests for both local and global TROP.

Code Quality

No additional findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No additional findings beyond the methodology/doc alignment issues above.

Tech Debt

No TODO.md mitigation applies to the P0/P1 items above. Under the stated rubric, these are correctness or undocumented-methodology issues, not deferrable tech debt.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new tests cover smoke/success paths but miss the key edge cases above. The ContinuousDiD survey-bootstrap test only checks overall_att / overall_att_se at tests/test_survey_phase3.py:673, not overall_acrt_*. The Phase 5 “full design” fixture omits fpc at tests/test_survey_phase5.py:109, so the SyntheticDiD and TROP full-design smoke tests at tests/test_survey_phase5.py:179, tests/test_survey_phase5.py:532, and tests/test_survey_phase5.py:547 never exercise the broken full-census Rao-Wu branch. Concrete fix: Add targeted regression tests for full-census FPC Rao-Wu behavior, lonely_psu="adjust" rejection/support, and ContinuousDiD overall_acrt_* under non-uniform survey weights.

Path to Approval

Fix diff_diff/bootstrap_utils.py so Rao-Wu handles f_h >= 1 as a census case with zero perturbation, then add stratified and unstratified full-census regression tests in at least one Rao-Wu estimator and a generator-level unit test.
Fix diff_diff/continuous_did.py so survey bootstrap ACRT^{glob} uses the same weighted treated derivative average as the point estimator / analytical IF, then add a regression test that asserts overall_acrt_se, CI, and p-value under non-uniform survey weights.
Resolve the lonely_psu="adjust" mismatch by either implementing it in both survey-bootstrap generators or documenting the unsupported mode with explicit **Note:** / **Deviation from R:** labels in docs/methodology/REGISTRY.md, plus tests for the current behavior.
Resolve the TROP strata=None design mismatch by either keeping the Rao-Wu path unstratified in that case or explicitly documenting the treatment-group pseudo-strata rule, including FPC semantics, and add SurveyDesign(weights=..., fpc=...) no-strata tests for both TROP methods.

igerber and others added 6 commits March 24, 2026 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add survey-aware bootstrap for all estimators (Phase 6)#237

Add survey-aware bootstrap for all estimators (Phase 6)#237
igerber wants to merge 6 commits intomainfrom
survey-improvements

igerber commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Mar 25, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant