feat(replica): refuse to create new restore when too many already exist by passcod · Pull Request #31 · beyondessential/postgres-restore-operator

passcod · 2026-05-08T06:06:20Z

and don't consider a migration incomplete when there's nothing to migrate

🦸 Review Hero

Run Review Hero
Auto-fix review suggestions
Auto-fix CI failures

When persistent_schemas is set in spec but migration is skipped (first restore, empty config, or all configured schemas missing on source), the early-return Ok(true) branches never patched status.schemaMigrationPhase. The cleanup gate in reconcile_replica requires schemaMigrationPhase==complete to delete previousRestore, so those replicas accumulated stale Active restores indefinitely.

…ter cleanup Previously the cleanup branch in reconcile_replica only deleted the single restore named in status.previousRestore. That field is overwritten on every switchover without first ensuring the prior value was cleaned up, so any combination of (a) cleanup blocked by the migration_complete gate or (b) two switchovers within one grace period orphaned restores forever. Replace with a sweep over all Active restores != currentRestore. The gating semantics are unchanged (migration_complete + elapsed > switchover_grace_period). Refuse to sweep when no live Active matches status.currentRestore, since that indicates an inconsistent state better resolved by another path.

If a replica accumulates 3 or more PostgresPhysicalRestore objects (MAX_RESTORES_PER_REPLICA), the operator now refuses to create another one. It logs a warning, sets a RestoreCreationBlocked status condition visible in kubectl describe, and emits a Warning event. The condition is cleared on the next successful create. Steady state is 1 restore per replica (transiently 2 around a switchover); 3 already indicates pruning has stopped working. This is a backstop against repeats of the bug fixed in the prior commits — it caps blast radius in degenerate states without preventing the operator from recovering once the underlying issue is resolved.

…restore Regression for the production bug where a replica with persistent_schemas configured but all schemas missing on source accumulated stale Active restores indefinitely (one cluster grew to 41 over a week). Asserts that after a manual second restore drives a switchover, both status.schemaMigrationPhase reaches "complete" and the previous restore is swept once the grace period elapses, leaving exactly one restore. Phase 3 (too-many-restores guardrail) is not covered end-to-end here: exercising it would require rotating the kopia snapshot mid-test (infrastructure not currently in the repo). The guardrail logic is small and self-contained; deferring its integration test.

Phase 3 (too-many-restores guardrail) shipped end-to-end, but its integration test is held back: exercising it requires rotating the kopia snapshot mid-test, which would mean a new fixture. Logic is small and the new RestoreCreationBlocked condition makes runtime state self-evident, so deferring rather than blocking the prune fix.

review-hero · 2026-05-08T06:10:51Z

🦸 Review Hero Summary
6 agents reviewed this PR | 1 critical | 4 suggestions | 0 nitpicks | Filtering: consensus 3 voters, 1 below threshold

Below consensus threshold (1 unique issue not confirmed by majority)

Location	Agent	Severity	Comment
`src/controllers/replica.rs:1109`	Bugs & Correctness	`critical`	Calling `mark_schema_migration_complete` in the 'first restore' branch sets `schemaMigrationPhase = complete`, which then trips the short-circuit guard at L1123-1132 on the SECOND restore cycle — c...

Local fix prompt (copy to your coding agent)

Fix these issues identified on the pull request. One commit per issue fixed.

src/controllers/replica.rs:337: After the sweep loop swallows delete errors (lines 331–334, warn! only, no early return), the status patch unconditionally clears schemaMigrationPhase to null. If any delete fails transiently, the surviving stale restore is now permanently un-sweepable: on the next reconcile, migration_complete evaluates to false (phase is null, persistent_schemas is set), so the sweep guard never fires again until an entirely new restore cycle runs mark_schema_migration_complete. This recreates the original accumulation bug under transient API errors. Fix: track whether all deletes succeeded and skip the status patch if any failed, so the next reconcile retries the remaining stale restores.

src/controllers/replica/resources.rs:292: The comment claims 'the count is now below the limit again', but when creation happens with existing.items.len() == MAX_RESTORES_PER_REPLICA - 1 (e.g., 2 restores), the total after creation is exactly MAX_RESTORES_PER_REPLICA (3). The condition is cleared to False/Healthy, but the very next reconcile that attempts to create a restore will immediately re-trip the guardrail and set it back to True. This causes unnecessary condition churn visible to operators. The condition should only be cleared to False when the count drops below MAX_RESTORES_PER_REPLICA, not immediately after creating a restore that brings it to the limit.

src/controllers/replica.rs:307: The comment says 'Refuse to sweep if status.currentRestore doesn't match any live Active restore', but the guard only checks existence (any phase). During a switchover, currentRestore points to the new restore which is in Switching phase — has_matching_current is true. If last_restore_completed_at is from the previous cycle (hours old, past grace period), the sweep fires mid-switchover and deletes the predecessor restore that is still in Active phase and may still be serving traffic. Fix: filter by Active phase too — restore_list.items.iter().any(|r| r.name_any() == current && r.status.as_ref().and_then(|s| s.phase.as_ref()) == Some(&RestorePhase::Active)).

src/controllers/replica.rs:327: Stale restores are deleted sequentially with .await inside a for-loop. In the exact failure mode this PR addresses (41 accumulated restores) that is 41 consecutive round-trips to the API server, and each failed delete blocks the next. Parallelise with futures::future::join_all (or tokio::task::JoinSet) so all deletes are in-flight simultaneously; errors can still be logged per-item after joining.

src/controllers/replica/resources.rs:185: create_restore_for_snapshot issues a full Kubernetes API list call (restores.list(...)) to count existing restores, but the caller in reconcile (replica.rs:146–148) already fetched the identically-labelled restore list moments earlier in the same reconcile pass. This doubles the number of API round-trips on every scheduling cycle. Fix: add an existing_count: usize (or &[PostgresPhysicalRestore]) parameter to create_restore_for_snapshot and pass restore_list.items.len() from the caller, eliminating the redundant fetch.

The first-restore branch in reconcile_schema_migration has no previous restore to clean up, so the cleanup gate is moot. Marking phase as complete here would short-circuit the next migration via the schema_migration_phase == "complete" guard, silently skipping schema migration on the second restore cycle even when real schemas exist that need to be carried over. The empty-config and all-filtered branches still mark complete since they do block cleanup if left unset.

The sweep guard had a comment saying "live Active restore" but the check itself only matched on name. The current reconcile flow returns early during switchover so this never fired with a non-Active current in practice, but the comment/code mismatch was brittle. Tighten the check to match the comment, costing nothing.

…y fail Previously the post-sweep status patch unconditionally cleared schemaMigrationPhase to null. If any individual delete failed (transient API error, etc.) the surviving stale restore would be permanently un-sweepable until the next switchover re-marked complete: migration_complete evaluates to false (phase is null, persistent_schemas is set), so the sweep guard refuses to retry. Track per-delete success and skip the status clear unless every delete succeeded. The next reconcile will then retry the survivors with the gate still open.

AGENTS.md requires imports to be merged and grouped by std, third-party, then local. The Phase 3 commit added ListParams and warn as separate import statements; merge them into the existing kube and tracing blocks.

passcod added 6 commits May 8, 2026 17:13

plan: restore-prune-fix

0cd3fea

review-hero Bot reviewed May 8, 2026

View reviewed changes

Comment thread src/controllers/replica.rs Outdated

Comment thread src/controllers/replica/resources.rs

Comment thread src/controllers/replica.rs Outdated

Comment thread src/controllers/replica.rs

Comment thread src/controllers/replica/resources.rs

passcod added 4 commits May 8, 2026 19:28

style(resources): merge new imports per project convention

b05c4d3

AGENTS.md requires imports to be merged and grouped by std, third-party, then local. The Phase 3 commit added ListParams and warn as separate import statements; merge them into the existing kube and tracing blocks.

passcod merged commit 9439466 into main May 8, 2026
13 checks passed

passcod deleted the restore-prune-fix branch May 8, 2026 08:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(replica): refuse to create new restore when too many already exist#31

feat(replica): refuse to create new restore when too many already exist#31
passcod merged 10 commits intomainfrom
restore-prune-fix

passcod commented May 8, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

review-hero Bot commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

passcod commented May 8, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦸 Review Hero

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

review-hero Bot commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

passcod commented May 8, 2026 •

edited by github-actions Bot

Loading