Skip to content

feat(replica): refuse to create new restore when too many already exist#31

Merged
passcod merged 10 commits intomainfrom
restore-prune-fix
May 8, 2026
Merged

feat(replica): refuse to create new restore when too many already exist#31
passcod merged 10 commits intomainfrom
restore-prune-fix

Conversation

@passcod
Copy link
Copy Markdown
Member

@passcod passcod commented May 8, 2026

and don't consider a migration incomplete when there's nothing to migrate

🦸 Review Hero

  • Run Review Hero
  • Auto-fix review suggestions
  • Auto-fix CI failures

passcod added 6 commits May 8, 2026 17:13
When persistent_schemas is set in spec but migration is skipped (first
restore, empty config, or all configured schemas missing on source),
the early-return Ok(true) branches never patched
status.schemaMigrationPhase. The cleanup gate in reconcile_replica
requires schemaMigrationPhase==complete to delete previousRestore, so
those replicas accumulated stale Active restores indefinitely.
…ter cleanup

Previously the cleanup branch in reconcile_replica only deleted the
single restore named in status.previousRestore. That field is
overwritten on every switchover without first ensuring the prior value
was cleaned up, so any combination of (a) cleanup blocked by the
migration_complete gate or (b) two switchovers within one grace
period orphaned restores forever.

Replace with a sweep over all Active restores != currentRestore. The
gating semantics are unchanged (migration_complete + elapsed >
switchover_grace_period). Refuse to sweep when no live Active matches
status.currentRestore, since that indicates an inconsistent state
better resolved by another path.
If a replica accumulates 3 or more PostgresPhysicalRestore objects
(MAX_RESTORES_PER_REPLICA), the operator now refuses to create another
one. It logs a warning, sets a RestoreCreationBlocked status condition
visible in kubectl describe, and emits a Warning event. The condition
is cleared on the next successful create.

Steady state is 1 restore per replica (transiently 2 around a
switchover); 3 already indicates pruning has stopped working. This is
a backstop against repeats of the bug fixed in the prior commits — it
caps blast radius in degenerate states without preventing the operator
from recovering once the underlying issue is resolved.
…restore

Regression for the production bug where a replica with persistent_schemas
configured but all schemas missing on source accumulated stale Active
restores indefinitely (one cluster grew to 41 over a week).

Asserts that after a manual second restore drives a switchover, both
status.schemaMigrationPhase reaches "complete" and the previous restore
is swept once the grace period elapses, leaving exactly one restore.

Phase 3 (too-many-restores guardrail) is not covered end-to-end here:
exercising it would require rotating the kopia snapshot mid-test
(infrastructure not currently in the repo). The guardrail logic is
small and self-contained; deferring its integration test.
Phase 3 (too-many-restores guardrail) shipped end-to-end, but its
integration test is held back: exercising it requires rotating the
kopia snapshot mid-test, which would mean a new fixture. Logic is
small and the new RestoreCreationBlocked condition makes runtime
state self-evident, so deferring rather than blocking the prune fix.
Comment thread src/controllers/replica.rs Outdated
Comment thread src/controllers/replica/resources.rs
Comment thread src/controllers/replica.rs Outdated
Comment thread src/controllers/replica.rs
Comment thread src/controllers/replica/resources.rs
@review-hero
Copy link
Copy Markdown

review-hero Bot commented May 8, 2026

🦸 Review Hero Summary
6 agents reviewed this PR | 1 critical | 4 suggestions | 0 nitpicks | Filtering: consensus 3 voters, 1 below threshold

Below consensus threshold (1 unique issue not confirmed by majority)
Location Agent Severity Comment
src/controllers/replica.rs:1109 Bugs & Correctness critical Calling mark_schema_migration_complete in the 'first restore' branch sets schemaMigrationPhase = complete, which then trips the short-circuit guard at L1123-1132 on the SECOND restore cycle — c...
Local fix prompt (copy to your coding agent)

Fix these issues identified on the pull request. One commit per issue fixed.


src/controllers/replica.rs:337: After the sweep loop swallows delete errors (lines 331–334, warn! only, no early return), the status patch unconditionally clears schemaMigrationPhase to null. If any delete fails transiently, the surviving stale restore is now permanently un-sweepable: on the next reconcile, migration_complete evaluates to false (phase is null, persistent_schemas is set), so the sweep guard never fires again until an entirely new restore cycle runs mark_schema_migration_complete. This recreates the original accumulation bug under transient API errors. Fix: track whether all deletes succeeded and skip the status patch if any failed, so the next reconcile retries the remaining stale restores.


src/controllers/replica/resources.rs:292: The comment claims 'the count is now below the limit again', but when creation happens with existing.items.len() == MAX_RESTORES_PER_REPLICA - 1 (e.g., 2 restores), the total after creation is exactly MAX_RESTORES_PER_REPLICA (3). The condition is cleared to False/Healthy, but the very next reconcile that attempts to create a restore will immediately re-trip the guardrail and set it back to True. This causes unnecessary condition churn visible to operators. The condition should only be cleared to False when the count drops below MAX_RESTORES_PER_REPLICA, not immediately after creating a restore that brings it to the limit.


src/controllers/replica.rs:307: The comment says 'Refuse to sweep if status.currentRestore doesn't match any live Active restore', but the guard only checks existence (any phase). During a switchover, currentRestore points to the new restore which is in Switching phase — has_matching_current is true. If last_restore_completed_at is from the previous cycle (hours old, past grace period), the sweep fires mid-switchover and deletes the predecessor restore that is still in Active phase and may still be serving traffic. Fix: filter by Active phase too — restore_list.items.iter().any(|r| r.name_any() == current && r.status.as_ref().and_then(|s| s.phase.as_ref()) == Some(&RestorePhase::Active)).


src/controllers/replica.rs:327: Stale restores are deleted sequentially with .await inside a for-loop. In the exact failure mode this PR addresses (41 accumulated restores) that is 41 consecutive round-trips to the API server, and each failed delete blocks the next. Parallelise with futures::future::join_all (or tokio::task::JoinSet) so all deletes are in-flight simultaneously; errors can still be logged per-item after joining.


src/controllers/replica/resources.rs:185: create_restore_for_snapshot issues a full Kubernetes API list call (restores.list(...)) to count existing restores, but the caller in reconcile (replica.rs:146–148) already fetched the identically-labelled restore list moments earlier in the same reconcile pass. This doubles the number of API round-trips on every scheduling cycle. Fix: add an existing_count: usize (or &[PostgresPhysicalRestore]) parameter to create_restore_for_snapshot and pass restore_list.items.len() from the caller, eliminating the redundant fetch.

passcod added 4 commits May 8, 2026 19:28
The first-restore branch in reconcile_schema_migration has no previous
restore to clean up, so the cleanup gate is moot. Marking phase as
complete here would short-circuit the next migration via the
schema_migration_phase == "complete" guard, silently skipping schema
migration on the second restore cycle even when real schemas exist
that need to be carried over.

The empty-config and all-filtered branches still mark complete since
they do block cleanup if left unset.
The sweep guard had a comment saying "live Active restore" but the
check itself only matched on name. The current reconcile flow returns
early during switchover so this never fired with a non-Active current
in practice, but the comment/code mismatch was brittle. Tighten the
check to match the comment, costing nothing.
…y fail

Previously the post-sweep status patch unconditionally cleared
schemaMigrationPhase to null. If any individual delete failed
(transient API error, etc.) the surviving stale restore would be
permanently un-sweepable until the next switchover re-marked complete:
migration_complete evaluates to false (phase is null,
persistent_schemas is set), so the sweep guard refuses to retry.

Track per-delete success and skip the status clear unless every
delete succeeded. The next reconcile will then retry the survivors
with the gate still open.
AGENTS.md requires imports to be merged and grouped by std,
third-party, then local. The Phase 3 commit added ListParams and warn
as separate import statements; merge them into the existing kube and
tracing blocks.
@passcod passcod merged commit 9439466 into main May 8, 2026
13 checks passed
@passcod passcod deleted the restore-prune-fix branch May 8, 2026 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant