feat(replica): refuse to create new restore when too many already exist#31
feat(replica): refuse to create new restore when too many already exist#31
Conversation
When persistent_schemas is set in spec but migration is skipped (first restore, empty config, or all configured schemas missing on source), the early-return Ok(true) branches never patched status.schemaMigrationPhase. The cleanup gate in reconcile_replica requires schemaMigrationPhase==complete to delete previousRestore, so those replicas accumulated stale Active restores indefinitely.
…ter cleanup Previously the cleanup branch in reconcile_replica only deleted the single restore named in status.previousRestore. That field is overwritten on every switchover without first ensuring the prior value was cleaned up, so any combination of (a) cleanup blocked by the migration_complete gate or (b) two switchovers within one grace period orphaned restores forever. Replace with a sweep over all Active restores != currentRestore. The gating semantics are unchanged (migration_complete + elapsed > switchover_grace_period). Refuse to sweep when no live Active matches status.currentRestore, since that indicates an inconsistent state better resolved by another path.
If a replica accumulates 3 or more PostgresPhysicalRestore objects (MAX_RESTORES_PER_REPLICA), the operator now refuses to create another one. It logs a warning, sets a RestoreCreationBlocked status condition visible in kubectl describe, and emits a Warning event. The condition is cleared on the next successful create. Steady state is 1 restore per replica (transiently 2 around a switchover); 3 already indicates pruning has stopped working. This is a backstop against repeats of the bug fixed in the prior commits — it caps blast radius in degenerate states without preventing the operator from recovering once the underlying issue is resolved.
…restore Regression for the production bug where a replica with persistent_schemas configured but all schemas missing on source accumulated stale Active restores indefinitely (one cluster grew to 41 over a week). Asserts that after a manual second restore drives a switchover, both status.schemaMigrationPhase reaches "complete" and the previous restore is swept once the grace period elapses, leaving exactly one restore. Phase 3 (too-many-restores guardrail) is not covered end-to-end here: exercising it would require rotating the kopia snapshot mid-test (infrastructure not currently in the repo). The guardrail logic is small and self-contained; deferring its integration test.
Phase 3 (too-many-restores guardrail) shipped end-to-end, but its integration test is held back: exercising it requires rotating the kopia snapshot mid-test, which would mean a new fixture. Logic is small and the new RestoreCreationBlocked condition makes runtime state self-evident, so deferring rather than blocking the prune fix.
|
🦸 Review Hero Summary Below consensus threshold (1 unique issue not confirmed by majority)
Local fix prompt (copy to your coding agent)Fix these issues identified on the pull request. One commit per issue fixed.
|
The first-restore branch in reconcile_schema_migration has no previous restore to clean up, so the cleanup gate is moot. Marking phase as complete here would short-circuit the next migration via the schema_migration_phase == "complete" guard, silently skipping schema migration on the second restore cycle even when real schemas exist that need to be carried over. The empty-config and all-filtered branches still mark complete since they do block cleanup if left unset.
The sweep guard had a comment saying "live Active restore" but the check itself only matched on name. The current reconcile flow returns early during switchover so this never fired with a non-Active current in practice, but the comment/code mismatch was brittle. Tighten the check to match the comment, costing nothing.
…y fail Previously the post-sweep status patch unconditionally cleared schemaMigrationPhase to null. If any individual delete failed (transient API error, etc.) the surviving stale restore would be permanently un-sweepable until the next switchover re-marked complete: migration_complete evaluates to false (phase is null, persistent_schemas is set), so the sweep guard refuses to retry. Track per-delete success and skip the status clear unless every delete succeeded. The next reconcile will then retry the survivors with the gate still open.
AGENTS.md requires imports to be merged and grouped by std, third-party, then local. The Phase 3 commit added ListParams and warn as separate import statements; merge them into the existing kube and tracing blocks.
and don't consider a migration incomplete when there's nothing to migrate
🦸 Review Hero