Fix CRR PK retrofit corruption, sync schema skew, and failed brain connect wipe#367
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
9249b16 to
c426318
Compare
|
@copilot review but do not make fixes |
Reviewed
No blocking issues found in this pass, and the targeted tests passed (with CR-SQLite-gated tests skipped as expected). |
c426318 to
5deafde
Compare
|
@copilot review but do not make fixes |
5deafde to
6b3f3c4
Compare
|
@copilot review but do not make fixes |
6b3f3c4 to
ce6457b
Compare
|
@copilot review but do not make fixes |
…n connect wipe - Skip primary-key retrofit for tables with __crsql_clock companions (same guard as FK retrofit) - Reject unknown sync tables instead of silently skipping; use applyChanges appliedCount in ACKs - Defer device registry clear until connectToBrain succeeds so failed connects keep local pairing data Co-authored-by: Arul Sharma <arul28@users.noreply.github.com>
ce6457b to
b003b82
Compare
|
@copilot review but do not make fixes |
There was a problem hiding this comment.
PR Review
Scope: 7 file(s), +158 / −19
Verdict: Looks good
This PR hardens CRDT sync and local DB migration: it skips wholesale PK retrofit on CRR-managed tables, reports honest appliedCount in changeset ACKs, rejects unknown inbound sync tables (while still ignoring retired unified_memories* tombstones), and defers clearing the device registry until connectToBrain succeeds. The changes are tightly scoped with matching tests and correct transactional rollback in applyChanges.
Notes
- Good patterns:
applyChangeswraps the loop inbegin/rollback/commit, so a mid-batchunknown_sync_tableerror cannot leave a partially applied batch (kvDb.ts:L3048-L3079). Host and peer services already map apply failures took: falseACKs with retry/disconnect handling. - Operational tradeoff: Rejecting unknown sync tables (instead of silently skipping) will hard-fail sync when peers run mismatched schema versions until all devices upgrade. That is stricter than before but fixes the prior ACK skew where
changes.lengthcould over-report applied rows. - Could not verify locally: Vitest could not run in this environment (
node_modulesnot installed), so test pass/fail was not executed here.
Sent by Cursor Automation: BUGBOT in Versic
| } | ||
| }); | ||
|
|
||
| it.skipIf(!isCrsqliteAvailable())("skips primary-key retrofit for tables that already have __crsql_clock companions", async () => { | ||
| const dbPath = makeDbPath("ade-kvdb-pk-retrofit-skip-crr-"); | ||
| const first = await openKvDb(dbPath, createLogger()); | ||
| first.run("create unique index if not exists temp_ade_pk_retrofit_probe on lanes(project_id, name)"); | ||
| first.close(); | ||
|
|
||
| const reopened = await openKvDb(dbPath, createLogger()); | ||
| try { | ||
| expect( | ||
| reopened.get<{ name: string }>( | ||
| "select name from sqlite_master where type = 'table' and name = 'lanes__crsql_clock' limit 1", | ||
| )?.name, | ||
| ).toBe("lanes__crsql_clock"); | ||
| reopened.run("drop index if exists temp_ade_pk_retrofit_probe"); | ||
| } finally { | ||
| reopened.close(); | ||
| } | ||
| }); | ||
|
|
||
| it("coalesces duplicate lane_linear_issue_links rows during migrate", async () => { |
There was a problem hiding this comment.
Test may not exercise the actual PK-retrofit path for
lanes
The probe index (temp_ade_pk_retrofit_probe) is dropped during the table-rebuild in the old (unfixed) code path, so its presence after reopen could serve as a witness. However, retrofitLegacyPrimaryKeyNotNullSchema only attempts a rebuild when a PK column in the stored CREATE TABLE SQL is missing the NOT NULL keyword. A fresh openKvDb produces a fully-constrained lanes schema, so the retrofit branch is never entered for lanes at all — the assertion that lanes__crsql_clock still exists would pass trivially even without the if (rawHasTable(db, ...)) continue; guard. To actually cover the regression, the test should inject a legacy schema for lanes (strip NOT NULL from the stored DDL, or use a fixture DB created with the old migration) so the migration genuinely reaches the guarded skip.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/desktop/src/main/services/state/kvDb.migrations.test.ts
Line: 157-179
Comment:
**Test may not exercise the actual PK-retrofit path for `lanes`**
The probe index (`temp_ade_pk_retrofit_probe`) is dropped during the table-rebuild in the old (unfixed) code path, so its presence after reopen could serve as a witness. However, `retrofitLegacyPrimaryKeyNotNullSchema` only attempts a rebuild when a PK column in the stored `CREATE TABLE` SQL is missing the `NOT NULL` keyword. A fresh `openKvDb` produces a fully-constrained `lanes` schema, so the retrofit branch is never entered for `lanes` at all — the assertion that `lanes__crsql_clock` still exists would pass trivially even without the `if (rawHasTable(db, ...)) continue;` guard. To actually cover the regression, the test should inject a legacy schema for `lanes` (strip `NOT NULL` from the stored DDL, or use a fixture DB created with the old migration) so the migration genuinely reaches the guarded skip.
How can I resolve this? If you propose a fix, please make it concise.

Bug and impact
CRR primary-key retrofit corrupts replicated tables (data corruption)
retrofitLegacyPrimaryKeyNotNullSchemacouldDROP TABLE+ rebuild CRR-managed tables (e.g.lanes) withoutcrsql_begin_alter/rebuildCrrTableWithBackfill. The FK retrofit pass already skipped tables with__crsql_clockcompanions; the PK pass did not. Legacy DBs needing PK repair on open could break CRDT clocks and cause desktop/iOS divergence.Unknown-table changesets ACKed without applying (sync data loss)
applyChangessilently skipped rows for tables missing from the local schema, butsyncHostService/syncPeerServiceACKedappliedCount = changes.lengthand advanced the sync cursor. A viewer on an older build receiving batches that include newer tables would permanently skip those rows after upgrading.Failed
connectToBrainwiped local device registry (data loss)clearClusterRegistryForViewerJoin()ran before the WebSocket handshake. On auth/timeout failure, localdevices/sync_cluster_staterows were already deleted and not restored.Root cause
kvDb.ts.applyChangesreturn value; implementation contradictedcrdt-model.md(unknown tables must be rejected, not skipped-and-ACKed).Fix
${table}__crsql_clockexists.unknown_sync_table:<name>for non-retired missing tables; report realappliedCountin changeset ACKs.syncPeerService.connect().Validation
npx vitest run src/main/services/state/kvDb.migrations.test.tskvDb.sync.test.ts(run when cr-sqlite is available on macOS CI)Does not overlap open draft PRs #363–#366 (CRR alter leak, orchestration approval/manifest, viewer join DELETE broadcast).
Greptile Summary
This PR fixes three distinct data-integrity bugs: the CRR PK retrofit could DROP and rebuild cr-sqlite-managed tables,
applyChangessilently advanced the sync cursor for unknown tables, andclearClusterRegistryForViewerJoin()ran before the WebSocket handshake so a failedconnectToBrainwiped the device registry.kvDb.ts: Adds a__crsql_clockcompanion check to skip CRR tables in the PK retrofit pass, and replaces the silent-skip on unknown tables with a thrownunknown_sync_table:<name>error (retired tables continue to be ignored viaSYNC_RETIRED_TABLES).syncHostService.ts/syncPeerService.ts: Both changeset-batch ACK paths now reportapplyResult.appliedCountinstead of echoingchanges.length, so the remote cursor only advances for rows that were actually written.syncService.ts:clearClusterRegistryForViewerJoin()is moved to after a successfulsyncPeerService.connect(), with a new integration test covering the failure path.Confidence Score: 3/5
The three core fixes are logically correct, but the connectToBrain reordering introduces a new failure mode: if anything after connect() throws, the catch block clears the saved draft without closing the live WebSocket, leaving the device syncing silently while the caller believes the connection failed.
The ACK-count and unknown-table fixes are straightforward and well-tested. The registry-wipe reordering solves the stated bug cleanly for the happy path. However, the catch block in connectToBrain does not call syncPeerService.disconnect(), so any exception thrown by clearClusterRegistryForViewerJoin(), touchLocalDevice(), or flushLocalChanges() leaves an open WebSocket connection processing changesets in the background while the UI shows a failed-connect state. The migration regression test also may not reach the guarded code path on a fresh schema.
apps/ade-cli/src/services/sync/syncService.ts (missing disconnect in catch) and apps/desktop/src/main/services/state/kvDb.migrations.test.ts (test fixture may not exercise the retrofit branch)
Important Files Changed
Sequence Diagram
sequenceDiagram participant UI participant SyncService participant PeerService participant DeviceRegistry participant Brain UI->>SyncService: connectToBrain(draft) SyncService->>PeerService: connect(draft) PeerService->>Brain: WebSocket open + hello Brain-->>PeerService: hello_ok (handshake complete) PeerService-->>SyncService: connect() resolves Note over SyncService,DeviceRegistry: Registry cleared AFTER successful connect (fix) SyncService->>DeviceRegistry: clearClusterRegistryForViewerJoin() SyncService->>PeerService: flushLocalChanges() Brain->>PeerService: changeset_batch alt all tables known PeerService->>PeerService: applyChanges() to appliedCount PeerService-->>Brain: "changeset_ack(ok=true, appliedCount)" else unknown non-retired table PeerService->>PeerService: applyChanges() throws unknown_sync_table PeerService-->>Brain: "changeset_ack(ok=false, appliedCount=0)" endComments Outside Diff (2)
apps/ade-cli/src/services/sync/syncService.ts, line 1023-1036 (link)syncPeerService.connect(draft)opens the WebSocket and completes the handshake before returning. If any of the subsequent calls —clearClusterRegistryForViewerJoin(),touchLocalDevice(), orflushLocalChanges()— throw, the catch block clears the saved draft viasetSavedDraft(null)but never callssyncPeerService.disconnect(). The live WebSocket connection remains open and keeps receiving changesets from the brain, while the caller receives an error indicating that the connection failed. The mismatch leaves the device syncing silently in the background in a state the UI believes is disconnected.Prompt To Fix With AI
apps/desktop/src/main/services/state/kvDb.ts"unified_memories_fts"is already covered by thestartsWith("unified_memories_")guard below, making its entry inSYNC_RETIRED_TABLESredundant. The constant only needs to hold names that would NOT match thestartsWithpredicate — currently only"unified_memories"itself.Prompt To Fix With AI
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "Fix CRR PK retrofit corruption, sync sch..." | Re-trigger Greptile