feat(kvs-lance): gridlake step-2 — adaptive batching, WAL/ACID recovery, per-row seq (savant-reviewed) by AdaWorldAPI · Pull Request #30 · AdaWorldAPI/surrealdb

AdaWorldAPI · 2026-05-30T08:47:37Z

What & why

Gridlake step 2, stacked on #29 (base = claude/kvs-lance-timeline, so this PR shows only the step-2 delta; it auto-retargets to main when #29 merges). Goal: align SurrealDB writes with Lance so the kv-lance backend gets RocksDB-grade WAL/ACID and ClickHouse-grade write batching while staying faithful to Lance's columnar/versioned model. Full architecture in .claude/lance-backend/GRIDLAKE.md (800 lines, status-tagged).

Changes (additive, stable-only, no new deps)

Phase	What
P1 — ClickHouse-parity adaptive batching	`FlusherConfig` += `max_pending_bytes` (8 MiB) and `min_flush_interval` (rate floor); the flusher triggers on rows ∣ bytes ∣ tick, gated by the floor so it can't out-run compaction (the "too many parts/versions" discipline). Trigger logic extracted into a pure `should_flush()`. `Memtable::pending_bytes()` added.
P2 — WAL/ACID	`lsm_recovery_atomic_multi_op_batch`: a single multi-op transaction (2 inserts + 1 delete = one `WalRecord`) replays all-or-nothing across a crash sim.
P3 — per-row `seq` column	`seq: UInt64` added to the Lance schema + both batch builders; a `Datastore` `AtomicU64` minted once per transaction and stamped per row through the memtable + flusher. Decouples per-commit replay identity from the batch-granular `version` column.
Tooling	`.rustfmt.toml` made stable-honest (nightly-only opts that no gate enforced are commented out, per the org's 99%-stable policy).

Quality gate: multi-savant review caught a BLOCKER the tests didn't

The agent-authored code passed 96/0 tests, but a 3-savant read-only review (concurrency/ACID, Lance data-model, Rust idiom/tests) found a real BLOCKER single-session tests structurally can't see:

commit_seq reset to 0 on every Datastore::new and was never seeded from the max seq already persisted in Lance — so after any restart where the WAL was truncated, new commits re-minted seqs that collided with / regressed below flushed rows, defeating the column's whole purpose (a cross-restart monotonic, per-commit replay axis).

Fix: Datastore::max_persisted_seq (a tolerant seq-column scan; 0 for empty/legacy datasets) seeds commit_seq at open; replayed WAL records mint above that floor. The distinction that makes this correct: generation is memtable-local and unpersisted (need only clear the WAL tail), but seq is a Lance column and must seed from the persisted max. New regression seq_survives_restart_above_persisted_max fails on the old code, passes on the fix.

Also fixed from the review: debug_assert_eq! length guards → real ArrowError checks (enforced in release); a deterministic flusher_tick_interval knob replacing the timing-fragile coalescing test; NITs (>= symmetry, floor-gating doc, stale doc).

Verification (orchestrator-run; sole cargo runner)

cargo check -p surrealdb-core --features kv-lance --tests → Finished, 0 errors
cargo test -p surrealdb-core --features kv-lance --lib kvs::lance → 98 passed, 0 failed, 3 ignored (incl. the restart-regression proving the BLOCKER fix)

Documented limitations (accepted; pre-release)

Recorded in .claude/board/EPIPHANIES.md + the review log .claude/board/GRIDLAKE_REVIEW.md:

No schema migration for a pre-seq (4-column) on-disk dataset (fresh datasets only today).
The WAL carries no persisted seq, so replay renumbers seqs above the persisted max (monotonic + unique) rather than recovering exact pre-crash values.
LegacyCommitGate stamps seq = version (true per-commit fidelity is an LSM-path property).
seq/generation mint from two atomics, so under concurrency seq order can disagree with commit order (harmless today; reads never consult seq).

Scope / non-goals

Step 2 only. Not here: schema-migration/backfill for seq, persisting seq in the WAL, per-commit seq on the gate path, the columnar/SoA memtable (Phase 3 in GRIDLAKE.md), and compaction-GC of tombstones (Phase 4).

https://claude.ai/code/session_01R9AWgFa65uPnLyS2my2d2R

Generated by Claude Code

The .rustfmt.toml enabled nightly-only options (wrap_comments, imports_granularity, group_imports, comment_width) while the build toolchain is pinned stable 1.95 and the fmt job that would apply them (pinned rust-toolchain.nightly via ci.yml) is not running — GitHub Actions is not enabled on this fork (PR #29 head has zero check runs). So those options were enforced by nothing yet made stable `cargo fmt` warn and reformat divergently. Comment them out so stable `cargo fmt` (the 1.95 pin, matching org policy of 99% stable / nightly only for Miri) is the single authoritative formatter. Re-enable as a block if a nightly-fmt CI job is ever turned on and the tree is normalized under it in one dedicated pass. Findings (no CI on the fork; rustfmt split-brain) recorded in .claude/board/EPIPHANIES.md. https://claude.ai/code/session_01R9AWgFa65uPnLyS2my2d2R

… + SoA roadmap Design doc for aligning SurrealDB writes with the Lance columnar/versioned store: the convergent ingest→migrate→compact pattern and its 5 invariants; a pattern→code mapping (WAL+memtable+flusher == RocksDB; CommitGate == write-group leader; merge_insert == SST/part build; tombstone rows == ClickHouse _row_exists mask / Lance deletion vectors; Lance version == seqno/manifest; background_optimizer == compaction); the ACID story pinned property-by-property to its enforcement point; ClickHouse parity and where it breaks down; the per-row seq keystone (decoupling replay fidelity from physical batching); the SoA/gridlake transcode-free flush; compaction GC; and a status-tagged phased roadmap. Forward-looking items carry explicit CONJECTURE/ROADMAP markers so design is never confused with as-built. https://claude.ai/code/session_01R9AWgFa65uPnLyS2my2d2R

… pending] Checkpoint of agent-authored Phase-1 code: ClickHouse-parity adaptive flush triggers (rows | bytes | tick) with a rate floor so the flusher cannot out-run compaction. - FlusherConfig += { max_pending_bytes (8 MiB), min_flush_interval (50ms) } - Memtable::pending_bytes() (sums key+val; tombstone counts key only) - flusher_loop trigger extracted into a pure `should_flush`, gated by a web_time::Instant rate floor - tests: pending_bytes_*, should_flush_* , config defaults Agent-reported green (91 kvs::lance tests pass) but NOT yet independently verified: this is the "unverified code" stage of the review pipeline. Savant review + orchestrator cargo-verification + marker removal happen before the PR is opened. DO NOT MERGE as-is. https://claude.ai/code/session_01R9AWgFa65uPnLyS2my2d2R

…, per-row seq column P1 ClickHouse-parity adaptive batching: FlusherConfig += {max_pending_bytes (8 MiB), min_flush_interval (50 ms)}; Memtable::pending_bytes(); the flusher triggers on rows | bytes | periodic-tick, gated by a web_time::Instant rate floor (shutdown final-drain bypasses it); trigger logic extracted into a pure should_flush() for testability. P2 WAL/ACID: lsm_recovery_atomic_multi_op_batch proves a single multi-op transaction (2 inserts + 1 delete = one WalRecord) replays all-or-nothing across a Box::leak crash simulation + reopen. P3 per-row commit-sequence column: seq:UInt64 added to the Lance schema, to build_write_batch_lance and build_tombstone_batch_lance (which now take a parallel &[u64] seqs slice); a Datastore-level commit_seq AtomicU64 is minted once per transaction and stamped per row through the memtable + flusher; WAL replay reassigns fresh seqs (on-disk format unchanged); the legacy CommitGate stamps seq = version. This decouples per-commit replay identity from the batch-granular `version` column. Verification: orchestrator independently ran `cargo test -p surrealdb-core --features kv-lance --lib kvs::lance` => 96 passed, 0 failed, 3 ignored (confirms the build agent's result). Multi-savant logic review (concurrency/ACID, Lance data-model, Rust idiom & tests) is in progress; any findings land as follow-up fix commits before this branch is proposed for merge. Additive only; no new dependencies; stable Rust; no WAL on-disk format change. https://claude.ai/code/session_01R9AWgFa65uPnLyS2my2d2R

…d max (BLOCKER) A 3-savant read-only review of the step-2 code (which passed 96/0 tests) caught a BLOCKER that single-session tests can't see: `commit_seq` was re-initialised to 0 on every `Datastore::new` and never advanced past the max `seq` already persisted in Lance, so after a restart (WAL truncated) new commits re-minted seqs that collided with / regressed below flushed rows — defeating the `seq` column's purpose as a cross-restart monotonic per-commit replay axis. `generation` is memtable-local and unpersisted (so it need only clear the WAL tail); `seq` is a Lance column and must seed from the persisted max. Fixes applied: - BLOCKER: add `Datastore::max_persisted_seq` (tolerant `seq`-column scan; 0 for empty/legacy datasets) and seed `commit_seq` from it at open; replayed WAL records mint above that floor. - Upgrade the `debug_assert_eq!` seqs/rows length guards in both batch builders to real `ArrowError` checks (enforced in release too). - Add a `flusher_tick_interval` LanceConfig knob and use it to make the seq-coalescing test deterministic (was timing-fragile on slow disks). - NITs: align the re-loop nudge to `>=`; correct the flusher floor-gating module doc; refresh the stale `KvSchema::build_write_batch` doc. - Tests: +`seq_survives_restart_above_persisted_max` (fails on the old code, proves the fix), +`seq_column_gate_path_equals_version`, +`scan_versions_for_tests` helper. Documented (accepted, pre-release) limitations the savants surfaced are recorded in .claude/board/EPIPHANIES.md (pre-`seq` schema migration; WAL reseq on replay; gate-path seq=version; seq/generation two-atomic ordering). Verified by the orchestrator (sole cargo runner): `cargo check --features kv-lance --tests` clean; `cargo test --features kv-lance --lib kvs::lance` → 98 passed, 0 failed, 3 ignored. https://claude.ai/code/session_01R9AWgFa65uPnLyS2my2d2R

coderabbitai · 2026-05-30T08:47:46Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 392393c7-95cc-4d21-b2c7-d2daa21c8dca

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/sleepy-cori-aRK2x

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e329a7a7c3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

) `max_persisted_seq` treated a dataset whose schema lacks the `seq` column as success (returning 0), but the write paths now always build 5-column batches, so the first flush/merge against such a 4-column dataset would hit an opaque Lance schema mismatch. Return a clear migration error instead — exactly the "fail with a clear migration error before allowing commits" Codex suggested. A fresh dataset created by this code always carries the 5-column schema, so this only fires for genuinely legacy (pre-release) data; all 4 seq tests pass. Also records the Cognitive-RISC substrate↔invariant mapping (and the N1 trap: do not add class_id to the kv-lance schema — it stays policy-free) in .claude/board/EPIPHANIES.md. https://claude.ai/code/session_01R9AWgFa65uPnLyS2my2d2R

claude added 5 commits May 30, 2026 07:36

chatgpt-codex-connector Bot reviewed May 30, 2026

View reviewed changes

Comment thread surrealdb/core/src/kvs/lance/mod.rs

AdaWorldAPI merged commit bf95ad5 into claude/kvs-lance-timeline May 30, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kvs-lance): gridlake step-2 — adaptive batching, WAL/ACID recovery, per-row seq (savant-reviewed)#30

feat(kvs-lance): gridlake step-2 — adaptive batching, WAL/ACID recovery, per-row seq (savant-reviewed)#30
AdaWorldAPI merged 5 commits into
claude/kvs-lance-timelinefrom
claude/sleepy-cori-aRK2x

AdaWorldAPI commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026

Review skipped

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented May 30, 2026

What & why

Changes (additive, stable-only, no new deps)

Quality gate: multi-savant review caught a BLOCKER the tests didn't

Verification (orchestrator-run; sole cargo runner)

Documented limitations (accepted; pre-release)

Scope / non-goals

Uh oh!

coderabbitai Bot commented May 30, 2026

Review skipped

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants