fix: fuzz-test accuracy and infrastructure improvements by rophy · Pull Request #14 · rophy/OpenLogReplicator

rophy · 2026-03-29T02:22:14Z

Summary

Fixes and improvements to the fuzz-test framework for OLR data accuracy validation on Oracle RAC.

Bug fixes:

Fix seq map trimming that caused event overwrites in Kafka consumer — trimming evicted entries for rows whose INSERT and later UPDATE/DELETE arrived far apart, causing INSERT OR REPLACE to overwrite the INSERT
Fix PL/SQL UPDATE procedures using global instead of local variables for SQL context
Fix fuzz-test OLR startup detection — search full log instead of tail -5, which missed the startup message with verbose trace output
Make event_id immutable (assigned once at INSERT, unchanged by UPDATE/DELETE) for reliable per-row tracking

Validator improvements:

Per-node frontier validation to eliminate false positives from RAC node interleaving
Per-seq comparison for multi-event rows (LOB splits)

Infrastructure:

24h Kafka retention for post-mortem investigation
Rename Dockerfile.consumer to Dockerfile
Revert non-root consumer (incompatible with Docker named volumes)
Clearer status messages in fuzz-test.sh
Add investigation policy to AGENTS.md

Test Results (30-minute run)

177,346 events validated, 0 non-LOB mismatches
176,656 matched
690 known LOB issues (LOB phantom committed transactions in RAC online redo olr#26)
OLR memory stable at ~615 MB

Test plan

5-minute fuzz test: 0 non-LOB mismatches
30-minute fuzz test: 0 non-LOB mismatches on 177K events
Redo-log regression: 161/162 pass (1 pre-existing failure)

Summary by CodeRabbit

Documentation
- Added Investigation Policy to development guidelines, requiring evidence-based root cause analysis in investigations.
Tests
- Updated Kafka retention configuration and service definitions in test environment setup.
- Enhanced test validation and monitoring logic for improved diagnostic accuracy.

event_id is now assigned once at INSERT and never changed by UPDATE/DELETE. The consumer increments seq naturally for subsequent ops on the same row. Validator compares per (event_id, seq) pair instead of just distinct event_ids.

- event_id assigned once at INSERT, never changed by UPDATE/DELETE - Consumer increments seq for subsequent ops on same row - Validator compares per (event_id, seq) pair - Dockerfile runs as non-root app user - Kafka retention increased to 24h for investigation - Validator shared Dockerfile with consumer for consistent user

Search full OLR log for startup message instead of tail -5, which missed it when DUMP trace output was enabled.

coderabbitai · 2026-03-29T02:22:30Z

📝 Walkthrough

Walkthrough

Refactors the fuzz testing framework: per-sequence validation replaces per-event merging; sequence maps no longer trimmed; workload SQL stops updating event_id; test scripts and container configs adjusted; and an "Investigation Policy" was added to AGENTS.md. (49 words)

Changes

Cohort / File(s)	Summary
Policy & Documentation `AGENTS.md`	Added "Investigation Policy" requiring evidence-backed statements and limiting assertions to what logs/code/data show.
Container & Compose `tests/dbz-twin/rac/docker-compose-fuzz.yaml`	Changed Kafka retention 1→24h; consumer build uses `Dockerfile`; validator now built from local `Dockerfile` and image `fuzz-consumer:latest`.
Test Orchestration `tests/dbz-twin/rac/fuzz-test.sh`	Quoted OLR readiness grep; switched monitoring from validator logs to consumer logs; improved completion summary and drain-wait idle detection by comparing last consumer status.
Sequence Tracking `tests/dbz-twin/rac/kafka-consumer.py`	Removed periodic trimming of `lm_seq`/`olr_seq` maps so per-`event_id` sequence entries persist for process lifetime.
Workload SQL `tests/dbz-twin/rac/perf/fuzz-workload.sql`	UPDATE procedures stopped modifying `event_id`; introduced local variables to compute non-`event_id` values before UPDATE.
Validation Logic `tests/dbz-twin/rac/validator.py`	Replaced merged event-level comparison with per-`(event_id, seq)` mapping and iteration; diagnostics and matching/counts are emitted per `(eid, seq)` and values compared after normalization.

Sequence Diagram(s)

sequenceDiagram
    participant Workload as SQL Workload
    participant Kafka as Kafka Broker
    participant Consumer as Kafka Consumer
    participant LM as LM Tracker
    participant OLR as OLR Tracker
    participant Validator as Validator

    Workload->>Kafka: Emit events (preserve event_id)
    Kafka->>Consumer: Deliver messages
    Consumer->>LM: Record (event_id, seq) -> lm_seq
    Consumer->>OLR: Record (event_id, seq) -> olr_seq
    Note over LM,OLR: No periodic trimming (retain all sequences)

    Validator->>LM: Query seqs for event_id
    Validator->>OLR: Query seqs for event_id
    loop for each (event_id, seq)
        Validator->>Validator: Normalize LM row for (eid,seq)
        Validator->>Validator: Normalize OLR row for (eid,seq)
        Validator->>Validator: compare_values -> match/mismatch/diagnostic
    end
    Validator->>Validator: Emit per-seq counts & diagnostics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat: fuzz test framework with Kafka + SQLite streaming validation #13: Modifies the same fuzz-test framework files (kafka-consumer.py, validator.py, fuzz-test.sh, docker-compose) and adjusts sequence handling and per-seq comparison logic.

Poem

🐰 I hopped through sequences, small and neat,
Per‑seq I checked each heartbeat's beat,
Trimming gone, the maps hold true,
Validator sings each mismatch too,
Carrots for tests — a clean, bold feat! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main changes: bug fixes to the fuzz-test framework and supporting infrastructure improvements across multiple components (consumer, validator, shell scripts, database procedures, and config files).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/fuzz-test-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/dbz-twin/rac/perf/fuzz-workload.sql (1)
427-445: ⚠️ Potential issue | 🟠 Major

Seeded rows now stay invisible to CDC validation.

run() seeds FUZZ_SCALAR, FUZZ_LOB, and FUZZ_PART with event_id='SEED' and immediately adds those IDs to the tracked pools at Lines 731-752 in this file. After these UPDATE paths stopped rewriting event_id, any UPDATE that hits one of those seeded rows keeps SEED, and tests/dbz-twin/rac/kafka-consumer.py drops SEED events at Lines 89-90. Early UPDATE traffic on seeded rows—and any later DELETE on those rows—never reaches SQLite or the validator. Please either keep seeded IDs out of the tracked pools or assign them a real event_id before they become UPDATE/DELETE targets.

Also applies to: 463-474, 569-579
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/dbz-twin/rac/perf/fuzz-workload.sql` around lines 427 - 445, Seeded
rows are being added to the tracked ID pools in run(), which causes later
UPDATE/DELETE procedures (e.g., PROCEDURE do_update_scalar using pick_tracked_id
and g_scalar_ids/g_scalar_id_cnt) to hit rows with event_id='SEED' that the
Kafka consumer drops; either stop adding seeded IDs to the tracked pools or give
seeded rows a real event_id before they are added: change run() to exclude
SEED-inserted IDs from g_scalar_ids/g_lob_ids/g_part_ids (or add only after
updating event_id), or immediately update seeded rows' event_id to a non-'SEED'
value before any code pushes their IDs into the tracked arrays so that
pick_tracked_id in do_update_scalar / do_update_lob / do_update_part will never
return SEED rows.
tests/dbz-twin/rac/validator.py (1)
249-264: ⚠️ Potential issue | 🟠 Major

The frontier still skips later seqs for already-validated event_ids.

The new (event_id, seq) comparison only fixes the inner match. The outer scan at Lines 200-208 still selects event_id > cursor_by_node, and Lines 333-335 advance that cursor to the max event_id. With immutable event_ids from tests/dbz-twin/rac/perf/fuzz-workload.sql, later UPDATE/DELETEs append new seqs to older event_ids, so those rows are never revisited after their initial insert was validated. Long update-heavy stretches can therefore hit the idle path with unvalidated events still in SQLite. Track validation progress per (event_id, seq) or keep revisiting seen event_ids until both sides' seq frontiers stop moving.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/dbz-twin/rac/validator.py` around lines 249 - 264, The outer scan and
cursor advancement currently use cursor_by_node and only advance to the max
event_id, which skips later seqs appended to existing event_id values; change
the validation frontier to be per (event_id, seq) or ensure event_ids are
revisited until both sides' seq frontiers stop moving. Concretely, replace or
augment cursor_by_node with a per-event progress tracker (e.g., cursor_by_event
mapping keyed by event_id and last_validated_seq) or, when advancing an event_id
in the loop that builds lm_by_seq / olr_by_seq, only advance cursor_by_node for
that event_id when max(lm_by_seq.keys()) == max(olr_by_seq.keys()) (i.e., both
sides' seq frontiers match); ensure lm_events and olr_events queries and the
all_seqs logic (lm_by_seq, olr_by_seq, all_seqs) use that per-(event_id, seq)
logic so updates/deletes appended to older event_id values are revisited until
validated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/dbz-twin/rac/fuzz-test.sh`:
- Around line 311-328: The idle-check loop in fuzz-test.sh (variables prev_line,
cur_line, idle_count reading docker logs for fuzz-consumer and grepping
'\[consumer\]') can exit prematurely because the consumer prints progress only
every 30s; update the loop to either require two successive identical
'[consumer]' log lines before considering it idle (e.g., track last_two_lines or
a stable_report_count) or replace the log-polling with a direct check of the
SQLite row counts used by the validator (query the DB and ensure counts are
stable across two checks) and then set final_counts accordingly (currently
computed into final_counts) before proceeding to run the validator.

In `@tests/dbz-twin/rac/validator.py`:
- Around line 270-289: The per-sequence branches that detect a row present only
on one side (the blocks checking "if lm_r and not olr_r" and "if olr_r and not
lm_r") currently increment total_validated/total_mismatches/total_lob_known but
do not update the aggregate missing counters; update those branches to also
increment total_missing_olr when lm_r exists without olr_r, and increment
total_missing_lm when olr_r exists without lm_r, preserving the existing is_lob
handling (i.e., still count lob rows into total_lob_known instead of the missing
counters) and keep the same print/log behavior so per-seq missing rows
contribute to the final summary.

---

Outside diff comments:
In `@tests/dbz-twin/rac/perf/fuzz-workload.sql`:
- Around line 427-445: Seeded rows are being added to the tracked ID pools in
run(), which causes later UPDATE/DELETE procedures (e.g., PROCEDURE
do_update_scalar using pick_tracked_id and g_scalar_ids/g_scalar_id_cnt) to hit
rows with event_id='SEED' that the Kafka consumer drops; either stop adding
seeded IDs to the tracked pools or give seeded rows a real event_id before they
are added: change run() to exclude SEED-inserted IDs from
g_scalar_ids/g_lob_ids/g_part_ids (or add only after updating event_id), or
immediately update seeded rows' event_id to a non-'SEED' value before any code
pushes their IDs into the tracked arrays so that pick_tracked_id in
do_update_scalar / do_update_lob / do_update_part will never return SEED rows.

In `@tests/dbz-twin/rac/validator.py`:
- Around line 249-264: The outer scan and cursor advancement currently use
cursor_by_node and only advance to the max event_id, which skips later seqs
appended to existing event_id values; change the validation frontier to be per
(event_id, seq) or ensure event_ids are revisited until both sides' seq
frontiers stop moving. Concretely, replace or augment cursor_by_node with a
per-event progress tracker (e.g., cursor_by_event mapping keyed by event_id and
last_validated_seq) or, when advancing an event_id in the loop that builds
lm_by_seq / olr_by_seq, only advance cursor_by_node for that event_id when
max(lm_by_seq.keys()) == max(olr_by_seq.keys()) (i.e., both sides' seq frontiers
match); ensure lm_events and olr_events queries and the all_seqs logic
(lm_by_seq, olr_by_seq, all_seqs) use that per-(event_id, seq) logic so
updates/deletes appended to older event_id values are revisited until validated.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c9f207d9-f95f-4948-8312-bd9412800bdc

📥 Commits

Reviewing files that changed from the base of the PR and between dc622f8 and 14b791d.

📒 Files selected for processing (7)

AGENTS.md
tests/dbz-twin/rac/Dockerfile
tests/dbz-twin/rac/docker-compose-fuzz.yaml
tests/dbz-twin/rac/fuzz-test.sh
tests/dbz-twin/rac/kafka-consumer.py
tests/dbz-twin/rac/perf/fuzz-workload.sql
tests/dbz-twin/rac/validator.py

tests/dbz-twin/rac/fuzz-test.sh

tests/dbz-twin/rac/validator.py

Per-seq MISSING_OLR/EXTRA_OLR branches incremented total_mismatches but not total_missing_olr/total_missing_lm, under-reporting in the summary output.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/dbz-twin/rac/validator.py (1)
249-264: ⚠️ Potential issue | 🔴 Critical

Don't keep the watermark keyed only by event_id.

Lines 249-251 make event_id stable for the whole row lifetime, but the surrounding scan still advances cursor_by_node by event_id and only refetches event_id > cursor. That means once a row has been validated once, any later seq for that same event_id is permanently behind the cursor, so subsequent UPDATE/DELETE mismatches are never checked and the idle pass can exit with unvalidated work still pending. This needs progress tracking per (event_id, seq) or, at minimum, a revisit path for event IDs whose max seq has grown.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/dbz-twin/rac/validator.py` around lines 249 - 264, The validator
currently advances cursor_by_node keyed only by event_id so later seq updates
for the same event_id are skipped; change progress tracking to include seq (or
track max seq per event_id and revisit when it grows). Concretely, modify the
scan/advance logic around cursor_by_node, and when you build lm_by_seq /
olr_by_seq from lm_events and olr_events, compute the highest seq for that
event_id and store cursor_by_node[node] as (event_id, max_seq) or maintain a
separate cursor_seq map; ensure the loop refetches event_ids whose max_seq >
stored seq (or uses tuple comparison) so UPDATE/DELETE seqs are validated, using
the existing variables lm_events, olr_events, lm_by_seq, olr_by_seq, all_seqs to
detect and drive the revisit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/dbz-twin/rac/validator.py`:
- Around line 249-264: The validator currently advances cursor_by_node keyed
only by event_id so later seq updates for the same event_id are skipped; change
progress tracking to include seq (or track max seq per event_id and revisit when
it grows). Concretely, modify the scan/advance logic around cursor_by_node, and
when you build lm_by_seq / olr_by_seq from lm_events and olr_events, compute the
highest seq for that event_id and store cursor_by_node[node] as (event_id,
max_seq) or maintain a separate cursor_seq map; ensure the loop refetches
event_ids whose max_seq > stored seq (or uses tuple comparison) so UPDATE/DELETE
seqs are validated, using the existing variables lm_events, olr_events,
lm_by_seq, olr_by_seq, all_seqs to detect and drive the revisit.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7893beae-9807-4f20-8afb-b186d4de2563

📥 Commits

Reviewing files that changed from the base of the PR and between 14b791d and 82d0b26.

📒 Files selected for processing (1)

tests/dbz-twin/rac/validator.py

rophy added 9 commits March 28, 2026 15:14

fix: clearer status messages in fuzz-test.sh

aa68af6

fix: use local variables in new UPDATE procedures for PL/SQL SQL context

1eb5096

fix: revert non-root consumer, incompatible with Docker named volumes

8b52ad7

chore: rename Dockerfile.consumer to Dockerfile

09c6a61

fix: immutable event_id + per-seq validation

27a8e5a

event_id is now assigned once at INSERT and never changed by UPDATE/DELETE. The consumer increments seq naturally for subsequent ops on the same row. Validator compares per (event_id, seq) pair instead of just distinct event_ids.

fix: remove seq map trimming that caused event overwrites

9164625

fix: fuzz-test OLR startup detection

2169776

Search full OLR log for startup message instead of tail -5, which missed it when DUMP trace output was enabled.

docs: add investigation policy to AGENTS.md

14b791d

coderabbitai bot reviewed Mar 29, 2026

View reviewed changes

tests/dbz-twin/rac/fuzz-test.sh Show resolved Hide resolved

tests/dbz-twin/rac/validator.py Show resolved Hide resolved

rophy changed the title ~~fix: fuzz-test startup detection + investigation policy~~ fix: fuzz-test accuracy and infrastructure improvements Mar 29, 2026

fix: count per-seq missing rows in validator summary

82d0b26

Per-seq MISSING_OLR/EXTRA_OLR branches incremented total_mismatches but not total_missing_olr/total_missing_lm, under-reporting in the summary output.

coderabbitai bot reviewed Mar 29, 2026

View reviewed changes

rophy merged commit 91acc0b into master Mar 29, 2026
2 checks passed

rophy deleted the chore/fuzz-test-fixes branch March 29, 2026 03:59

coderabbitai bot mentioned this pull request Mar 31, 2026

fix: LOB phantom undo suppression + fuzz test improvements #16

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fuzz-test accuracy and infrastructure improvements#14

fix: fuzz-test accuracy and infrastructure improvements#14
rophy merged 10 commits intomasterfrom
chore/fuzz-test-fixes

rophy commented Mar 29, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 29, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rophy commented Mar 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Results (30-minute run)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rophy commented Mar 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 29, 2026 •

edited

Loading