Skip to content

fix: fuzz-test accuracy and infrastructure improvements#14

Merged
rophy merged 10 commits intomasterfrom
chore/fuzz-test-fixes
Mar 29, 2026
Merged

fix: fuzz-test accuracy and infrastructure improvements#14
rophy merged 10 commits intomasterfrom
chore/fuzz-test-fixes

Conversation

@rophy
Copy link
Copy Markdown
Owner

@rophy rophy commented Mar 29, 2026

Summary

Fixes and improvements to the fuzz-test framework for OLR data accuracy validation on Oracle RAC.

Bug fixes:

  • Fix seq map trimming that caused event overwrites in Kafka consumer — trimming evicted entries for rows whose INSERT and later UPDATE/DELETE arrived far apart, causing INSERT OR REPLACE to overwrite the INSERT
  • Fix PL/SQL UPDATE procedures using global instead of local variables for SQL context
  • Fix fuzz-test OLR startup detection — search full log instead of tail -5, which missed the startup message with verbose trace output
  • Make event_id immutable (assigned once at INSERT, unchanged by UPDATE/DELETE) for reliable per-row tracking

Validator improvements:

  • Per-node frontier validation to eliminate false positives from RAC node interleaving
  • Per-seq comparison for multi-event rows (LOB splits)

Infrastructure:

  • 24h Kafka retention for post-mortem investigation
  • Rename Dockerfile.consumer to Dockerfile
  • Revert non-root consumer (incompatible with Docker named volumes)
  • Clearer status messages in fuzz-test.sh
  • Add investigation policy to AGENTS.md

Test Results (30-minute run)

Test plan

  • 5-minute fuzz test: 0 non-LOB mismatches
  • 30-minute fuzz test: 0 non-LOB mismatches on 177K events
  • Redo-log regression: 161/162 pass (1 pre-existing failure)

Summary by CodeRabbit

  • Documentation

    • Added Investigation Policy to development guidelines, requiring evidence-based root cause analysis in investigations.
  • Tests

    • Updated Kafka retention configuration and service definitions in test environment setup.
    • Enhanced test validation and monitoring logic for improved diagnostic accuracy.

rophy added 9 commits March 28, 2026 15:14
event_id is now assigned once at INSERT and never changed by UPDATE/DELETE.
The consumer increments seq naturally for subsequent ops on the same row.
Validator compares per (event_id, seq) pair instead of just distinct event_ids.
- event_id assigned once at INSERT, never changed by UPDATE/DELETE
- Consumer increments seq for subsequent ops on same row
- Validator compares per (event_id, seq) pair
- Dockerfile runs as non-root app user
- Kafka retention increased to 24h for investigation
- Validator shared Dockerfile with consumer for consistent user
Search full OLR log for startup message instead of tail -5, which
missed it when DUMP trace output was enabled.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

📝 Walkthrough

Walkthrough

Refactors the fuzz testing framework: per-sequence validation replaces per-event merging; sequence maps no longer trimmed; workload SQL stops updating event_id; test scripts and container configs adjusted; and an "Investigation Policy" was added to AGENTS.md. (49 words)

Changes

Cohort / File(s) Summary
Policy & Documentation
AGENTS.md
Added "Investigation Policy" requiring evidence-backed statements and limiting assertions to what logs/code/data show.
Container & Compose
tests/dbz-twin/rac/docker-compose-fuzz.yaml
Changed Kafka retention 1→24h; consumer build uses Dockerfile; validator now built from local Dockerfile and image fuzz-consumer:latest.
Test Orchestration
tests/dbz-twin/rac/fuzz-test.sh
Quoted OLR readiness grep; switched monitoring from validator logs to consumer logs; improved completion summary and drain-wait idle detection by comparing last consumer status.
Sequence Tracking
tests/dbz-twin/rac/kafka-consumer.py
Removed periodic trimming of lm_seq/olr_seq maps so per-event_id sequence entries persist for process lifetime.
Workload SQL
tests/dbz-twin/rac/perf/fuzz-workload.sql
UPDATE procedures stopped modifying event_id; introduced local variables to compute non-event_id values before UPDATE.
Validation Logic
tests/dbz-twin/rac/validator.py
Replaced merged event-level comparison with per-(event_id, seq) mapping and iteration; diagnostics and matching/counts are emitted per (eid, seq) and values compared after normalization.

Sequence Diagram(s)

sequenceDiagram
    participant Workload as SQL Workload
    participant Kafka as Kafka Broker
    participant Consumer as Kafka Consumer
    participant LM as LM Tracker
    participant OLR as OLR Tracker
    participant Validator as Validator

    Workload->>Kafka: Emit events (preserve event_id)
    Kafka->>Consumer: Deliver messages
    Consumer->>LM: Record (event_id, seq) -> lm_seq
    Consumer->>OLR: Record (event_id, seq) -> olr_seq
    Note over LM,OLR: No periodic trimming (retain all sequences)

    Validator->>LM: Query seqs for event_id
    Validator->>OLR: Query seqs for event_id
    loop for each (event_id, seq)
        Validator->>Validator: Normalize LM row for (eid,seq)
        Validator->>Validator: Normalize OLR row for (eid,seq)
        Validator->>Validator: compare_values -> match/mismatch/diagnostic
    end
    Validator->>Validator: Emit per-seq counts & diagnostics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 I hopped through sequences, small and neat,
Per‑seq I checked each heartbeat's beat,
Trimming gone, the maps hold true,
Validator sings each mismatch too,
Carrots for tests — a clean, bold feat! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: bug fixes to the fuzz-test framework and supporting infrastructure improvements across multiple components (consumer, validator, shell scripts, database procedures, and config files).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/fuzz-test-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/dbz-twin/rac/perf/fuzz-workload.sql (1)

427-445: ⚠️ Potential issue | 🟠 Major

Seeded rows now stay invisible to CDC validation.

run() seeds FUZZ_SCALAR, FUZZ_LOB, and FUZZ_PART with event_id='SEED' and immediately adds those IDs to the tracked pools at Lines 731-752 in this file. After these UPDATE paths stopped rewriting event_id, any UPDATE that hits one of those seeded rows keeps SEED, and tests/dbz-twin/rac/kafka-consumer.py drops SEED events at Lines 89-90. Early UPDATE traffic on seeded rows—and any later DELETE on those rows—never reaches SQLite or the validator. Please either keep seeded IDs out of the tracked pools or assign them a real event_id before they become UPDATE/DELETE targets.

Also applies to: 463-474, 569-579

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/dbz-twin/rac/perf/fuzz-workload.sql` around lines 427 - 445, Seeded
rows are being added to the tracked ID pools in run(), which causes later
UPDATE/DELETE procedures (e.g., PROCEDURE do_update_scalar using pick_tracked_id
and g_scalar_ids/g_scalar_id_cnt) to hit rows with event_id='SEED' that the
Kafka consumer drops; either stop adding seeded IDs to the tracked pools or give
seeded rows a real event_id before they are added: change run() to exclude
SEED-inserted IDs from g_scalar_ids/g_lob_ids/g_part_ids (or add only after
updating event_id), or immediately update seeded rows' event_id to a non-'SEED'
value before any code pushes their IDs into the tracked arrays so that
pick_tracked_id in do_update_scalar / do_update_lob / do_update_part will never
return SEED rows.
tests/dbz-twin/rac/validator.py (1)

249-264: ⚠️ Potential issue | 🟠 Major

The frontier still skips later seqs for already-validated event_ids.

The new (event_id, seq) comparison only fixes the inner match. The outer scan at Lines 200-208 still selects event_id > cursor_by_node, and Lines 333-335 advance that cursor to the max event_id. With immutable event_ids from tests/dbz-twin/rac/perf/fuzz-workload.sql, later UPDATE/DELETEs append new seqs to older event_ids, so those rows are never revisited after their initial insert was validated. Long update-heavy stretches can therefore hit the idle path with unvalidated events still in SQLite. Track validation progress per (event_id, seq) or keep revisiting seen event_ids until both sides' seq frontiers stop moving.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/dbz-twin/rac/validator.py` around lines 249 - 264, The outer scan and
cursor advancement currently use cursor_by_node and only advance to the max
event_id, which skips later seqs appended to existing event_id values; change
the validation frontier to be per (event_id, seq) or ensure event_ids are
revisited until both sides' seq frontiers stop moving. Concretely, replace or
augment cursor_by_node with a per-event progress tracker (e.g., cursor_by_event
mapping keyed by event_id and last_validated_seq) or, when advancing an event_id
in the loop that builds lm_by_seq / olr_by_seq, only advance cursor_by_node for
that event_id when max(lm_by_seq.keys()) == max(olr_by_seq.keys()) (i.e., both
sides' seq frontiers match); ensure lm_events and olr_events queries and the
all_seqs logic (lm_by_seq, olr_by_seq, all_seqs) use that per-(event_id, seq)
logic so updates/deletes appended to older event_id values are revisited until
validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/dbz-twin/rac/fuzz-test.sh`:
- Around line 311-328: The idle-check loop in fuzz-test.sh (variables prev_line,
cur_line, idle_count reading docker logs for fuzz-consumer and grepping
'\[consumer\]') can exit prematurely because the consumer prints progress only
every 30s; update the loop to either require two successive identical
'[consumer]' log lines before considering it idle (e.g., track last_two_lines or
a stable_report_count) or replace the log-polling with a direct check of the
SQLite row counts used by the validator (query the DB and ensure counts are
stable across two checks) and then set final_counts accordingly (currently
computed into final_counts) before proceeding to run the validator.

In `@tests/dbz-twin/rac/validator.py`:
- Around line 270-289: The per-sequence branches that detect a row present only
on one side (the blocks checking "if lm_r and not olr_r" and "if olr_r and not
lm_r") currently increment total_validated/total_mismatches/total_lob_known but
do not update the aggregate missing counters; update those branches to also
increment total_missing_olr when lm_r exists without olr_r, and increment
total_missing_lm when olr_r exists without lm_r, preserving the existing is_lob
handling (i.e., still count lob rows into total_lob_known instead of the missing
counters) and keep the same print/log behavior so per-seq missing rows
contribute to the final summary.

---

Outside diff comments:
In `@tests/dbz-twin/rac/perf/fuzz-workload.sql`:
- Around line 427-445: Seeded rows are being added to the tracked ID pools in
run(), which causes later UPDATE/DELETE procedures (e.g., PROCEDURE
do_update_scalar using pick_tracked_id and g_scalar_ids/g_scalar_id_cnt) to hit
rows with event_id='SEED' that the Kafka consumer drops; either stop adding
seeded IDs to the tracked pools or give seeded rows a real event_id before they
are added: change run() to exclude SEED-inserted IDs from
g_scalar_ids/g_lob_ids/g_part_ids (or add only after updating event_id), or
immediately update seeded rows' event_id to a non-'SEED' value before any code
pushes their IDs into the tracked arrays so that pick_tracked_id in
do_update_scalar / do_update_lob / do_update_part will never return SEED rows.

In `@tests/dbz-twin/rac/validator.py`:
- Around line 249-264: The outer scan and cursor advancement currently use
cursor_by_node and only advance to the max event_id, which skips later seqs
appended to existing event_id values; change the validation frontier to be per
(event_id, seq) or ensure event_ids are revisited until both sides' seq
frontiers stop moving. Concretely, replace or augment cursor_by_node with a
per-event progress tracker (e.g., cursor_by_event mapping keyed by event_id and
last_validated_seq) or, when advancing an event_id in the loop that builds
lm_by_seq / olr_by_seq, only advance cursor_by_node for that event_id when
max(lm_by_seq.keys()) == max(olr_by_seq.keys()) (i.e., both sides' seq frontiers
match); ensure lm_events and olr_events queries and the all_seqs logic
(lm_by_seq, olr_by_seq, all_seqs) use that per-(event_id, seq) logic so
updates/deletes appended to older event_id values are revisited until validated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c9f207d9-f95f-4948-8312-bd9412800bdc

📥 Commits

Reviewing files that changed from the base of the PR and between dc622f8 and 14b791d.

📒 Files selected for processing (7)
  • AGENTS.md
  • tests/dbz-twin/rac/Dockerfile
  • tests/dbz-twin/rac/docker-compose-fuzz.yaml
  • tests/dbz-twin/rac/fuzz-test.sh
  • tests/dbz-twin/rac/kafka-consumer.py
  • tests/dbz-twin/rac/perf/fuzz-workload.sql
  • tests/dbz-twin/rac/validator.py

@rophy rophy changed the title fix: fuzz-test startup detection + investigation policy fix: fuzz-test accuracy and infrastructure improvements Mar 29, 2026
Per-seq MISSING_OLR/EXTRA_OLR branches incremented total_mismatches
but not total_missing_olr/total_missing_lm, under-reporting in the
summary output.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/dbz-twin/rac/validator.py (1)

249-264: ⚠️ Potential issue | 🔴 Critical

Don't keep the watermark keyed only by event_id.

Lines 249-251 make event_id stable for the whole row lifetime, but the surrounding scan still advances cursor_by_node by event_id and only refetches event_id > cursor. That means once a row has been validated once, any later seq for that same event_id is permanently behind the cursor, so subsequent UPDATE/DELETE mismatches are never checked and the idle pass can exit with unvalidated work still pending. This needs progress tracking per (event_id, seq) or, at minimum, a revisit path for event IDs whose max seq has grown.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/dbz-twin/rac/validator.py` around lines 249 - 264, The validator
currently advances cursor_by_node keyed only by event_id so later seq updates
for the same event_id are skipped; change progress tracking to include seq (or
track max seq per event_id and revisit when it grows). Concretely, modify the
scan/advance logic around cursor_by_node, and when you build lm_by_seq /
olr_by_seq from lm_events and olr_events, compute the highest seq for that
event_id and store cursor_by_node[node] as (event_id, max_seq) or maintain a
separate cursor_seq map; ensure the loop refetches event_ids whose max_seq >
stored seq (or uses tuple comparison) so UPDATE/DELETE seqs are validated, using
the existing variables lm_events, olr_events, lm_by_seq, olr_by_seq, all_seqs to
detect and drive the revisit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/dbz-twin/rac/validator.py`:
- Around line 249-264: The validator currently advances cursor_by_node keyed
only by event_id so later seq updates for the same event_id are skipped; change
progress tracking to include seq (or track max seq per event_id and revisit when
it grows). Concretely, modify the scan/advance logic around cursor_by_node, and
when you build lm_by_seq / olr_by_seq from lm_events and olr_events, compute the
highest seq for that event_id and store cursor_by_node[node] as (event_id,
max_seq) or maintain a separate cursor_seq map; ensure the loop refetches
event_ids whose max_seq > stored seq (or uses tuple comparison) so UPDATE/DELETE
seqs are validated, using the existing variables lm_events, olr_events,
lm_by_seq, olr_by_seq, all_seqs to detect and drive the revisit.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7893beae-9807-4f20-8afb-b186d4de2563

📥 Commits

Reviewing files that changed from the base of the PR and between 14b791d and 82d0b26.

📒 Files selected for processing (1)
  • tests/dbz-twin/rac/validator.py

@rophy rophy merged commit 91acc0b into master Mar 29, 2026
2 checks passed
@rophy rophy deleted the chore/fuzz-test-fixes branch March 29, 2026 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant