Skip to content

fix(reindex): eliminate TOCTOU race in ReindexThread BulkProcessor rebuild#35323

Open
fabrizzio-dotCMS wants to merge 4 commits intomainfrom
fix/issue-35313-reindex-toctou
Open

fix(reindex): eliminate TOCTOU race in ReindexThread BulkProcessor rebuild#35323
fabrizzio-dotCMS wants to merge 4 commits intomainfrom
fix/issue-35313-reindex-toctou

Conversation

@fabrizzio-dotCMS
Copy link
Copy Markdown
Member

Summary

Fixes the TOCTOU (Time-Of-Check To Time-Of-Use) race condition in ReindexThread.runReindexLoop() identified in issue #35313.

Root cause

Two concurrent hazards on lines 243–248 of ReindexThread.java:

1. Null-check / rebuild not atomic (rebuildBulkIndexer)

if (bulkProcessor == null || rebuildBulkIndexer.get()) {  // CHECK
    closeBulkProcessor(bulkProcessor);
    bulkProcessorListener = new BulkProcessorListener();
    bulkProcessor = indexAPI.createBulkProcessor(bulkProcessorListener); // USE
}

An external thread calling rebuildBulkIndexer() between the check and the creation can leave callbacks pointing to a stale listener, silently losing reindex records.

2. putAll vs afterBulk callback race

bulkProcessorListener.workingRecords.putAll(workingRecords); // ReindexThread
// meanwhile: afterBulk() reads workingRecords on BulkProcessor callback thread

In back-to-back loop iterations, putAll(batch N+1) can interleave with afterBulk(batch N) reading workingRecords, causing deleteReindexEntry/markAsFailed to resolve entries from the wrong batch.

Fix

  • Create a fresh BulkProcessorListener and IndexBulkProcessor per batch via try-with-resources
  • close() blocks until afterBulk completes before the next batch starts — each batch owns its immutable workingRecords snapshot
  • rebuildBulkIndexer flag, rebuildBulkIndexer() method, and closeBulkProcessor() helper removed (all made unnecessary by per-batch approach)
  • finalizeReIndex() simplified: no longer manages a processor parameter
  • BulkProcessorListener.afterBulk: logs warning on >50% failure rate instead of signalling a rebuild

Scope

Only 2 files changed (ReindexThread.java + BulkProcessorListener.java), net -33 lines.

Test plan

  • Existing ReindexThread integration tests pass
  • OSCreateContentIndexIntegrationTest / OpenSearchUpgradeSuite pass
  • Manual concurrent-reindex: trigger rebuildBulkIndexer() while reindex is in flight — no ConcurrentModificationException, no silent record loss
  • Full reindex switchover still works: queue empties → finalizeReIndex → switchover → notification

Closes #35313

…build

Two concurrent hazards existed in runReindexLoop():

1. putAll/afterBulk race: putAll(batch N+1) could interleave with
   afterBulk(batch N) reading workingRecords on the BulkProcessor
   callback thread, causing deleteReindexEntry/markAsFailed to operate
   on the wrong entries.

2. Null-check/rebuild not atomic: rebuildBulkIndexer flag set between
   check and create could leave callbacks pointing to a stale listener.

Fix: create a fresh BulkProcessorListener and IndexBulkProcessor per
batch via try-with-resources. close() blocks until afterBulk completes
before the next batch starts — each batch owns its immutable
workingRecords snapshot, so no shared mutable state exists between
consecutive batches.

- rebuildBulkIndexer flag and method removed (per-batch makes explicit
  rebuild unnecessary)
- closeBulkProcessor() helper removed (no longer needed)
- finalizeReIndex() simplified: no longer manages a processor param
- BulkProcessorListener.afterBulk: logs warning on >50% failure rate
  instead of signalling a rebuild

Refs #35313

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @fabrizzio-dotCMS's task in 3m 6s —— View job


PR Review: fix(reindex): eliminate TOCTOU race in ReindexThread BulkProcessor rebuild

  • Gathering context
  • Reading changed files
  • Reviewing code
  • Posting findings

Summary

The core fix is sound: per-batch BulkProcessorListener isolation definitively eliminates the original TOCTOU race. Issues below are about edge cases and test gaps.


Issues

1. close() blocking guarantee is conditional — the Javadoc comment overclaims

ReindexThread.java:199-201 says:

"The processor is closed (blocking until afterBulk completes) before the next batch starts"

For the ES path (ESIndexBulkProcessor.close()), this calls delegate.awaitClose(awaitTimeoutSeconds, SECONDS) — which has a timeout (default 20s, BULK_PROCESSOR_AWAIT_TIMEOUT). If it times out, awaitClose returns false and afterBulk may still be pending on the ES callback thread when the next batch starts.

The per-batch isolation still protects against the original TOCTOU race regardless (each batch has its own workingRecords map), so correctness is not broken. But the comment establishes a happens-before guarantee that doesn't hold when ES is slow. If awaitClose times out, the queue entries for that batch are silently unresolved (neither deleted nor failed) — they'll be retried on the next poll, which is acceptable but not obvious from the code.

The comment should be corrected: the safety comes from per-batch map isolation, not from close() blocking. Fix this →


2. Test doesn't verify close() is actually called

ReindexThreadUnitTest.java:168-169: mockProc.close() is stubbed as a no-op and never verify-d. The try-with-resources is what guarantees close() fires — but the test doesn't assert it. If someone accidentally refactors the try-with-resources away, the test still passes.

Add verify(mockProc, atLeast(2)).close() after loopThread.join(...). Fix this →


3. Static instance not reset in test teardown — cross-test pollution

ReindexThread.java:146: the @VisibleForTesting constructor sets instance = this. The new test creates a ReindexThread pointing to mock dependencies, and @After never nulls instance. Any test running after testPerBatchListenerIsolation that calls ReindexThread.getInstance() gets the mock-backed instance, not a real one.

The existing testGenerateNotification also sets the singleton. This is a pre-existing pattern but the new test is the first one that calls ReindexThread.stopThread() (which also uses the singleton), so the ordering sensitivity is higher now.


4. contentletsIndexed accumulation skipped on close() exception

ReindexThread.java:231:

} // close() blocks until afterBulk completes before the next batch starts
contentletsIndexed += batchListener.getContentletsIndexed();

batchListener.contentletsIndexed is incremented in beforeBulk() (not afterBulk()), so it's already set before close() is called. But if close() throws, the exception jumps to the outer catch, skipping this line. The counter is under-reported by one batch on every primary-flush failure. Minor, but it's now silently wrong for any monitoring that reads totalESPuts().


5. workingRecords is package-private and mutable after construction

BulkProcessorListener.java:32:

final Map<String, ReindexEntry> workingRecords;

This is a ConcurrentHashMap populated by putAll in runReindexLoop before the processor opens. It's safe for the current call site, but any same-package code can mutate it after population. afterBulk reads it on a different thread without any guarantee the caller won't add more entries. Consider making it either constructor-injected or providing a snapshot-at-construction pattern to make the "immutable after creation" contract explicit.


What's clean

  • Removal of rebuildBulkIndexer flag and closeBulkProcessor helper is correct — no callers remain anywhere in the codebase.
  • The putAll before createBulkProcessor is correct ordering: workingRecords is fully populated before any auto-flush could call afterBulk.
  • CompositeBulkProcessor.close() correctly propagates primary failures while swallowing shadow (OS) failures.
  • OSIndexBulkProcessor.close() calls flush() synchronously — no timeout concern there.
  • The test reflection approach for setStateRunning is reasonable given ThreadState is private.

fabrizzio-dotCMS and others added 2 commits April 15, 2026 08:32
…indexThread

Adds testPerBatchListenerIsolation to ReindexThreadUnitTest.

Runs runReindexLoop() via reflection against mocked dependencies and
uses a CountDownLatch to let exactly two batches complete before
stopping the thread. Verifies:
- createBulkProcessor is called with a fresh listener per batch (not
  the same instance reused), proving the TOCTOU hazard is eliminated.
- Each listener's workingRecords snapshot contains only the entries from
  its own batch, not contaminated by the adjacent batch.

Refs #35313

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Expand wildcard import com.dotmarketing.business.* to explicit imports
- Add @IndexLibraryIndependent annotation (class uses only neutral APIs)
- Make state field final
- Remove unused TimeUnit import (no longer needed after per-batch fix)
- Reorder imports per Google Java style
- Fix brace formatting in finally block

Refs #35313

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@fabrizzio-dotCMS fabrizzio-dotCMS requested review from freddyDOTCMS and nollymar and removed request for nollymar April 15, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

fix(ReindexThread): TOCTOU race on BulkProcessor rebuild — use AtomicReference + compareAndSet

4 participants