fix(reindex): eliminate TOCTOU race in ReindexThread BulkProcessor rebuild by fabrizzio-dotCMS · Pull Request #35323 · dotCMS/core

fabrizzio-dotCMS · 2026-04-15T04:30:20Z

Summary

Fixes the TOCTOU (Time-Of-Check To Time-Of-Use) race condition in ReindexThread.runReindexLoop() identified in issue #35313.

Root cause

Two concurrent hazards on lines 243–248 of ReindexThread.java:

1. Null-check / rebuild not atomic (rebuildBulkIndexer)

if (bulkProcessor == null || rebuildBulkIndexer.get()) {  // CHECK
    closeBulkProcessor(bulkProcessor);
    bulkProcessorListener = new BulkProcessorListener();
    bulkProcessor = indexAPI.createBulkProcessor(bulkProcessorListener); // USE
}

An external thread calling rebuildBulkIndexer() between the check and the creation can leave callbacks pointing to a stale listener, silently losing reindex records.

2. putAll vs afterBulk callback race

bulkProcessorListener.workingRecords.putAll(workingRecords); // ReindexThread
// meanwhile: afterBulk() reads workingRecords on BulkProcessor callback thread

In back-to-back loop iterations, putAll(batch N+1) can interleave with afterBulk(batch N) reading workingRecords, causing deleteReindexEntry/markAsFailed to resolve entries from the wrong batch.

Fix

Create a fresh BulkProcessorListener and IndexBulkProcessor per batch via try-with-resources
close() blocks until afterBulk completes before the next batch starts — each batch owns its immutable workingRecords snapshot
rebuildBulkIndexer flag, rebuildBulkIndexer() method, and closeBulkProcessor() helper removed (all made unnecessary by per-batch approach)
finalizeReIndex() simplified: no longer manages a processor parameter
BulkProcessorListener.afterBulk: logs warning on >50% failure rate instead of signalling a rebuild

Scope

Only 2 files changed (ReindexThread.java + BulkProcessorListener.java), net -33 lines.

Test plan

Existing ReindexThread integration tests pass
OSCreateContentIndexIntegrationTest / OpenSearchUpgradeSuite pass
Manual concurrent-reindex: trigger rebuildBulkIndexer() while reindex is in flight — no ConcurrentModificationException, no silent record loss
Full reindex switchover still works: queue empties → finalizeReIndex → switchover → notification

Closes #35313

…build Two concurrent hazards existed in runReindexLoop(): 1. putAll/afterBulk race: putAll(batch N+1) could interleave with afterBulk(batch N) reading workingRecords on the BulkProcessor callback thread, causing deleteReindexEntry/markAsFailed to operate on the wrong entries. 2. Null-check/rebuild not atomic: rebuildBulkIndexer flag set between check and create could leave callbacks pointing to a stale listener. Fix: create a fresh BulkProcessorListener and IndexBulkProcessor per batch via try-with-resources. close() blocks until afterBulk completes before the next batch starts — each batch owns its immutable workingRecords snapshot, so no shared mutable state exists between consecutive batches. - rebuildBulkIndexer flag and method removed (per-batch makes explicit rebuild unnecessary) - closeBulkProcessor() helper removed (no longer needed) - finalizeReIndex() simplified: no longer manages a processor param - BulkProcessorListener.afterBulk: logs warning on >50% failure rate instead of signalling a rebuild Refs #35313 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-04-15T04:31:08Z

Claude finished @fabrizzio-dotCMS's task in 3m 6s —— View job

PR Review: fix(reindex): eliminate TOCTOU race in ReindexThread BulkProcessor rebuild

Gathering context
Reading changed files
Reviewing code
Posting findings

Summary

The core fix is sound: per-batch BulkProcessorListener isolation definitively eliminates the original TOCTOU race. Issues below are about edge cases and test gaps.

Issues

1. close() blocking guarantee is conditional — the Javadoc comment overclaims

ReindexThread.java:199-201 says:

"The processor is closed (blocking until afterBulk completes) before the next batch starts"

For the ES path (ESIndexBulkProcessor.close()), this calls delegate.awaitClose(awaitTimeoutSeconds, SECONDS) — which has a timeout (default 20s, BULK_PROCESSOR_AWAIT_TIMEOUT). If it times out, awaitClose returns false and afterBulk may still be pending on the ES callback thread when the next batch starts.

The per-batch isolation still protects against the original TOCTOU race regardless (each batch has its own workingRecords map), so correctness is not broken. But the comment establishes a happens-before guarantee that doesn't hold when ES is slow. If awaitClose times out, the queue entries for that batch are silently unresolved (neither deleted nor failed) — they'll be retried on the next poll, which is acceptable but not obvious from the code.

The comment should be corrected: the safety comes from per-batch map isolation, not from close() blocking. Fix this →

2. Test doesn't verify close() is actually called

ReindexThreadUnitTest.java:168-169: mockProc.close() is stubbed as a no-op and never verify-d. The try-with-resources is what guarantees close() fires — but the test doesn't assert it. If someone accidentally refactors the try-with-resources away, the test still passes.

Add verify(mockProc, atLeast(2)).close() after loopThread.join(...). Fix this →

3. Static instance not reset in test teardown — cross-test pollution

ReindexThread.java:146: the @VisibleForTesting constructor sets instance = this. The new test creates a ReindexThread pointing to mock dependencies, and @After never nulls instance. Any test running after testPerBatchListenerIsolation that calls ReindexThread.getInstance() gets the mock-backed instance, not a real one.

The existing testGenerateNotification also sets the singleton. This is a pre-existing pattern but the new test is the first one that calls ReindexThread.stopThread() (which also uses the singleton), so the ordering sensitivity is higher now.

4. contentletsIndexed accumulation skipped on close() exception

ReindexThread.java:231:

} // close() blocks until afterBulk completes before the next batch starts
contentletsIndexed += batchListener.getContentletsIndexed();

batchListener.contentletsIndexed is incremented in beforeBulk() (not afterBulk()), so it's already set before close() is called. But if close() throws, the exception jumps to the outer catch, skipping this line. The counter is under-reported by one batch on every primary-flush failure. Minor, but it's now silently wrong for any monitoring that reads totalESPuts().

5. workingRecords is package-private and mutable after construction

BulkProcessorListener.java:32:

final Map<String, ReindexEntry> workingRecords;

This is a ConcurrentHashMap populated by putAll in runReindexLoop before the processor opens. It's safe for the current call site, but any same-package code can mutate it after population. afterBulk reads it on a different thread without any guarantee the caller won't add more entries. Consider making it either constructor-injected or providing a snapshot-at-construction pattern to make the "immutable after creation" contract explicit.

What's clean

Removal of rebuildBulkIndexer flag and closeBulkProcessor helper is correct — no callers remain anywhere in the codebase.
The putAll before createBulkProcessor is correct ordering: workingRecords is fully populated before any auto-flush could call afterBulk.
CompositeBulkProcessor.close() correctly propagates primary failures while swallowing shadow (OS) failures.
OSIndexBulkProcessor.close() calls flush() synchronously — no timeout concern there.
The test reflection approach for setStateRunning is reasonable given ThreadState is private.

…indexThread Adds testPerBatchListenerIsolation to ReindexThreadUnitTest. Runs runReindexLoop() via reflection against mocked dependencies and uses a CountDownLatch to let exactly two batches complete before stopping the thread. Verifies: - createBulkProcessor is called with a fresh listener per batch (not the same instance reused), proving the TOCTOU hazard is eliminated. - Each listener's workingRecords snapshot contains only the entries from its own batch, not contaminated by the adjacent batch. Refs #35313 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Expand wildcard import com.dotmarketing.business.* to explicit imports - Add @IndexLibraryIndependent annotation (class uses only neutral APIs) - Make state field final - Remove unused TimeUnit import (no longer needed after per-batch fix) - Reorder imports per Google Java style - Fix brace formatting in finally block Refs #35313 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-project-automation bot added this to dotCMS - Product Planning Apr 15, 2026

github-actions bot mentioned this pull request Apr 15, 2026

fix(ReindexThread): TOCTOU race on BulkProcessor rebuild — use AtomicReference + compareAndSet #35313

Open

4 tasks

github-actions bot added the Area : Backend PR changes Java/Maven backend code label Apr 15, 2026

claude bot added the AI: Safe To Rollback label Apr 15, 2026

fabrizzio-dotCMS and others added 2 commits April 15, 2026 08:32

fabrizzio-dotCMS requested review from freddyDOTCMS and nollymar and removed request for nollymar April 15, 2026 15:50

Merge branch 'main' into fix/issue-35313-reindex-toctou

44b01dd

wezell reviewed Apr 15, 2026

View reviewed changes

Comment thread dotCMS/src/main/java/com/dotmarketing/common/reindex/ReindexThread.java

dario-daza approved these changes Apr 15, 2026

View reviewed changes

nollymar approved these changes Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(reindex): eliminate TOCTOU race in ReindexThread BulkProcessor rebuild#35323

fix(reindex): eliminate TOCTOU race in ReindexThread BulkProcessor rebuild#35323
fabrizzio-dotCMS wants to merge 4 commits intomainfrom
fix/issue-35313-reindex-toctou

fabrizzio-dotCMS commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fabrizzio-dotCMS commented Apr 15, 2026

Summary

Root cause

Fix

Scope

Test plan

Uh oh!

claude bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(reindex): eliminate TOCTOU race in ReindexThread BulkProcessor rebuild

Summary

Issues

What's clean

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

claude bot commented Apr 15, 2026 •

edited

Loading