Skip to content

Conversation

Akhil-Pathivada
Copy link

Problem

Issue #592 reports random task indexing failures where tasks are saved in the DB but fail to index in OpenSearch, causing inconsistency between the Primary DB and Search index.

Root Cause

A race condition exists in the OpenSearchRestDAO.indexObject() method. The critical section (add to buffer → check size → flush) is not atomic, allowing concurrent threads to interfere during the flush operation.

The issue occurs when:

  1. Thread A adds an item to the buffer (size = 1)
  2. Thread A checks size (1 >= threshold), begins flushing to OpenSearch
  3. Thread B adds an item to the buffer (size = 2) while Thread A is sending data to OpenSearch
  4. Thread A receives success from OpenSearch and replaces the entire buffer with a new empty one: bulkRequests.put(docType, new BulkRequests(...))
  5. Thread B's item is discarded when the buffer is replaced
  6. Thread B checks size on the new empty buffer (size = 0), doesn't flush
  7. Result: Thread B's item is lost

This is a Time-Of-Check-Time-Of-Use (TOCTOU) race condition where the buffer state changes between when Thread B adds its item and when it performs the size check.

Why indexBatchSize=1 exacerbates this:
With the default indexBatchSize=1, every item triggers an immediate flush, maximizing the window for concurrent modifications during the OpenSearch network call.

Solution

Add a synchronized(this) block around the critical section in indexObject():

This ensures:

  • Only one thread can add to the buffer, check size, and trigger flush at a time
  • No thread can add items while another thread is flushing
  • Buffer replacement happens atomically without losing items from concurrent threads

Testing

  • Tested with concurrent workflow executions and verified 100% indexing success rate

Changes

  • Modified OpenSearchRestDAO.indexObject() method with synchronized block
  • No API changes or breaking changes

Fixes #592

@Akhil-Pathivada Akhil-Pathivada force-pushed the fix/issue-592-opensearch-indexing-race-condition branch from ee1209e to f209a12 Compare October 8, 2025 12:02
@Akhil-Pathivada
Copy link
Author

Akhil-Pathivada commented Oct 8, 2025

@v1r3n @manan164 could you please help reviewing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Silent Task Indexing Failures with Async Indexing Enabled

1 participant