Skip to content

Bug: Silent Task Indexing Failures with Async Indexing Enabled #592

@Akhil-Pathivada

Description

@Akhil-Pathivada

Describe the bug
When asyncIndexingEnabled=true, tasks from multi-task workflows randomly fail to index to the Search DB (here OpenSearch) due to unhandled CompletableFuture exceptions, causing data inconsistency between the primary database and search index.

Details
Conductor version: Latest (main branch)
Persistence implementation: Postgres
Queue implementation: Redis standalone
Lock: Redis
Workflow definition: Any workflow with 2+ tasks
Task definition: Any SIMPLE tasks
Event handler definition: N/A

To Reproduce
Steps to reproduce the behavior:

  1. Set conductor.app.asyncIndexingEnabled=true in configuration
  2. Create a workflow with 2 or more tasks
  3. Start and complete the workflow (all tasks complete successfully)
  4. Check the search index (OpenSearch/Elasticsearch) - only some tasks are indexed
  5. Check PostgreSQL - all tasks exist correctly
  6. Repeat with different workflows - random tasks are missing from search index

Expected behavior
All completed tasks should be indexed to the Search DB and be searchable via the API and UI.

Screenshots

Image Image

Additional context

  • Root cause: In ExecutionDAOFacade.java, asyncIndexTask() CompletableFuture returns are ignored, causing silent failures
  • Affects all deployments with async indexing enabled
  • Workaround: Set asyncIndexingEnabled=false (impacts performance)

Proposed Solution
Replace the current individual async indexing approach with bulk task indexing, following the same pattern already used for task execution logs. This approach would:

  1. Add bulk indexing methods to the IndexDAO interface similar to the existing addTaskExecutionLogs methods
  2. Implement these methods in all IndexDAO implementations using their existing bulk request infrastructure
  3. Update ExecutionDAOFacade to collect all tasks and submit them as a single bulk operation instead of individual async calls
  4. Leverage existing error handling, retry logic, and monitoring that's already proven to work with bulk operations

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions