Fix inconsistent ES8 workflow state refresh handling#1012
Conversation
| } | ||
|
|
||
| try { | ||
| String indexedStatus = indexDAO.get(workflow.getWorkflowId(), "status"); |
There was a problem hiding this comment.
This will add additional reads making the sweeper slower. This might be better suited inside ES implementation.
There was a problem hiding this comment.
@v1r3n @nthmost-orkes agreed. I removed the sweeper reconciliation logic from this PR and pushed the branch update.
This PR is now scoped back to ES8 write-path correctness plus optional refreshOnWrite behavior.
For reconciliation, I will raise a separate PR that adds backend-specific reconciliation capability instead of putting the ES read in the core sweeper hot path. The shape I have in mind is a no-op default capability for non-ES8 backends, with the actual reconciliation moved into an ES8-specific implementation. That keeps non-ES8 backends from paying additional read overhead while still allowing ES8 deployments to opt into index/execution-store drift repair.
|
Thanks for the detailed work here — the analysis of the inconsistent refresh behavior is correct, and the problem is real. Closing this in favor of a different approach, but want to flag one piece worth preserving. Why we're going a different directionPR #823 is fixing the same root cause across ES7, OS v2, and OS v3 by routing Adding a The sweeper reconciliation logic is worth savingThe If you're interested in contributing that as a standalone PR with the ES read gated behind a config flag (off by default), that would be a solid addition. Something like Closing this now — appreciate the effort you put into it. |
|
@nthmost-orkes thanks, agree with the concern that One ES8-specific clarification: PR #823 addresses this for ES7 and OS v2/v3 by routing This PR already makes that ES8-specific change for workflow/task writes. So I do not think #823 needs to address ES8 here unless it is expanded beyond its current ES7/OS v2/OS v3 scope. Separate from that, The problem this flag is trying to solve is narrower than general indexing throughput. Some ES8 deployments require read-after-write search visibility for workflow/task state. Today
I updated the PR description to make this separation clearer. Could you please reopen this PR for review with that scope in mind? |
|
Hi @v1r3n , @nthmost-orkes , can I please get some reviews on this? Much appreciated 🙏🏻 |
Summary
This PR fixes ES8 indexing consistency in two separate layers:
conductor.elasticsearch.refreshOnWrite, for deployments that require immediate search visibility after every ES8 write/delete.By default, ES8 remains async/bulk-oriented. Forced refresh is only used when
refreshOnWrite=true.Issue
ES8 workflow writes had diverged from other ES8 write paths. They did not consistently go through the helper that resolves the concrete ILM index/write index before delegating to the shared indexing path. This could leave workflow indexing behavior different from task, message, event, and log indexing behavior.
Separately, some deployments need read-after-write search visibility for workflow/task state.
waitForIndexRefreshis not equivalent to that requirement because it maps to Elasticsearchrefresh=wait_for, which waits for the next refresh cycle. It does not force an immediate refresh. Loweringindex.refresh_intervalcan shorten the wait, but it still cannot mean "refresh after every write".In practice, these gaps could leave Elasticsearch temporarily showing stale workflow/task state even after the execution store had already been updated.
Why a fix is needed
Conductor relies on indexed workflow/task state for search, archival, and operational visibility. When ES8 mutations are not handled consistently:
Fix
This change:
indexDocumentWithIlmFallback(...), preserving ILM alias/index resolution before delegating to the shared indexing pathconductor.elasticsearch.refreshOnWrite=falseas an ES8-only opt-inrefreshOnWrite=true, applies Elasticsearchrefresh=trueto ES8 index/update/delete/bulk writes so the affected shard is refreshed before the operation returnswaitForIndexRefreshsemantics asrefresh=wait_forupdateTimeFollow-up
Sweeper index/execution-store reconciliation was removed from this PR based on review feedback. That logic will be proposed separately as backend-specific reconciliation capability, with a no-op default and an ES8-specific implementation so non-ES8 backends do not pay additional read overhead.
Tests
refreshOnWriteoptionVerification
./gradlew spotlessApply./gradlew :conductor-core:test --tests org.conductoross.conductor.core.execution.WorkflowSweeperTestPrevious verification from this branch:
./gradlew :conductor-es8-persistence:test --tests org.conductoross.conductor.es8.config.ElasticSearchPropertiesTest --tests org.conductoross.conductor.es8.dao.index.TestElasticSearchRestDAOV8Note: the ES8 integration suite is guarded by Testcontainers and was skipped in this environment rather than failing.