[BugFix] Handle TXN_IN_PROCESSING status with retry mechanism in tran… #473
+100
−62
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What type of PR is this:
Which issues of this PR fixes :
Fixes #
Problem Summary(Required) :
Scenario:
When using the StarRocks Flink connector with Two-Phase Commit (2PC) enabled, the Flink job occasionally throws a
StreamLoadFailException( "Status": "TXN_IN_PROCESSING", "Message": "Transaction in processing, please retry later") during the checkpointing phase.
This typically happens under high I/O load or when the data chunks are small and sent quickly. The StarRocks FE returns a response with "Status": "TXN_IN_PROCESSING" during the prepare request.
Root Cause:
The TXN_IN_PROCESSING status indicates that although the data has been received by the Backend (BE), the internal asynchronous processes (such as flushing MemTables to disk or completing replica synchronization) are still in progress. In the current implementation, the connector treats this as a terminal failure. Because Flink triggers the
prepare call immediately after data transmission (sometimes within milliseconds), it's highly possible that the backend hasn't finished its housekeeping tasks.
Measures:
Retry Mechanism: Modified TransactionStreamLoader.java to capture the TXN_IN_PROCESSING status specifically during the prepare phase. Instead of failing immediately, the connector will now wait and retry the request.
Configurable Parameters: Introduced two new properties to allow users to customize the retry behavior:
sink.properties.prepare_retry_times: Maximum retries (default is 6).
sink.properties.prepare_retry_interval_ms: Sleeping time between retries (default is 1000ms).
Constants Update: Added TXN_IN_PROCESSING and the new property keys to StreamLoadConstants.java to ensure consistency.
This change prevents transient StarRocks status from causing global Flink job restarts, significantly improving the stability of the streaming pipeline.
Checklist: