[BugFix] Handle TXN_IN_PROCESSING status with retry mechanism in tran… #473

shengjk · 2025-12-25T08:54:02Z

What type of PR is this：

BugFix
Enhancement

Which issues of this PR fixes ：

Fixes #

Problem Summary(Required) ：

Scenario:
When using the StarRocks Flink connector with Two-Phase Commit (2PC) enabled, the Flink job occasionally throws a
StreamLoadFailException( "Status": "TXN_IN_PROCESSING", "Message": "Transaction in processing, please retry later") during the checkpointing phase.

This typically happens under high I/O load or when the data chunks are small and sent quickly. The StarRocks FE returns a response with "Status": "TXN_IN_PROCESSING" during the prepare request.

Root Cause:
The TXN_IN_PROCESSING status indicates that although the data has been received by the Backend (BE), the internal asynchronous processes (such as flushing MemTables to disk or completing replica synchronization) are still in progress. In the current implementation, the connector treats this as a terminal failure. Because Flink triggers the
prepare call immediately after data transmission (sometimes within milliseconds), it's highly possible that the backend hasn't finished its housekeeping tasks.

Measures:
Retry Mechanism: Modified TransactionStreamLoader.java to capture the TXN_IN_PROCESSING status specifically during the prepare phase. Instead of failing immediately, the connector will now wait and retry the request.

Configurable Parameters: Introduced two new properties to allow users to customize the retry behavior:
sink.properties.prepare_retry_times: Maximum retries (default is 6).
sink.properties.prepare_retry_interval_ms: Sleeping time between retries (default is 1000ms).
Constants Update: Added TXN_IN_PROCESSING and the new property keys to StreamLoadConstants.java to ensure consistency.

This change prevents transient StarRocks status from causing global Flink job restarts, significantly improving the stability of the streaming pipeline.

Checklist:

This pr will affect users' behaviors
This pr needs user documentation (for new or modified features or behaviors)

…saction prepare phase Signed-off-by: shengjk1 <[email protected]>

shengjk · 2025-12-26T01:04:00Z

@banmoy @meegoo Could you please review this PR?

We've encountered frequent TXN_IN_PROCESSING errors during checkpoints under high I/O pressure. This PR effectively bridges the asynchronous gap between data ingestion and physical persistence in StarRocks BE, significantly reducing job-restart storms caused by transient backend states.

[BugFix] Handle TXN_IN_PROCESSING status with retry mechanism in tran…

a1eab15

…saction prepare phase Signed-off-by: shengjk1 <[email protected]>

shengjk force-pushed the main branch from 59996dd to a1eab15 Compare December 25, 2025 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] Handle TXN_IN_PROCESSING status with retry mechanism in tran… #473

[BugFix] Handle TXN_IN_PROCESSING status with retry mechanism in tran… #473

shengjk commented Dec 25, 2025

Uh oh!

shengjk commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[BugFix] Handle TXN_IN_PROCESSING status with retry mechanism in tran… #473

Are you sure you want to change the base?

[BugFix] Handle TXN_IN_PROCESSING status with retry mechanism in tran… #473

Conversation

shengjk commented Dec 25, 2025

What type of PR is this：

Which issues of this PR fixes ：

Problem Summary(Required) ：

Checklist:

Uh oh!

shengjk commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant