Skip to content

Conversation

@shengjk
Copy link

@shengjk shengjk commented Dec 25, 2025

What type of PR is this:

  • BugFix
  • Enhancement

Which issues of this PR fixes :

Fixes #

Problem Summary(Required) :

Scenario:
When using the StarRocks Flink connector with Two-Phase Commit (2PC) enabled, the Flink job occasionally throws a
StreamLoadFailException( "Status": "TXN_IN_PROCESSING", "Message": "Transaction in processing, please retry later") during the checkpointing phase.

This typically happens under high I/O load or when the data chunks are small and sent quickly. The StarRocks FE returns a response with "Status": "TXN_IN_PROCESSING" during the prepare request.

Root Cause:
The TXN_IN_PROCESSING status indicates that although the data has been received by the Backend (BE), the internal asynchronous processes (such as flushing MemTables to disk or completing replica synchronization) are still in progress. In the current implementation, the connector treats this as a terminal failure. Because Flink triggers the
prepare call immediately after data transmission (sometimes within milliseconds), it's highly possible that the backend hasn't finished its housekeeping tasks.

Measures:
Retry Mechanism: Modified TransactionStreamLoader.java to capture the TXN_IN_PROCESSING status specifically during the prepare phase. Instead of failing immediately, the connector will now wait and retry the request.

Configurable Parameters: Introduced two new properties to allow users to customize the retry behavior:
sink.properties.prepare_retry_times: Maximum retries (default is 6).
sink.properties.prepare_retry_interval_ms: Sleeping time between retries (default is 1000ms).
Constants Update: Added TXN_IN_PROCESSING and the new property keys to StreamLoadConstants.java to ensure consistency.

This change prevents transient StarRocks status from causing global Flink job restarts, significantly improving the stability of the streaming pipeline.

Checklist:

  • This pr will affect users' behaviors
  • This pr needs user documentation (for new or modified features or behaviors)

@shengjk
Copy link
Author

shengjk commented Dec 26, 2025

@banmoy @meegoo Could you please review this PR?

We've encountered frequent TXN_IN_PROCESSING errors during checkpoints under high I/O pressure. This PR effectively bridges the asynchronous gap between data ingestion and physical persistence in StarRocks BE, significantly reducing job-restart storms caused by transient backend states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant