Skip to content

[Fix](batch) Prevent writer deadlock from currentCacheBytes drift#653

Merged
JNSimba merged 1 commit into
apache:masterfrom
addu390:fix/batch-writer-deadlock-cachebytes-drift
May 6, 2026
Merged

[Fix](batch) Prevent writer deadlock from currentCacheBytes drift#653
JNSimba merged 1 commit into
apache:masterfrom
addu390:fix/batch-writer-deadlock-cachebytes-drift

Conversation

@addu390
Copy link
Copy Markdown
Contributor

@addu390 addu390 commented May 2, 2026

Proposed changes

Issue Number: close #614

Problem Summary:

DorisBatchStreamLoad increments currentCacheBytes by the client-side record bytes on insert, but decrements it by respContent.getLoadBytes() on a successful load. Whenever the BE-reported value is smaller than what the client buffered, either by partial_columns=true, compress_type=gz, etc, each load leaks a few bytes from the counter.

Over time the leak accumulates above maxBlockedBytes, so writeRecord parks on block.await() forever even though bufferMap and flushQueue are empty. The job freezes with no exception, only repeating Cache full, waiting for flush and bufferMap is empty, no need to flush null logs.

Two changes:

  1. Decrement currentCacheBytes by buffer.getBufferSizeBytes() so the add and subtract are symmetric regardless of compression / projection.
  2. Move the per-buffer flush check above the global cache-pressure await loop so a buffer that just crossed bufferFlushMaxBytes actually gets flushed instead of being stranded behind backpressure.

Checklist(Required)

  1. Does it affect the original behavior: No
  2. Has unit tests been added: No Need
  3. Has document been added or modified: No Need
  4. Does it need to update dependencies: No
  5. Are there any changes that cannot be rolled back: No

Further comments

Same root cause was independently reported in #614 (gz compression trigger). The fix is config-agnostic, it covers partial_columns, gz, and any future source of client-vs-BE byte asymmetry.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a batch sink stall in DorisBatchStreamLoad by making cache-byte accounting symmetric with what the client actually buffered and by triggering per-buffer flushes before entering global backpressure waits. In the Flink Doris connector, these changes target the writer path that can otherwise freeze when currentCacheBytes drifts upward over time.

Changes:

  • Flush full buffers before waiting on global cache pressure so newly full buffers are not stranded behind backpressure.
  • Decrement currentCacheBytes using the buffered byte count instead of Doris-reported loadBytes.
  • Keep the fix local to the batch stream-load implementation without changing public APIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +205 to +214
if (flushQueue.size() < executionOptions.getFlushQueueSize()
&& (buffer.getBufferSizeBytes() >= executionOptions.getBufferFlushMaxBytes()
|| buffer.getNumOfRecords() >= executionOptions.getBufferFlushMaxRows())) {
boolean flush = bufferFullFlush(bufferKey);
LOG.info("trigger flush by buffer full, flush: {}", flush);
} else if (buffer.getBufferSizeBytes() >= STREAM_LOAD_MAX_BYTES
|| buffer.getNumOfRecords() >= STREAM_LOAD_MAX_ROWS) {
// The buffer capacity exceeds the stream load limit, flush
boolean flush = bufferFullFlush(bufferKey);
LOG.info("trigger flush by buffer exceeding the limit, flush: {}", flush);
Comment on lines 499 to +500
long cacheByteBeforeFlush =
currentCacheBytes.getAndAdd(-respContent.getLoadBytes());
currentCacheBytes.getAndAdd(-buffer.getBufferSizeBytes());
Copy link
Copy Markdown
Member

@JNSimba JNSimba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you for your contribution.

@JNSimba JNSimba merged commit 0044826 into apache:master May 6, 2026
13 checks passed
@addu390
Copy link
Copy Markdown
Contributor Author

addu390 commented May 6, 2026

@JNSimba Thanks for the review. When would the next release cutoff be?

@JNSimba
Copy link
Copy Markdown
Member

JNSimba commented May 7, 2026

@JNSimba Thanks for the review. When would the next release cutoff be?

Yes, a fix will be released quickly, and a vote is expected to be launched within the next two days.

JNSimba added a commit to apache/doris-website that referenced this pull request May 11, 2026
## Versions

- [x] dev
- [x] 4.x
- [ ] 3.x
- [ ] 2.1 or older (not covered by version/language sync gate)

## Languages

- [x] Chinese
- [x] English
- [ ] Japanese candidate translation needed

## Docs Checklist

- [ ] Checked by AI
- [ ] Test Cases Built
- [x] Updated required version and language counterparts, or explained
why not
- [x] If only one language changed, confirmed whether source/translation
counterparts need sync

## Summary

Release Flink Doris Connector 26.1.1, superseding 26.1.0.

- Version table in `flink-doris-connector.md` (dev + 4.x, EN + zh-CN):
replace `26.1.0` row with `26.1.1`.
- `release-notes.md` (dev + 4.x, EN + zh-CN): prepend a `26.1.1`
section.
- Bug fix: batch sink potentially freezing during prolonged operation
when compression is enabled (apache/doris-flink-connector#653).
  - Credits: @addu390
- Download page (`src/constant/download.data.ts`): replace 26.1.0 entry
(label/value/source/binary URLs and the `FLINK_SAME_SOURCE_2610`
constant) with 26.1.1.

Release notes reference: apache/doris-flink-connector#654
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Sink CSV format, enable GZ compression to batch write to Doris, will get stuck and counter release error

3 participants