Skip to content

Conversation

@m1a2st
Copy link
Collaborator

@m1a2st m1a2st commented Jan 31, 2026

We add a new map to record which topic partitions have experienced
overflow. When an overflow occurs, the next time the group is
processed, we reduce the segment size by a factor of 0.9 to prevent the
overflow from happening again. If the partition still overflows, we
continue to multiply the ratio by 0.9 on subsequent attempts until the
partition is successfully cleaned.

@github-actions github-actions bot added triage PRs from the community core Kafka Broker storage Pull requests that target the storage module labels Jan 31, 2026
Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m1a2st thanks for this fix

try {
// it's OK not to hold the Log's lock in this case, because this segment is only accessed by other threads
// after `Log.replaceSegments` (which acquires the lock) is called
dest.append(result.maxOffset(), retained);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you wrap only dest.append in the try-catch block to avoid catching unrelated error?


public SegmentOverflowException(LogSegment segment) {
super("Segment size would overflow during compaction for segment " + segment);
this.segment = segment;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need it?

log.name(), new Date(cleanableHorizonMs), new Date(legacyDeleteHorizonMs));
CleanedTransactionMetadata transactionMetadata = new CleanedTransactionMetadata();

double sizeRatio = 1.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this work better?

        double sizeRatio = segmentOverflowPartitions.getOrDefault(log.topicPartition(), 1.0);
        if (sizeRatio != 1.0) {
            logger.info("Partition {} has overflow history. " + "Reducing effective segment size to {}% for this round.",
                    log.topicPartition(), sizeRatio * 100);
        }

cleanSegments(log, group, offsetMap, currentTime, stats, transactionMetadata, legacyDeleteHorizonMs, upperBoundOffset);
}

if (segmentOverflowPartitions.containsKey(log.topicPartition())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        if (segmentOverflowPartitions.remove(log.topicPartition()) != null) {
            logger.info("Successfully cleaned log {} with degraded size (ratio: {}%). " +
                            "Cleared overflow marker. Next cleaning will use normal size.",
                    log.name(), sizeRatio * 100);
        }

currentTime
);
} catch (SegmentOverflowException e) {
if (segmentOverflowPartitions.containsKey(log.topicPartition())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                    var previousRatio = segmentOverflowPartitions.put(log.topicPartition(),
                            segmentOverflowPartitions.getOrDefault(log.topicPartition(), 1.0) * 0.9);
                    if (previousRatio == null) {
                        logger.warn("Segment overflow detected for partition {}: {}. " +
                                        "Marked for degradation to 90% size in next cleaning round.",
                                log.topicPartition(), e.getMessage());
                    } else {
                        logger.warn("Repeated segment overflow for partition {}: {}. " +
                                        "Further degrading to {}% size in next cleaning round.",
                                log.topicPartition(), e.getMessage(), previousRatio * 0.9 * 100);
                    }

@chia7712 chia7712 requested review from jolshan and junrao January 31, 2026 19:38
@github-actions github-actions bot removed the triage PRs from the community label Feb 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Kafka Broker storage Pull requests that target the storage module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants