KAFKA-20036 Handle LogCleaner segment overflow caused by compression level changes#21379
KAFKA-20036 Handle LogCleaner segment overflow caused by compression level changes#21379m1a2st wants to merge 32 commits intoapache:trunkfrom
Conversation
storage/src/main/java/org/apache/kafka/storage/internals/log/Cleaner.java
Outdated
Show resolved
Hide resolved
storage/src/main/java/org/apache/kafka/storage/internals/log/SegmentOverflowException.java
Outdated
Show resolved
Hide resolved
storage/src/main/java/org/apache/kafka/storage/internals/log/Cleaner.java
Outdated
Show resolved
Hide resolved
storage/src/main/java/org/apache/kafka/storage/internals/log/Cleaner.java
Outdated
Show resolved
Hide resolved
storage/src/main/java/org/apache/kafka/storage/internals/log/Cleaner.java
Outdated
Show resolved
Hide resolved
storage/src/main/java/org/apache/kafka/storage/internals/log/Cleaner.java
Outdated
Show resolved
Hide resolved
| List<List<LogSegment>> groupedSegments = groupSegmentsBySize( | ||
| log.logSegments(0, endOffset), | ||
| log.config().segmentSize(), | ||
| effectiveMaxSize, |
There was a problem hiding this comment.
Hmm, does this approach work in general? When grouping segments, we need to include at least one segment in the group. It's possible that the cleaning of a single segment can cause it to exceed 2GB.
There was a problem hiding this comment.
That is a good point. We could follow the approach used for handling offset overflow: split the segment and then restart the cleanup. The trade-off is that the first hale of the segment will be cleaned in isolation, so there might be little to nothing to clean up :)
There was a problem hiding this comment.
That approach could work, but one has to guess the size to split the segments into. Have you considered the alternative of creating multiple cleaned segments? log.replaceSegments() already supports replacing multiple segments. If cleanInto() hits a file overflow exception, we could close the current cleaned segment, create a new one and continue the cleaning.
There was a problem hiding this comment.
yes, we can roll the new segment when either the "size check" or "overflow check" is triggered.
// 1. Size Check: current size + retained size > config limit
// 2. Overflow Check: max offset - base offset > Integer.MAX_VALUE
boolean willExceedSize = (long) dest.size() + retained.sizeInBytes() > log.config().segmentSize();
boolean willOverflow = result.maxOffset() - dest.baseOffset() > Integer.MAX_VALUE;
if (willExceedSize || willOverflow) {
logger.info("Rolling new segment. Condition met: size_exceeded={}, overflow={}. (Segment size: {}, Batch size: {}, BaseOffset: {}, MaxOffset: {})",
willExceedSize, willOverflow, dest.size(), retained.sizeInBytes(), dest.baseOffset(), result.maxOffset());
dest = rollNewSegment(log, dest, cleanedSegments, transactionMetadata, retained);
}However, I have another concern regrading "temporary disk usage". If we remove the initial segment grouping entirely, it might requires a significant amount of disk space to hold all the cleaned segments simultaneously before the replacement happens.
I believe the grouping logic should be retained, but simplified to serve as a batch size threshold. This way, we can control the cleaning scope to avoid occupying too much disk space, while still allowing the inner logic to split segment dynamically if needed
There was a problem hiding this comment.
I think we should retain groupSegmentsBySize() to control temporary disk usage, and handle overflow dynamically within cleanInto() by creating multiple cleaned segments as needed.
This approach allows us to avoid disk space issues while still handling segment overflow gracefully. The peak disk usage remains bounded by the group size rather than the total log size.
There was a problem hiding this comment.
Yes, we will still want to group the segments.
| * @param currentTime The time at which the clean was initiated | ||
| * @param log The log instance for creating new segments if overflow occurs | ||
| * | ||
| * @return The current active destination segment (maybe different from input dest if overflow occurred) |
There was a problem hiding this comment.
This api seems awkward. An alternative is to instead passing in a starting position in sourceRecords to cleanInto(). Initially, we can pass in 0 as the position, if cleanInto() hits a size limit, it throws an exception with the current position in sourceRecords. The caller catches this exception, creates a new destination segment and call cleanInto() again with position in the exception and the new destination segment.
There was a problem hiding this comment.
Thanks for the feedback! I agree this is a cleaner design. I've updated cleanInto() to accept a starting position and throw an exception containing the current position on overflow. The caller then creates a new destination segment and resumes cleaning from where it left off.
| MemoryRecords retained = MemoryRecords.readableRecords(outputBuffer); | ||
|
|
||
| // Check for TWO types of overflow BEFORE appending: | ||
| // 1. Offset overflow: offset range exceeds Integer.MAX_VALUE |
There was a problem hiding this comment.
The grouping of the segments takes offset overflow into consideration. So, it seems that we can't hit this?
# Conflicts: # checkstyle/suppressions.xml
|
|
||
| // Complete current cleaned segment | ||
| currentCleaned.onBecomeInactiveSegment(); | ||
| currentCleaned.flush(); |
There was a problem hiding this comment.
Should we set setLastModified for currentCleaned?
storage/src/main/java/org/apache/kafka/storage/internals/log/Cleaner.java
Outdated
Show resolved
Hide resolved
storage/src/main/java/org/apache/kafka/storage/internals/log/Cleaner.java
Outdated
Show resolved
Hide resolved
storage/src/main/java/org/apache/kafka/storage/internals/log/SegmentSizeOverflowException.java
Show resolved
Hide resolved
storage/src/main/java/org/apache/kafka/storage/internals/log/SegmentSizeOverflowException.java
Outdated
Show resolved
Hide resolved
storage/src/main/java/org/apache/kafka/storage/internals/log/SegmentSizeOverflowException.java
Outdated
Show resolved
Hide resolved
storage/src/test/java/org/apache/kafka/storage/internals/log/CleanerIntegrationTest.java
Outdated
Show resolved
Hide resolved
|
Thanks for @junrao review, I have addressed all the comments |
| // If cleanInto completes without exception, we're done with this segment | ||
| cleaningComplete = true; | ||
|
|
||
| } catch (SegmentSizeOverflowException e) { |
There was a problem hiding this comment.
Where do we catch LogSegmentOffsetOverflowException now?
There was a problem hiding this comment.
Using an exception for control flow feels a bit unnatural here. Could we consider returning a result object instead to indicate where the cleaning stopped?
There was a problem hiding this comment.
Let's me attach a example to strengthen my comment.
private record Overflow(int position) {} if (sizeOverflow || offsetOverflow) {
// log.xxx
return Optional.of(new Overflow(position - result.bytesRead()));
}
...
return Optional.empty();while (!cleaningComplete) {
Optional<Overflow> overflowOpt = cleanInto(
log.topicPartition(),
currentSegment.log(),
currentCleaned,
position,
);
if (overflowOpt.isPresent()) {
Overflow overflow = overflowOpt.get();
logger.info("Completing cleaned segment {} due to overflow, creating new segment", currentCleaned.baseOffset());
currentCleaned.onBecomeInactiveSegment();
currentCleaned.flush();
currentCleaned.setLastModified(currentSegment.lastModified());
cleanedSegments.add(currentCleaned);
Iterator<FileChannelRecordBatch> nextBatches = currentSegment.log().batchesFrom(overflow.position()).iterator();
long nextBaseOffset = nextBatches.hasNext() ? nextBatches.next().baseOffset() : currentCleaned.readNextOffset();
currentCleaned = UnifiedLog.createNewCleanedSegment(log.dir(), log.config(), nextBaseOffset);
transactionMetadata.setCleanedIndex(Optional.of(currentCleaned.txnIndex()));
position = overflow.position();
} else {
cleaningComplete = true;
}
}
storage/src/main/java/org/apache/kafka/storage/internals/log/Cleaner.java
Outdated
Show resolved
Hide resolved
| cleanedSegments.add(currentCleaned); | ||
|
|
||
| // Create new cleaned segment with base offset = next offset of completed segment | ||
| long nextBaseOffset = currentCleaned.readNextOffset(); |
There was a problem hiding this comment.
If we use the logic here to handle LogSegmentOffsetOverflowException, it's better to set nextBaseOffset as the base offset of the next batch to be cleaned. Since compaction can leave a hole, currentCleaned.readNextOffset() may not always be the base offset of the next batch to be cleaned.
| // remove the index entry | ||
| if (segment.baseOffset() != sortedNewSegments.get(0).baseOffset()) { | ||
| // remove the index entry; skip removal for base offsets that a new segment is replacing in-place | ||
| if (!newSegmentBaseOffsets.contains(segment.baseOffset())) { |
There was a problem hiding this comment.
Could we merge the code in line 1051 here?
| // recompression during cleaning can cause the cleaned segment to exceed that size. | ||
| // Similarly, combining multiple source segments into one cleaned segment can cause | ||
| // the offset range to exceed Integer.MAX_VALUE. | ||
| boolean sizeOverflow = retained.sizeInBytes() > maxCleanedSegmentSize - dest.size(); |
There was a problem hiding this comment.
Should we allow the write for the first batch of an empty segment?
There was a problem hiding this comment.
We should allow this.
Since the integration test is expensive and generating 2GB of data is not ideal, I removed it. |
Exercised this patch via a producer flow with positive results. Instead of throwing a exception, the cleaner now yield undersized log files. Typically, these files will be naturally merged in future compaction passes as the dataset accrues duplicate keys |
| * Clean a group of segments into a single replacement segment. | ||
| * Clean a group of segments into one or more replacement segments. | ||
| * | ||
| * <p>If cleaning would cause the destination segment's size or offset range to exceed the configured limit |
There was a problem hiding this comment.
* <p>If cleaning causes the destination segment's size or offset range to exceed the configured limit
* (e.g., due to recompression or combining multiple source segments), the current cleaned segment is
* finalized and a new one is started.
# Conflicts: # storage/src/main/java/org/apache/kafka/storage/internals/log/Cleaner.java
We add a new map to record which topic partitions have experienced
overflow. When an overflow occurs, the next time the group is
processed, we reduce the segment size by a factor of 0.9 to prevent the
overflow from happening again. If the partition still overflows, we
continue to multiply the ratio by 0.9 on subsequent attempts until the
partition is successfully cleaned.