KAFKA-19221: Propagate IOException on LogSegment#close #19607

gaurav-narula · 2025-04-30T16:12:35Z

Log segment closure results in right sizing the segment on disk along
with the associated index files.

This is specially important for TimeIndexes where a failure to right
size may eventually cause log roll failures leading to under replication
and log cleaner failures.

This change uses Utils.closeAll which propagates exceptions, resulting
in an "unclean" shutdown. That would then cause the broker to attempt to
recover the log segment and the index on next startup, thereby avoiding
the failures described above.

Reviewers: Omnia Ibrahim [email protected], Jun Rao
[email protected], Chia-Ping Tsai [email protected]

gaurav-narula · 2025-04-30T16:12:58Z

CC: @soarez @OmniaGM

github-actions · 2025-05-08T03:25:14Z

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

gaurav-narula · 2025-05-23T12:49:42Z

CC: @chia7712 can you please take a look?

github-actions · 2025-05-25T03:29:51Z

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

OmniaGM

Nice improvement. This looks like a small change. @ijuma & @chia7712 can one of you please review this change?

junrao

@gaurav-narula : Thanks for the PR. Left a couple of comments.

junrao · 2025-05-31T09:31:46Z

server/src/test/java/org/apache/kafka/server/LogManagerIntegrationTest.java

@@ -57,6 +58,70 @@ public LogManagerIntegrationTest(ClusterInstance cluster) {
        this.cluster = cluster;
    }

+    @ClusterTest(types = {Type.KRAFT}, brokers = 3)


Could we just use 1 broker in the test?

junrao · 2025-05-31T09:32:48Z

server/src/test/java/org/apache/kafka/server/LogManagerIntegrationTest.java

+        assertTrue(timeIndexFile.exists());
+        assertTrue(timeIndexFile.setReadOnly());
+
+        cluster.brokers().get(0).shutdown();


Could we verify that the cleanShutdown file is not written?

gaurav-narula · 2025-06-01T13:52:02Z

Thanks for the review @junrao. I've addressed your comments with 2677f90 and rebased against trunk.

chia7712

@gaurav-narula thanks for this fix.

chia7712 · 2025-06-01T14:28:08Z

server/src/test/java/org/apache/kafka/server/LogManagerIntegrationTest.java

+
+        cluster.brokers().get(0).shutdown();
+
+        assertEquals(1, cluster.brokers().get(0).config().logDirs().size());


Could you please add a variable for cluster.brokers().get(0)?

var broker = cluster.brokers().get(0); broker.shutdown(); assertEquals(1, broker.config().logDirs().size()); String logDir = broker.config().logDirs().get(0);

chia7712 · 2025-06-01T14:35:20Z

storage/src/main/java/org/apache/kafka/storage/internals/log/LogSegment.java

-        Utils.closeQuietly(lazyTimeIndex, "timeIndex", LOGGER);
-        Utils.closeQuietly(log, "log", LOGGER);
-        Utils.closeQuietly(txnIndex, "txnIndex", LOGGER);
+        Utils.closeAll(lazyOffsetIndex, lazyTimeIndex, log, txnIndex);


If LogSegment#close now throws an exception, then LogSegments#close might break without closing all segments, right?

kafka/storage/src/main/java/org/apache/kafka/storage/internals/log/LogSegments.java

Line 108 in 4eac6ad

s.close();

Thanks for pointing that! I've modified LogSegments#close to ensure we close all log segments even if one of them throws and added a test in LogSegmentsTest.

In the process, I found that some tests assumed that resources like Indexes and FileRecords may be closed multiple times. With LogSegment#close now propagating exceptions, we need to exit early if the resources have been closed before, otherwise we'd see failures due to FileChannel being closed or mmap being null. I've therefore updated AbstractIndex, TransactionIndex and FileRecords appropriately.

chia7712

@gaurav-narula thanks for updates. one small comment is left. PTAL

chia7712 · 2025-06-02T18:33:35Z

storage/src/test/java/org/apache/kafka/storage/internals/log/LogSegmentsTest.java

+
+        doThrow(new IOException("Failure")).when(seg2).close();
+
+        try {


Could you please consider using assertThrows?

Addressed with ce22ba9

chia7712 · 2025-06-05T10:30:37Z

* What went wrong:
Execution failed for task ':storage:spotlessJavaCheck'.
> The following files had format violations:
      src/test/java/org/apache/kafka/storage/internals/log/LogSegmentsTest.java
          @@ -20,6 +20,7 @@
           import·org.apache.kafka.common.utils.Time;
           import·org.apache.kafka.common.utils.Utils;
           import·org.apache.kafka.test.TestUtils;
          +
           import·org.junit.jupiter.api.AfterEach;
           import·org.junit.jupiter.api.BeforeEach;
           import·org.junit.jupiter.api.Test;
  Run './gradlew :storage:spotlessApply' to fix these violations.

@gaurav-narula could you please fix the build error?

Log segment closure results in right sizing the segment on disk along with the associated index files. This is specially important for TimeIndexes where a failure to right size may eventually cause log roll failures leading to under replication and log cleaner failures. This change uses `Utils.closeAll` which propagates exceptions, resulting in an "unclean" shutdown. That would then cause the broker to attempt to recover the log segment and the index on next startup, thereby avoiding the failures described above.

gaurav-narula · 2025-06-10T16:19:03Z

@chia7712 should be good now

junrao

@gaurav-narula : Thanks for the updated PR. LGTM

Log segment closure results in right sizing the segment on disk along with the associated index files. This is specially important for TimeIndexes where a failure to right size may eventually cause log roll failures leading to under replication and log cleaner failures. This change uses `Utils.closeAll` which propagates exceptions, resulting in an "unclean" shutdown. That would then cause the broker to attempt to recover the log segment and the index on next startup, thereby avoiding the failures described above. Reviewers: Omnia Ibrahim <[email protected]>, Jun Rao <[email protected]>, Chia-Ping Tsai <[email protected]>

chia7712 · 2025-06-23T09:07:48Z

storage/src/main/java/org/apache/kafka/storage/internals/log/LogSegment.java

@@ -751,10 +751,7 @@ public Optional<FileRecords.TimestampAndOffset> findOffsetByTimestamp(long times
    public void close() throws IOException {
        if (maxTimestampAndOffsetSoFar != TimestampOffset.UNKNOWN)
            Utils.swallow(LOGGER, Level.WARN, "maybeAppend", () -> timeIndex().maybeAppend(maxTimestampSoFar(), shallowOffsetOfMaxTimestampSoFar(), true));


I think writing the max timestamp has a similar issue. the last entry is assumed to be the max timestamp after restarting. Hence, it needs to rebuild the index if the "true" max timestamp is not stored correctly

I open KAFKA-19428 to trace it

Log segment closure results in right sizing the segment on disk along with the associated index files. This is specially important for TimeIndexes where a failure to right size may eventually cause log roll failures leading to under replication and log cleaner failures. This change uses `Utils.closeAll` which propagates exceptions, resulting in an "unclean" shutdown. That would then cause the broker to attempt to recover the log segment and the index on next startup, thereby avoiding the failures described above. Reviewers: Omnia Ibrahim <[email protected]>, Jun Rao <[email protected]>, Chia-Ping Tsai <[email protected]>

github-actions bot added triage PRs from the community core Kafka Broker storage Pull requests that target the storage module labels Apr 30, 2025

github-actions bot added the small Small PRs label Apr 30, 2025

gaurav-narula force-pushed the KAFKA-19221 branch from 94c3137 to f107be8 Compare April 30, 2025 16:17

github-actions bot added the needs-attention label May 8, 2025

github-actions bot removed the needs-attention label May 24, 2025

github-actions bot added the needs-attention label May 25, 2025

OmniaGM approved these changes May 28, 2025

View reviewed changes

ijuma requested a review from junrao May 28, 2025 22:53

github-actions bot removed needs-attention triage PRs from the community labels May 29, 2025

junrao reviewed May 31, 2025

View reviewed changes

gaurav-narula force-pushed the KAFKA-19221 branch from f107be8 to 2677f90 Compare June 1, 2025 13:49

gaurav-narula requested a review from junrao June 1, 2025 13:50

chia7712 reviewed Jun 1, 2025

View reviewed changes

github-actions bot added clients and removed small Small PRs labels Jun 2, 2025

gaurav-narula requested a review from chia7712 June 2, 2025 01:21

gaurav-narula force-pushed the KAFKA-19221 branch from 0151603 to c290a11 Compare June 2, 2025 01:30

chia7712 reviewed Jun 2, 2025

View reviewed changes

gaurav-narula requested a review from chia7712 June 2, 2025 19:48

chia7712 added the ci-approved label Jun 5, 2025

gaurav-narula added 4 commits June 10, 2025 14:15

Address review comments

994d1de

Extract broker variable and ensure we close all log segments

1c16f4c

Use assertThrows

80b16a3

Spotless apply

d0afe7c

gaurav-narula force-pushed the KAFKA-19221 branch from ce22ba9 to d0afe7c Compare June 10, 2025 13:16

chia7712 approved these changes Jun 10, 2025

View reviewed changes

junrao approved these changes Jun 10, 2025

View reviewed changes

chia7712 merged commit edd0efd into apache:trunk Jun 10, 2025
24 checks passed

chia7712 reviewed Jun 23, 2025

View reviewed changes


		cluster.brokers().get(0).shutdown();

		assertEquals(1, cluster.brokers().get(0).config().logDirs().size());


		doThrow(new IOException("Failure")).when(seg2).close();

		try {

KAFKA-19221: Propagate IOException on LogSegment#close #19607

KAFKA-19221: Propagate IOException on LogSegment#close #19607

Uh oh!

Conversation

gaurav-narula commented Apr 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaurav-narula commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 8, 2025

Uh oh!

gaurav-narula commented May 23, 2025

Uh oh!

github-actions bot commented May 25, 2025

Uh oh!

OmniaGM left a comment

Choose a reason for hiding this comment

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaurav-narula commented Jun 1, 2025

Uh oh!

chia7712 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chia7712 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chia7712 commented Jun 5, 2025

Uh oh!

gaurav-narula commented Jun 10, 2025

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gaurav-narula commented Apr 30, 2025 •

edited by github-actions bot

Loading

gaurav-narula commented Apr 30, 2025 •

edited

Loading