Skip to content

KAFKA-19221: Propagate IOException on LogSegment#close #19607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 10, 2025

Conversation

gaurav-narula
Copy link
Contributor

@gaurav-narula gaurav-narula commented Apr 30, 2025

Log segment closure results in right sizing the segment on disk along
with the associated index files.

This is specially important for TimeIndexes where a failure to right
size may eventually cause log roll failures leading to under replication
and log cleaner failures.

This change uses Utils.closeAll which propagates exceptions, resulting
in an "unclean" shutdown. That would then cause the broker to attempt to
recover the log segment and the index on next startup, thereby avoiding
the failures described above.

Reviewers: Omnia Ibrahim [email protected], Jun Rao
[email protected], Chia-Ping Tsai [email protected]

@github-actions github-actions bot added triage PRs from the community core Kafka Broker storage Pull requests that target the storage module labels Apr 30, 2025
@gaurav-narula
Copy link
Contributor Author

gaurav-narula commented Apr 30, 2025

CC: @soarez @OmniaGM

Copy link

github-actions bot commented May 8, 2025

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

@gaurav-narula
Copy link
Contributor Author

CC: @chia7712 can you please take a look?

Copy link

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

Copy link
Contributor

@OmniaGM OmniaGM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement. This looks like a small change. @ijuma & @chia7712 can one of you please review this change?

@ijuma ijuma requested a review from junrao May 28, 2025 22:53
@github-actions github-actions bot removed needs-attention triage PRs from the community labels May 29, 2025
Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaurav-narula : Thanks for the PR. Left a couple of comments.

@@ -57,6 +58,70 @@ public LogManagerIntegrationTest(ClusterInstance cluster) {
this.cluster = cluster;
}

@ClusterTest(types = {Type.KRAFT}, brokers = 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just use 1 broker in the test?

assertTrue(timeIndexFile.exists());
assertTrue(timeIndexFile.setReadOnly());

cluster.brokers().get(0).shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we verify that the cleanShutdown file is not written?

@gaurav-narula
Copy link
Contributor Author

Thanks for the review @junrao. I've addressed your comments with 2677f90 and rebased against trunk.

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaurav-narula thanks for this fix.


cluster.brokers().get(0).shutdown();

assertEquals(1, cluster.brokers().get(0).config().logDirs().size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a variable for cluster.brokers().get(0)?

        var broker = cluster.brokers().get(0);
        broker.shutdown();

        assertEquals(1, broker.config().logDirs().size());
        String logDir = broker.config().logDirs().get(0);

Utils.closeQuietly(lazyTimeIndex, "timeIndex", LOGGER);
Utils.closeQuietly(log, "log", LOGGER);
Utils.closeQuietly(txnIndex, "txnIndex", LOGGER);
Utils.closeAll(lazyOffsetIndex, lazyTimeIndex, log, txnIndex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If LogSegment#close now throws an exception, then LogSegments#close might break without closing all segments, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that! I've modified LogSegments#close to ensure we close all log segments even if one of them throws and added a test in LogSegmentsTest.

In the process, I found that some tests assumed that resources like Indexes and FileRecords may be closed multiple times. With LogSegment#close now propagating exceptions, we need to exit early if the resources have been closed before, otherwise we'd see failures due to FileChannel being closed or mmap being null. I've therefore updated AbstractIndex, TransactionIndex and FileRecords appropriately.

@github-actions github-actions bot added clients and removed small Small PRs labels Jun 2, 2025
@gaurav-narula gaurav-narula requested a review from chia7712 June 2, 2025 01:21
Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaurav-narula thanks for updates. one small comment is left. PTAL


doThrow(new IOException("Failure")).when(seg2).close();

try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please consider using assertThrows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with ce22ba9

@chia7712
Copy link
Member

chia7712 commented Jun 5, 2025

* What went wrong:
Execution failed for task ':storage:spotlessJavaCheck'.
> The following files had format violations:
      src/test/java/org/apache/kafka/storage/internals/log/LogSegmentsTest.java
          @@ -20,6 +20,7 @@
           import·org.apache.kafka.common.utils.Time;
           import·org.apache.kafka.common.utils.Utils;
           import·org.apache.kafka.test.TestUtils;
          +
           import·org.junit.jupiter.api.AfterEach;
           import·org.junit.jupiter.api.BeforeEach;
           import·org.junit.jupiter.api.Test;
  Run './gradlew :storage:spotlessApply' to fix these violations.

@gaurav-narula could you please fix the build error?

Log segment closure results in right sizing the segment on disk along
with the associated index files.

This is specially important for TimeIndexes where a failure to right
size may eventually cause log roll failures leading to under replication
and log cleaner failures.

This change uses `Utils.closeAll` which propagates exceptions, resulting
in an "unclean" shutdown. That would then cause the broker to attempt to
recover the log segment and the index on next startup, thereby avoiding
the failures described above.
@gaurav-narula
Copy link
Contributor Author

@chia7712 should be good now

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaurav-narula : Thanks for the updated PR. LGTM

@chia7712 chia7712 merged commit edd0efd into apache:trunk Jun 10, 2025
24 checks passed
chia7712 pushed a commit that referenced this pull request Jun 10, 2025
Log segment closure results in right sizing the segment on disk along
with the associated index files.

This is specially important for TimeIndexes where a failure to right
size may eventually cause log roll failures leading to under replication
and log cleaner failures.

This change uses `Utils.closeAll` which propagates exceptions, resulting
in an "unclean" shutdown. That would then cause the broker to attempt to
recover the log segment and the index on next startup, thereby avoiding
the failures described above.

Reviewers: Omnia Ibrahim <[email protected]>, Jun Rao
 <[email protected]>, Chia-Ping Tsai <[email protected]>
@@ -751,10 +751,7 @@ public Optional<FileRecords.TimestampAndOffset> findOffsetByTimestamp(long times
public void close() throws IOException {
if (maxTimestampAndOffsetSoFar != TimestampOffset.UNKNOWN)
Utils.swallow(LOGGER, Level.WARN, "maybeAppend", () -> timeIndex().maybeAppend(maxTimestampSoFar(), shallowOffsetOfMaxTimestampSoFar(), true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think writing the max timestamp has a similar issue. the last entry is assumed to be the max timestamp after restarting. Hence, it needs to rebuild the index if the "true" max timestamp is not stored correctly

I open KAFKA-19428 to trace it

gaurav-narula added a commit to gaurav-narula/kafka that referenced this pull request Jun 30, 2025
Log segment closure results in right sizing the segment on disk along
with the associated index files.

This is specially important for TimeIndexes where a failure to right
size may eventually cause log roll failures leading to under replication
and log cleaner failures.

This change uses `Utils.closeAll` which propagates exceptions, resulting
in an "unclean" shutdown. That would then cause the broker to attempt to
recover the log segment and the index on next startup, thereby avoiding
the failures described above.

Reviewers: Omnia Ibrahim <[email protected]>, Jun Rao
 <[email protected]>, Chia-Ping Tsai <[email protected]>
chia7712 pushed a commit to chia7712/kafka that referenced this pull request Jul 4, 2025
Log segment closure results in right sizing the segment on disk along
with the associated index files.

This is specially important for TimeIndexes where a failure to right
size may eventually cause log roll failures leading to under replication
and log cleaner failures.

This change uses `Utils.closeAll` which propagates exceptions, resulting
in an "unclean" shutdown. That would then cause the broker to attempt to
recover the log segment and the index on next startup, thereby avoiding
the failures described above.

Reviewers: Omnia Ibrahim <[email protected]>, Jun Rao
 <[email protected]>, Chia-Ping Tsai <[email protected]>
gaurav-narula added a commit to gaurav-narula/kafka that referenced this pull request Jul 4, 2025
Log segment closure results in right sizing the segment on disk along
with the associated index files.

This is specially important for TimeIndexes where a failure to right
size may eventually cause log roll failures leading to under replication
and log cleaner failures.

This change uses `Utils.closeAll` which propagates exceptions, resulting
in an "unclean" shutdown. That would then cause the broker to attempt to
recover the log segment and the index on next startup, thereby avoiding
the failures described above.

Reviewers: Omnia Ibrahim <[email protected]>, Jun Rao
 <[email protected]>, Chia-Ping Tsai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-approved clients core Kafka Broker storage Pull requests that target the storage module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants