-
Notifications
You must be signed in to change notification settings - Fork 14.9k
KAFKA-15599: Move KafkaMetadataLog to the raft module #19762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
There are some concerns regarding this change, though we haven't made |
Where has this been discussed? |
In KIP-595 we wrote:
We are only trying to pave the way for Kafka normal partition replication instead of making it SDK exportable, I may have thought too much about it, but the idea behind it still deserves consideration, for example, etcd-raft(https://github.com/etcd-io/raft) also follows a similar principle to not include a concrete log implementation. |
|
I'm not sure I interpret that statement from KIP-595 as an intent to keep the The |
| import java.util.concurrent.atomic.AtomicLong; | ||
| import java.util.function.Function; | ||
|
|
||
| public class KafkaRaftLog implements ReplicatedLog { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JIRA ticket also talked about renaming ReplicatedLog to RaftLog. Do we want to do that separately or we think it's not a good idea? cc @jsancio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think RaftLog is a nicer name. I can do it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dengziming
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR, I left some comments, PTAL.
| deleted = deletedSegments != 0 || !forgottenSnapshots.isEmpty(); | ||
| } | ||
| } | ||
| removeSnapshots(forgottenSnapshots, reason); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we move this inside of synchronized block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right this can be moved outside of the synchronized block
| ) | ||
| ); | ||
| } catch (IOException ioe) { | ||
| throw new UncheckedIOException(ioe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's a newcomer's question, is UncheckedIOException better than IOException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UncheckedIOException is unchecked while IOException is checked. However, it's risky to make this change as you have to check that every caller handles the new exception and the compiler will not help you - the way we did it for other classes was not to change exception types during conversion from Scala to Java. And to have a separate JIRA for investigating this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I used UncheckedIOException is that none of the method definitions in RaftLog are marked as throwing exceptions. None of the calling logic, mostly in KafkaRaftClient, has logic to handle checked exceptions.
So any exception throw will be passed back up the call stack to KafkaRadftClientDriver.doWork() that has catches Throwable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's better to wrap this in KafkaException to distinguish it from other UncheckedIOException? Ditto in a few other places.
| @Override | ||
| public Optional<OffsetAndEpoch> earliestSnapshotId() { | ||
| synchronized (snapshots) { | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to match latestSnapshotId()
| try { | ||
| OffsetAndEpoch epoch = snapshots.lastKey(); | ||
| return epoch == null ? Optional.empty() : Optional.of(epoch); | ||
| } catch (NoSuchElementException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing this NoSuchElementException catching code to if snapshots.isEmpty()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
| for (OffsetAndEpoch key : snapshots.keySet()) { | ||
| Optional<RawSnapshotReader> snapshotReader = readSnapshot(key); | ||
| snapshotReader.ifPresent(fileRawSnapshotReader -> { | ||
| snapshots.put(key, Optional.of((FileRawSnapshotReader) snapshotReader.get())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we put it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it shouldn't be there. The call to readSnapshot() will have opened the reader and updated snapshots already.
| int nodeId) throws IOException { | ||
| Properties props = new Properties(); | ||
| props.setProperty(TopicConfig.MAX_MESSAGE_BYTES_CONFIG, String.valueOf(config.maxBatchSizeInBytes())); | ||
| props.setProperty(TopicConfig.SEGMENT_BYTES_CONFIG, String.valueOf(config.logSegmentBytes())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder — we recently introduced some changes in KafkaMetadataLog (see bcda92b#diff-b332f85b04775c821226b6f704e91d51f9647f29ba73dace65b99cf36f6b9cea)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads-up!
dengziming
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All lgtm.
junrao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimaison : Thanks for the PR. Left a few comments.
| this.snapshots = snapshots; | ||
| this.topicPartition = topicPartition; | ||
| this.config = config; | ||
| this.logIdent = "[MetadataLog partition=" + topicPartition + ", nodeId=" + nodeId + "] "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change MetadataLog to RaftLog?
| } else { | ||
| props.setProperty(TopicConfig.SEGMENT_BYTES_CONFIG, String.valueOf(config.logSegmentBytes())); | ||
| } | ||
| props.setProperty(TopicConfig.FILE_DELETE_DELAY_MS_CONFIG, String.valueOf(ServerLogConfigs.LOG_DELETE_DELAY_MS_DEFAULT)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we are missing the following.
props.setProperty(TopicConfig.SEGMENT_MS_CONFIG, config.logSegmentMillis.toString)
| nodeId | ||
| ); | ||
|
|
||
| // When recovering, truncate fully if the latest snapshot is after the log end offset. This can happen to a follower |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was the following validation removed?
if (defaultLogConfig.segmentSize() < config.logSegmentBytes()) {
metadataLog.error(s"Overriding ${MetadataLogConfig.INTERNAL_METADATA_LOG_SEGMENT_BYTES_CONFIG} is only supported for testing. Setting " +
s"this value too low may lead to an inability to write batches of metadata records.")
}
|
|
||
| for (SnapshotPath snapshotPath : snapshotsToDelete) { | ||
| Files.deleteIfExists(snapshotPath.path()); | ||
| LOG.info("Deleted unneeded snapshot file with path {}", snapshotPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the partition level indent ?
| String reason(OffsetAndEpoch snapshotId); | ||
| } | ||
|
|
||
| static class RetentionMsBreach implements SnapshotDeletionReason { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change it to record? Ditto for RetentionSizeBreach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed all the SnapshotDeletionReason implementations to record
| @Test | ||
| public void testConfig() throws IOException { | ||
| Properties props = new Properties(); | ||
| props.put("process.roles", "broker"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move KRaftConfigs to the raft module so that we could use the defined config names?
| mockTime.sleep(config.internalDeleteDelayMillis()); | ||
| // Assert that the log dir doesn't contain any older snapshots | ||
| Files | ||
| .walk(logDir, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be merged with the previous line?
| OffsetAndEpoch snapshotId1 = new OffsetAndEpoch(1000, 1); | ||
| RawSnapshotWriter snapshot = log.createNewSnapshotUnchecked(snapshotId1).get(); | ||
| append(snapshot, 500); | ||
| snapshot.freeze(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we close snapshot as before? Ditto below.
| } | ||
|
|
||
| @Test | ||
| public void testSegmentsLessThanLatestSnapshot() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we skip testSegmentMsConfigIsSetInMetadataLog() in the old test ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was added after I opened the PR. I must have missed it while rebasing.
| * @param log The log | ||
| * @return true if the topic partition should not exist on the broker, false otherwise. | ||
| */ | ||
| public static boolean isStrayReplica(TopicsImage newTopicsImage, int brokerId, Optional<Uuid> topicId, int partitionId, String log) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be put in metadata.util?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasons I put it here are that:
- it's a single method related to
TopicsImage - the
TopicsImageTestclass has utility methods that helped simplify the test logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this static if the first argument is TopicsImage? This looks a lot like an object method where the this reference is the TopicsImage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was static mostly for historical reasons as I just copied it from LogManager but you're right since the first argument is a TopicsImage the new method I added is now not static.
junrao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimaison : Thanks for the updated PR. A few more comments. Also, there are compilation errors.
| FileRawSnapshotReader fileReader = FileRawSnapshotReader.open(log.dir().toPath(), snapshotId); | ||
| snapshots.put(snapshotId, Optional.of(fileReader)); | ||
| return Optional.of(fileReader); | ||
| } catch (UncheckedIOException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following path is only for NoSuchFileException. For other IOExceptions, we should propagate to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which method do you see throwing NoSuchFileException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://stackoverflow.com/questions/63638250/nio-channels-filechannel-open-threw-nosuchfileexception, FileChannel.open() could throw NoSuchFileException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, going down the whole call stack, I see that UnixFileSystemProvider uses UnixException.translateToIOException() and that maps ENOENT to NoSuchFileException.
I wonder if the existing code is broken as FileRawSnapshotReader.open() wrap the IOException into UncheckedIOException so the current catch block should be not entered.
I made the change and added a test case checking NoSuchFileException. However I can't find a way to reliably force another type of IOException to check these are thrown.
| ) | ||
| ); | ||
| } catch (IOException ioe) { | ||
| throw new KafkaException(ioe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that existing code like Snapshots and FileSnapshotReader convert IOException to UncheckedIOException. It's probably better to follow that convention for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the whole raft module pretty much only uses UncheckedIOException.
| public void testIsStrayReplicaWithEmptyImage() { | ||
| TopicsImage image = topicsImage(List.of()); | ||
| List<TopicIdPartition> onDisk = List.of(FOO_0, FOO_1, BAR_0, BAR_1, BAZ_0); | ||
| assertTrue(onDisk.stream().allMatch(log -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this hasn't been addressed?
| mockTime.sleep(config.internalDeleteDelayMillis()); | ||
| // Assert that the log dir doesn't contain any older snapshots | ||
| Files.walk(logDir, 1) | ||
| .map(Snapshots::parse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indentation is different from the one in line 628.
| // Simulate log cleanup that advances the LSO | ||
| log.log().maybeIncrementLogStartOffset(snapshotId.offset() - 1, LogStartOffsetIncrementReason.SegmentDeletion); | ||
|
|
||
| assertEquals(Optional.empty(), log.createNewSnapshot(new OffsetAndEpoch(snapshotId.offset() - 2, snapshotId.epoch()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we close the created snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expect createNewSnapshot() to return Optional.empty(), so there should be nothing to close.
| log.updateHighWatermark(new LogOffsetMetadata(numberOfRecords)); | ||
| createNewSnapshot(log, snapshotId); | ||
|
|
||
| assertEquals(Optional.empty(), log.createNewSnapshot(snapshotId), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we close the created snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expect createNewSnapshot() to return Optional.empty(), so there should be nothing to close.
junrao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimaison : Thanks for the updated PR. Just a comment based on Jose's feedback. Also, could you resolve the conflicts?
jsancio
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes I still need to review the new KafkaRaftLog files but here are some early comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that we should make this internal by moving the implementation to o.a.k.r.internals.KafkaRaftLog.java. The same of the accompanying test suite file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
| <allow class="org.apache.kafka.common.compress.Compression" exact-match="true" /> | ||
| <allow pkg="org.apache.kafka.common.config" /> | ||
| <allow pkg="org.apache.kafka.common.feature" /> | ||
| <allow pkg="org.apache.kafka.common.internals" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that to me this means that the common module is not organized correctly if the raft module needs types in the "internals" namespace. Same comment applies to the change below which also includes o.a.k.s.internals.log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internals simply means it should not be used outside the kafka repo. This is more explicit than the uncommon "check the javadoc" to check if the class is public or internal.
| * @param log The log | ||
| * @return true if the topic partition should not exist on the broker, false otherwise. | ||
| */ | ||
| public static boolean isStrayReplica(TopicsImage newTopicsImage, int brokerId, Optional<Uuid> topicId, int partitionId, String log) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this static if the first argument is TopicsImage? This looks a lot like an object method where the this reference is the TopicsImage.
| if (topicId.isEmpty()) { | ||
| // Missing topic ID could result from storage failure or unclean shutdown after topic creation but before flushing | ||
| // data to the `partition.metadata` file. And before appending data to the log, the `partition.metadata` is always | ||
| // flushed to disk. So if the topic ID is missing, it mostly means no data was appended, and we can treat this as | ||
| // a stray log. | ||
| LOG.info("The topicId does not exist in {}, treat it as a stray log.", log); | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about letting the log manager make this decision? In Kafka 4.0 all topics must have a topic id.
| PartitionRegistration partition = newTopicsImage.getPartition(topicId.get(), partitionId); | ||
| if (partition == null) { | ||
| LOG.info("Found stray log dir {}: the topicId {} does not exist in the metadata image.", log, topicId); | ||
| return true; | ||
| } else { | ||
| List<Integer> replicas = Arrays.stream(partition.replicas).boxed().toList(); | ||
| if (!replicas.contains(brokerId)) { | ||
| LOG.info("Found stray log dir {}: the current replica assignment {} does not contain the local brokerId {}.", | ||
| log, replicas.stream().map(String::valueOf).collect(Collectors.joining(", ", "[", "]")), brokerId); | ||
| return true; | ||
| } else { | ||
| return false; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this should be a functionality of the log manager. Maybe the TopicsImage method should just return all of the topic ids that don't exist in the given image and broker.
This would allow you to remove that added static logger.
| logManager.startup( | ||
| metadataCache.getAllTopics().asScala, | ||
| isStray = log => JLogManager.isStrayKraftReplica(brokerId, newImage.topics(), log) | ||
| isStray = log => TopicsImage.isStrayReplica(newImage.topics(), brokerId, log.topicId(), log.topicPartition().partition(), log.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor but to me stray partition are in the log manager not in the topics image. Meaning the log manager has partition entries that are not in the latest topics image.
In some sense the log manager understand topics image and makes sure that they match. The topics images doesn't know anything about "stray partitions" and the log manager.
If you still want to move the functionality TopicsImage maybe make it a method (not static) with Stream<TopicIdPartition> deletedPartitionsForReplica(int brokerId, Stream<TopicIdPartition>).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced a more generic method in TopicsImage called partitionReplicas(), and moved the logic specific to stray partitions back to LogManager.
|
@jsancio Can you take another look? Thanks |
junrao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimaison : Thanks for the updated PR. A few more minor comments.
| if (reader == null) { | ||
| return Optional.empty(); | ||
| } else if (reader.isPresent()) { | ||
| return Optional.of(reader.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just return reader here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm rewrapping it here as the generic type does not match. reader is a Optional<FileRawSnapshotReader> while we need to return Optional<RawSnapshotReader>
| metadataCache.getAllTopics().asScala, | ||
| isStray = log => JLogManager.isStrayKraftReplica(brokerId, newImage.topics(), log) | ||
| isStray = log => { | ||
| if (log.topicId().isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we keep this part in JLogManager.isStrayReplica()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I kept this here is because I wanted to keep the else block here since it's using newImage which we're trying not to use in JLogManager
| // flushed to disk. So if the topic ID is missing, it mostly means no data was appended, and we can treat this as | ||
| // a stray log. | ||
| LOG.info("The topicId does not exist in {}, treat it as a stray log.", log); | ||
| public static boolean isStrayReplica(List<Integer> replicas, int brokerId, Uuid topicId, UnifiedLog log) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to pass in topicId since it comes from log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Fixed
81a8925 to
96f6c9d
Compare
96f6c9d to
adee6d1
Compare
adee6d1 to
509f808
Compare
|
Rebased on trunk to resolve conflicts |
509f808 to
c7b7a6c
Compare
c7b7a6c to
b3680a8
Compare
b3680a8 to
c09095b
Compare
c09095b to
ed53125
Compare
ed53125 to
c250ac3
Compare
c250ac3 to
6578051
Compare
KafkaMetadataLogtoKafkaRaftLograftdepend onstoragestoragedependency onmetadataas this otherwise create acyclic dependency