[KAFKA-15580] First attempt at UncleanRecoveryManager #19468

josefk31 · 2025-04-14T17:11:26Z

Basic architecture for system to manage UncleanRecovery by fetching
log information from brokers.

The idea is to improve durability by electing the leader with longest during unclean recovery instead of just picking a random unfenced broker.

There will be one additional thread - RecoverySendThread which handles NIO between controller and broker - sending requests. LogInfoResponseReceived events will be written to the controllers queue. These events don't write anything until the last response is received - meaning that all needed information is collected to make accurate determinations of log length.

Responses to requests are processed in the controllers event queue by a RecoveryManager class.
It keeps track of the # of outstanding requests and decides whether and how to retry failed requests. RecoveryManager builds up a "LogInfoStore" object with information about log length of various replicas.

Either after all expected requests are received or a timeout; the RecoveryManager will begin to run leadership elections for batches of TopicPartitions (respecting the max batch size configuration). These elections will have access to the store which contains enriched log information to assist them in making leadership decisions.

Most of the tracking is performed in a class called "StateMachine" within RecoveryManager. This is deliberate so that eventually the RecoveryManager can supervise multiple unclean election requests at the same time.

Created GetReplicaLogInfoRequest/GetReplicaLogInfoResponse
Created RecoveryManager class to handle sending and receiving
requests
Added configs to switch on unclean recovery
Added ability to elect with longest logs in PartitionChangeBuilder
Added RecoverySenderThread for sending

add GetReplicaLogInfo Request and Response add GetReplicaLogInfo to apikeys.java add GetReplicaLogInfo Request and Response fix GetReplicaLogInfoResponse add test for GetReplicaLogInfo in RequestResponseTest.java add GetReplicaLogInfoResponse add latestVersionUnstable Conflicts: clients/src/main/java/org/apache/kafka/common/protocol/ApiKeys.java clients/src/main/java/org/apache/kafka/common/requests/AbstractRequest.java clients/src/main/java/org/apache/kafka/common/requests/AbstractResponse.java clients/src/test/java/org/apache/kafka/common/requests/RequestResponseTest.java core/src/main/scala/kafka/network/RequestConvertToJson.scala cherry-pick of PR on top of origin/trunk.

Basic architecture for system to manage UncleanRecovery by fetching log information from brokers. The idea is to improve durability by electing the leader with longest during unclean recovery instead of just picking a random unfenced broker. There will be one additional thread - RecoverySendThread which handles NIO between controller and broker - sending requests. LogInfoResponseReceived events will be written to the controllers queue. These events don't write anything until the last response is received - meaning that all needed information is collected to make accurate determinations of log length. Responses to requests are processed in the controllers event queue by a RecoveryManager class. It keeps track of the # of outstanding requests and decides whether and how to retry failed requests. RecoveryManager builds up a "LogInfoStore" object with information about log length of various replicas. Either after all expected requests are received or a timeout; the RecoveryManager will begin to run leadership elections for batches of TopicPartitions (respecting the max batch size configuration). These elections will have access to the store which contains enriched log information to assist them in making leadership decisions. Most of the tracking is performed in a class called "StateMachine" within RecoveryManager. This is deliberate so that eventually the RecoveryManager can supervise multiple unclean election requests at the same time. - Created GetReplicaLogInfoRequest/GetReplicaLogInfoResponse - Created RecoveryManager class to handle sending and receiving requests - Added configs to switch on unclean recovery - Added ability to elect with longest logs in PartitionChangeBuilder - Added RecoverySenderThread for sending

github-actions · 2025-04-22T03:21:43Z

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

mannoopj and others added 2 commits April 14, 2025 12:57

github-actions bot added triage PRs from the community core Kafka Broker kraft storage Pull requests that target the storage module clients labels Apr 14, 2025

github-actions bot added the needs-attention label Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KAFKA-15580] First attempt at UncleanRecoveryManager #19468

[KAFKA-15580] First attempt at UncleanRecoveryManager #19468

Uh oh!

josefk31 commented Apr 14, 2025

Uh oh!

github-actions bot commented Apr 22, 2025

Uh oh!

Uh oh!

[KAFKA-15580] First attempt at UncleanRecoveryManager #19468

Are you sure you want to change the base?

[KAFKA-15580] First attempt at UncleanRecoveryManager #19468

Uh oh!

Conversation

josefk31 commented Apr 14, 2025

Uh oh!

github-actions bot commented Apr 22, 2025

Uh oh!

Uh oh!