Skip to content

[fix][broker] Prevent stale replicator pending reads after termination#25767

Open
Denovo1998 wants to merge 1 commit into
apache:masterfrom
Denovo1998:prevent_stale_replicator_pending_reads_after_termination
Open

[fix][broker] Prevent stale replicator pending reads after termination#25767
Denovo1998 wants to merge 1 commit into
apache:masterfrom
Denovo1998:prevent_stale_replicator_pending_reads_after_termination

Conversation

@Denovo1998
Copy link
Copy Markdown
Contributor

Fixes #xyz

Main Issue: #xyz

PIP: #xyz

Motivation

This is a follow-up to #25625 for the replication read-failure path related to #25097.

#25625 completes a replicator InFlightTask when a managed-ledger read fails, so retryable read failures do not leave stale pending-read state behind. However, there is still a race when the read failure callback arrives after the replicator has already left Started, for example during termination or producer restart. In that case, readEntriesFailed returns before clearing the failed InFlightTask, leaving entries == null and causing hasPendingRead() to continue treating the old read as active.

Modifications

  • Complete failed InFlightTask contexts before checking whether the replicator is still in the Started state.
  • Keep the cleanup defensive by only handling InFlightTask contexts whose entries have not already been set.
  • Remove duplicated failed-task completion from the later retry/error branches in readEntriesFailed.
  • Add a regression test that starts a real replication read, blocks the managed-ledger read failure, terminates the replicator through the normal lifecycle, releases the failure callback, and verifies the pending-read state is cleared.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant