Skip to content

ErrConsumerLeadershipChanged does not trigger new pull request, causing consumer to stall #2063

@arkadiuss

Description

@arkadiuss

Observed behavior

When a JetStream consumer receives ErrConsumerLeadershipChanged (status code 409, "leadership change"), the client resets pending message counts but does not send a new pull request. This leaves the consumer in a state where:

  1. The connection is healthy
  2. The subscription is valid
  3. Messages are pending on the server (NumPending > 0)
  4. The consumer is not receiving any messages

The consumer remains stalled indefinitely until the application is restarted.

In our production with JetStream cluster, leadership changes occur quite often which let us to spot the issue. For us this caused a few message processing delays.

Expected behavior

When ErrConsumerLeadershipChanged is received, the client should immediately send a new pull request to resume message delivery, similar to how ErrNoHeartbeat is handled.

Current code:

if errors.Is(msgErr, ErrConsumerLeadershipChanged) {
    s.pending.msgCount = 0
    s.pending.byteCount = 0
}
return nil  // Does NOT trigger new pull request

Expected behavior (similar to ErrNoHeartbeat handling at line 394-414):

if errors.Is(msgErr, ErrConsumerLeadershipChanged) {
    s.pending.msgCount = 0
    s.pending.byteCount = 0
    // Should trigger a new pull request here
    s.fetchNext <- &pullRequest{
        Expires:       s.consumeOpts.Expires,
        Batch:         s.consumeOpts.MaxMessages,
        MaxBytes:      s.consumeOpts.MaxBytes,
        Heartbeat:     s.consumeOpts.Heartbeat,
        // ... other fields
    }
    if s.hbMonitor != nil {
        s.hbMonitor.Reset(2 * s.consumeOpts.Heartbeat)
    }
}

Server and client version

  • nats.go: v1.51.0
  • nats-server: v2.12.4, clustered JetStream, 3 replicas, file storage

Host environment

  • Linux/amd64, Go 1.26.2
  • GKE, consumers run as long-lived pods

Steps to reproduce

  1. 3-node JetStream cluster, stream + durable pull consumer (R=3).
  2. Client calls consumer.Consume(handler) with default Heartbeat.
  3. Publish a steady stream of messages.
  4. Force a leadership change: nats consumer cluster step-down STREAM CONSUMER.
  5. Observe: handler stops being called; consumer.Info().NumPending climbs; no errors on the connection.

Triggers in production: cluster scale events, partial network blips - leadership changes here happen without a client-side disconnect, so reconnection logic doesn't help.

Metadata

Metadata

Assignees

Labels

defectSuspected defect such as a bug or regression

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions