Skip to content

Conversation

@tilakraj94
Copy link

@tilakraj94 tilakraj94 commented Jan 2, 2026

This PR addresses scenarios where the stream state and consumer state diverge. Such discrepancies can occur in real-world situations, for example after a server or VM crash. While we don’t currently have deterministic steps to reproduce this issue, it’s reasonable to expect cases where a consumer’s sequence ends up ahead of the stream’s sequence.

In customer environments , especially where teams cannot directly interact with or manually intervene in systems , the system should be able to recover automatically. In these situations, the consumer state itself is not corrupted but may reference an invalid sequence. This change allows the consumer to fall back to the correct stream sequence so it can continue functioning, rather than getting stuck and stopping message consumption.

This PR is open for discussion, and feedback is welcome on whether these changes are appropriate, should be adjusted, or belong elsewhere.

Signed-off-by: Tilak Raj [email protected]

@tilakraj94 tilakraj94 requested a review from a team as a code owner January 2, 2026 14:08
@tilakraj94 tilakraj94 changed the title reset consumer state to stream state reset invalid consumer state to stream state on startup Jan 2, 2026
@MauriceVanVeen
Copy link
Member

Is this with sync: always configured or not? Given a file-based non-replicated (R1) stream.

If it's not configured it could be the stream and consumer are properly in sync, a VM crash follows, and some of the stream's messages at the tail revert while the consumer state remains ahead.

Fully resetting the consumer state to a recalculated starting sequence is not the correct thing to do (might only work for a WorkQueue, but don't think it's fully correct there either). For example, imagine a scenario where a stream has 100 messages, and those 100 messages have been sent to the consumer and they are all pending. The next sequence is o.sseq=101. Assuming in this scenario the stream reverts back to only containing 40 messages. Due to the full reset ALL messages that are available in the stream (the first 40) will be redelivered, even if they were previously acknowledged.

Instead, the state should be looked at and compared to the highest stream sequence:

  • should reset all state to the stream sequence if the ack floor is higher, and the ack floor should then equal the stream sequence (only in this case can the delivery sequences and o.pending, o.rdc be reset fully)
  • should remove any entries in o.pending and o.rdc that are above the stream sequence
  • the starting sequence (o.sseq) and delivery sequence (o.dseq) need to be updated:
    • to equal ack floor if o.pending is empty
    • o.sseq needs to be the highest sequence found in o.pending, where the key is the sequence, and o.sseq = found+1
    • same thing for o.dseq but the p.Sequence stored in the o.pending map

I'd recommend to start with a test that only uses normal JetStream APIs to get the system in a state like described above, then shutdown the server and remove index.db and truncate a .blk file, and then restarting the server. Publishing a new message should be accepted, and then be consumable by the consumer.

@MauriceVanVeen
Copy link
Member

I'd also recommend to open an issue and discuss there first before opening a PR like this. Otherwise you're spending time on code that might not be suitable, and we need to do multiple code review cycles, instead of discussing the desired approach first.

See also the contributing guide: CONTRIBUTING.md

@tilakraj94
Copy link
Author

Hi @MauriceVanVeen, thank you for your patience and time. I opened this PR primarily for discussion purposes. I initially wanted to create an issue or proposal in the repository, but wasn’t able to do that at the same time. I agree that we should first discuss the approach and then move on to implementation.

To that end, I’ve opened a proposal here #7693, and we can use it to discuss the approach or any other related points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants