v2.12.5: Consumer loss regression - why is this release not yanked? #7967
Replies: 2 comments 1 reply
-
|
Michael, I'm sorry for the impact this had on your team. This is exactly the kind of situation we work hard to prevent, and this time we didn't. The root cause was a narrow but real interaction: asynchronous snapshots at the meta layer, combined with asset updates, combined with a server restart within a specific time window. It's a complex path we didn't have test coverage for, and that's on us. Once we identified it, we moved quickly. We updated Synadia Cloud and all of our managed deployments, reached out directly to self-hosted customers, and added a warning to the v2.12.5 release notes which you can see here. There is a simple flag that disables the affected behavior, which served as an effective workaround while the fix was prepared. On the question of yanking the release, I understand why that feels like the right call, and I don't dismiss the instinct. With an ecosystem as large as NATS, pulling a release carries its own risks for the many operators already running it. Given that a clean workaround existed and notifications were out, we made the judgment call to leave it published with clear guidance. I recognize not everyone will agree with that decision. We will always strive to do better, and appreciate the feedback. =derek |
Beta Was this translation helpful? Give feedback.
-
|
Hi @derekcollison, it was a special case, but to get better: Regards |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
v2.12.5 caused us real damage in production. We lost consumers in a clustered JetStream setup, which led to data loss in payment processing workflows. It took our team significant hours of debugging and recovery at a customer site before we traced it back to this release.
Why is v2.12.5 still published? The regression is confirmed, a fix is coming — yet the release is still up for anyone to pull. A config workaround is not an acceptable response for a bug that silently drops consumers. Please yank this release immediately.
How did this ship? Consumer state in clustered stream updates is not an edge case — it's a core guarantee. If the test matrix doesn't cover this, that needs to change. A brief post-mortem would go a long way to restore trust.
We rely on NATS for business-critical infrastructure. Right now, that trust is shaken.
Beta Was this translation helpful? Give feedback.
All reactions