Skip to content

Conversation

@sonnes
Copy link
Collaborator

@sonnes sonnes commented Dec 4, 2025

Fix "Local: Erroneous state" panic during Kafka rebalancing

Problem

When a Kafka rebalance occurs during batch processing (especially with concurrency > 1), goroutines attempt to commit offsets for partitions they no longer own, causing librdkafka to return "Erroneous state" errors which bubble up as panics.

┌─────────────────────────────────────────────────────────────────────────┐
│                        REBALANCE RACE CONDITION                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Pod A              Pod B              Pod C (new)                     │
│   ┌─────┐            ┌─────┐            ┌─────┐                         │
│   │ P0  │            │ P1  │            │     │  ← Spawns               │
│   └──┬──┘            └──┬──┘            └──┬──┘                         │
│      │                  │                  │                            │
│      ▼                  ▼                  │                            │
│  ┌────────┐        ┌────────┐              │                            │
│  │ Long   │        │ Long   │              │                            │
│  │ Batch  │        │ Batch  │              ▼                            │
│  │Process │        │Process │     ┌─────────────────┐                   │
│  └───┬────┘        └───┬────┘     │   REBALANCE     │                   │
│      │                 │          │   TRIGGERED     │                   │
│      │ ◄───────────────┼──────────┤                 │                   │
│      │    Partitions   │          │ P0, P1 REVOKED! │                   │
│      │    Revoked      │          └─────────────────┘                   │
│      │                 │                                                │
│      ▼                 ▼                                                │
│  ┌────────────┐   ┌────────────┐                                        │
│  │  Commit    │   │  Commit    │   ← Partition no longer owned!         │
│  │  Offset    │   │  Offset    │                                        │
│  └─────┬──────┘   └─────┬──────┘                                        │
│        │                │                                               │
│        ▼                ▼                                               │
│   ╔════════════════════════════════╗                                    │
│   ║  💥 Local: Erroneous state 💥  ║                                    │
│   ║     (librdkafka error)         ║                                    │
│   ╚════════════════════════════════╝                                    │
│        │                                                                │
│        ▼                                                                │
│   ┌─────────────┐                                                       │
│   │   PANIC()   │  → Container exits                                    │
│   └─────────────┘                                                       │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Root Cause

Local: Erroneous state is a librdkafka internal error that occurs when the client attempts a Kafka operation while in an invalid/transition state (e.g., after partitions were revoked during a rebalance).

With batch concurrency > 1: Multiple goroutines are processing simultaneously. When rebalance starts, all goroutines try to store offsets/commit at nearly the same time while the rebalance is in progress. Kafka's transactional state machine cannot handle overlapping operations during a rebalance — one succeeds, the others panic.

Solution

  • Track active partitions via a rebalance callback (onPartitionsAssigned / onPartitionsRevoked)
  • Skip offset commits for partitions that have been revoked
  • Filter batch offsets to only include partitions still assigned before calling StoreOffsets

Changes

File Change
consumer.go Added activePartitions map, rebalance callback, isPartitionActive() check in storeMessage()
batch_consumer.go Same partition tracking + filters storeBatch() to only commit offsets for active partitions
*_test.go Added tests for rebalance scenarios

@sonnes sonnes force-pushed the fix/track-rebalancing branch from c63aa6f to 8f04407 Compare December 4, 2025 08:57
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 96.19048% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.24%. Comparing base (6ce2a14) to head (8f04407).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
xkafka/batch_consumer.go 96.29% 1 Missing and 1 partial ⚠️
xkafka/consumer.go 95.83% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #67      +/-   ##
==========================================
+ Coverage   83.30%   84.24%   +0.93%     
==========================================
  Files          60       60              
  Lines        2810     2316     -494     
==========================================
- Hits         2341     1951     -390     
+ Misses        456      351     -105     
- Partials       13       14       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants