Simplified it into the following scenario:
3 brokers. 1 topic with 4 partitions. 2 consumer instances to consume that topic. The start index of broker, partition and consumer is 0.
When c0 (consumer instance 0) calls utils.go#dividePartitionsBetweenConsumers(), the leaders are like:
- p0 on b0
- p1 on b1
- p2 on b2
- p3 on b0
After sort(by leader then by partition id), partitions is like
[
{p0 b0 xxxx} // {partition leader address}
{p3 b0 xxxx}
{p1 b1 xxxx}
{p2 b2 xxxx}
]
So c0 gets its myPartitions (to claim) like p0, p3.
Then p0 somehow change its leader to b2. The leaders are like:
- p0 on b2
- p1 on b1
- p2 on b2
- p3 on b0
And another consumer instance c1 calls utils.go#dividePartitionsBetweenConsumers().
After sort(by leader then by partition id), partitions is like
[
{p3 b0 xxxx}
{p1 b1 xxxx}
{p0 b2 xxxx}
{p2 b2 xxxx}
]
c1 gets its myPartitions (to claim) like p0, p2.
As a result, we have a condition that c0 tries to claim p0 and p3 while c1 tries to claim p0 and p2.
- c0 and c1 both fight for p0 and c0 wins (as it tries to claim it firstly).
- But no one tries to claim p1.
In utils.go, we sort the partitionLeader by leader firstly.
func (pls partitionLeaders) Less(i, j int) bool {
return pls[i].leader < pls[j].leader || (pls[i].leader == pls[j].leader && pls[i].id < pls[j].id)
}
When leader changes between 2 calls of dividePartitionsBetweenConsumers(), the result after sort is changed. I believe the root cause it here.