Segment Rebalance Race

I've identified a race in segment movement which causes temporary segment unavailability (and thus partial query results).

Segment load/drop callbacks are racing with [prepareCurrentServers](https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/server/coordinator/duty/PrepareBalancerAndLoadQueues.java#L139-L147). 

Consider the following scenario: Coordinator is moving segment `S` from server `A` to server `B`. `S` has 1x replication.

Basically, the load/drop callbacks can happen between the start/end of this function in a way where you get a 

```
SegmentReplicaCount{requiredAndLoadable=1, required=1, loaded=2, loadedNonHistorical=0, loading=0, dropping=0, movingTo=0, movingFrom=0}. 
```

Essentially:
```
T0[Coordinator]: enter prepareCurrentServers()
T1[Coordinator]: Server[B] completed request[MOVE_TO] on segment[S] with status[SUCCESS]
T2[Coordinator]: Dropping segment [S] from server[A]
T3[A]: Completely removing segment[S] in [30,000]ms.
T4[Coordinator]: Server[A] completed request[DROP] on segment[S] with status[SUCCESS].
T5[Coordinator]: exit prepareCurrentServers()
T6[Coordinator]: enter prepareCluster()
T7[Coordinator]: exit prepareCluster()
T8[Coordinator]: enter initReplicaCounts()
T9[Coordinator]: exit initReplicaCounts()
T10[Coordinator]: Segment S replica count is SegmentReplicaCount{requiredAndLoadable=1, required=1, loaded=2, loadedNonHistorical=0, loading=0, dropping=0, movingTo=0, movingFrom=0}
T11[Coordinator]: Dropping segment [S] from server[B]
```

I think what's happening is that the server loop in `prepareCurrentServers()` reads the servers in a state where LOAD has persisted in the server view (2x loaded) but the DROP has not materialized yet in the view. This causes `loaded=2`. Then, I thought the in-flight DROP (since it hasn't materialized in the view) would get picked up in the `queuedSegments` load queue peons (and show up as `dropping=1`), but I think since the DROP callback returns – and clears the entry from the old `queuedSegments` load queue peons – before `prepareCluster()` has a chance to copy over the queued action to new `queuedSegments`, we lose that important bit of information. Hence, you are left in a weird state with a "valid" queue but an invalid load state. In other words, I think we need to somehow synchronize callbacks with this `prepareCurrentServers()` and `prepareCluster()`.

### Affected Version

All recent Druid versions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Segment Rebalance Race #18764

Affected Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segment Rebalance Race #18764

Description

Affected Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions