Description
Describe the problem
On a couple of occasions recently, we saw a large restore failing due to errors during the scatter phase. This resulted in an unbalanced distribution of the newly split-out ranges. As the restore progressed, some of the nodes' disks reached their capacity and the restore paused.
The scatters failed due to concurrent changes in the range descriptors, initiated by the replicate queue. Simultaneously, the replica queue was also encountering the same errors, hinting at contention between the replica queue and scatter. Because replication changes are multi-step (e.g. adding a learner, promoting it to a voter, etc), and each step involves a descriptor change, it's easy to see how two independent sources of replication changes can step on each others' toes.
To avoid this race, all replication and lease changes need to acquire an allocator token.
cockroach/pkg/kv/kvserver/allocator/plan/token.go
Lines 17 to 31 in c83c57d
The replicate queue already does so, in process
:
cockroach/pkg/kv/kvserver/replicate_queue.go
Lines 633 to 640 in aecc555
However, adminScatter
does not. It invokes processOneChange
directly, circumventing the allocator token acquisition in process
, which internally calls processOneChange
:
cockroach/pkg/kv/kvserver/replica_command.go
Lines 4199 to 4201 in e1318fa
In fact, adminScatter
does obtain the allocator token later on, when scattering leases, but it needs to do so for the replica scatter phase above as well.
To Reproduce
For a unit test repro, see #144580.
We should repro this in a roachtest as well; e.g. running a large restore or dropping a large table and then restoring it.
Expected behavior
AdminScatter
should obtain an allocator token for the replica scattering phase, similarly to how it does for the lease scattering phase.
Environment:
This has likely been an issue for a while, but allocator tokens were introduced in 24.2 (#119410).
Additional context
We've seen this recently in a large-customer escalation as well as in DRT large-scale testing.
Jira issue: CRDB-49435