kvserver: ensure AdminScatter obtains an allocator token

**Describe the problem**

On a couple of occasions recently, we saw a large restore failing due to errors during the scatter phase. This resulted in an unbalanced distribution of the newly split-out ranges. As the restore progressed, some of the nodes' disks reached their capacity and the restore paused.

The scatters failed due to concurrent changes in the range descriptors, initiated by the replicate queue. Simultaneously, the replica queue was also encountering the same errors, hinting at contention between the replica queue and scatter. Because replication changes are multi-step (e.g. adding a learner, promoting it to a voter, etc), and each step involves a descriptor change, it's easy to see how two independent sources of replication changes can step on each others' toes.

To avoid this race, all replication and lease changes need to acquire an allocator token. 

https://github.com/cockroachdb/cockroach/blob/c83c57d354741ac36740894f7387c014eb6c09fe/pkg/kv/kvserver/allocator/plan/token.go#L17-L31

The replicate queue already does so, in `process`:

https://github.com/cockroachdb/cockroach/blob/aecc555951656a4b17c89a7eac8dad8f62c3b18d/pkg/kv/kvserver/replicate_queue.go#L633-L640

However, `adminScatter` does not. It invokes `processOneChange` directly, circumventing the allocator token acquisition in `process`, which internally calls `processOneChange`:

https://github.com/cockroachdb/cockroach/blob/e1318fa80280d4609e04b1b189d941f21ee6e15e/pkg/kv/kvserver/replica_command.go#L4199-L4201

In fact, `adminScatter` does obtain the allocator token later on, when scattering leases, but it needs to do so for the replica scatter phase above as well.

**To Reproduce**

For a unit test repro, see https://github.com/cockroachdb/cockroach/pull/144580.
We should repro this in a roachtest as well; e.g. running a large restore or dropping a large table and then restoring it.

**Expected behavior**
`AdminScatter` should obtain an allocator token for the replica scattering phase, similarly to how it does for the lease scattering phase.

**Environment:**
This has likely been an issue for a while, but allocator tokens were introduced in 24.2 (https://github.com/cockroachdb/cockroach/pull/119410).

**Additional context**
We've seen this recently in a large-customer escalation as well as in DRT large-scale testing.



Jira issue: CRDB-49435

	// AllocatorToken is a token which provides mutual exclusion for allocator
	// execution. When the token is acquired, other acquirers will fail until
	// release.
	//
	// The leaseholder replica should acquire an allocator token before beginning
	// replica or lease changes on a range. After the changes have
	// failed/succeeded, the token should be released. The goal is to limit the
	// amount of concurrent reshuffling activity of a range.
	type AllocatorToken struct {
	mu struct {
	syncutil.Mutex
	acquired bool
	acquiredName string
	}
	}

	func (rq *replicateQueue) process(
	ctx context.Context, repl *Replica, confReader spanconfig.StoreReader,
	) (processed bool, err error) {
	if tokenErr := repl.allocatorToken.TryAcquire(ctx, rq.name); tokenErr != nil {
	log.KvDistribution.VEventf(ctx,
	1, "unable to acquire allocator token to process range: %v", tokenErr)
	return false, tokenErr
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kvserver: ensure AdminScatter obtains an allocator token #144579

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	_, err = rq.processOneChange(
	ctx, r, desc, conf, true /* scatter /, false, / dryRun */
	)

kvserver: ensure AdminScatter obtains an allocator token #144579

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions