Skip to content

kvserver: ensure AdminScatter obtains an allocator token #144579

Open
@miraradeva

Description

@miraradeva

Describe the problem

On a couple of occasions recently, we saw a large restore failing due to errors during the scatter phase. This resulted in an unbalanced distribution of the newly split-out ranges. As the restore progressed, some of the nodes' disks reached their capacity and the restore paused.

The scatters failed due to concurrent changes in the range descriptors, initiated by the replicate queue. Simultaneously, the replica queue was also encountering the same errors, hinting at contention between the replica queue and scatter. Because replication changes are multi-step (e.g. adding a learner, promoting it to a voter, etc), and each step involves a descriptor change, it's easy to see how two independent sources of replication changes can step on each others' toes.

To avoid this race, all replication and lease changes need to acquire an allocator token.

// AllocatorToken is a token which provides mutual exclusion for allocator
// execution. When the token is acquired, other acquirers will fail until
// release.
//
// The leaseholder replica should acquire an allocator token before beginning
// replica or lease changes on a range. After the changes have
// failed/succeeded, the token should be released. The goal is to limit the
// amount of concurrent reshuffling activity of a range.
type AllocatorToken struct {
mu struct {
syncutil.Mutex
acquired bool
acquiredName string
}
}

The replicate queue already does so, in process:

func (rq *replicateQueue) process(
ctx context.Context, repl *Replica, confReader spanconfig.StoreReader,
) (processed bool, err error) {
if tokenErr := repl.allocatorToken.TryAcquire(ctx, rq.name); tokenErr != nil {
log.KvDistribution.VEventf(ctx,
1, "unable to acquire allocator token to process range: %v", tokenErr)
return false, tokenErr
}

However, adminScatter does not. It invokes processOneChange directly, circumventing the allocator token acquisition in process, which internally calls processOneChange:

_, err = rq.processOneChange(
ctx, r, desc, conf, true /* scatter */, false, /* dryRun */
)

In fact, adminScatter does obtain the allocator token later on, when scattering leases, but it needs to do so for the replica scatter phase above as well.

To Reproduce

For a unit test repro, see #144580.
We should repro this in a roachtest as well; e.g. running a large restore or dropping a large table and then restoring it.

Expected behavior
AdminScatter should obtain an allocator token for the replica scattering phase, similarly to how it does for the lease scattering phase.

Environment:
This has likely been an issue for a while, but allocator tokens were introduced in 24.2 (#119410).

Additional context
We've seen this recently in a large-customer escalation as well as in DRT large-scale testing.

Jira issue: CRDB-49435

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-distributionRelating to rebalancing and leasing.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsT-kvKV Teambranch-masterFailures and bugs on the master branch.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions