Skip to content
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions A113-pick-first-weighted-shuffling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
A113: pick_first: Weighted Random Shuffling
----
* Author(s): Alex Polcyn (@apolcyn)
* Approver: Mark Roth (@markdroth), Eric Anderson (@ejona86), Doug Fawley (@dfawley), Easwar Swaminathan (@easwars)
* Status: In Review
* Implemented in: <language, ...>
* Last updated: Jan 26, 2026
* Discussion at: https://groups.google.com/g/grpc-io/c/iCsweGDmUU4

## Abstract

Support weighted random shuffling in the pick first LB policy.

## Background

The pick first LB policy currently supports random shuffling. A primary intention of the feature
is for load balancing, however it does not take (possibly present) locality or endpoint weights
into account. Naturally this can lead to skewed load distribution and hotspots, when the load
balancing control plane delivers varied weights and expects them to be followed.


### Related Proposals:
* [A62](https://github.com/grpc/proposal/blob/master/A62-pick-first.md): pick_first: sticky TRANSIENT_FAILURE and address order randomization
* [A42](https://github.com/grpc/proposal/blob/master/A42-xds-ring-hash-lb-policy.md) xDS Ring Hash LB Policy

## Proposal

### Changes within Pick First

Modify behavior of pick_first when the `shuffle_address_list` option is set, and
perform a weighted random sort *based on per-endpoint weights*. To do this, we will
use the [Weighted Random Sampling](https://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf) algorithm
proposed by Efraimidis, Spirakis:

1) Assign a key to each endpoint, `u ^ (1 / weight)`, where `u` is a uniform random number in `(0, 1)` and weight
is the weight of the endpoint (as present in a weight attribute). Default `weight` to 1 if no weight attribute is
present.

2) Sort endpoints by key in *descending* order.

Note: the paper suggests `u` be in `(0, 1)` *exclusive*. Random numbers *on* zero or one effectively
drop their weight. Zero will technically not transform to the exponential distribution that we are trying
to create. However, load balancing skew introduced by such edge cases is unlikely to be noticeable, and so
implementations are free to include these bounds so long as it does not cause other problems
(e.g. crashes).


### CDS LB Policy changes: Computing Endpoint Weights

In XDS, we have a notion of both locality and endpoint weights. The expectation of the load balancing
control plane is to *first* pick locality and *second* pick endpoint. The total probability distribution
reflected by per-endpoint weights must reflect this. As such, we need to normalize locality weights within
each priority and endpoint weights within locality; the final weight provided to `pick_first` should be a
product of the two normalized weights (i.e. a logical AND of the two selection events).

The CDS LB policy currently calculates per-endpoint weight attributes. It will continue to do so however
we need to fix the mechanics: an endpoint's final weight should be a product of its *normalized* locality
weight and *normalized* endpoint weight, rather than their product outright.

Note: as a side effect this will fix per-endpoint weights in Ring Hash LB, which
[currently](https://github.com/grpc/proposal/blob/master/A42-xds-ring-hash-lb-policy.md#change-child-policy-config-generation-in-xds_cluster_resolver-policy) are a product of the initial *raw* locality and endpoint weights.
This "fix" will not require any changes within Ring Hash LB itself.

We can continue to represent weights as integers if we represent their normalized values in
fixed point Q1.31 format. Math as follows (citation due for @ejona):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to use this notation, then it is UQ1.31. Q1.31 is signed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks fixed


```
// To normalize:
uint32_t ONE = 1 << 31;
uint32_t weight = (uint64_t) weight * ONE / weight_sum;

// To multiply the weights for an endpoint:
weight = ((uint64_t) locality_weight * weight) >> 31;
if (weight == 0) weight = 1;
```

Note: currently we round down to zero (and then up if we hit zero).
We *could* use more accurate rounding schemes. However, rounding down
is simple and should provide enough precision for load balancing
purposes. For example, we only round down to zero if the product of
two normalized weight probabilities is less than `2 ^ -31`, this kind
of error is unlikely to cause noticeable skew in load balancing.

### Temporary environment variable protection

CDS LB policy and Pick First LB policy behavior changes will be guarded by `GRPC_EXPERIMENTAL_PF_WEIGHTED_SHUFFLING`.

Barring unexpected issues, this should be enabled by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"After testing"? (Although unit testing isn't enough...) Or "after the implementation is shown to be stable"? I feel like we've written something like that before, but I don't recall where.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## Rationale

* CDS LB policy changes are needed to generate correct weight distributions, not only for Pick First but
also for Ring Hash
* Using fixed point Q31 format has predictable bounds on precision, and allows us to continue representing
weights as integers. Note our math assumes the sum of weights within a grouping does not exceed max uint32,
which is mandated in the XDS protocol.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere we should say we are preserving the "sum is a uint32" property for the internal weights (after the multiplication). That can be in the CDS section, or in this section. But we should call it out as a thing we are doing on purpose and desire. (It could also be used as an explanation for why we are using unsigned UQ1.31 instead of UQ32)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, done


## Implementation

TBD