Skip to content

Commit 1423c95

Browse files
committed
A114: WRR Support for Custom Backend Metrics
1 parent d926ac4 commit 1423c95

File tree

1 file changed

+167
-0
lines changed

1 file changed

+167
-0
lines changed
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# A114: WRR Support for Custom Backend Metrics
2+
3+
- Author(s): sauravzg
4+
- Approver: markdroth
5+
- Status: In review
6+
- Implemented in:
7+
- Last updated: 2026-01-30
8+
- Discussion at:
9+
10+
## Abstract
11+
12+
This proposal updates the client-side `weighted_round_robin` (WRR) load balancing policy to support customizable utilization metrics. It adds a new configuration field `metric_names_for_computing_utilization` to the WRR LB policy config. This allows users to specify which backend metrics should be used to compute endpoint weights, enabling the use of custom metrics (via ORCA named metrics) instead of relying solely on the default `application_utilization` or `cpu_utilization`.
13+
14+
## Background
15+
16+
The existing `weighted_round_robin` policy (defined in [gRFC A58][A58]) calculates endpoint weights based on standard metrics provided by the backend via ORCA (Open Request Cost Aggregation) load reports. Specifically, it uses `application_utilization` if available, and falls back to `cpu_utilization`.
17+
18+
However, services may want to drive load balancing decisions based on other resources, such as memory utilization, queue depth, or custom application-defined metrics. The [xDS Custom Backend Metrics][A51] specification (ORCA) supports reporting arbitrary named metrics, and xDS has updated its WRR implementation to allow selecting these metrics for utilization calculation.
19+
20+
To support advanced load balancing scenarios, gRPC's WRR policy needs to support this flexibility.
21+
22+
### Related Proposals
23+
24+
- [A58: Client-Side Weighted Round Robin LB Policy][A58]
25+
- [A51: Custom Backend Metrics][A51]
26+
27+
## Proposal
28+
29+
### Service Config Update
30+
31+
We will add a new field `metric_names_for_computing_utilization` to the [`WeightedRoundRobinLbConfig`][WeightedRoundRobinLbConfigProto] message in the Service Config.
32+
33+
```protobuf
34+
message WeightedRoundRobinLbConfig {
35+
// ... existing fields ...
36+
37+
// A list of metric names to use for computing utilization.
38+
//
39+
// By default, endpoint weight is computed based on the 'application_utilization'
40+
// field reported by the endpoint.
41+
//
42+
// If 'application_utilization' is not set, then utilization will instead be
43+
// computed by taking the max of the values of the metrics specified here.
44+
//
45+
// For map fields in the ORCA proto, the string will be of the form
46+
// "<map_field_name>.<map_key>". For example, the string "named_metrics.foo"
47+
// will mean to look for the key "foo" in the ORCA "named_metrics" field.
48+
//
49+
// If none of the specified metrics are present in the load report, then
50+
// 'cpu_utilization' is used instead.
51+
repeated string metric_names_for_computing_utilization = 7;
52+
}
53+
```
54+
55+
### Weight Calculation Logic
56+
57+
The weight calculation logic in the WRR policy will be updated to determine the `utilization` value as follows from the [`OrcaLoadReport`][OrcaLoadReportProto]
58+
59+
1. **Check `application_utilization`**: If the backend reports `application_utilization` (value > 0), use it.
60+
2. **Check Custom Metrics**: If `application_utilization` is not reported (or is 0), and `metric_names_for_computing_utilization` is configured:
61+
- Iterate through the specified metric names.
62+
- **Resolve Metric Value**:
63+
- If the name is of the format `field.key` (e.g., `named_metrics.foo`), look up the map field `field` and retrieve the value for `key`.
64+
- If the name is a simple field name (e.g., `cpu_utilization`, `mem_utilization`), look up the `field`.
65+
- **Compute Max**: Track the maximum value among all successfully resolved, finite metrics.
66+
- If a max value is found, use it as the `utilization`.
67+
3. **Fallback to `cpu_utilization`**: If neither of the above yields a value, use `cpu_utilization` (utilizing the existing logic).
68+
69+
#### Pseudocode
70+
71+
```
72+
function GetUtilization(report, configured_metrics):
73+
# 1. Prefer application_utilization if present
74+
if report.application_utilization > 0:
75+
return report.application_utilization
76+
77+
# 2. Check Custom Metrics
78+
max_util = null
79+
80+
for metric_name in configured_metrics:
81+
value = null
82+
83+
if metric_name contains ".":
84+
# Map lookup (e.g. "named_metrics.foo" -> map="named_metrics", key="foo")
85+
map_name, key = split_on_first_dot(metric_name)
86+
if report has map field map_name:
87+
value = report[map_name][key]
88+
else:
89+
# Root field lookup (e.g. "mem_utilization") via Reflection
90+
if report has field metric_name:
91+
value = report[metric_name]
92+
93+
# Only consider valid, non-nan values
94+
if value is not null and !is_nan(value):
95+
if max_util is null or value > max_util:
96+
max_util = value
97+
98+
if max_util is not null:
99+
return max_util
100+
101+
# 3. Fallback
102+
return report.cpu_utilization
103+
```
104+
105+
#### Implementation Notes
106+
107+
Since `OrcaLoadReport` is often exposed as a language-specific proxy object rather than a raw Protobuf message (e.g., in Java and C++), implementations are **not** expected to use Protobuf reflection to look up arbitrary fields. Instead, implementations should manually handle:
108+
109+
- **Map Fields**: Lookups in the `utilization` map (e.g., `named_metrics.foo`).
110+
- **Standard Fields**: Explicit lookups for known fields (e.g., `cpu_utilization`, `mem_utilization`, `application_utilization`).
111+
112+
Support for any new standard fields added to `OrcaLoadReport` in the future will require explicit code changes in the implementation. This behavior is consistent with the [current Envoy implementation](https://github.com/envoyproxy/envoy/blob/35749578db375f5fe8ac5dd293cb7c4efb689611/source/common/orca/orca_load_metrics.cc#L47-L73).
113+
114+
- **Map Fields**: For strings like `named_metrics.foo`, the implementation must verify `named_metrics` is a valid map field and look up the key `foo`. The string is split on the first dot: `foo.bar.baz` looks up key `bar.baz` in map `foo`.
115+
- **Scalar Fields**: For strings like `mem_utilization` (if added to ORCA), the implementation must verify the field exists and retrieve its double/float value.
116+
- **Normalization**: The WRR policy does not perform any normalization on the reported metrics. The application is supposed to handle correct load reporting and normalization of reported metrics for utilization.
117+
118+
#### Validity and Edge Cases
119+
120+
- **Nan Values**: As shown above, `NaN` values in reports are explicitly ignored to prevent undefined behavior in weight calculations.
121+
- **Bound Checks**: The final selected `utilization` is subject to the standard validation logic from [gRFC A58][A58] (e.g., ensuring the value is positive) before being used to compute weight to avoid undefined behavior in weight calculations. Since proto3 doesn't provide any way to identify presence vs 0 value, we treat 0 as an invalid (missing) value, consistent with [gRFC A58][A58].
122+
123+
The rest of the weight calculation formula (using QPS, EPS, and penalty) from [gRFC A58][A58] remains unchanged.
124+
125+
### xDS Integration
126+
127+
We will support the `metric_names_for_computing_utilization` field in the xDS [`ClientSideWeightedRoundRobin`][ClientSideWeightedRoundRobinProto] policy.
128+
129+
When converting the xDS configuration to the gRPC Service Config `WeightedRoundRobinLbConfig`, the `metric_names_for_computing_utilization` field should be copied over directly.
130+
131+
### Temporary environment variable protection
132+
133+
The features described in this proposal will be guarded by the environment variable `GRPC_EXPERIMENTAL_WRR_CUSTOM_METRICS`, which defaults to `false`.
134+
135+
## Rationale
136+
137+
- **Consistency with Envoy**: This design mirrors the corresponding feature in Envoy, ensuring consistent behavior for xDS-controlled clients.
138+
- **Flexibility**: Allows users to define load balancing weights based on the actual bottleneck resource of their application (e.g., memory-bound services).
139+
- **Backward Compatibility**: The default behavior (using `application_utilization` or `cpu_utilization`) remains unchanged if the new field is not configured.
140+
141+
## Implementation
142+
143+
This will be implemented in all languages C++, Java, and Go.
144+
145+
### C++
146+
147+
- **xDS Integration**: Update [`ClientSideWeightedRoundRobinLbPolicyConfigFactory`](https://github.com/grpc/grpc/blob/f7f13023412c1a589af7558eb0b9f8f664a76431/src/core/xds/grpc/xds_lb_policy_registry.cc#L68) to copy over the new field from the xDS configuration.
148+
- **Config**: Update [`WeightedRoundRobinLbConfig`](https://github.com/grpc/grpc/blob/f7f13023412c1a589af7558eb0b9f8f664a76431/src/core/load_balancing/weighted_round_robin/weighted_round_robin.cc#L133) to include `metric_names_for_computing_utilization`.
149+
- **Weight Calculation Logic**: Update callers of [`EndpointWeight::MaybeUpdateWeight`](https://github.com/grpc/grpc/blob/f7f13023412c1a589af7558eb0b9f8f664a76431/src/core/load_balancing/weighted_round_robin/weighted_round_robin.cc#L212) to implement the new utilisation selection logic.
150+
151+
### Java
152+
153+
- **xDS Integration**: Update [`convertWeightedRoundRobinConfig`](https://github.com/grpc/grpc-java/blob/a9f73f4c0aa5617aa2b6ae6ac805693915899b6a/xds/src/main/java/io/grpc/xds/LoadBalancerConfigFactory.java#L286) to copy over the field from the xDS configuration.
154+
- **Config**: Update [`WeightedRoundRobinLoadBalancerConfig`](https://github.com/grpc/grpc-java/blob/a9f73f4c0aa5617aa2b6ae6ac805693915899b6a/xds/src/main/java/io/grpc/xds/WeightedRoundRobinLoadBalancer.java#L716) to include `metric_names_for_computing_utilization`.
155+
- **Weight Calculation Logic**: Update [`OrcaPerRequestUtil`](https://github.com/grpc/grpc-java/blob/a9f73f4c0aa5617aa2b6ae6ac805693915899b6a/xds/src/main/java/io/grpc/xds/WeightedRoundRobinLoadBalancer.java#L364) to implement the new utilisation selection logic.
156+
157+
### Go
158+
159+
- **xDS Integration**: Update the configuration struct and conversion logic in [`converter.go`](https://github.com/grpc/grpc-go/blob/c05cfb3693bf18086810e671a6f9c05f296e0183/internal/xds/xdsclient/xdslbregistry/converter/converter.go#L241) to copy over the new field from the xDS configuration.
160+
- **Config**: Update the configuration struct in [`config.go`](https://github.com/grpc/grpc-go/blob/7985bb44d26ecbeb8950996d028e38a0de08070b/balancer/weightedroundrobin/config.go#L26) to include `metric_names_for_computing_utilization`.
161+
- **Weight Calculation Logic**: Update the weight update function in [`balancer.go`](https://github.com/grpc/grpc-go/blob/7985bb44d26ecbeb8950996d028e38a0de08070b/balancer/weightedroundrobin/balancer.go#L546) to implement the new utilisation selection logic.
162+
163+
[A58]: A58-client-side-weighted-round-robin-lb-policy.md
164+
[A51]: A51-custom-backend-metrics.md
165+
[ClientSideWeightedRoundRobinProto]: https://github.com/envoyproxy/envoy/blob/7242d5ad170523d7936849e596d261e3502c3886/api/envoy/extensions/load_balancing_policies/client_side_weighted_round_robin/v3/client_side_weighted_round_robin.proto#L86
166+
[WeightedRoundRobinLbConfigProto]: https://github.com/grpc/grpc-proto/blob/ec99424f3b7dae9db194f848b4cea52ecfae07af/grpc/service_config/service_config.proto#L205
167+
[OrcaLoadReportProto]: https://github.com/cncf/xds/blob/0feb69152e9f7e8a45c8a3cfe8c7dd93bca3512f/xds/data/orca/v3/orca_load_report.proto#L15

0 commit comments

Comments
 (0)