Skip to content

Conversation

@AgraVator
Copy link
Contributor

@AgraVator AgraVator commented Oct 1, 2025

try to fix b/448552373 by altering the second check for inflight RPCs from expected +/- threshold to expected - QPS

test run

@AgraVator AgraVator requested a review from a team as a code owner October 1, 2025 08:17
@eshitachandwani
Copy link
Member

Nit: Since this repo is open source, we shouldn’t include internal bug links. Could you replace it with just the correct bug number? I think this one might be a duplicate bug rather than the intended metabug.

@pawbhard
Copy link
Contributor

pawbhard commented Oct 1, 2025

Nit: Since this repo is open source, we shouldn’t include internal bug links. Could you replace it with just the correct bug number? I think this one might be a duplicate bug rather than the intended metabug.

+1 use b/

eshitachandwani
eshitachandwani previously approved these changes Oct 7, 2025
@arjan-bal
Copy link
Contributor

Please wait before submitting this. I'm not 100% sure this is the right fix. I'm going the test and circuit breaking docs to see if the test needs be changed instead.

@arjan-bal
Copy link
Contributor

arjan-bal commented Oct 7, 2025

From my understanding of the test, the client is configured to send 100 QPS to the servers.

default_test_server, rpc="UnaryCall,EmptyCall", qps=_QPS

The server is configured to block on receiving the calls and a deadline of 20 seconds is set on the RPCs. This means that the server should block for 20 seconds on an RPC before failing it with a DEADLINE_EXCEEDED status.

with self.subTest("11_configure_client_with_keep_open"):
test_client.update_config.configure(
rpc_types=grpc_testing.RPC_TYPES_BOTH_CALLS,
metadata={
(
grpc_testing.RPC_TYPE_UNARY_CALL,
"rpc-behavior",
"keep-open",
),
(
grpc_testing.RPC_TYPE_EMPTY_CALL,
"rpc-behavior",
"keep-open",
),
},
timeout_sec=20,
)

Initially, when the circuit breaking config is not received by the client, it makes a large number of concurrent requests, example from the test logs:

I1006 09:10:54.669732 125114108895232 xds_k8s_testcase.py:890] [psm-grpc-client-64545cd99d-thnn8] << Received LoadBalancerAccumulatedStatsResponse:
- method: UNARY_CALL
  rpcs_started: 2464
  result:
    (0, OK): 468
    (4, DEADLINE_EXCEEDED): 4
- method: EMPTY_CALL
  rpcs_started: 2464
  result:
    (0, OK): 468
    (4, DEADLINE_EXCEEDED): 4

I1006 09:10:54.669906 125114108895232 xds_k8s_testcase.py:899] [psm-grpc-client-64545cd99d-thnn8] << UNARY_CALL RPCs in flight: 1992, expected 500 ±5%

After the client receives the circuit breaking config, it will ensure there are at most 500 UnaryCall and 1000 EmptyCall requests in-flight. In the logs, we can see the number of in-flight calls reducing with time.

I0929 20:58:33.484530 128673922162688 xds_k8s_testcase.py:899] [psm-grpc-client-7746d59849-tm852] << EMPTY_CALL RPCs in flight: 1630, expected 1000 ±5%
I0929 20:58:43.495065 128673922162688 grpc.py:79] [psm-grpc-client-7746d59849-tm852:8079] >> RPC LoadBalancerStatsService.GetClientAccumulatedStats(request=LoadBalancerAccumulatedStatsRequest({}), timeout=600, wait_for_ready=True)
I0929 20:58:43.537494 128673922162688 xds_k8s_testcase.py:890] [psm-grpc-client-7746d59849-tm852] << Received LoadBalancerAccumulatedStatsResponse:
- method: EMPTY_CALL
  rpcs_started: 10630
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 2194
    (14, UNAVAILABLE): 7921
- method: UNARY_CALL
  rpcs_started: 11098
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 500
    (14, UNAVAILABLE): 10083

I0929 20:58:43.537702 128673922162688 xds_k8s_testcase.py:899] [psm-grpc-client-7746d59849-tm852] << EMPTY_CALL RPCs in flight: 968, expected 1000 ±5%
I0929 20:58:43.537857 128673922162688 xds_k8s_testcase.py:868] Will check again in 5 seconds to verify that RPC count is steady
I0929 20:58:48.543253 128673922162688 grpc.py:79] [psm-grpc-client-7746d59849-tm852:8079] >> RPC LoadBalancerStatsService.GetClientAccumulatedStats(request=LoadBalancerAccumulatedStatsRequest({}), timeout=600, wait_for_ready=True)
I0929 20:58:48.584075 128673922162688 xds_k8s_testcase.py:890] [psm-grpc-client-7746d59849-tm852] << Received LoadBalancerAccumulatedStatsResponse:
- method: EMPTY_CALL
  rpcs_started: 11039
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 2603
    (14, UNAVAILABLE): 7921
- method: UNARY_CALL
  rpcs_started: 11454
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 530
    (14, UNAVAILABLE): 10409

The test then checks once that the in-flight requests are within 500±5% and 1000±5% respectively.

The problem

At t=20 seconds, the 100 RPCs started at t=0 will fail as their deadlines will expire. At the same time, the client will start 100 more RPCs that may succeed. If the test driver were to query the in-flight RPCs at this time, it will see the RPC count b/w 900-1000 EmptyCall for and 400-500 for UnaryCall. Assuming the rate of RPC starting is equal to rate of RPCs timing out, we'll see 950 EmptyCalls and 450 UnaryCalls on average.

The same situation will happen at t=21, 22...30 and t=20+30, 21+30...30+30.

TL;DR the steady state in the test is cyclic.

@arjan-bal
Copy link
Contributor

arjan-bal commented Oct 7, 2025

One way to fix the test would be to change the two assertions as follows:

  1. In the first check, verify RPCs are within [threshold - 5%, threshold], even 1% may work. Notice that there is no +5% because circuit breaking must not allow RPCs more than the threshold.
  2. In the second check, verify RPCs are within [threshold - QPS, threshold].

@eshitachandwani eshitachandwani dismissed their stale review January 8, 2026 18:16

Need to change the approach

@AgraVator AgraVator force-pushed the increase-error-threshold-circuit-breaking-test branch from 31d84d2 to 0e6a6ed Compare January 12, 2026 05:52
@AgraVator AgraVator changed the title increase error threshold of diff in circuit_breaking_test Alter the second check for inflight RPCs from expected +/- threshold to expected - QPS in circuit_breaking_test Jan 14, 2026
Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good, but please add more comments as the behaviour is non-trivial.

self._checkRpcsInFlight(
test_client, rpc_type, num_rpcs, threshold_percent
)
# In the second check, verify RPCs are within [threshold - QPS, threshold].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better to describe the "why" instead of the "what" in the comment here. It may not be immediately apparent to readers. It will also help explain why the qps argument is required.

f"[{max(0, num_rpcs - qps)}, {num_rpcs}]"
),
)
first_min = int(num_rpcs * (1 - threshold_percent / 100))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to add a comment explaining why we don't define the max value as int(num_rpcs * (1 + threshold_percent / 100)). This is mainly because circuit_breaking (the only consumer) requires that the RPC count strictly not exceed the provided QPS.

@AgraVator AgraVator requested a review from arjan-bal January 14, 2026 08:46
@AgraVator AgraVator requested a review from sergiitk January 19, 2026 08:38
Copy link
Member

@sergiitk sergiitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we need to update PR title and description to match the changed logic

f"Timeout waiting for test client {test_client.hostname} to"
f"report {num_rpcs} pending calls ±{threshold_percent}%"
f"report {num_rpcs} pending calls in range "
f"[{int(num_rpcs * (1 - steady_state_min_threshold_percent / 100))}, "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, calculate the first_min above the retryer, and use the variable. Not only it's easier to ready, it'll make it much clearer this is what this retryer is for.

Suggested change
f"[{int(num_rpcs * (1 - steady_state_min_threshold_percent / 100))}, "
f"[{first_min}, "

Comment on lines 887 to 888
steady_state_min_threshold_percent: int = 5,
min_tolerance_delta_after_steady_state: int = 100,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. steady_state_min_threshold_percent and min_tolerance_delta_after_steady_state: note the order of the words. The first one starts with "stage-tolerance_type-unit", the second one is "tolerance_type-stage" (and no unit). Let's make it consistent "stage-tolerance_type-unit".
  2. min_tolerance_delta_after_steady_state: we need to include the unit name here. F.e. steady_state_min_threshold_percent makes it clear the unit is "percent".
  3. steady_state_min_threshold_percent- I don't think the word "threshold" applies here now that we don't use it to define the "max" end of the range. Gemini: "threshold range is defining the acceptable minimum and maximum points for something to function, trigger, or be considered valid, like a temperature band for thermoregulation or a specific voltage for a transistor. "

I'm thinking something like

Suggested change
steady_state_min_threshold_percent: int = 5,
min_tolerance_delta_after_steady_state: int = 100,
steady_state_allowed_shortfall_percent: int = 5,
after_steady_state_allowed_shortfall_count: int = 100,

test_client,
rpc_type=grpc_testing.RPC_TYPE_EMPTY_CALL,
num_rpcs=_INITIAL_EMPTY_MAX_REQUESTS,
min_tolerance_delta_after_steady_state=_QPS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't want to repeat the comment multiple times, let's make a local variable after_steady_state_shortfall above self.subTest("12_client_reaches_target_steady_state, assign it to _QPS, move the comment above it, and reuse the variable in all subtests.

Comment on lines 201 to 202
# circuit_breaking_test requires that the RPC count strictly
# not exceed the provided tolerance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something got lost in the explanation when the comment was moved.

This should cover "In the second check" comment from this commit 9d1c1b6

@AgraVator AgraVator changed the title Alter the second check for inflight RPCs from expected +/- threshold to expected - QPS in circuit_breaking_test Alter the second check for inflight RPCs from expected +/- threshold to [expected - QPS, expected] in circuit_breaking_test Jan 23, 2026
Copy link
Member

@sergiitk sergiitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to re-run the tests.

Comment on lines +905 to +907
first_min = int(
num_rpcs * (1 - steady_state_allowed_shortfall_percent / 100)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to repeat this, it's already defined on line 893.

Suggested change
first_min = int(
num_rpcs * (1 - steady_state_allowed_shortfall_percent / 100)
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants