Alter the second check for inflight RPCs from expected +/- threshold to [expected - QPS, expected] in circuit_breaking_test #207

AgraVator · 2025-10-01T08:17:18Z

try to fix b/448552373 by altering the second check for inflight RPCs from expected +/- threshold to expected - QPS

eshitachandwani · 2025-10-01T14:21:42Z

Nit: Since this repo is open source, we shouldn’t include internal bug links. Could you replace it with just the correct bug number? I think this one might be a duplicate bug rather than the intended metabug.

pawbhard · 2025-10-01T20:47:51Z

Nit: Since this repo is open source, we shouldn’t include internal bug links. Could you replace it with just the correct bug number? I think this one might be a duplicate bug rather than the intended metabug.

+1 use b/

arjan-bal · 2025-10-07T07:01:52Z

Please wait before submitting this. I'm not 100% sure this is the right fix. I'm going the test and circuit breaking docs to see if the test needs be changed instead.

arjan-bal · 2025-10-07T08:25:41Z

From my understanding of the test, the client is configured to send 100 QPS to the servers.

psm-interop/tests/circuit_breaking_test.py

Line 167 in c3d3bf5

default_test_server, rpc="UnaryCall,EmptyCall", qps=_QPS

The server is configured to block on receiving the calls and a deadline of 20 seconds is set on the RPCs. This means that the server should block for 20 seconds on an RPC before failing it with a DEADLINE_EXCEEDED status.

psm-interop/tests/circuit_breaking_test.py

Lines 178 to 194 in c3d3bf5

    
           with self.subTest("11_configure_client_with_keep_open"): 
        
               test_client.update_config.configure( 
        
                   rpc_types=grpc_testing.RPC_TYPES_BOTH_CALLS, 
        
                   metadata={ 
        
                       ( 
        
                           grpc_testing.RPC_TYPE_UNARY_CALL, 
        
                           "rpc-behavior", 
        
                           "keep-open", 
        
                       ), 
        
                       ( 
        
                           grpc_testing.RPC_TYPE_EMPTY_CALL, 
        
                           "rpc-behavior", 
        
                           "keep-open", 
        
                       ), 
        
                   }, 
        
                   timeout_sec=20, 
        
               )

Initially, when the circuit breaking config is not received by the client, it makes a large number of concurrent requests, example from the test logs:

I1006 09:10:54.669732 125114108895232 xds_k8s_testcase.py:890] [psm-grpc-client-64545cd99d-thnn8] << Received LoadBalancerAccumulatedStatsResponse:
- method: UNARY_CALL
  rpcs_started: 2464
  result:
    (0, OK): 468
    (4, DEADLINE_EXCEEDED): 4
- method: EMPTY_CALL
  rpcs_started: 2464
  result:
    (0, OK): 468
    (4, DEADLINE_EXCEEDED): 4

I1006 09:10:54.669906 125114108895232 xds_k8s_testcase.py:899] [psm-grpc-client-64545cd99d-thnn8] << UNARY_CALL RPCs in flight: 1992, expected 500 ±5%

After the client receives the circuit breaking config, it will ensure there are at most 500 UnaryCall and 1000 EmptyCall requests in-flight. In the logs, we can see the number of in-flight calls reducing with time.

I0929 20:58:33.484530 128673922162688 xds_k8s_testcase.py:899] [psm-grpc-client-7746d59849-tm852] << EMPTY_CALL RPCs in flight: 1630, expected 1000 ±5%
I0929 20:58:43.495065 128673922162688 grpc.py:79] [psm-grpc-client-7746d59849-tm852:8079] >> RPC LoadBalancerStatsService.GetClientAccumulatedStats(request=LoadBalancerAccumulatedStatsRequest({}), timeout=600, wait_for_ready=True)
I0929 20:58:43.537494 128673922162688 xds_k8s_testcase.py:890] [psm-grpc-client-7746d59849-tm852] << Received LoadBalancerAccumulatedStatsResponse:
- method: EMPTY_CALL
  rpcs_started: 10630
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 2194
    (14, UNAVAILABLE): 7921
- method: UNARY_CALL
  rpcs_started: 11098
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 500
    (14, UNAVAILABLE): 10083

I0929 20:58:43.537702 128673922162688 xds_k8s_testcase.py:899] [psm-grpc-client-7746d59849-tm852] << EMPTY_CALL RPCs in flight: 968, expected 1000 ±5%
I0929 20:58:43.537857 128673922162688 xds_k8s_testcase.py:868] Will check again in 5 seconds to verify that RPC count is steady
I0929 20:58:48.543253 128673922162688 grpc.py:79] [psm-grpc-client-7746d59849-tm852:8079] >> RPC LoadBalancerStatsService.GetClientAccumulatedStats(request=LoadBalancerAccumulatedStatsRequest({}), timeout=600, wait_for_ready=True)
I0929 20:58:48.584075 128673922162688 xds_k8s_testcase.py:890] [psm-grpc-client-7746d59849-tm852] << Received LoadBalancerAccumulatedStatsResponse:
- method: EMPTY_CALL
  rpcs_started: 11039
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 2603
    (14, UNAVAILABLE): 7921
- method: UNARY_CALL
  rpcs_started: 11454
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 530
    (14, UNAVAILABLE): 10409

The test then checks once that the in-flight requests are within 500±5% and 1000±5% respectively.

The problem

At t=20 seconds, the 100 RPCs started at t=0 will fail as their deadlines will expire. At the same time, the client will start 100 more RPCs that may succeed. If the test driver were to query the in-flight RPCs at this time, it will see the RPC count b/w 900-1000 EmptyCall for and 400-500 for UnaryCall. Assuming the rate of RPC starting is equal to rate of RPCs timing out, we'll see 950 EmptyCalls and 450 UnaryCalls on average.

The same situation will happen at t=21, 22...30 and t=20+30, 21+30...30+30.

TL;DR the steady state in the test is cyclic.

arjan-bal · 2025-10-07T09:02:18Z

One way to fix the test would be to change the two assertions as follows:

In the first check, verify RPCs are within [threshold - 5%, threshold], even 1% may work. Notice that there is no +5% because circuit breaking must not allow RPCs more than the threshold.
In the second check, verify RPCs are within [threshold - QPS, threshold].

Need to change the approach

…from x - threshold, x

arjan-bal

The changes look good, but please add more comments as the behaviour is non-trivial.

arjan-bal · 2026-01-14T08:13:04Z

framework/xds_k8s_testcase.py

-        self._checkRpcsInFlight(
-            test_client, rpc_type, num_rpcs, threshold_percent
-        )
+        # In the second check, verify RPCs are within [threshold - QPS, threshold].


It will be better to describe the "why" instead of the "what" in the comment here. It may not be immediately apparent to readers. It will also help explain why the qps argument is required.

arjan-bal · 2026-01-14T08:16:38Z

framework/xds_k8s_testcase.py

+                f"[{max(0, num_rpcs - qps)}, {num_rpcs}]"
            ),
        )
+        first_min = int(num_rpcs * (1 - threshold_percent / 100))


It would be helpful to add a comment explaining why we don't define the max value as int(num_rpcs * (1 + threshold_percent / 100)). This is mainly because circuit_breaking (the only consumer) requires that the RPC count strictly not exceed the provided QPS.

framework/xds_k8s_testcase.py

sergiitk

Note: we need to update PR title and description to match the changed logic

sergiitk · 2026-01-22T17:00:58Z

framework/xds_k8s_testcase.py

                f"Timeout waiting for test client {test_client.hostname} to"
-                f"report {num_rpcs} pending calls ±{threshold_percent}%"
+                f"report {num_rpcs} pending calls in range "
+                f"[{int(num_rpcs * (1 - steady_state_min_threshold_percent / 100))}, "


Instead, calculate the first_min above the retryer, and use the variable. Not only it's easier to ready, it'll make it much clearer this is what this retryer is for.

Suggested change

f"[{int(num_rpcs * (1 - steady_state_min_threshold_percent / 100))}, "

f"[{first_min}, "

sergiitk · 2026-01-22T17:58:48Z

framework/xds_k8s_testcase.py

+        steady_state_min_threshold_percent: int = 5,
+        min_tolerance_delta_after_steady_state: int = 100,


Two things:

steady_state_min_threshold_percent and min_tolerance_delta_after_steady_state: note the order of the words. The first one starts with "stage-tolerance_type-unit", the second one is "tolerance_type-stage" (and no unit). Let's make it consistent "stage-tolerance_type-unit".

min_tolerance_delta_after_steady_state: we need to include the unit name here. F.e. steady_state_min_threshold_percent makes it clear the unit is "percent".

steady_state_min_threshold_percent- I don't think the word "threshold" applies here now that we don't use it to define the "max" end of the range. Gemini: "threshold range is defining the acceptable minimum and maximum points for something to function, trigger, or be considered valid, like a temperature band for thermoregulation or a specific voltage for a transistor. "

I'm thinking something like

Suggested change

steady_state_min_threshold_percent: int = 5,

min_tolerance_delta_after_steady_state: int = 100,

steady_state_allowed_shortfall_percent: int = 5,

after_steady_state_allowed_shortfall_count: int = 100,

sergiitk · 2026-01-22T18:03:10Z

tests/circuit_breaking_test.py

                test_client,
                rpc_type=grpc_testing.RPC_TYPE_EMPTY_CALL,
                num_rpcs=_INITIAL_EMPTY_MAX_REQUESTS,
+                min_tolerance_delta_after_steady_state=_QPS,


since we don't want to repeat the comment multiple times, let's make a local variable after_steady_state_shortfall above self.subTest("12_client_reaches_target_steady_state, assign it to _QPS, move the comment above it, and reuse the variable in all subtests.

sergiitk · 2026-01-22T18:06:04Z

tests/circuit_breaking_test.py

+                # circuit_breaking_test requires that the RPC count strictly
+                # not exceed the provided tolerance.


I think something got lost in the explanation when the comment was moved.

This should cover "In the second check" comment from this commit 9d1c1b6

sergiitk

Please make sure to re-run the tests.

sergiitk · 2026-01-23T19:00:05Z

framework/xds_k8s_testcase.py

+        first_min = int(
+            num_rpcs * (1 - steady_state_allowed_shortfall_percent / 100)
+        )


no need to repeat this, it's already defined on line 893.

Suggested change

first_min = int(

num_rpcs * (1 - steady_state_allowed_shortfall_percent / 100)

)

AgraVator requested a review from a team as a code owner October 1, 2025 08:17

AgraVator requested review from arjan-bal, pawbhard and sergiitk October 1, 2025 08:17

pawbhard approved these changes Oct 1, 2025

View reviewed changes

eshitachandwani previously approved these changes Oct 7, 2025

View reviewed changes

AgraVator added 2 commits January 12, 2026 10:54

increase error threshold of diff in circuit_breaking_test

ed33df0

change logic to test for in flight RPCs b/w x - QPS, QPS when stable …

0e6a6ed

…from x - threshold, x

AgraVator force-pushed the increase-error-threshold-circuit-breaking-test branch from 31d84d2 to 0e6a6ed Compare January 12, 2026 05:52

AgraVator and others added 3 commits January 12, 2026 11:28

revert the threshold increase

dfbb270

fix formatting

363630a

Merge branch 'main' into increase-error-threshold-circuit-breaking-test

35cd5ef

AgraVator changed the title ~~increase error threshold of diff in circuit_breaking_test~~ Alter the second check for inflight RPCs from expected +/- threshold to expected - QPS in circuit_breaking_test Jan 14, 2026

arjan-bal approved these changes Jan 14, 2026

View reviewed changes

enchance comments

9d1c1b6

AgraVator requested a review from arjan-bal January 14, 2026 08:46

format comments

6d5e64e

AgraVator requested a review from murgatroid99 January 14, 2026 16:22

murgatroid99 approved these changes Jan 14, 2026

View reviewed changes

sergiitk requested changes Jan 17, 2026

View reviewed changes

AgraVator and others added 2 commits January 19, 2026 14:06

suggested changes

cb19c77

Merge branch 'main' into increase-error-threshold-circuit-breaking-test

3033347

AgraVator requested a review from sergiitk January 19, 2026 08:38

sergiitk requested changes Jan 22, 2026

View reviewed changes

suggested changes

95f479a

AgraVator changed the title ~~Alter the second check for inflight RPCs from expected +/- threshold to expected - QPS in circuit_breaking_test~~ Alter the second check for inflight RPCs from expected +/- threshold to [expected - QPS, expected] in circuit_breaking_test Jan 23, 2026

sergiitk approved these changes Jan 23, 2026

View reviewed changes

	f"[{int(num_rpcs * (1 - steady_state_min_threshold_percent / 100))}, "
	f"[{first_min}, "

		steady_state_min_threshold_percent: int = 5,
		min_tolerance_delta_after_steady_state: int = 100,

		# circuit_breaking_test requires that the RPC count strictly
		# not exceed the provided tolerance.

	first_min = int(
	num_rpcs * (1 - steady_state_allowed_shortfall_percent / 100)
	)

Alter the second check for inflight RPCs from expected +/- threshold to [expected - QPS, expected] in circuit_breaking_test #207

Are you sure you want to change the base?

Alter the second check for inflight RPCs from expected +/- threshold to [expected - QPS, expected] in circuit_breaking_test #207

Uh oh!

Conversation

AgraVator commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eshitachandwani commented Oct 1, 2025

Uh oh!

pawbhard commented Oct 1, 2025

Uh oh!

arjan-bal commented Oct 7, 2025

Uh oh!

arjan-bal commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The problem

Uh oh!

arjan-bal commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arjan-bal left a comment

Choose a reason for hiding this comment

Uh oh!

arjan-bal Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

arjan-bal Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiitk left a comment

Choose a reason for hiding this comment

Uh oh!

sergiitk Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

sergiitk Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

sergiitk Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

sergiitk Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

sergiitk left a comment

Choose a reason for hiding this comment

Uh oh!

sergiitk Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

AgraVator commented Oct 1, 2025 •

edited

Loading

arjan-bal commented Oct 7, 2025 •

edited

Loading

arjan-bal commented Oct 7, 2025 •

edited

Loading