-
Notifications
You must be signed in to change notification settings - Fork 33
Alter the second check for inflight RPCs from expected +/- threshold to [expected - QPS, expected] in circuit_breaking_test #207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Nit: Since this repo is open source, we shouldn’t include internal bug links. Could you replace it with just the correct bug number? I think this one might be a duplicate bug rather than the intended metabug. |
+1 use b/ |
|
Please wait before submitting this. I'm not 100% sure this is the right fix. I'm going the test and circuit breaking docs to see if the test needs be changed instead. |
|
From my understanding of the test, the client is configured to send 100 QPS to the servers. psm-interop/tests/circuit_breaking_test.py Line 167 in c3d3bf5
The server is configured to block on receiving the calls and a deadline of 20 seconds is set on the RPCs. This means that the server should block for 20 seconds on an RPC before failing it with a DEADLINE_EXCEEDED status. psm-interop/tests/circuit_breaking_test.py Lines 178 to 194 in c3d3bf5
Initially, when the circuit breaking config is not received by the client, it makes a large number of concurrent requests, example from the test logs: After the client receives the circuit breaking config, it will ensure there are at most 500 UnaryCall and 1000 EmptyCall requests in-flight. In the logs, we can see the number of in-flight calls reducing with time. The test then checks once that the in-flight requests are within 500±5% and 1000±5% respectively. The problemAt t=20 seconds, the 100 RPCs started at t=0 will fail as their deadlines will expire. At the same time, the client will start 100 more RPCs that may succeed. If the test driver were to query the in-flight RPCs at this time, it will see the RPC count b/w 900-1000 EmptyCall for and 400-500 for UnaryCall. Assuming the rate of RPC starting is equal to rate of RPCs timing out, we'll see 950 EmptyCalls and 450 UnaryCalls on average. The same situation will happen at t=21, 22...30 and t=20+30, 21+30...30+30. TL;DR the steady state in the test is cyclic. |
|
One way to fix the test would be to change the two assertions as follows:
|
31d84d2 to
0e6a6ed
Compare
arjan-bal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good, but please add more comments as the behaviour is non-trivial.
framework/xds_k8s_testcase.py
Outdated
| self._checkRpcsInFlight( | ||
| test_client, rpc_type, num_rpcs, threshold_percent | ||
| ) | ||
| # In the second check, verify RPCs are within [threshold - QPS, threshold]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be better to describe the "why" instead of the "what" in the comment here. It may not be immediately apparent to readers. It will also help explain why the qps argument is required.
framework/xds_k8s_testcase.py
Outdated
| f"[{max(0, num_rpcs - qps)}, {num_rpcs}]" | ||
| ), | ||
| ) | ||
| first_min = int(num_rpcs * (1 - threshold_percent / 100)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to add a comment explaining why we don't define the max value as int(num_rpcs * (1 + threshold_percent / 100)). This is mainly because circuit_breaking (the only consumer) requires that the RPC count strictly not exceed the provided QPS.
sergiitk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we need to update PR title and description to match the changed logic
framework/xds_k8s_testcase.py
Outdated
| f"Timeout waiting for test client {test_client.hostname} to" | ||
| f"report {num_rpcs} pending calls ±{threshold_percent}%" | ||
| f"report {num_rpcs} pending calls in range " | ||
| f"[{int(num_rpcs * (1 - steady_state_min_threshold_percent / 100))}, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, calculate the first_min above the retryer, and use the variable. Not only it's easier to ready, it'll make it much clearer this is what this retryer is for.
| f"[{int(num_rpcs * (1 - steady_state_min_threshold_percent / 100))}, " | |
| f"[{first_min}, " |
framework/xds_k8s_testcase.py
Outdated
| steady_state_min_threshold_percent: int = 5, | ||
| min_tolerance_delta_after_steady_state: int = 100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
steady_state_min_threshold_percentandmin_tolerance_delta_after_steady_state: note the order of the words. The first one starts with "stage-tolerance_type-unit", the second one is "tolerance_type-stage" (and no unit). Let's make it consistent "stage-tolerance_type-unit".min_tolerance_delta_after_steady_state: we need to include the unit name here. F.e.steady_state_min_threshold_percentmakes it clear the unit is "percent".steady_state_min_threshold_percent- I don't think the word "threshold" applies here now that we don't use it to define the "max" end of the range. Gemini: "threshold range is defining the acceptable minimum and maximum points for something to function, trigger, or be considered valid, like a temperature band for thermoregulation or a specific voltage for a transistor. "
I'm thinking something like
| steady_state_min_threshold_percent: int = 5, | |
| min_tolerance_delta_after_steady_state: int = 100, | |
| steady_state_allowed_shortfall_percent: int = 5, | |
| after_steady_state_allowed_shortfall_count: int = 100, |
tests/circuit_breaking_test.py
Outdated
| test_client, | ||
| rpc_type=grpc_testing.RPC_TYPE_EMPTY_CALL, | ||
| num_rpcs=_INITIAL_EMPTY_MAX_REQUESTS, | ||
| min_tolerance_delta_after_steady_state=_QPS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we don't want to repeat the comment multiple times, let's make a local variable after_steady_state_shortfall above self.subTest("12_client_reaches_target_steady_state, assign it to _QPS, move the comment above it, and reuse the variable in all subtests.
tests/circuit_breaking_test.py
Outdated
| # circuit_breaking_test requires that the RPC count strictly | ||
| # not exceed the provided tolerance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something got lost in the explanation when the comment was moved.
This should cover "In the second check" comment from this commit 9d1c1b6
sergiitk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure to re-run the tests.
| first_min = int( | ||
| num_rpcs * (1 - steady_state_allowed_shortfall_percent / 100) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to repeat this, it's already defined on line 893.
| first_min = int( | |
| num_rpcs * (1 - steady_state_allowed_shortfall_percent / 100) | |
| ) |
try to fix b/448552373 by altering the second check for inflight RPCs from expected +/- threshold to expected - QPS
test run