Fixing TestServerlessScaleToZero test#269
Conversation
…ents utility method Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com>
Reviewer's Guide by SourceryThis pull request fixes a flaky test by adding a new helper method to wait for deployments to be created. The existing test logic immediately checked for deployments after an update, which could fail if the deployment wasn't instantly available. The new method uses a timeout sampler to wait for the expected number of deployments matching a specific label selector. No diagrams generated as the changes look simple and do not need a visual representation. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
The following are automatically added/executed:
Available user actions:
Supported labels{'/wip', '/verified', '/hold', '/lgtm'} |
There was a problem hiding this comment.
Hey @brettmthompson - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider if this
wait_for_deploymentspattern could be generalized for other resource types to avoid potential future duplication of waiting logic.
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
…eployment_replicas func to handle the same bug Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com>
WalkthroughThe changes refactor the logic for waiting on inference deployment replicas in both the test and utility code. The test now utilizes a utility function, Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Case
participant Utils as utilities/infra.py
participant K8s as Kubernetes API
Test->>Utils: wait_for_inference_deployment_replicas(isvc, expected_num_deployments, labels, deployed, timeout)
Utils->>K8s: Query deployments with label selector
K8s-->>Utils: Return list of deployments
Utils->>Utils: Check if expected number of deployments found
Utils->>K8s: Wait for replicas in each deployment (with timeout)
K8s-->>Utils: Deployment status
Utils-->>Test: Return deployments or raise error
Poem
Note ⚡️ AI Code Reviews for VS Code, Cursor, WindsurfCodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback. Note ⚡️ Faster reviews with cachingCodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🧬 Code Graph Analysis (1)utilities/infra.py (3)
🔇 Additional comments (12)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
utilities/infra.py (3)
188-191: Typo & clarification in doc-string
labels (str): Comma seperated list …
- Spelling: separated
- The expected format is actually a Kubernetes selector string that is appended to the auto-generated ISVC selector. Consider wording it as:
labels (str): Additional label selector(s) (comma-separated list of key=value).This avoids confusion about whether spaces are allowed.
213-225: Early break condition may starve retry logicThe loop breaks only when
len(deployment_list) == expected_num_deployments.
If the cluster overshoots (e.g. 2 deployments while you expect 1), the sampler keeps running until timeout even though we already know the state is invalid. This wastes the remaining timeout budget and may hide the real issue (e.g. extra rollout).Consider short-circuiting both over- and under-shoot cases:
- if len(deployment_list) == expected_num_deployments: - break + current = len(deployment_list) + if current == expected_num_deployments: + break + if current > expected_num_deployments: + raise ResourceNotUniqueError( + f"Too many predictor deployments found in namespace {ns}. " + f"Expected {expected_num_deployments}, found {current}" + )That will fail fast instead of idling for the whole timeout.
232-238: PropagateResourceNotUniqueError/ResourceNotFoundErroras root cause
TimeoutExpiredErroris re-raised whene.last_exp is not None, losing the higher-level context (missing/extra deployments).
It would be clearer to wrap the original exception:except TimeoutExpiredError as e: raise TimeoutExpiredError( f"Timed out waiting for {expected_num_deployments} deployment(s) " f"in namespace {ns} (last error: {e.last_exp})" ) from eCallers then only need to handle one exception type and still receive the detailed root cause.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/model_serving/model_server/serverless/test_scale_to_zero.py(2 hunks)utilities/infra.py(5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
tests/model_serving/model_server/serverless/test_scale_to_zero.py (3)
utilities/infra.py (1)
wait_for_inference_deployment_replicas(172-257)tests/conftest.py (1)
admin_client(51-52)tests/model_serving/model_server/serverless/conftest.py (1)
inference_service_patched_replicas(22-36)
utilities/infra.py (2)
utilities/constants.py (2)
Timeout(189-196)KServeDeploymentType(6-9)utilities/general.py (1)
create_isvc_label_selector_str(143-173)
🔇 Additional comments (1)
utilities/infra.py (1)
141-149: 🛠️ Refactor suggestionMake new
timeoutarg backward-compatible & update call-sites
wait_for_replicas_in_deployment()now takes atimeoutparameter, but every existing caller outside this file still passes only the previous two positional arguments. Unless you audited the whole repository, those calls will raise
TypeError: wait_for_replicas_in_deployment() takes 2 positional arguments but 3 were given.If you want the new arg to be optional without touching all callers immediately, keep the new parameter keyword-only:
-def wait_for_replicas_in_deployment(deployment: Deployment, replicas: int, timeout: int = Timeout.TIMEOUT_2MIN) -> None: +def wait_for_replicas_in_deployment( + deployment: Deployment, + replicas: int, + *, + timeout: int = Timeout.TIMEOUT_2MIN, +) -> None:That way legacy positional invocations continue to work while new code can override the timeout when needed.
Likely an incorrect or invalid review comment.
…t param for Deployment.Get() calls Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com>
|
/verified |
dbasunag
left a comment
There was a problem hiding this comment.
@brettmthompson I am good with the PR, can you please see if you can get https://github.com/opendatahub-io/opendatahub-tests/pull/269/files#r2080184023 resolved. Either way, I will not be holding it for this one.
|
Status of building tag latest: success. |
* fixing TestServerlessScaleToZero test and adding new wait_for_deployments utility method Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com> * removing wait_for_deployments func and reworking wait_for_inference_deployment_replicas func to handle the same bug Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com> * adding new UnexpectedResourceCountError and now using dyn_client input param for Deployment.Get() calls Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com> --------- Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com>
* fixing TestServerlessScaleToZero test and adding new wait_for_deployments utility method Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com> * removing wait_for_deployments func and reworking wait_for_inference_deployment_replicas func to handle the same bug Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com> * adding new UnexpectedResourceCountError and now using dyn_client input param for Deployment.Get() calls Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com> --------- Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com>
Description
Currently the
TestServerlessScaleToZerotest will fail at the final steptest_serverless_pods_after_scale_to_one_replica. The reason for this failure is because this step immediately retrieves the deployments after the infernce service is edited, not giving enough time for the new deployment to be created. So when the deployments are retrieved only the first 2 will be returned, causing the label check forserving.knative.dev/configurationGeneration=3to never return true.To resolve this issue, I have updated the
wait_for_inference_deployments_replicasfunction in the following ways:TimeoutSampler. This allows time for deployments to be created. The sameResourceNotUniqueErrororResourceNotFoundErrorexceptions will be raised if an unexpected number of deployments is returned after timeout, with a bit more detail in the error message.TimeoutWatchclass to distribute the timeout across the sub functions called.wait_for_replicas_in_deploymentmethod would be called, but only the first element from the list of deployments would ever be passed to this function, regardless of which deployment we were currently iterating over in the list. I have updated this to pass the current deployment being iterated over.labelsstring input parameter that allows for deployments to be filtered on custom labels in addition to the inference service label.deployedboolean input parameter. This allows for deployments that are expected to have 0 replicas to be checked using this method.Note
I am using the
TimeoutSamplerclass here and not theretrydecorator for the following reasons:TimeoutWatchclassResourceNotUniqueErrorandResourceNotFoundError). If a retry decorator is used, the top level exception raised will always be of typeTimeoutExpiredErrorHow Has This Been Tested?
Ran the
TestServerlessScaleToZerolocally and it is now successful.Merge criteria:
Summary by Sourcery
Fix flaky
TestServerlessScaleToZerotest by waiting for the deployment to be ready after scaling.Tests:
wait_for_inference_deployments_replicasutility function to poll for expected deployments within a timeout.test_serverless_pods_after_scale_to_one_replicawait_for_inference_deployments_replicas` before checking deployment state.Summary by CodeRabbit
Summary by CodeRabbit