Skip to content

feat: add option to run tests even if cluster sanity checks failed#289

Closed
lugi0 wants to merge 0 commit intoopendatahub-io:mainfrom
lugi0:main
Closed

feat: add option to run tests even if cluster sanity checks failed#289
lugi0 wants to merge 0 commit intoopendatahub-io:mainfrom
lugi0:main

Conversation

@lugi0
Copy link
Copy Markdown
Contributor

@lugi0 lugi0 commented May 5, 2025

Description

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features
    • Added command-line options to customize cluster sanity checks during test runs, including skipping checks, skipping specific resource checks, and continuing tests with warnings on failures.
  • Bug Fixes
    • Fixed a typo in the help text for the upgrade deployment modes option.
  • Refactor
    • Enhanced logging and error handling for cluster sanity checks to improve test execution feedback.

@lugi0 lugi0 self-assigned this May 5, 2025
@lugi0 lugi0 requested a review from a team as a code owner May 5, 2025 15:32
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented May 5, 2025

Walkthrough

The changes introduce a new pytest command-line option, --cluster-sanity-continue-on-failure, allowing tests to continue even if cluster sanity checks fail, with appropriate warnings and JUnit XML property updates. The help text for the --upgrade-deployment-modes option is corrected from "Coma-separated" to "Comma-separated." The verify_cluster_sanity function is refactored to handle three cluster sanity-related options: skipping all checks, skipping only RHOAI-specific checks, and continuing on failure. Logging and JUnit XML reporting are enhanced to reflect these behaviors, while the default behavior of aborting on failure remains unchanged.

Changes

File(s) Change Summary
conftest.py Fixed typo in help text for --upgrade-deployment-modes. Added new CLI option --cluster-sanity-continue-on-failure to pytest.
utilities/infra.py Refactored verify_cluster_sanity to handle new CLI options: --cluster-sanity-skip-check, --cluster-sanity-skip-rhoai-check, and --cluster-sanity-continue-on-failure. Enhanced logging, error handling, and JUnit XML reporting to support skipping and continuation behaviors on cluster sanity check failures.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Pytest
    participant verify_cluster_sanity
    participant Logger
    participant JUnitXML

    User->>Pytest: Run tests with CLI options
    Pytest->>verify_cluster_sanity: Call with options
    verify_cluster_sanity->>Logger: Log check start
    alt --cluster-sanity-skip-check
        verify_cluster_sanity->>Logger: Log skipping all checks
        verify_cluster_sanity->>JUnitXML: Set skipped property
        verify_cluster_sanity-->>Pytest: Return
    else --cluster-sanity-skip-rhoai-check
        verify_cluster_sanity->>Logger: Log skipping RHOAI checks
        verify_cluster_sanity->>JUnitXML: Set skipped property
        verify_cluster_sanity->>Logger: Run node health and schedulability checks
    else Checks run
        verify_cluster_sanity->>Logger: Run all checks including RHOAI
    end
    alt Checks fail
        alt --cluster-sanity-continue-on-failure
            verify_cluster_sanity->>Logger: Log warning, continue tests
            verify_cluster_sanity->>JUnitXML: Set failure with continue property
            verify_cluster_sanity-->>Pytest: Return
        else Default behavior
            verify_cluster_sanity->>Logger: Log error, exit tests
            verify_cluster_sanity->>JUnitXML: Set failure with exit code 99
            verify_cluster_sanity->>Pytest: Exit test run
        end
    else Checks pass
        verify_cluster_sanity->>Logger: Log success
        verify_cluster_sanity-->>Pytest: Return
    end
Loading

Poem

A rabbit hopped through code so neat,
Tweaked a typo, made it sweet.
Cluster checks now heed your say—
Skip or warn, or halt the day.
With options new, the tests can flow,
As carrots in the garden grow! 🥕


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aba9503 and e6c1d83.

📒 Files selected for processing (1)
  • utilities/infra.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • utilities/infra.py
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
utilities/infra.py (1)

914-931: Well-implemented continue-on-failure behavior.

The implementation correctly handles the different behaviors based on the --cluster-sanity-continue-on-failure flag, with appropriate logging and JUnit XML property updates.

There's a TODO comment at line 930. Consider either implementing the file output for Jenkins reporting or creating a separate ticket to track this item:

- # TODO: Write to file to easily report the failure in jenkins
+ # TODO: Write to file to easily report the failure in jenkins (tracked in separate ticket #XXX)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eda4aeb and aba9503.

📒 Files selected for processing (2)
  • conftest.py (2 hunks)
  • utilities/infra.py (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
utilities/infra.py (2)
tests/conftest.py (3)
  • nodes (496-497)
  • dsci_resource (334-335)
  • dsc_resource (339-340)
utilities/exceptions.py (1)
  • ResourceNotReadyError (99-100)
🔇 Additional comments (8)
conftest.py (2)

112-112: Fixed typo in help text.

Corrected "Coma-separated" to "Comma-separated" in the help text.


133-139: Well-structured new CLI option for continuing tests on cluster sanity failure.

This new flag provides a clean way to optionally continue test execution even when cluster sanity checks fail, enhancing flexibility for test scenarios while maintaining the default behavior of failing fast.

utilities/infra.py (6)

874-877: Clear documentation of the new behavior.

The updated docstring accurately describes the cluster sanity check behavior and the effect of the new flag.


887-892: Good use of constants for CLI options.

Using constants for the command-line options improves code readability and maintainability, reducing the risk of typos when referencing these options throughout the code.


894-897: Clean implementation of the skip-check functionality.

The code correctly handles the case when --cluster-sanity-skip-check is specified, logging a warning and returning early to skip all checks.


900-900: Added helpful log message to indicate check start.

Adding this log message improves debuggability by clearly showing when the cluster sanity check begins.


903-905: Good implementation of selective RHOAI check skipping.

The code effectively implements the ability to skip only RHOAI-specific resource checks while still performing the basic node checks.


908-909: Added success log message.

This log message helps with debugging by clearly indicating when the sanity check has successfully completed.

@github-actions
Copy link
Copy Markdown

github-actions bot commented May 5, 2025

The following are automatically added/executed:

  • PR size label.
  • Run pre-commit
  • Run tox
  • Add PR author as the PR assignee

Available user actions:

  • To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
  • To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
  • To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
    lgtm label removed on each new commit push.
  • To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
    verified label removed on each new commit push.
  • To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
    and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
Supported labels

{'/verified', '/lgtm', '/wip', '/hold'}

try:
LOGGER.info("Check cluster sanity.")

LOGGER.info("Running cluster sanity check...")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not skip here. If the nodes are not schedulable or healthy. It is a cluster problem.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different from what was being discussed in slack. I do agree with you, but can we get an ack to proceed before implementing the change?

Comment on lines +918 to +921
if junitxml_property:
junitxml_property(name="cluster_sanity_check_failed", value=True) # type: ignore[call-arg]
junitxml_property(name="cluster_sanity_forced_continue", value=True) # type: ignore[call-arg]
# Return normally from the fixture setup
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is for test result reporting. If cluster sanity fails and we are continuing with tests, we should not mark the tests as failure because of sanity failure. They should be marked failed only if the test fails.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only marking those two specific properties, not the tests as failures

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the associated test would be marked as failed though (e.g. first test of model registry). Do you want that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants