Skip to content

Fix wait_for_running_pods flakyness#834

Merged
vsibirsk merged 4 commits intoRedHatQE:mainfrom
hmeir:fix-flaky-wait-pod-running
May 18, 2025
Merged

Fix wait_for_running_pods flakyness#834
vsibirsk merged 4 commits intoRedHatQE:mainfrom
hmeir:fix-flaky-wait-pod-running

Conversation

@hmeir
Copy link
Copy Markdown
Contributor

@hmeir hmeir commented Apr 27, 2025

Short description:

pod respinned and its name changes, so the pod's list contains an already deleted pod name, which may lead to a failure.
The reason is that pod.instance being called for deleted pod, and the sampler getting stuck.

The solution - wrap it with the try-except block, and get the pod list with each func call

Which issue(s) this PR fixes:

Gating tests flaky across multiple versions.

Special notes for reviewer:
jira-ticket:

Summary by CodeRabbit

  • Refactor
    • Improved error handling and logic flow in pod status checks for more robust monitoring.
    • Enhanced the process for detecting non-running pods by dynamically retrieving the pod list during status checks.
    • Updated logging to provide clearer information about pod statuses and warnings when issues are detected.

@hmeir
Copy link
Copy Markdown
Contributor Author

hmeir commented Apr 27, 2025

/wip

@ghost
Copy link
Copy Markdown

ghost commented Apr 27, 2025

Report bugs in Issues

The following are automatically added:

  • Add reviewers from OWNER file (in the root of the repository) under reviewers section.
  • Set PR size label.
  • New issue is created for the PR. (Closed when PR is merged/closed)
  • Run pre-commit if .pre-commit-config.yaml exists in the repo.

Available user actions:

  • To mark PR as WIP comment /wip to the PR, To remove it from the PR comment /wip cancel to the PR.
  • To block merging of PR comment /hold, To un-block merging of PR comment /hold cancel.
  • To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
    verified label removed on each new commit push.
  • To cherry pick a merged PR comment /cherry-pick <target branch to cherry-pick to> in the PR.
    • Multiple target branches can be cherry-picked, separated by spaces. (/cherry-pick branch1 branch2)
    • Cherry-pick will be started when PR is merged
  • To build and push container image command /build-and-push-container in the PR (tag will be the PR number).
    • You can add extra args to the Podman build command
      • Example: /build-and-push-container --build-arg OPENSHIFT_PYTHON_WRAPPER_COMMIT=<commit_hash>
  • To add a label by comment use /<label name>, to remove, use /<label name> cancel
  • To assign reviewers based on OWNERS file use /assign-reviewers
  • To check if PR can be merged use /check-can-merge
  • to assign reviewer to PR use /assign-reviewer @<reviewer>

PR will be approved when the following conditions are met:

  • /approve from one of the approvers.
  • Minimum number of required /lgtm (2) is met.
Approvers and Reviewers
  • Approvers:

    • dbasunag
    • dbasunag
    • dshchedr
    • dshchedr
    • myakove
    • vsibirsk
    • vsibirsk
  • Reviewers:

    • RoniKishner
    • RoniKishner
    • dbasunag
    • dbasunag
    • dshchedr
    • dshchedr
    • geetikakay
    • vsibirsk
    • vsibirsk
Supported /retest check runs
  • /retest tox: Retest tox
  • /retest build-container: Retest build-container
  • /retest all: Retest all
Supported labels
  • hold
  • verified
  • wip
  • lgtm
  • approve

@ghost ghost added the size/XS label Apr 27, 2025
@ghost ghost requested a review from dshchedr April 27, 2025 13:12
@ghost ghost added the branch-main label Apr 27, 2025
@ghost ghost requested review from RoniKishner, dbasunag, geetikakay and vsibirsk April 27, 2025 13:12
@dbasunag1 dbasunag1 added the wip label Apr 27, 2025
@dbasunag1 dbasunag1 changed the title Fix wait_for_running_pods flakyness WIP: Fix wait_for_running_pods flakyness Apr 27, 2025
@hmeir
Copy link
Copy Markdown
Contributor Author

hmeir commented Apr 27, 2025

/build-and-push-container

@dbasunag1
Copy link
Copy Markdown

New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-834 published

@hmeir
Copy link
Copy Markdown
Contributor Author

hmeir commented Apr 27, 2025

/wip cancel

@hmeir
Copy link
Copy Markdown
Contributor Author

hmeir commented Apr 27, 2025

/verified

@ghost ghost removed the wip label Apr 27, 2025
@ghost ghost changed the title WIP: Fix wait_for_running_pods flakyness Fix wait_for_running_pods flakyness Apr 27, 2025
@ghost ghost added the commented-Ahmad-Hafe label Apr 28, 2025
@hmeir
Copy link
Copy Markdown
Contributor Author

hmeir commented May 8, 2025

/build-and-push-container

@ghost
Copy link
Copy Markdown

ghost commented May 8, 2025

New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-834 published

@hmeir
Copy link
Copy Markdown
Contributor Author

hmeir commented May 9, 2025

/verified

dshchedr
dshchedr previously approved these changes May 9, 2025
@@ -275,30 +275,30 @@ def wait_for_pods_deletion(pods):

def get_pod_container_error_status(pod):
pod_instance_status = pod.instance.status
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be wrapped with try-execpt. if the pod does not exit, this will fail.

Copy link
Copy Markdown
Contributor Author

@hmeir hmeir May 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pod doesn't exist we will hit NotFoundError in get_not_running_pods

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_pod_container_error_status should handle this regardless of get_not_running_pods so if someone calls this function it fails gracefully

# pod that was spinned up in place of pod marked for deletion, reaches healthy state before end
# of this check
if pod_instance.metadata.get("deletionTimestamp") or pod_instance.status.phase not in (
elif pod_instance.metadata.get("deletionTimestamp") or pod_instance.status.phase not in (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would make sense to reverse the conditions - i.e first check pod's metadata and phase and only if does not meet that condition iterate over the containers.

else:
current_check = 0
if sample:
LOGGER.info(f"All pods: {[pod.name for pod in sample]}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this log add extra valut -i.e pod names only?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its good for debugging.
But yeah I guess we dont need that for 99% of the cases.

Deleting

@hmeir
Copy link
Copy Markdown
Contributor Author

hmeir commented May 12, 2025

/verified

@rnetser
Copy link
Copy Markdown
Collaborator

rnetser commented May 13, 2025

/approve

hmeir added 4 commits May 14, 2025 03:09
Signed-off-by: Harel Meir <hmeir@redhat.com>
Signed-off-by: Harel Meir <hmeir@redhat.com>
Signed-off-by: Harel Meir <hmeir@redhat.com>
Signed-off-by: Harel Meir <hmeir@redhat.com>
@hmeir
Copy link
Copy Markdown
Contributor Author

hmeir commented May 14, 2025

/build-and-push-container

@ghost
Copy link
Copy Markdown

ghost commented May 14, 2025

New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-834 published

@hmeir
Copy link
Copy Markdown
Contributor Author

hmeir commented May 14, 2025

/verified

Copy link
Copy Markdown
Collaborator

@rnetser rnetser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@rnetser
Copy link
Copy Markdown
Collaborator

rnetser commented May 14, 2025

/lgtm

@vsibirsk
Copy link
Copy Markdown
Collaborator

/approve

@vsibirsk
Copy link
Copy Markdown
Collaborator

/retest build-container

@ghost
Copy link
Copy Markdown

ghost commented May 18, 2025

Successfully removed PR tag: quay.io/openshift-cnv/openshift-virtualization-tests:pr-834.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.