Fix wait_for_running_pods flakyness#834
Conversation
|
/wip |
|
Report bugs in Issues The following are automatically added:
Available user actions:
PR will be approved when the following conditions are met:
Approvers and Reviewers
Supported /retest check runs
Supported labels
|
|
/build-and-push-container |
|
New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-834 published |
|
/wip cancel |
|
/verified |
|
/build-and-push-container |
|
New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-834 published |
|
/verified |
utilities/infra.py
Outdated
| @@ -275,30 +275,30 @@ def wait_for_pods_deletion(pods): | |||
|
|
|||
| def get_pod_container_error_status(pod): | |||
| pod_instance_status = pod.instance.status | |||
There was a problem hiding this comment.
should be wrapped with try-execpt. if the pod does not exit, this will fail.
There was a problem hiding this comment.
If the pod doesn't exist we will hit NotFoundError in get_not_running_pods
There was a problem hiding this comment.
get_pod_container_error_status should handle this regardless of get_not_running_pods so if someone calls this function it fails gracefully
utilities/infra.py
Outdated
| # pod that was spinned up in place of pod marked for deletion, reaches healthy state before end | ||
| # of this check | ||
| if pod_instance.metadata.get("deletionTimestamp") or pod_instance.status.phase not in ( | ||
| elif pod_instance.metadata.get("deletionTimestamp") or pod_instance.status.phase not in ( |
There was a problem hiding this comment.
i think it would make sense to reverse the conditions - i.e first check pod's metadata and phase and only if does not meet that condition iterate over the containers.
utilities/infra.py
Outdated
| else: | ||
| current_check = 0 | ||
| if sample: | ||
| LOGGER.info(f"All pods: {[pod.name for pod in sample]}") |
There was a problem hiding this comment.
does this log add extra valut -i.e pod names only?
There was a problem hiding this comment.
Its good for debugging.
But yeah I guess we dont need that for 99% of the cases.
Deleting
|
/verified |
|
/approve |
Signed-off-by: Harel Meir <hmeir@redhat.com>
Signed-off-by: Harel Meir <hmeir@redhat.com>
Signed-off-by: Harel Meir <hmeir@redhat.com>
Signed-off-by: Harel Meir <hmeir@redhat.com>
|
/build-and-push-container |
|
New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-834 published |
|
/verified |
|
/lgtm |
|
/approve |
|
/retest build-container |
|
Successfully removed PR tag: quay.io/openshift-cnv/openshift-virtualization-tests:pr-834. |
Short description:
pod respinned and its name changes, so the pod's list contains an already deleted pod name, which may lead to a failure.
The reason is that pod.instance being called for deleted pod, and the sampler getting stuck.
The solution - wrap it with the try-except block, and get the pod list with each func call
Which issue(s) this PR fixes:
Gating tests flaky across multiple versions.
Special notes for reviewer:
jira-ticket:
Summary by CodeRabbit