-
Notifications
You must be signed in to change notification settings - Fork 461
Enrich failure handling #1065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Enrich failure handling #1065
Conversation
Signed-off-by: limengxuan <[email protected]>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.5.0 to 6.6.0. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](docker/build-push-action@v6.5.0...v6.6.0) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: rongfu.leng <[email protected]>
Signed-off-by: wawa0210 <[email protected]>
Signed-off-by: wawa0210 <[email protected]>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.6.0 to 6.6.1. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](docker/build-push-action@v6.6.0...v6.6.1) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: william-wang <[email protected]>
Signed-off-by: wawa0210 <[email protected]>
Signed-off-by: wawa0210 <[email protected]>
* fix: fix duplicate resource keys in configmap * fix: Update incorrect component names in monitorservice
Bumps [github.com/opencontainers/runc](https://github.com/opencontainers/runc) from 1.1.2 to 1.1.12. - [Release notes](https://github.com/opencontainers/runc/releases) - [Changelog](https://github.com/opencontainers/runc/blob/main/CHANGELOG.md) - [Commits](opencontainers/runc@v1.1.2...v1.1.12) --- updated-dependencies: - dependency-name: github.com/opencontainers/runc dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: rongfu.leng <[email protected]>
Signed-off-by: wawa0210 <[email protected]>
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 2 to 3. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@v2...v3) --- updated-dependencies: - dependency-name: github/codeql-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]>
Bumps [actions/setup-go](https://github.com/actions/setup-go) from 4 to 5. - [Release notes](https://github.com/actions/setup-go/releases) - [Commits](actions/setup-go@v4...v5) --- updated-dependencies: - dependency-name: actions/setup-go dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: rongfu.leng <[email protected]>
Signed-off-by: rongfu.leng <[email protected]>
…ect-HAMi#458) Signed-off-by: zoyopei <[email protected]>
Signed-off-by: rongfu.leng <[email protected]>
Signed-off-by: wawa0210 <[email protected]>
Signed-off-by: wawa0210 <[email protected]>
Signed-off-by: wawa0210 <[email protected]>
Signed-off-by: wawa0210 <[email protected]>
Bumps [github.com/opencontainers/runc](https://github.com/opencontainers/runc) from 1.1.12 to 1.1.14. - [Release notes](https://github.com/opencontainers/runc/releases) - [Changelog](https://github.com/opencontainers/runc/blob/main/CHANGELOG.md) - [Commits](opencontainers/runc@v1.1.12...v1.1.14) --- updated-dependencies: - dependency-name: github.com/opencontainers/runc dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.6.1 to 6.7.0. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](docker/build-push-action@v6.6.1...v6.7.0) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: wawa0210 <[email protected]>
…AMi#963) * feat: Add support for profiling via net/http/pprof package Signed-off-by: Shouren Yang <[email protected]> * feat: Add how-to-profiling-scheduler docs Signed-off-by: Shouren Yang <[email protected]> * feat: rename the --enable-profiling flag to --profiling and merge pprof routes to http server Signed-off-by: Shouren Yang <[email protected]> --------- Signed-off-by: Shouren Yang <[email protected]>
* update enflame devices Signed-off-by: limengxuan <[email protected]>
Signed-off-by: 王然 <[email protected]>
Signed-off-by: yxxhero <[email protected]>
Signed-off-by: yxxhero <[email protected]>
Signed-off-by: yxxhero <[email protected]>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.15.0 to 6.16.0. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](docker/build-push-action@v6.15.0...v6.16.0) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-version: 6.16.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]>
Project-HAMi#1023) Signed-off-by: wangmin <[email protected]> Co-authored-by: wangmin <[email protected]>
…i#1021) * feat: Support for using RuntimeClass with nvidia devices Signed-off-by: 王然 <[email protected]> * docs: runtimeClassName Signed-off-by: 王然 <[email protected]> * feat: reset hasResource logic Signed-off-by: 王然 <[email protected]> --------- Signed-off-by: 王然 <[email protected]>
…1020) Signed-off-by: wangmin <[email protected]> Co-authored-by: wangmin <[email protected]>
…t after ConfigMap modification (Project-HAMi#1022) Signed-off-by: 王然 <[email protected]>
(Project-HAMi#1012) Signed-off-by: ouyangluwei(riseunion) <[email protected]> Co-authored-by: ouyangluwei(riseunion) <[email protected]>
add new ai accelerator GCU S60 made by https://www.enflame-tech.com Signed-off-by: winston-zhang-orz <[email protected]>
* update cambricon devices Signed-off-by: limengxuan <[email protected]> * update Signed-off-by: limengxuan <[email protected]> * update Signed-off-by: limengxuan <[email protected]> * update Signed-off-by: limengxuan <[email protected]> --------- Signed-off-by: limengxuan <[email protected]> Signed-off-by: limengxuan <[email protected]>
…roject-HAMi#1031) Fix scheduler metrics can not be accessed when using master branch of HAMi Signed-off-by: limengxuan <[email protected]>
Signed-off-by: rongfu.leng <[email protected]>
…roject-HAMi#938) * Separate options from client to make the responsibility more clear. Remove the magic number in the main function and define it as a constant. Signed-off-by: yangshiqi <[email protected]> * fix merge bugs and add testcase. remove some comments to try e2e Signed-off-by: yangshiqi <[email protected]> * debug for e2e Signed-off-by: yangshiqi <[email protected]> * fix e2e error Signed-off-by: yangshiqi <[email protected]> --------- Signed-off-by: yangshiqi <[email protected]> Co-authored-by: yangshiqi <[email protected]>
Signed-off-by: rongfu.leng <[email protected]>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.16.0 to 6.17.0. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](docker/build-push-action@v6.16.0...v6.17.0) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-version: 6.17.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]>
…roject-HAMi#1056) Signed-off-by: Shouren Yang <[email protected]>
Signed-off-by: Shouren Yang <[email protected]>
Signed-off-by: wawa0210 <[email protected]>
Signed-off-by: wen.rui <[email protected]>
38c2ac4 to
8aa69f2
Compare
Codecov Report❌ Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances test failure diagnostics by adding utilities to fetch and log pod details across namespaces, adjusts the pod‐running wait interval for GPU workloads, and integrates detailed pod checks after any test failure.
- Increased the polling interval in
WaitForPodRunningfrom 5s to 30s. - Introduced
GetNamespaceList,GetPodLogs, andCheckPodDetailsintest/utils/pod.go. - Updated
AfterEachintest/e2e/pod/test_pod.goto callCheckPodDetailson failures and removed a debugfmt.Printf.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| test/utils/pod.go | Added new failure-handling helpers, updated polling interval, and imported I/O packages. |
| test/e2e/pod/test_pod.go | Call CheckPodDetails on test failures and remove leftover fmt.Printf debug statement. |
Comments suppressed due to low confidence (1)
test/utils/pod.go:96
- Use the passed-in context
ctxinstead ofcontext.TODO()to allow cancellation and deadlines to propagate correctly.
pod, err := clientSet.CoreV1().Pods(namespace).Get(context.TODO(), podName, metav1.GetOptions{})
| events, err := GetPodEvents(clientSet, ns, pod.Name) | ||
| if err != nil { | ||
| klog.Errorf("Failed to get events for %s/%s: %v", ns, pod.Name, err) | ||
| return |
Copilot
AI
May 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning here stops logging details for other pods. Consider using continue to proceed to the next pod and log all failures.
| return | |
| continue |
| logs, err := GetPodLogs(clientSet, ns, pod.Name) | ||
| if err != nil { | ||
| klog.Errorf("Failed to get logs for %s/%s: %v", ns, pod.Name, err) | ||
| return |
Copilot
AI
May 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with events, use continue instead of return so that other pods are still checked and logged.
| return | |
| continue |
| } | ||
|
|
||
| klog.Infof("Show logs for %s/%s:", ns, pod.Name) | ||
| klog.Infof(logs) |
Copilot
AI
May 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Passing raw logs to Infof can misinterpret formatting verbs—use klog.Info(logs) or klog.Infof("%s", logs) instead.
| klog.Infof(logs) | |
| klog.Infof("%s", logs) |
| return false, nil | ||
| }) | ||
| } | ||
|
|
Copilot
AI
May 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Add a doc comment to describe the purpose and behavior of this public function for better maintainability.
| // GetNamespaceList retrieves a list of all namespaces in the Kubernetes cluster. | |
| // It takes a Kubernetes clientset as input and returns a slice of namespace names | |
| // or an error if the operation fails. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Improving failure handling for test.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Error logs:
Does this PR introduce a user-facing change?: