-
Notifications
You must be signed in to change notification settings - Fork 461
feat(scheduler): add node nouse gpuuuid function #1206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: yintong.huang <[email protected]>
* update documents Signed-off-by: limengxuan <[email protected]>
Signed-off-by: elrondwong <[email protected]>
add-nodelock-ut Signed-off-by: learner0810 <[email protected]>
Signed-off-by: penguin <[email protected]>
* update nodelock for mig instance & add document for mig monitor Signed-off-by: limengxuan <[email protected]>
* update documents for config Signed-off-by: limengxuan <[email protected]>
* Setting devicePlugin.compatWithCPUManager=true will set PASS_DEVICE_SPECS=true as an environment variable. Signed-off-by: 张 驰 <[email protected]> * Change the parameter compatWithCPUManager for setting the PASS_DEVICE_SPECS ENV to passDeviceSpecsEnabled, and set the default value to true. Signed-off-by: 张 驰 <[email protected]> --------- Signed-off-by: 张 驰 <[email protected]>
Signed-off-by: learner0810 <[email protected]>
…ler. (Project-HAMi#746) Signed-off-by: chaunceyjiang <[email protected]>
…t-HAMi#735) Signed-off-by: elrondwong <[email protected]>
Signed-off-by: bin <[email protected]>
Signed-off-by: elrondwong <[email protected]>
Signed-off-by: bin <[email protected]>
* add star history to readme, fix typos and add more contributors and maintainers. Signed-off-by: yangshiqi <[email protected]> * add spaces Signed-off-by: yangshiqi <[email protected]> --------- Signed-off-by: yangshiqi <[email protected]>
Signed-off-by: Fengyang <[email protected]>
Signed-off-by: windsonsea <[email protected]>
Signed-off-by: KubeKyrie <[email protected]>
Signed-off-by: yxxhero <[email protected]>
Signed-off-by: bin <[email protected]>
…t-HAMi#767) Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: KubeKyrie <[email protected]>
Signed-off-by: KubeKyrie <[email protected]>
Signed-off-by: learner0810 <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: wen.rui <[email protected]>
Signed-off-by: Rei1010 <[email protected]> Signed-off-by: wen.rui <[email protected]>
Signed-off-by: jinye <[email protected]>
Signed-off-by: jinye <[email protected]>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.10.0 to 6.11.0. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](docker/build-push-action@v6.10.0...v6.11.0) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]>
* update version Signed-off-by: limengxuan <[email protected]> * * update Signed-off-by: limengxuan <[email protected]> --------- Signed-off-by: limengxuan <[email protected]> Signed-off-by: limengxuan <[email protected]> * Update hami-core version to fix (Project-HAMi#1256) * update libvgpu Signed-off-by: limengxuan <[email protected]> * update version Signed-off-by: limengxuan <[email protected]> * update_hami_core Signed-off-by: limengxuan <[email protected]> * 更新 hami 版本至 2.6.1,并在 Makefile 中添加 helm 模板验证命令和values 说明。 Signed-off-by: clcc2019 <[email protected]> * Update charts/hami/README.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: song duan <[email protected]> * Update charts/hami/README.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: song duan <[email protected]> * 移除 Makefile 中的 helm 模板命令,简化构建流程。 Signed-off-by: song duan <[email protected]> * docs: translate Chinese content to English in chart README - Translate devicePlugin.deviceSplitCount description from Chinese to English - Translate devicePlugin.migStrategy description from Chinese to English - Translate devicePlugin.disablecorelimit description from Chinese to English - Ensure all parameter descriptions are now in English for consistency Signed-off-by: clcc2019 <[email protected]> --------- Signed-off-by: limengxuan <[email protected]> Signed-off-by: limengxuan <[email protected]> Signed-off-by: clcc2019 <[email protected]> Signed-off-by: song duan <[email protected]> Co-authored-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ct-HAMi#1285) Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.74.2 to 1.75.0. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](grpc/grpc-go@v1.74.2...v1.75.0) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.75.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
|
please rebase the code to pass the CI |
Signed-off-by: Lei Guo <[email protected]> Co-authored-by: Lei Guo <[email protected]>
…roject-HAMi#1296) * remake utils --------- Signed-off-by: limengxuan <[email protected]>
Signed-off-by: james <[email protected]>
…ject-HAMi#1299) * fix#1050 Signed-off-by: Jifei Wang <[email protected]> * fmt Signed-off-by: Jifei Wang <[email protected]> * resolve confilict Signed-off-by: Jifei Wang <[email protected]> --------- Signed-off-by: Jifei Wang <[email protected]>
* clear and correct ascend device name Signed-off-by: Jifei Wang <[email protected]> * recover changes Signed-off-by: Jifei Wang <[email protected]> * fix: fix golangci-lint error (Project-HAMi#1319) Signed-off-by: james <[email protected]> --------- Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: james <[email protected]> Co-authored-by: james <[email protected]>
* docs: update ascend910b-support docs Signed-off-by: james <[email protected]> * refactor: add blank Signed-off-by: james <[email protected]> --------- Signed-off-by: james <[email protected]>
Bumps [actions/stale](https://github.com/actions/stale) from 9 to 10. - [Release notes](https://github.com/actions/stale/releases) - [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md) - [Commits](actions/stale@v9...v10) --- updated-dependencies: - dependency-name: actions/stale dependency-version: '10' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…oject-HAMi#1324) Bumps [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action) from 0.32.0 to 0.33.1. - [Release notes](https://github.com/aquasecurity/trivy-action/releases) - [Commits](aquasecurity/trivy-action@0.32.0...0.33.1) --- updated-dependencies: - dependency-name: aquasecurity/trivy-action dependency-version: 0.33.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…roject-HAMi#1305) Bumps [github.com/stretchr/testify](https://github.com/stretchr/testify) from 1.10.0 to 1.11.1. - [Release notes](https://github.com/stretchr/testify/releases) - [Commits](stretchr/testify@v1.10.0...v1.11.1) --- updated-dependencies: - dependency-name: github.com/stretchr/testify dependency-version: 1.11.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: ZhengW22 <[email protected]>
Signed-off-by: ZhengW22 <[email protected]>
Signed-off-by: ZhengW22 <[email protected]>
Signed-off-by: ZhengW22 <[email protected]>
Signed-off-by: ZhengW22 <[email protected]>
…." This reverts commit 8ba1840. Signed-off-by: ZhengW22 <[email protected]>
Signed-off-by: ZhengW22 <[email protected]>
Signed-off-by: ZhengW22 <[email protected]>
|
New changes are detected. LGTM label has been removed. |
Signed-off-by: ZhengW22 <[email protected]>
|
@Shouren @archlitchi Hello, I have rebased the code and fix some problems. |
|
@Shouren Hello, can you help me to review the new code? |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a useful feature to disable specific GPUs on a node via annotations. The overall approach of filtering out disabled devices from the scheduler's node cache is sound. However, the description of the implementation in the PR body is inaccurate. It states that disabled GPUs are marked as fully utilized, but the code actually removes them from the node's device list. Please update the description to reflect the actual implementation.
I've found a couple of critical bugs. One is that UUIDs from annotations are not correctly trimmed, which would cause the feature to fail with whitespace in the annotation value. Another is a reference to a constant that doesn't seem to be defined, which will cause a compilation failure. I've also provided some suggestions for refactoring to improve code clarity and for adding a test case to improve coverage. Please review the comments.
| iluvatar.IluvatarNoUseUUID: {iluvatar.IluvatarGPUDevice}, | ||
| enflame.EnflameNoUseUUID: {enflame.EnflameGPUDevice}, | ||
| mthreads.MthreadsNoUseUUID: {mthreads.MthreadsGPUDevice}, | ||
| metax.MetaxNoUseUUID: {metax.MetaxGPUDevice, metax.MetaxSGPUDevice}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant metax.MetaxNoUseUUID is used here, but it does not appear to be defined in the metax package based on the provided files. This will likely cause a compilation error. Please ensure this constant is defined, for example in pkg/device/metax/device.go or a similar file, with a value like "metax-tech.com/nouse-gpuuuid".
| if vendorWithDisableGPUUUIDMap[vendor] == nil { | ||
| newVendorMap := make(map[string]bool) | ||
| newVendorMap[disableGPUUUID] = true | ||
| vendorWithDisableGPUUUIDMap[vendor] = newVendorMap | ||
| } else { | ||
| vendorWithDisableGPUUUIDMap[vendor][disableGPUUUID] = true | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bug in this section. The disableGPUUUID variable, which may contain leading/trailing whitespace, is used as a map key. The trimmed version id should be used instead to ensure correct matching. Additionally, the map initialization logic can be simplified to be more idiomatic.
if vendorWithDisableGPUUUIDMap[vendor] == nil {
vendorWithDisableGPUUUIDMap[vendor] = make(map[string]bool)
}
vendorWithDisableGPUUUIDMap[vendor][id] = true| for _, d := range nodeInfo.Devices { | ||
| removeFlag := false | ||
| if disableGPUUUIDMap, ok := vendorWithDisableGPUUUIDMap[d.DeviceVendor]; ok { | ||
| if ok := disableGPUUUIDMap[d.ID]; ok { | ||
| klog.V(5).Infof("Disable gpu uuid is : %s", d.ID) | ||
| removeFlag = true | ||
| } | ||
| } | ||
| if !removeFlag { | ||
| tmp = append(tmp, d) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop for filtering devices can be simplified for better readability. Using a continue statement would make the logic clearer and avoid the removeFlag variable. This also resolves the shadowing of the ok variable, which is a good practice to avoid.
for _, d := range nodeInfo.Devices {
if disableGPUUUIDMap, ok := vendorWithDisableGPUUUIDMap[d.DeviceVendor]; ok {
if _, disabled := disableGPUUUIDMap[d.ID]; disabled {
klog.V(5).Infof("Disabling gpu with uuid: %s", d.ID)
continue
}
}
tmp = append(tmp, d)
}| } | ||
| } | ||
|
|
||
| func Test_rmDeviceByNodeAnnotation(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test suite for rmDeviceByNodeAnnotation is good, but it would be beneficial to add a test case that verifies that UUIDs with leading/trailing whitespace in the annotation value are handled correctly. This would ensure the strings.TrimSpace logic is effective and prevent regressions.
For example:
{
name: "Test remove device with whitespace in annotation",
args: args{
nodeInfo: &device.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: " " + id1 + " "}}},
Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
},
},
want: []device.DeviceInfo{},
},|
@ZhengW22 Could you please take a look at the critical comments in gemini-code-assist? |
What type of PR is this?
What this PR does / why we need it:
This PR adds the capability to disable GPUs at the node level by applying annotations to nodes. GPUs matching the specified UUIDs will no longer be allocated to any pods.
The implementation works by setting the used count of the corresponding node GPUs to their maximum capacity when calculating nodeUsage, effectively occupying those resources. This approach maintains compatibility with scheduling logic for different types of GPU cards.
Which issue(s) this PR fixes:
No.
Special notes for your reviewer:
No.
Does this PR introduce a user-facing change?:
No.