Skip to content

Conversation

@ZhengW22
Copy link

What type of PR is this?

What this PR does / why we need it:
This PR adds the capability to disable GPUs at the node level by applying annotations to nodes. GPUs matching the specified UUIDs will no longer be allocated to any pods.

The implementation works by setting the used count of the corresponding node GPUs to their maximum capacity when calculating nodeUsage, effectively occupying those resources. This approach maintains compatibility with scheduling logic for different types of GPU cards.

Which issue(s) this PR fixes:
No.

Special notes for your reviewer:
No.

Does this PR introduce a user-facing change?:
No.

yt-huang and others added 30 commits December 22, 2024 10:21
add-nodelock-ut

Signed-off-by: learner0810 <[email protected]>
* update nodelock for mig instance & add document for mig monitor

Signed-off-by: limengxuan <[email protected]>
* update documents for config

Signed-off-by: limengxuan <[email protected]>
* Setting devicePlugin.compatWithCPUManager=true will set PASS_DEVICE_SPECS=true as an environment variable.

Signed-off-by: 张 驰 <[email protected]>

* Change the parameter compatWithCPUManager for setting the PASS_DEVICE_SPECS ENV to passDeviceSpecsEnabled, and set the default value to true.

Signed-off-by: 张 驰 <[email protected]>

---------

Signed-off-by: 张 驰 <[email protected]>
* add star history to readme, fix typos and add more contributors and maintainers.

Signed-off-by: yangshiqi <[email protected]>

* add spaces

Signed-off-by: yangshiqi <[email protected]>

---------

Signed-off-by: yangshiqi <[email protected]>
Signed-off-by: KubeKyrie <[email protected]>
Signed-off-by: KubeKyrie <[email protected]>
Signed-off-by: Rei1010 <[email protected]>
Signed-off-by: wen.rui <[email protected]>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.10.0 to 6.11.0.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](docker/build-push-action@v6.10.0...v6.11.0)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
clcc2019 and others added 2 commits August 25, 2025 04:28
* update version

Signed-off-by: limengxuan <[email protected]>

* * update

Signed-off-by: limengxuan <[email protected]>

---------

Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>

* Update hami-core version to fix (Project-HAMi#1256)


* update libvgpu

Signed-off-by: limengxuan <[email protected]>

* update version

Signed-off-by: limengxuan <[email protected]>

* update_hami_core

Signed-off-by: limengxuan <[email protected]>

* 更新 hami 版本至 2.6.1,并在 Makefile 中添加 helm 模板验证命令和values 说明。

Signed-off-by: clcc2019 <[email protected]>

* Update charts/hami/README.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: song duan <[email protected]>

* Update charts/hami/README.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: song duan <[email protected]>

* 移除 Makefile 中的 helm 模板命令,简化构建流程。

Signed-off-by: song duan <[email protected]>

* docs: translate Chinese content to English in chart README

- Translate devicePlugin.deviceSplitCount description from Chinese to English
- Translate devicePlugin.migStrategy description from Chinese to English
- Translate devicePlugin.disablecorelimit description from Chinese to English
- Ensure all parameter descriptions are now in English for consistency

Signed-off-by: clcc2019 <[email protected]>

---------

Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: clcc2019 <[email protected]>
Signed-off-by: song duan <[email protected]>
Co-authored-by: limengxuan <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ct-HAMi#1285)

Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.74.2 to 1.75.0.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](grpc/grpc-go@v1.74.2...v1.75.0)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.75.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@archlitchi
Copy link
Member

please rebase the code to pass the CI

Kyrie336 and others added 17 commits August 26, 2025 03:50
…ject-HAMi#1299)

* fix#1050

Signed-off-by: Jifei Wang <[email protected]>

* fmt

Signed-off-by: Jifei Wang <[email protected]>

* resolve confilict

Signed-off-by: Jifei Wang <[email protected]>

---------

Signed-off-by: Jifei Wang <[email protected]>
* clear and correct ascend device name

Signed-off-by: Jifei Wang <[email protected]>

* recover changes

Signed-off-by: Jifei Wang <[email protected]>

* fix: fix golangci-lint error (Project-HAMi#1319)

Signed-off-by: james <[email protected]>

---------

Signed-off-by: Jifei Wang <[email protected]>
Signed-off-by: james <[email protected]>
Co-authored-by: james <[email protected]>
* docs: update ascend910b-support docs

Signed-off-by: james <[email protected]>

* refactor: add blank

Signed-off-by: james <[email protected]>

---------

Signed-off-by: james <[email protected]>
Bumps [actions/stale](https://github.com/actions/stale) from 9 to 10.
- [Release notes](https://github.com/actions/stale/releases)
- [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md)
- [Commits](actions/stale@v9...v10)

---
updated-dependencies:
- dependency-name: actions/stale
  dependency-version: '10'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…oject-HAMi#1324)

Bumps [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action) from 0.32.0 to 0.33.1.
- [Release notes](https://github.com/aquasecurity/trivy-action/releases)
- [Commits](aquasecurity/trivy-action@0.32.0...0.33.1)

---
updated-dependencies:
- dependency-name: aquasecurity/trivy-action
  dependency-version: 0.33.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…roject-HAMi#1305)

Bumps [github.com/stretchr/testify](https://github.com/stretchr/testify) from 1.10.0 to 1.11.1.
- [Release notes](https://github.com/stretchr/testify/releases)
- [Commits](stretchr/testify@v1.10.0...v1.11.1)

---
updated-dependencies:
- dependency-name: github.com/stretchr/testify
  dependency-version: 1.11.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: ZhengW22 <[email protected]>
Signed-off-by: ZhengW22 <[email protected]>
Signed-off-by: ZhengW22 <[email protected]>
Signed-off-by: ZhengW22 <[email protected]>
@hami-robot
Copy link
Contributor

hami-robot bot commented Sep 10, 2025

New changes are detected. LGTM label has been removed.

@ZhengW22
Copy link
Author

@Shouren @archlitchi Hello, I have rebased the code and fix some problems.

@ZhengW22
Copy link
Author

@Shouren Hello, can you help me to review the new code?

@wawa0210
Copy link
Member

wawa0210 commented Dec 8, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a useful feature to disable specific GPUs on a node via annotations. The overall approach of filtering out disabled devices from the scheduler's node cache is sound. However, the description of the implementation in the PR body is inaccurate. It states that disabled GPUs are marked as fully utilized, but the code actually removes them from the node's device list. Please update the description to reflect the actual implementation.

I've found a couple of critical bugs. One is that UUIDs from annotations are not correctly trimmed, which would cause the feature to fail with whitespace in the annotation value. Another is a reference to a constant that doesn't seem to be defined, which will cause a compilation failure. I've also provided some suggestions for refactoring to improve code clarity and for adding a test case to improve coverage. Please review the comments.

iluvatar.IluvatarNoUseUUID: {iluvatar.IluvatarGPUDevice},
enflame.EnflameNoUseUUID: {enflame.EnflameGPUDevice},
mthreads.MthreadsNoUseUUID: {mthreads.MthreadsGPUDevice},
metax.MetaxNoUseUUID: {metax.MetaxGPUDevice, metax.MetaxSGPUDevice},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The constant metax.MetaxNoUseUUID is used here, but it does not appear to be defined in the metax package based on the provided files. This will likely cause a compilation error. Please ensure this constant is defined, for example in pkg/device/metax/device.go or a similar file, with a value like "metax-tech.com/nouse-gpuuuid".

Comment on lines +108 to +114
if vendorWithDisableGPUUUIDMap[vendor] == nil {
newVendorMap := make(map[string]bool)
newVendorMap[disableGPUUUID] = true
vendorWithDisableGPUUUIDMap[vendor] = newVendorMap
} else {
vendorWithDisableGPUUUIDMap[vendor][disableGPUUUID] = true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a bug in this section. The disableGPUUUID variable, which may contain leading/trailing whitespace, is used as a map key. The trimmed version id should be used instead to ensure correct matching. Additionally, the map initialization logic can be simplified to be more idiomatic.

                            if vendorWithDisableGPUUUIDMap[vendor] == nil {
                                vendorWithDisableGPUUUIDMap[vendor] = make(map[string]bool)
                            }
                            vendorWithDisableGPUUUIDMap[vendor][id] = true

Comment on lines +125 to +136
for _, d := range nodeInfo.Devices {
removeFlag := false
if disableGPUUUIDMap, ok := vendorWithDisableGPUUUIDMap[d.DeviceVendor]; ok {
if ok := disableGPUUUIDMap[d.ID]; ok {
klog.V(5).Infof("Disable gpu uuid is : %s", d.ID)
removeFlag = true
}
}
if !removeFlag {
tmp = append(tmp, d)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop for filtering devices can be simplified for better readability. Using a continue statement would make the logic clearer and avoid the removeFlag variable. This also resolves the shadowing of the ok variable, which is a good practice to avoid.

    for _, d := range nodeInfo.Devices {
        if disableGPUUUIDMap, ok := vendorWithDisableGPUUUIDMap[d.DeviceVendor]; ok {
            if _, disabled := disableGPUUUIDMap[d.ID]; disabled {
                klog.V(5).Infof("Disabling gpu with uuid: %s", d.ID)
                continue
            }
        }
        tmp = append(tmp, d)
    }

}
}

func Test_rmDeviceByNodeAnnotation(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test suite for rmDeviceByNodeAnnotation is good, but it would be beneficial to add a test case that verifies that UUIDs with leading/trailing whitespace in the annotation value are handled correctly. This would ensure the strings.TrimSpace logic is effective and prevent regressions.

For example:

{
    name: "Test remove device with whitespace in annotation",
    args: args{
        nodeInfo: &device.NodeInfo{
            Node:    &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: " " + id1 + " "}}},
            Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
        },
    },
    want: []device.DeviceInfo{},
},

@wawa0210
Copy link
Member

@ZhengW22 Could you please take a look at the critical comments in gemini-code-assist?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.