Skip to content

Conversation

@qiangwei1983
Copy link
Contributor

What type of PR is this?
/kind bug

What this PR does / why we need it:
Fix the issue Iluvatar device scheduling policy binpack and spread are reversed

Which issue(s) this PR fixes:
Fixes #1589

lixd and others added 30 commits June 23, 2025 15:59
…t-HAMi#1138)

* fix: override node socre failure for kunlun Project-HAMi#1137

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* Merge branch 'master' into bug-fix

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: update

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

---------

Signed-off-by: ouyangluwei(riseunion) <[email protected]>
Co-authored-by: ouyangluwei(riseunion) <[email protected]>
Signed-off-by: ouyangluwei(riseunion) <[email protected]>
Co-authored-by: ouyangluwei(riseunion) <[email protected]>
* fix: Multi-node scoring nodes are inaccurate

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

---------

Signed-off-by: ouyangluwei(riseunion) <[email protected]>
Co-authored-by: ouyangluwei(riseunion) <[email protected]>
Signed-off-by: ouyangluwei(riseunion) <[email protected]>
Co-authored-by: ouyangluwei(riseunion) <[email protected]>
…scheduler roles + a namespace-scoped role for leader election

Signed-off-by: antvirf <[email protected]>
…-in-release-changelog

feat: Add new labels in .github/release.yml
…perms

feat(scheduler-role): use a scoped-down role for scheduler
Signed-off-by: Jifei wang <[email protected]>

update vendor
…oject-HAMi#1161)

Bumps [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action) from 0.31.0 to 0.32.0.
- [Release notes](https://github.com/aquasecurity/trivy-action/releases)
- [Commits](aquasecurity/trivy-action@0.31.0...0.32.0)

---
updated-dependencies:
- dependency-name: aquasecurity/trivy-action
  dependency-version: 0.32.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat(helm): optionally disable admission webhook

This simplifies the deployment considerably and makes HAMi less
intrusive inclusters where only a minority of workloads actually require
GPU scheduling.

Signed-off-by: antvirf <[email protected]>

* fix(chart.yaml): keep original version from master

Signed-off-by: antvirf <[email protected]>

* docs(helm): comment to explain impact of disabling admissionWebhook

Signed-off-by: antvirf <[email protected]>

---------

Signed-off-by: antvirf <[email protected]>
Signed-off-by: Yunlu Wen <[email protected]>
Co-authored-by: Yunlu Wen <[email protected]>
Fix e2e CI

Signed-off-by: limengxuan <[email protected]>
…vidia

Add basic test cases for enflame,hygon, metax, mthreads, nvidia module to verify Fit function.
Includes positive and negative test scenarios.

Signed-off-by: wangmin <[email protected]>

Co-authored-by: wangmin <[email protected]>
DSFans2014 and others added 13 commits January 12, 2026 03:36
…here are multiple containers in one pod (Project-HAMi#1579)

* fix: fix resource quota

Signed-off-by: james <[email protected]>

* fix: fix test case

Signed-off-by: james <[email protected]>

---------

Signed-off-by: james <[email protected]>
* add modernize check

Signed-off-by: dongjiang1989 <[email protected]>

* add unittest case

Signed-off-by: dongjiang1989 <[email protected]>

---------

Signed-off-by: dongjiang1989 <[email protected]>
…oject-HAMi#1584)

Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.27.3 to 2.27.5.
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](onsi/ginkgo@v2.27.3...v2.27.5)

---
updated-dependencies:
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-version: 2.27.5
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…Mi#1586)

Bumps [golang.org/x/term](https://github.com/golang/term) from 0.38.0 to 0.39.0.
- [Commits](golang/term@v0.38.0...v0.39.0)

---
updated-dependencies:
- dependency-name: golang.org/x/term
  dependency-version: 0.39.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* update

Signed-off-by: limengxuan <[email protected]>

* update

Signed-off-by: limengxuan <[email protected]>

* update chart for CDI

Signed-off-by: limengxuan <[email protected]>

* update docs

Signed-off-by: limengxuan <[email protected]>

* Update docs/config_cn.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update docs/config.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update documents

Signed-off-by: limengxuan <[email protected]>

* fix kunlunxin vxpu issue

Signed-off-by: limengxuan <[email protected]>

* Update pkg/device/kunlun/vdevice.go

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update

Signed-off-by: limengxuan <[email protected]>

* update Makefile for helm package

Signed-off-by: limengxuan <[email protected]>

* update discord invitation

Signed-off-by: limengxuan <[email protected]>

* Update README.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update README_cn.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update README_ja.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Signed-off-by: limengxuan <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…i#1587)

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.48.0 to 0.49.0.
- [Commits](golang/net@v0.48.0...v0.49.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-version: 0.49.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ct-HAMi#1585)

Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.38.3 to 1.39.0.
- [Release notes](https://github.com/onsi/gomega/releases)
- [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md)
- [Commits](onsi/gomega@v1.38.3...v1.39.0)

---
updated-dependencies:
- dependency-name: github.com/onsi/gomega
  dependency-version: 1.39.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ject-HAMi#1592)

Bumps [github.com/sirupsen/logrus](https://github.com/sirupsen/logrus) from 1.9.3 to 1.9.4.
- [Release notes](https://github.com/sirupsen/logrus/releases)
- [Changelog](https://github.com/sirupsen/logrus/blob/master/CHANGELOG.md)
- [Commits](sirupsen/logrus@v1.9.3...v1.9.4)

---
updated-dependencies:
- dependency-name: github.com/sirupsen/logrus
  dependency-version: 1.9.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…AMi#1588)

Bumps [golang.org/x/tools](https://github.com/golang/tools) from 0.40.0 to 0.41.0.
- [Release notes](https://github.com/golang/tools/releases)
- [Commits](golang/tools@v0.40.0...v0.41.0)

---
updated-dependencies:
- dependency-name: golang.org/x/tools
  dependency-version: 0.41.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@hami-robot hami-robot bot added kind/bug Something isn't working dco-signoff: yes labels Jan 19, 2026
@hami-robot hami-robot bot requested review from lengrongfu and wawa0210 January 19, 2026 02:04
@hami-robot
Copy link
Contributor

hami-robot bot commented Jan 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: qiangwei1983
Once this PR has been reviewed and has the lgtm label, please assign shouren for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @qiangwei1983, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a bug in the Iluvatar device scheduling mechanism where the binpack and spread policies were effectively reversed. The fix involves a targeted modification to the device iteration order within the Fit function, ensuring that devices are processed in the correct sequence to align with the intended scheduling behavior and resolve the reported functional discrepancy.

Highlights

  • Iluvatar Device Scheduling Policy Fix: Corrected the logic for Iluvatar device scheduling policies by reversing the iteration order when fitting devices. This change addresses the issue where binpack and spread policies were inadvertently swapped, ensuring they now function as intended.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@hami-robot hami-robot bot added the size/XS label Jan 19, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue where the Iluvatar device scheduling policies for binpack and spread were reversed. The fix involves reversing the loop iteration order in the Fit function, which is a correct approach to solve the described problem. However, the unit tests have not been updated to account for this change in behavior, which will cause them to fail. I have left a comment detailing the required changes in the test file.

tmpDevs = make(map[string]device.ContainerDevices)
reason := make(map[string]int)
for i := range devices {
for i := len(devices) - 1; i >= 0; i-- {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change correctly reverses the iteration order to fix the scheduling policy issue. However, it appears the unit tests have not been updated to reflect this change.

Specifically, Test_Fit in pkg/device/iluvatar/device_test.go has a "fit success" test case that provides two identical devices, dev-0 and dev-1, and expects dev-0 to be selected. With this change, the loop will iterate in reverse, so dev-1 will be selected first, causing the test to fail.

Please update the test case to expect dev-1 to be selected to ensure the tests pass and correctly reflect the new behavior.

@archlitchi
Copy link
Member

please update the unit_test to pass the CI

tmpDevs = make(map[string]device.ContainerDevices)
reason := make(map[string]int)
for i := range devices {
for i := len(devices) - 1; i >= 0; i-- {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiangwei1983 I think it it better to use slices.Backward for traversing in reverse order with descending indices

@codecov
Copy link

codecov bot commented Jan 19, 2026

Codecov Report

❌ Patch coverage is 40.66852% with 1065 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ce-plugin/nvidiadevice/nvinternal/plugin/server.go 16.01% 169 Missing and 4 partials ⚠️
...evice-plugin/nvidiadevice/nvinternal/rm/rm_mock.go 0.00% 130 Missing ⚠️
...vice-plugin/nvidiadevice/nvinternal/plugin/util.go 22.58% 93 Missing and 3 partials ⚠️
pkg/device/awsneuron/device.go 76.36% 61 Missing and 8 partials ⚠️
...g/device-plugin/nvidiadevice/nvinternal/cdi/cdi.go 0.00% 68 Missing ⚠️
...-plugin/nvidiadevice/nvinternal/plugin/register.go 0.00% 66 Missing ⚠️
...e-plugin/nvidiadevice/nvinternal/plugin/factory.go 0.00% 48 Missing ⚠️
pkg/device/ascend/device.go 74.19% 42 Missing and 6 partials ⚠️
...device-plugin/nvidiadevice/nvinternal/imex/imex.go 0.00% 35 Missing ⚠️
pkg/device/amd/device.go 75.86% 31 Missing and 4 partials ⚠️
... and 19 more
Flag Coverage Δ
unittests 51.09% <40.66%> (-10.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ice-plugin/nvidiadevice/nvinternal/info/version.go 0.00% <ø> (ø)
...device-plugin/nvidiadevice/nvinternal/rm/helper.go 100.00% <100.00%> (ø)
pkg/device/common/common.go 53.33% <ø> (ø)
pkg/device/devices.go 81.12% <ø> (+7.47%) ⬆️
pkg/device/enflame/config.go 100.00% <ø> (ø)
pkg/device/enflame/device.go 83.25% <ø> (+5.16%) ⬆️
pkg/device/enflame/gcu.go 90.35% <ø> (ø)
pkg/device/hygon/device.go 91.22% <ø> (-2.15%) ⬇️
pkg/device/iluvatar/device.go 59.29% <ø> (-26.94%) ⬇️
pkg/device/kunlun/config.go 0.00% <ø> (ø)
... and 60 more

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@archlitchi
Copy link
Member

@ouyangluwei163 do you recall why we iterate iluvatar device in forward order? please reply here if you do

@ouyangluwei163
Copy link
Contributor

#933

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Scheduler Policy Bug] hami.io/gpu-scheduler-policy (spread/binpack) functionality is reversed