Skip to content

Conversation

@dongjiang1989
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add promtheus podMonitor in helm-charts

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

ouyangluwei163 and others added 30 commits June 24, 2025 17:48
Signed-off-by: ouyangluwei(riseunion) <[email protected]>
Co-authored-by: ouyangluwei(riseunion) <[email protected]>
* fix: Multi-node scoring nodes are inaccurate

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <[email protected]>

---------

Signed-off-by: ouyangluwei(riseunion) <[email protected]>
Co-authored-by: ouyangluwei(riseunion) <[email protected]>
Signed-off-by: ouyangluwei(riseunion) <[email protected]>
Co-authored-by: ouyangluwei(riseunion) <[email protected]>
…scheduler roles + a namespace-scoped role for leader election

Signed-off-by: antvirf <[email protected]>
…-in-release-changelog

feat: Add new labels in .github/release.yml
…perms

feat(scheduler-role): use a scoped-down role for scheduler
Signed-off-by: Jifei wang <[email protected]>

update vendor
…oject-HAMi#1161)

Bumps [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action) from 0.31.0 to 0.32.0.
- [Release notes](https://github.com/aquasecurity/trivy-action/releases)
- [Commits](aquasecurity/trivy-action@0.31.0...0.32.0)

---
updated-dependencies:
- dependency-name: aquasecurity/trivy-action
  dependency-version: 0.32.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat(helm): optionally disable admission webhook

This simplifies the deployment considerably and makes HAMi less
intrusive inclusters where only a minority of workloads actually require
GPU scheduling.

Signed-off-by: antvirf <[email protected]>

* fix(chart.yaml): keep original version from master

Signed-off-by: antvirf <[email protected]>

* docs(helm): comment to explain impact of disabling admissionWebhook

Signed-off-by: antvirf <[email protected]>

---------

Signed-off-by: antvirf <[email protected]>
Signed-off-by: Yunlu Wen <[email protected]>
Co-authored-by: Yunlu Wen <[email protected]>
Fix e2e CI

Signed-off-by: limengxuan <[email protected]>
…vidia

Add basic test cases for enflame,hygon, metax, mthreads, nvidia module to verify Fit function.
Includes positive and negative test scenarios.

Signed-off-by: wangmin <[email protected]>

Co-authored-by: wangmin <[email protected]>
Add basic test cases for cambricon module to verify Fit function.
Includes positive and negative test scenarios.

Co-authored-by: wangmin <[email protected]>
Add basic test cases for Ascend module to verify Fit function.
Includes positive and negative test scenarios.

Co-authored-by: wangmin <[email protected]>
…roject-HAMi#1186)

Bumps [github.com/fsnotify/fsnotify](https://github.com/fsnotify/fsnotify) from 1.7.0 to 1.9.0.
- [Release notes](https://github.com/fsnotify/fsnotify/releases)
- [Changelog](https://github.com/fsnotify/fsnotify/blob/main/CHANGELOG.md)
- [Commits](fsnotify/fsnotify@v1.7.0...v1.9.0)

---
updated-dependencies:
- dependency-name: github.com/fsnotify/fsnotify
  dependency-version: 1.9.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
IsQiao and others added 16 commits January 12, 2026 06:47
…here are multiple containers in one pod (Project-HAMi#1579)

* fix: fix resource quota

Signed-off-by: james <[email protected]>

* fix: fix test case

Signed-off-by: james <[email protected]>

---------

Signed-off-by: james <[email protected]>
* add modernize check

Signed-off-by: dongjiang1989 <[email protected]>

* add unittest case

Signed-off-by: dongjiang1989 <[email protected]>

---------

Signed-off-by: dongjiang1989 <[email protected]>
…oject-HAMi#1584)

Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.27.3 to 2.27.5.
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](onsi/ginkgo@v2.27.3...v2.27.5)

---
updated-dependencies:
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-version: 2.27.5
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…Mi#1586)

Bumps [golang.org/x/term](https://github.com/golang/term) from 0.38.0 to 0.39.0.
- [Commits](golang/term@v0.38.0...v0.39.0)

---
updated-dependencies:
- dependency-name: golang.org/x/term
  dependency-version: 0.39.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* update

Signed-off-by: limengxuan <[email protected]>

* update

Signed-off-by: limengxuan <[email protected]>

* update chart for CDI

Signed-off-by: limengxuan <[email protected]>

* update docs

Signed-off-by: limengxuan <[email protected]>

* Update docs/config_cn.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update docs/config.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update documents

Signed-off-by: limengxuan <[email protected]>

* fix kunlunxin vxpu issue

Signed-off-by: limengxuan <[email protected]>

* Update pkg/device/kunlun/vdevice.go

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update

Signed-off-by: limengxuan <[email protected]>

* update Makefile for helm package

Signed-off-by: limengxuan <[email protected]>

* update discord invitation

Signed-off-by: limengxuan <[email protected]>

* Update README.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update README_cn.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update README_ja.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Signed-off-by: limengxuan <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…i#1587)

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.48.0 to 0.49.0.
- [Commits](golang/net@v0.48.0...v0.49.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-version: 0.49.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ct-HAMi#1585)

Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.38.3 to 1.39.0.
- [Release notes](https://github.com/onsi/gomega/releases)
- [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md)
- [Commits](onsi/gomega@v1.38.3...v1.39.0)

---
updated-dependencies:
- dependency-name: github.com/onsi/gomega
  dependency-version: 1.39.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ject-HAMi#1592)

Bumps [github.com/sirupsen/logrus](https://github.com/sirupsen/logrus) from 1.9.3 to 1.9.4.
- [Release notes](https://github.com/sirupsen/logrus/releases)
- [Changelog](https://github.com/sirupsen/logrus/blob/master/CHANGELOG.md)
- [Commits](sirupsen/logrus@v1.9.3...v1.9.4)

---
updated-dependencies:
- dependency-name: github.com/sirupsen/logrus
  dependency-version: 1.9.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…AMi#1588)

Bumps [golang.org/x/tools](https://github.com/golang/tools) from 0.40.0 to 0.41.0.
- [Release notes](https://github.com/golang/tools/releases)
- [Commits](golang/tools@v0.40.0...v0.41.0)

---
updated-dependencies:
- dependency-name: golang.org/x/tools
  dependency-version: 0.41.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* add hami build info metrics and version print

Signed-off-by: dongjiang1989 <[email protected]>

* add unittest case update ut cover

Signed-off-by: dongjiang1989 <[email protected]>

---------

Signed-off-by: dongjiang1989 <[email protected]>
When a node is deleted, the overviewstatus and cachedstatus maps
were not being cleaned up, causing metrics to still report data
for removed nodes. This fix adds cleanup logic in onDelNode to
remove the node from nodeManager and both status maps.

Fixes Project-HAMi#1595

Signed-off-by: lifeng <[email protected]>
Co-authored-by: Claude Opus 4.5 <[email protected]>
* update

Signed-off-by: limengxuan <[email protected]>

* update

Signed-off-by: limengxuan <[email protected]>

* update chart for CDI

Signed-off-by: limengxuan <[email protected]>

* update docs

Signed-off-by: limengxuan <[email protected]>

* Update docs/config_cn.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update docs/config.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update documents

Signed-off-by: limengxuan <[email protected]>

* fix kunlunxin vxpu issue

Signed-off-by: limengxuan <[email protected]>

* Update pkg/device/kunlun/vdevice.go

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update

Signed-off-by: limengxuan <[email protected]>

* update Makefile for helm package

Signed-off-by: limengxuan <[email protected]>

* update discord invitation

Signed-off-by: limengxuan <[email protected]>

* Update README.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update README_cn.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update README_ja.md

Signed-off-by: limengxuan <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update version

Signed-off-by: limengxuan <[email protected]>

---------

Signed-off-by: limengxuan <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: dongjiang1989 <[email protected]>
@hami-robot hami-robot bot requested a review from FouoF January 19, 2026 13:32
@hami-robot
Copy link
Contributor

hami-robot bot commented Jan 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dongjiang1989
Once this PR has been reviewed and has the lgtm label, please assign wawa0210 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot bot requested a review from ouyangluwei163 January 19, 2026 13:32
@hami-robot hami-robot bot added the size/S label Jan 19, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dongjiang1989, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a PodMonitor resource to the hami-vgpu Helm chart, enabling Prometheus to monitor the scheduler component. A new configuration option is added to control the deployment of the PodMonitor.

Highlights

  • PodMonitor Resource: Adds a PodMonitor resource definition to the Helm chart for monitoring the hami-vgpu scheduler using Prometheus. This allows Prometheus to automatically discover and collect metrics from the scheduler pods.
  • Prometheus Configuration: Introduces a prometheus.enabled flag in the values.yaml file to control whether the PodMonitor resource is deployed. It is disabled by default.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Prometheus PodMonitor for the hami-scheduler, which is a great addition for observability. The implementation is mostly correct, but I've identified a potential issue with the metrics port name that could prevent scraping. Additionally, I've suggested an improvement to make the PodMonitor more easily discoverable by Prometheus through a dedicated labels configuration in values.yaml. These changes will enhance the robustness and usability of the new monitoring feature.

spec:
podMetricsEndpoints:
- path: /metrics
port: metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The podMetricsEndpoints configuration specifies port: metrics. However, the corresponding service definition in charts/hami/templates/scheduler/service.yaml defines the monitoring port with the name monitor. For the PodMonitor to correctly discover and scrape the metrics endpoint, the port name must match the name defined in the Pod's specification. Assuming the Pod spec follows the service's naming convention, this should be monitor to ensure metrics are collected.

      port: monitor

Comment on lines +11 to +13
{{- with .Values.global.labels }}
{{- toYaml . | nindent 4 }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make the PodMonitor discoverable by the Prometheus Operator, it's good practice to allow users to specify custom labels. This change uses a new .Values.prometheus.labels value (suggested in a separate comment on values.yaml) to apply these labels. While global.labels can also be used, a dedicated prometheus.labels is more explicit and aligned with the new prometheus values section.

    {{- with .Values.prometheus.labels }}
    {{- toYaml . | nindent 4 }}
    {{- end }}
    {{- with .Values.global.labels }}
    {{- toYaml . | nindent 4 }}
    {{- end }}

- iluvatar.ai/MR-V50.vMem

prometheus:
enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To allow users to easily add labels for Prometheus Operator discovery of the PodMonitor, it's a good practice to provide a dedicated labels field under the prometheus section. This makes the chart more configurable and user-friendly. This new value should then be used in charts/hami/templates/scheduler/monitor.yaml to apply the labels to the PodMonitor resource.

  enabled: false
  labels: {}

@codecov
Copy link

codecov bot commented Jan 19, 2026

Codecov Report

❌ Patch coverage is 40.66852% with 1065 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ce-plugin/nvidiadevice/nvinternal/plugin/server.go 16.01% 169 Missing and 4 partials ⚠️
...evice-plugin/nvidiadevice/nvinternal/rm/rm_mock.go 0.00% 130 Missing ⚠️
...vice-plugin/nvidiadevice/nvinternal/plugin/util.go 22.58% 93 Missing and 3 partials ⚠️
pkg/device/awsneuron/device.go 76.36% 61 Missing and 8 partials ⚠️
...g/device-plugin/nvidiadevice/nvinternal/cdi/cdi.go 0.00% 68 Missing ⚠️
...-plugin/nvidiadevice/nvinternal/plugin/register.go 0.00% 66 Missing ⚠️
...e-plugin/nvidiadevice/nvinternal/plugin/factory.go 0.00% 48 Missing ⚠️
pkg/device/ascend/device.go 74.19% 42 Missing and 6 partials ⚠️
...device-plugin/nvidiadevice/nvinternal/imex/imex.go 0.00% 35 Missing ⚠️
pkg/device/amd/device.go 75.86% 31 Missing and 4 partials ⚠️
... and 19 more
Flag Coverage Δ
unittests 51.16% <40.66%> (-9.99%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ice-plugin/nvidiadevice/nvinternal/info/version.go 0.00% <ø> (ø)
...device-plugin/nvidiadevice/nvinternal/rm/helper.go 100.00% <100.00%> (ø)
pkg/device/common/common.go 53.33% <ø> (ø)
pkg/device/devices.go 81.12% <ø> (+7.47%) ⬆️
pkg/device/enflame/config.go 100.00% <ø> (ø)
pkg/device/enflame/device.go 83.25% <ø> (+5.16%) ⬆️
pkg/device/enflame/gcu.go 90.35% <ø> (ø)
pkg/device/hygon/device.go 91.22% <ø> (-2.15%) ⬇️
pkg/device/iluvatar/device.go 59.29% <ø> (-26.94%) ⬇️
pkg/device/kunlun/config.go 0.00% <ø> (ø)
... and 62 more

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@FouoF
Copy link
Contributor

FouoF commented Jan 20, 2026

I think add a serviceMonitor is better than podMonitor for the scheduler already create service to expose metrics.
#1255 and #1486 did something about it.

@dongjiang1989
Copy link
Contributor Author

I think add a serviceMonitor is better than podMonitor for the scheduler already create service to expose metrics. #1255 and #1486 did something about it.

Thanks @FouoF
A ServiceMonitor must depend on a Service; however, for components like the scheduler, it is not necessary to create a Service resource solely for metrics. A PodMonitor serves the same purpose.

@archlitchi
Copy link
Member

archlitchi commented Jan 20, 2026

could you sync again with master branch, i've just rebase and force pushed several commits to minimize the repo size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.