-
Notifications
You must be signed in to change notification settings - Fork 461
feat: add promtheus podMonitor in helm-charts #1600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: add promtheus podMonitor in helm-charts #1600
Conversation
Signed-off-by: ouyangluwei(riseunion) <[email protected]> Co-authored-by: ouyangluwei(riseunion) <[email protected]>
Signed-off-by: antvirf <[email protected]>
Signed-off-by: calvin chen <[email protected]>
…roject-HAMi#1041) Signed-off-by: ghostloda <[email protected]>
* fix: Multi-node scoring nodes are inaccurate Signed-off-by: ouyangluwei(riseunion) <[email protected]> * fix: ut Signed-off-by: ouyangluwei(riseunion) <[email protected]> * fix: ut Signed-off-by: ouyangluwei(riseunion) <[email protected]> * fix: ut Signed-off-by: ouyangluwei(riseunion) <[email protected]> * fix: ut Signed-off-by: ouyangluwei(riseunion) <[email protected]> --------- Signed-off-by: ouyangluwei(riseunion) <[email protected]> Co-authored-by: ouyangluwei(riseunion) <[email protected]>
Signed-off-by: ouyangluwei(riseunion) <[email protected]> Co-authored-by: ouyangluwei(riseunion) <[email protected]>
…scheduler roles + a namespace-scoped role for leader election Signed-off-by: antvirf <[email protected]>
Signed-off-by: antvirf <[email protected]>
…-in-release-changelog feat: Add new labels in .github/release.yml
…perms feat(scheduler-role): use a scoped-down role for scheduler
Signed-off-by: Jifei wang <[email protected]> update vendor
…oject-HAMi#1161) Bumps [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action) from 0.31.0 to 0.32.0. - [Release notes](https://github.com/aquasecurity/trivy-action/releases) - [Commits](aquasecurity/trivy-action@0.31.0...0.32.0) --- updated-dependencies: - dependency-name: aquasecurity/trivy-action dependency-version: 0.32.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat(helm): optionally disable admission webhook This simplifies the deployment considerably and makes HAMi less intrusive inclusters where only a minority of workloads actually require GPU scheduling. Signed-off-by: antvirf <[email protected]> * fix(chart.yaml): keep original version from master Signed-off-by: antvirf <[email protected]> * docs(helm): comment to explain impact of disabling admissionWebhook Signed-off-by: antvirf <[email protected]> --------- Signed-off-by: antvirf <[email protected]>
Signed-off-by: Yunlu Wen <[email protected]> Co-authored-by: Yunlu Wen <[email protected]>
Fix e2e CI Signed-off-by: limengxuan <[email protected]>
Signed-off-by: Shouren Yang <[email protected]>
Signed-off-by: Shouren Yang <[email protected]>
Signed-off-by: Shouren Yang <[email protected]>
Signed-off-by: Shouren Yang <[email protected]>
…5.0 to 1.17.8 (Project-HAMi#1170) Signed-off-by: Shouren Yang <[email protected]>
…t-HAMi#1183) Signed-off-by: Shouren Yang <[email protected]>
…Mi#1189) Signed-off-by: Shouren Yang <[email protected]>
Signed-off-by: Jifei Wang <[email protected]>
…vidia Add basic test cases for enflame,hygon, metax, mthreads, nvidia module to verify Fit function. Includes positive and negative test scenarios. Signed-off-by: wangmin <[email protected]> Co-authored-by: wangmin <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: Jifei Wang <[email protected]>
…HAMi#1092) Signed-off-by: Shouren Yang <[email protected]>
Add basic test cases for cambricon module to verify Fit function. Includes positive and negative test scenarios. Co-authored-by: wangmin <[email protected]>
Add basic test cases for Ascend module to verify Fit function. Includes positive and negative test scenarios. Co-authored-by: wangmin <[email protected]>
…roject-HAMi#1186) Bumps [github.com/fsnotify/fsnotify](https://github.com/fsnotify/fsnotify) from 1.7.0 to 1.9.0. - [Release notes](https://github.com/fsnotify/fsnotify/releases) - [Changelog](https://github.com/fsnotify/fsnotify/blob/main/CHANGELOG.md) - [Commits](fsnotify/fsnotify@v1.7.0...v1.9.0) --- updated-dependencies: - dependency-name: github.com/fsnotify/fsnotify dependency-version: 1.9.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: qinxiaowen <[email protected]>
…here are multiple containers in one pod (Project-HAMi#1579) * fix: fix resource quota Signed-off-by: james <[email protected]> * fix: fix test case Signed-off-by: james <[email protected]> --------- Signed-off-by: james <[email protected]>
Signed-off-by: Jifei Wang <[email protected]>
* add modernize check Signed-off-by: dongjiang1989 <[email protected]> * add unittest case Signed-off-by: dongjiang1989 <[email protected]> --------- Signed-off-by: dongjiang1989 <[email protected]>
…oject-HAMi#1584) Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.27.3 to 2.27.5. - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.3...v2.27.5) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.27.5 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…Mi#1586) Bumps [golang.org/x/term](https://github.com/golang/term) from 0.38.0 to 0.39.0. - [Commits](golang/term@v0.38.0...v0.39.0) --- updated-dependencies: - dependency-name: golang.org/x/term dependency-version: 0.39.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* update Signed-off-by: limengxuan <[email protected]> * update Signed-off-by: limengxuan <[email protected]> * update chart for CDI Signed-off-by: limengxuan <[email protected]> * update docs Signed-off-by: limengxuan <[email protected]> * Update docs/config_cn.md Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update docs/config.md Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update documents Signed-off-by: limengxuan <[email protected]> * fix kunlunxin vxpu issue Signed-off-by: limengxuan <[email protected]> * Update pkg/device/kunlun/vdevice.go Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update Signed-off-by: limengxuan <[email protected]> * update Makefile for helm package Signed-off-by: limengxuan <[email protected]> * update discord invitation Signed-off-by: limengxuan <[email protected]> * Update README.md Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update README_cn.md Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update README_ja.md Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…i#1587) Bumps [golang.org/x/net](https://github.com/golang/net) from 0.48.0 to 0.49.0. - [Commits](golang/net@v0.48.0...v0.49.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-version: 0.49.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ct-HAMi#1585) Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.38.3 to 1.39.0. - [Release notes](https://github.com/onsi/gomega/releases) - [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md) - [Commits](onsi/gomega@v1.38.3...v1.39.0) --- updated-dependencies: - dependency-name: github.com/onsi/gomega dependency-version: 1.39.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ject-HAMi#1592) Bumps [github.com/sirupsen/logrus](https://github.com/sirupsen/logrus) from 1.9.3 to 1.9.4. - [Release notes](https://github.com/sirupsen/logrus/releases) - [Changelog](https://github.com/sirupsen/logrus/blob/master/CHANGELOG.md) - [Commits](sirupsen/logrus@v1.9.3...v1.9.4) --- updated-dependencies: - dependency-name: github.com/sirupsen/logrus dependency-version: 1.9.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…AMi#1588) Bumps [golang.org/x/tools](https://github.com/golang/tools) from 0.40.0 to 0.41.0. - [Release notes](https://github.com/golang/tools/releases) - [Commits](golang/tools@v0.40.0...v0.41.0) --- updated-dependencies: - dependency-name: golang.org/x/tools dependency-version: 0.41.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: james <[email protected]>
* add hami build info metrics and version print Signed-off-by: dongjiang1989 <[email protected]> * add unittest case update ut cover Signed-off-by: dongjiang1989 <[email protected]> --------- Signed-off-by: dongjiang1989 <[email protected]>
When a node is deleted, the overviewstatus and cachedstatus maps were not being cleaned up, causing metrics to still report data for removed nodes. This fix adds cleanup logic in onDelNode to remove the node from nodeManager and both status maps. Fixes Project-HAMi#1595 Signed-off-by: lifeng <[email protected]> Co-authored-by: Claude Opus 4.5 <[email protected]>
* update Signed-off-by: limengxuan <[email protected]> * update Signed-off-by: limengxuan <[email protected]> * update chart for CDI Signed-off-by: limengxuan <[email protected]> * update docs Signed-off-by: limengxuan <[email protected]> * Update docs/config_cn.md Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update docs/config.md Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update documents Signed-off-by: limengxuan <[email protected]> * fix kunlunxin vxpu issue Signed-off-by: limengxuan <[email protected]> * Update pkg/device/kunlun/vdevice.go Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update Signed-off-by: limengxuan <[email protected]> * update Makefile for helm package Signed-off-by: limengxuan <[email protected]> * update discord invitation Signed-off-by: limengxuan <[email protected]> * Update README.md Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update README_cn.md Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update README_ja.md Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update version Signed-off-by: limengxuan <[email protected]> --------- Signed-off-by: limengxuan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: dongjiang1989 <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dongjiang1989 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @dongjiang1989, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a Prometheus PodMonitor for the hami-scheduler, which is a great addition for observability. The implementation is mostly correct, but I've identified a potential issue with the metrics port name that could prevent scraping. Additionally, I've suggested an improvement to make the PodMonitor more easily discoverable by Prometheus through a dedicated labels configuration in values.yaml. These changes will enhance the robustness and usability of the new monitoring feature.
| spec: | ||
| podMetricsEndpoints: | ||
| - path: /metrics | ||
| port: metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The podMetricsEndpoints configuration specifies port: metrics. However, the corresponding service definition in charts/hami/templates/scheduler/service.yaml defines the monitoring port with the name monitor. For the PodMonitor to correctly discover and scrape the metrics endpoint, the port name must match the name defined in the Pod's specification. Assuming the Pod spec follows the service's naming convention, this should be monitor to ensure metrics are collected.
port: monitor| {{- with .Values.global.labels }} | ||
| {{- toYaml . | nindent 4 }} | ||
| {{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the PodMonitor discoverable by the Prometheus Operator, it's good practice to allow users to specify custom labels. This change uses a new .Values.prometheus.labels value (suggested in a separate comment on values.yaml) to apply these labels. While global.labels can also be used, a dedicated prometheus.labels is more explicit and aligned with the new prometheus values section.
{{- with .Values.prometheus.labels }}
{{- toYaml . | nindent 4 }}
{{- end }}
{{- with .Values.global.labels }}
{{- toYaml . | nindent 4 }}
{{- end }}| - iluvatar.ai/MR-V50.vMem | ||
|
|
||
| prometheus: | ||
| enabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To allow users to easily add labels for Prometheus Operator discovery of the PodMonitor, it's a good practice to provide a dedicated labels field under the prometheus section. This makes the chart more configurable and user-friendly. This new value should then be used in charts/hami/templates/scheduler/monitor.yaml to apply the labels to the PodMonitor resource.
enabled: false
labels: {}
Thanks @FouoF |
|
could you sync again with master branch, i've just rebase and force pushed several commits to minimize the repo size. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Add promtheus podMonitor in helm-charts
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: