-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: support vnpu for Ascend910 series #4717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support vnpu for Ascend910 series #4717
Conversation
|
Welcome @DSFans2014! It looks like this is your first PR to volcano-sh/volcano 🎉 |
Summary of ChangesHello @DSFans2014, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Volcano scheduler's capabilities by adding robust support for the HAMi Ascend device plugin. The primary goal is to overcome existing limitations in NPU scheduling, particularly for heterogeneous setups and virtualization scenarios involving Ascend 910 series devices. The changes introduce new API structures and logic to manage Ascend devices, track their usage, and apply intelligent scheduling policies, ensuring optimal resource allocation for NPU-intensive workloads. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the HAMi Ascend device plugin, enabling scheduling for heterogeneous configurations and 910 series virtualization. The changes include new data structures for Ascend devices, logic for device discovery, filtering, scoring, and allocation. New configuration options and tests are also added. The implementation is comprehensive, but there are several areas for improvement regarding correctness, code quality, and adherence to Go conventions. Key issues include duplicated license headers, incorrect logic in device fitting, bugs in resource request generation and annotation decoding, and inconsistent naming.
archlitchi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please sign-off your commit to pass the DCO check
aa06eb9 to
0123bb1
Compare
|
Could you also raise an issue and relate this PR to it? So I could add this to volcano release 1.14 project and easy to track the progress? |
#4718 issue has been opened |
|
/ok-to-test |
|
please add a document about how to use this feature |
|
also, maybe you need to add a configMap file to this repo 'https://github.com/Project-HAMi/ascend-device-plugin', so the deviceshare plugin can get the vnpu templates |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR integrates HAMI Ascend device plugin support into Volcano scheduler, enabling scheduling and resource management for Ascend NPU devices (including Ascend 310P and 910 series). The implementation provides device sharing capabilities similar to existing GPU support.
Key Changes
- Added new
AscendVNPUEnableconfiguration flag and device registration mechanism - Implemented
AscendDevicestype with device allocation, scoring, and filtering capabilities - Extended node resource tracking to support multiple Ascend device types dynamically
- Added utility functions for pod/node annotation patching and device info encoding/decoding
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/scheduler/plugins/deviceshare/deviceshare.go | Added Ascend VNPU enable flag, device registration logic, and updated predicate/scoring for Ascend devices |
| pkg/scheduler/api/shared_device_pool.go | Added RegisterDevice function and AscendDevices interface compliance |
| pkg/scheduler/api/node_info.go | Extended node initialization to dynamically create Ascend devices and track their resources |
| pkg/scheduler/api/devices/util.go | Added helper functions for node/pod operations and device-related constants |
| pkg/scheduler/api/devices/device_info.go | Added core data structures for device management and encoding/decoding utilities |
| pkg/scheduler/api/devices/config/vnpu.go | Defined VNPU configuration structure with templates |
| pkg/scheduler/api/devices/config/config.go | Added VNPU configuration to main config with default Ascend310P settings |
| pkg/scheduler/api/devices/ascend/device_info.go | Implemented complete Ascend device management including allocation, scoring, and topology-aware selection |
| pkg/scheduler/api/devices/ascend/device_info_test.go | Added unit tests for memory trimming and device fitting logic |
| docs/user-guide/how_to_use_hami_ascend_device_pulgin.md | Added user guide for HAMI Ascend device plugin installation and usage |
Comments suppressed due to low confidence (1)
pkg/scheduler/plugins/deviceshare/deviceshare.go:213
- The comment "TODO When a pod requests a device of the current type, but the current node does not have such a device, an error is thrown" is incomplete. The condition
dev.HasDeviceRequest(task.Pod)was removed from the check, but this may cause the error message to be shown even for pods that don't request the device. The logic should verify if the pod actually requests this device type before throwing an error.
// TODO When a pod requests a device of the current type, but the current node does not have such a device, an error is thrown
if dev == nil {
predicateStatus = append(predicateStatus, &api.Status{
Code: devices.Unschedulable,
Reason: "node not initialized with device" + val,
Plugin: PluginName,
})
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I've create a new user-guide for both Mindcluster and HAMi, let's wait for @JackyTYang to fill his part |
70fa2b2 to
6f6c3ee
Compare
|
Thanks for your contribution, currently, there are too many commits, could you clean and squash them into only one commit? Or several two-three clean commits @DSFans2014 |
bab5654 to
e9ae2a5
Compare
e9ae2a5 to
e978169
Compare
now there is only one commit, thank you @JesseStutler |
Signed-off-by: james <[email protected]>
9f5d144 to
24ef99d
Compare
|
Looks good overall. Since this PR isn't closely related to Volcano's main scheduling workflow, I didn't review it in extreme detail. I trust the Hami team's engineers to have verified it thoroughly. BTW, I tend to prefer keeping device-related code in separate repositories rather than the main repo. Perhaps we can work on this moving forward. :) /approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hwdef The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Yes, we are working on providing a DRA-driver for both 'ascend' and 'nvidia' which is compatible with volcano, this way it doesn't need to modify the main repository |
Signed-off-by: james <[email protected]>
bf7dd63 to
6119cb4
Compare
JesseStutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
Thanks for the review @JesseStutler. There's also a related doc PR volcano-sh/website#430, please take a look when you have time |
#4656 already supports scheduling for
Ascend NPU 310P, but it cannot handle heterogeneous configurations and 910 series virtualization. Therefore, we have currently implemented support for the HAMi Ascend device plugin.close #4718