Skip to content

Conversation

@DSFans2014
Copy link
Contributor

@DSFans2014 DSFans2014 commented Nov 6, 2025

#4656 already supports scheduling for Ascend NPU 310P, but it cannot handle heterogeneous configurations and 910 series virtualization. Therefore, we have currently implemented support for the HAMi Ascend device plugin.
close #4718

@volcano-sh-bot
Copy link
Contributor

Welcome @DSFans2014! It looks like this is your first PR to volcano-sh/volcano 🎉

@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 6, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @DSFans2014, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Volcano scheduler's capabilities by adding robust support for the HAMi Ascend device plugin. The primary goal is to overcome existing limitations in NPU scheduling, particularly for heterogeneous setups and virtualization scenarios involving Ascend 910 series devices. The changes introduce new API structures and logic to manage Ascend devices, track their usage, and apply intelligent scheduling policies, ensuring optimal resource allocation for NPU-intensive workloads.

Highlights

  • HAMi Ascend Device Plugin Support: Introduces comprehensive support for the HAMi Ascend device plugin, enabling the Volcano scheduler to manage Ascend NPUs more effectively.
  • Enhanced NPU Scheduling: Addresses previous limitations with "Ascend NPU 310P" scheduling, specifically improving handling for heterogeneous configurations and virtualization of Ascend 910 series devices.
  • New Ascend Device API: Adds new Go files (device_info.go, device_info_test.go, config/vnpu.go) to define data structures and logic for Ascend device information, resource requests, and scheduling policies.
  • Scheduler Integration: Integrates Ascend device management into the Volcano scheduler's core components, including node information, shared device pool, and the device sharing plugin, allowing for proper filtering, scoring, and allocation.
  • Configuration Updates: Modifies the device configuration to include a VNPUs field, enabling flexible definition of virtual NPU types and their properties.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the HAMi Ascend device plugin, enabling scheduling for heterogeneous configurations and 910 series virtualization. The changes include new data structures for Ascend devices, logic for device discovery, filtering, scoring, and allocation. New configuration options and tests are also added. The implementation is comprehensive, but there are several areas for improvement regarding correctness, code quality, and adherence to Go conventions. Key issues include duplicated license headers, incorrect logic in device fitting, bugs in resource request generation and annotation decoding, and inconsistent naming.

Copy link
Contributor

@archlitchi archlitchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please sign-off your commit to pass the DCO check

@DSFans2014 DSFans2014 force-pushed the feat/support-hami-ascend-device branch 2 times, most recently from aa06eb9 to 0123bb1 Compare November 6, 2025 09:28
@JesseStutler
Copy link
Member

Could you also raise an issue and relate this PR to it? So I could add this to volcano release 1.14 project and easy to track the progress?

@DSFans2014
Copy link
Contributor Author

Could you also raise an issue and relate this PR to it? So I could add this to volcano release 1.14 project and easy to track the progress?

#4718 issue has been opened

@archlitchi
Copy link
Contributor

/ok-to-test

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Nov 7, 2025
@archlitchi
Copy link
Contributor

please add a document about how to use this feature

@archlitchi
Copy link
Contributor

also, maybe you need to add a configMap file to this repo 'https://github.com/Project-HAMi/ascend-device-plugin', so the deviceshare plugin can get the vnpu templates

@DSFans2014
Copy link
Contributor Author

also, maybe you need to add a configMap file to this repo 'https://github.com/Project-HAMi/ascend-device-plugin', so the deviceshare plugin can get the vnpu templates

Project-HAMi/ascend-device-plugin#38

@JesseStutler JesseStutler requested a review from Copilot November 8, 2025 14:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates HAMI Ascend device plugin support into Volcano scheduler, enabling scheduling and resource management for Ascend NPU devices (including Ascend 310P and 910 series). The implementation provides device sharing capabilities similar to existing GPU support.

Key Changes

  • Added new AscendVNPUEnable configuration flag and device registration mechanism
  • Implemented AscendDevices type with device allocation, scoring, and filtering capabilities
  • Extended node resource tracking to support multiple Ascend device types dynamically
  • Added utility functions for pod/node annotation patching and device info encoding/decoding

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/scheduler/plugins/deviceshare/deviceshare.go Added Ascend VNPU enable flag, device registration logic, and updated predicate/scoring for Ascend devices
pkg/scheduler/api/shared_device_pool.go Added RegisterDevice function and AscendDevices interface compliance
pkg/scheduler/api/node_info.go Extended node initialization to dynamically create Ascend devices and track their resources
pkg/scheduler/api/devices/util.go Added helper functions for node/pod operations and device-related constants
pkg/scheduler/api/devices/device_info.go Added core data structures for device management and encoding/decoding utilities
pkg/scheduler/api/devices/config/vnpu.go Defined VNPU configuration structure with templates
pkg/scheduler/api/devices/config/config.go Added VNPU configuration to main config with default Ascend310P settings
pkg/scheduler/api/devices/ascend/device_info.go Implemented complete Ascend device management including allocation, scoring, and topology-aware selection
pkg/scheduler/api/devices/ascend/device_info_test.go Added unit tests for memory trimming and device fitting logic
docs/user-guide/how_to_use_hami_ascend_device_pulgin.md Added user guide for HAMI Ascend device plugin installation and usage
Comments suppressed due to low confidence (1)

pkg/scheduler/plugins/deviceshare/deviceshare.go:213

  • The comment "TODO When a pod requests a device of the current type, but the current node does not have such a device, an error is thrown" is incomplete. The condition dev.HasDeviceRequest(task.Pod) was removed from the check, but this may cause the error message to be shown even for pods that don't request the device. The logic should verify if the pod actually requests this device type before throwing an error.
					// TODO When a pod requests a device of the current type, but the current node does not have such a device, an error is thrown
					if dev == nil {
						predicateStatus = append(predicateStatus, &api.Status{
							Code:   devices.Unschedulable,
							Reason: "node not initialized with device" + val,
							Plugin: PluginName,
						})

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@archlitchi
Copy link
Contributor

I've create a new user-guide for both Mindcluster and HAMi, let's wait for @JackyTYang to fill his part

@DSFans2014 DSFans2014 changed the title feat: support hami ascend device plugin feat: support vnpu for Ascend910 series Nov 11, 2025
@JesseStutler
Copy link
Member

JesseStutler commented Nov 18, 2025

Thanks for your contribution, currently, there are too many commits, could you clean and squash them into only one commit? Or several two-three clean commits @DSFans2014

@DSFans2014
Copy link
Contributor Author

Thanks for your contribution, currently, there are too many commits, could you clean and squash them into only one commit? Or several two-three clean commits @DSFans2014

now there is only one commit, thank you @JesseStutler

@hwdef
Copy link
Member

hwdef commented Nov 20, 2025

Looks good overall. Since this PR isn't closely related to Volcano's main scheduling workflow, I didn't review it in extreme detail. I trust the Hami team's engineers to have verified it thoroughly.

BTW, I tend to prefer keeping device-related code in separate repositories rather than the main repo. Perhaps we can work on this moving forward. :)

/approve
Thanks

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hwdef

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2025
@archlitchi
Copy link
Contributor

Looks good overall. Since this PR isn't closely related to Volcano's main scheduling workflow, I didn't review it in extreme detail. I trust the Hami team's engineers to have verified it thoroughly.

BTW, I tend to prefer keeping device-related code in separate repositories rather than the main repo. Perhaps we can work on this moving forward. :)

/approve Thanks

Yes, we are working on providing a DRA-driver for both 'ascend' and 'nvidia' which is compatible with volcano, this way it doesn't need to modify the main repository

Signed-off-by: james <[email protected]>
@DSFans2014 DSFans2014 force-pushed the feat/support-hami-ascend-device branch from bf7dd63 to 6119cb4 Compare November 21, 2025 02:53
Copy link
Member

@JesseStutler JesseStutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 25, 2025
@volcano-sh-bot volcano-sh-bot merged commit 97bb785 into volcano-sh:master Nov 25, 2025
21 checks passed
@DSFans2014 DSFans2014 deleted the feat/support-hami-ascend-device branch November 25, 2025 02:07
@DSFans2014
Copy link
Contributor Author

Thanks for the review @JesseStutler. There's also a related doc PR volcano-sh/website#430, please take a look when you have time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support hami ascend device plugin

7 participants