feat(trainer): Support NPU labels in TrainJob device#336
feat(trainer): Support NPU labels in TrainJob device#336sujalshah-bit wants to merge 1 commit intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
Adds NPU extended-resource detection to the Kubernetes backend so TrainJob device metadata can be derived from container resource limits when an NPU resource key is present.
Changes:
- Introduces an NPU-related resource constant in
constants.py. - Extends
get_container_devices()to detect*/npuresource keys and fail fast on multiple NPU types. - Adds unit tests covering single-NPU and multiple-NPU-key scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| kubeflow/trainer/constants/constants.py | Adds an NPU-related constant for resource identification. |
| kubeflow/trainer/backends/kubernetes/utils.py | Adds NPU resource key detection and validation in device extraction. |
| kubeflow/trainer/backends/kubernetes/utils_test.py | Adds test cases validating NPU detection and multi-key failure behavior. |
| # The label for NPU in the container resources. | ||
| NPU_LABEL = "/npu" | ||
|
|
There was a problem hiding this comment.
NPU_LABEL is defined as the suffix string "/npu" (not a full resource label like the other *_LABEL constants) and is currently unused, which is likely to confuse future callers; rename it to something like NPU_RESOURCE_SUFFIX/NPU_SUFFIX (or document it explicitly as a suffix) and update usages accordingly.
| mig_key = mig_keys[0] | ||
| device = mig_key.split("/")[1] | ||
| device_count = resources.limits[mig_key].actual_instance | ||
| elif npu_keys := [k for k in resources.limits if k.endswith("/npu")]: |
There was a problem hiding this comment.
The NPU detection uses a hard-coded string literal "/npu" even though an NPU constant was added in constants; use the constant here (and/or rename it to a suffix constant) so the matching rule is defined in one place and can’t drift.
| elif npu_keys := [k for k in resources.limits if k.endswith("/npu")]: | |
| elif npu_keys := [k for k in resources.limits if k.endswith(constants.NPU_SUFFIX)]: |
Implement support for NPU resource labels in resource limit validation, resolving the existing TODO to support additional accelerator types. Signed-off-by: Sujal Shah <sujalshah28092004@gmail.com>
f4fa8fe to
6975af2
Compare
|
@andreyvelich and @kramaranya, gentle reminder for the review :) |
| # Potentially, we can use the trainer.kubeflow.org/device label from the runtime or | ||
| # node types. | ||
| # TODO (andreyvelich): Support other resource labels (e.g. NPUs). | ||
| # TODO (andreyvelich): Support other resource labels. |
There was a problem hiding this comment.
You can remove this TODO statement.
| # TODO (andreyvelich): Support other resource labels. |
| if len(npu_keys) > 1: | ||
| raise ValueError(f"Multiple NPU resource types are not supported yet: {npu_keys}") |
| if len(npu_keys) > 1: | ||
| raise ValueError(f"Multiple NPU resource types are not supported yet: {npu_keys}") | ||
|
|
||
| npu_key = npu_keys[0] |
There was a problem hiding this comment.
Why are you taking the first item? I thought NPU resources are set similarly to TPU and GPU
| config={ | ||
| "resources": models.IoK8sApiCoreV1ResourceRequirements( | ||
| limits={ | ||
| "huawei.com/npu": models.IoK8sApimachineryPkgApiResourceQuantity(2), |
There was a problem hiding this comment.
Remove vendors from unit tests
| "huawei.com/npu": models.IoK8sApimachineryPkgApiResourceQuantity(2), | |
| "example.com/npu": models.IoK8sApimachineryPkgApiResourceQuantity(2), |
| expected_output=("npu", "2.0"), | ||
| ), | ||
| TestCase( | ||
| name="multiple NPU resource types are not supported", |
There was a problem hiding this comment.
This should not be the case.
| TPU_LABEL = "google.com/tpu" | ||
|
|
||
| # The Suffix label for NPU in the container resources. | ||
| NPU_LABEL_SUFFIX = "/npu" |
There was a problem hiding this comment.
I thought the resource name is: huawei.com/Ascend910.
@danish9039 Do you know details about it?
Ref: #264
|
/ok-to-test |
What this PR does / why we need it:
Add support for NPU resource labels in container resource limits,
resolving the existing TODO to support additional accelerator types
(e.g., NPUs).
/npusuffix matchingFixes NONE
Checklist: