Skip to content

Optimize Fit-in-device logic to make it device-specific #1097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jun 11, 2025

Conversation

archlitchi
Copy link
Member

/kind feature

Right now, a unified fitInCertainDevice is responsible for all devices, which is a very bad design, it causes many unnecessary logic among different types of device.

We need to optimize that by make fit logic device-specific, implemented by device interface.So We did several adjustment to device API, such as

Add interface Fit
Remove interface CheckType, CheckUUID, CustomFilterRule. Because these logics can be implemented in Fit directly

archlitchi and others added 23 commits March 18, 2025 18:07
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
@archlitchi archlitchi changed the title Optimize Fit-in-device logic to make it device-dependent Optimize Fit-in-device logic to make it device-specific May 30, 2025
Copy link

codecov bot commented May 30, 2025

Codecov Report

Attention: Patch coverage is 5.16605% with 771 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/device/mthreads/device.go 3.19% 91 Missing ⚠️
pkg/device/nvidia/device.go 3.19% 91 Missing ⚠️
pkg/device/metax/device.go 3.33% 87 Missing ⚠️
pkg/device/metax/sdevice.go 0.00% 85 Missing ⚠️
pkg/device/ascend/device.go 2.35% 83 Missing ⚠️
pkg/device/cambricon/device.go 2.35% 83 Missing ⚠️
pkg/device/enflame/device.go 2.35% 83 Missing ⚠️
pkg/device/hygon/device.go 2.35% 83 Missing ⚠️
pkg/device/iluvatar/device.go 2.35% 83 Missing ⚠️
pkg/monitor/nvidia/cudevshr.go 0.00% 2 Missing ⚠️
Flag Coverage Δ
unittests 52.95% <5.16%> (-10.19%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/device/common/common.go 100.00% <100.00%> (ø)
pkg/device/devices.go 73.65% <ø> (ø)
pkg/scheduler/policy/node_policy.go 100.00% <100.00%> (ø)
pkg/scheduler/score.go 94.21% <100.00%> (-2.69%) ⬇️
pkg/monitor/nvidia/cudevshr.go 0.00% <0.00%> (ø)
pkg/device/ascend/device.go 60.45% <2.35%> (-26.43%) ⬇️
pkg/device/cambricon/device.go 57.49% <2.35%> (-22.22%) ⬇️
pkg/device/enflame/device.go 50.44% <2.35%> (-27.64%) ⬇️
pkg/device/hygon/device.go 63.00% <2.35%> (-30.37%) ⬇️
pkg/device/iluvatar/device.go 54.58% <2.35%> (-31.65%) ⬇️
... and 4 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@archlitchi
Copy link
Member Author

/assign @lengrongfu @Shouren

@Shouren Shouren self-requested a review June 3, 2025 02:20
@Shouren Shouren added the kind/enhancement New feature or request label Jun 3, 2025
@Shouren
Copy link
Collaborator

Shouren commented Jun 3, 2025

@archlitchi After a quick review of this pr, i find there are lots of duplicated code in the current implementation, especially in the Fit func of each device. I have not compared every implementation of device in detail, but i think it is better to create some helper functions in pkg/util or pkg/device/util to deal with common processing logic for each reasons.

Copy link
Collaborator

@Shouren Shouren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For devices in pkg/device, the CheckType and CheckUUID functions should start with a lower case which keeps up with modification to customFilterRule function.

found, numa := nv.CheckType(annos, *dev, k)
if !found {
reason[common.CardTypeMismatch]++
klog.V(5).InfoS(common.CardTypeMismatch, "pod", klog.KObj(pod), "device", dev.ID, dev.Type, k.Type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logger level we can unified debug is 4. Other log levels are similar.

@@ -290,3 +287,99 @@ func (dev *Devices) AddResourceUsage(n *util.DeviceUsage, ctr *util.ContainerDev
n.Usedmem += ctr.Usedmem
return nil
}

func (npu *Devices) Fit(devices []*util.DeviceUsage, request util.ContainerDeviceRequest, annos map[string]string, pod *corev1.Pod, allocated *util.PodDevices) (bool, map[string]util.ContainerDevices, string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function l found every device type there is a similar logic, is there a better implementation of this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the same way, Fine-grained interfaces are more friendly to subsequent scheduler expansion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's our next move, we need to remove redundant logic in fit for each devices

@archlitchi
Copy link
Member Author

@archlitchi After a quick review of this pr, i find there are lots of duplicated code in the current implementation, especially in the Fit func of each device. I have not compared every implementation of device in detail, but i think it is better to create some helper functions in pkg/util or pkg/device/util to deal with common processing logic for each reasons.

yes, we need refine some logics here, for now, i just copy original fitInCertain method to every devices

Copy link
Contributor

hami-robott bot commented Jun 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: limengxuan <[email protected]>
@Shouren
Copy link
Collaborator

Shouren commented Jun 11, 2025

/lgtm

@lengrongfu
Copy link
Member

/lgtm

@archlitchi archlitchi merged commit c6de50b into Project-HAMi:master Jun 11, 2025
12 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants