Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
88c4a87
feat(gpu): add static policy implementation for GPU resource management
luomingmeng Jun 20, 2025
893df52
refactor(topology): skip add zone node which is not a child of socket…
luomingmeng Jun 20, 2025
56f77c0
feat(gpu): enhance GPU memory allocation with NUMA awareness
luomingmeng Jul 25, 2025
3db0ed6
refactor(cpu): remove unused preferredHintIndexes variable
luomingmeng Jul 25, 2025
42eb792
feat(resource-plugin): add associated device allocation support
luomingmeng Jul 25, 2025
6ca1e52
refactor(qrm-plugins): embed UnimplementedResourcePluginServer in pol…
luomingmeng Jul 25, 2025
3307f1a
feat(gpu): add associated device topology hints support
luomingmeng Jul 31, 2025
24ae049
fix typo and add logs
luomingmeng Aug 4, 2025
198368f
refactor(gpu): remove redundant non-numa-affinity gpu allocation logic
luomingmeng Aug 4, 2025
14f7089
feat(gpu): optimize GPU allocation by preferring NUMA nodes with most…
luomingmeng Sep 22, 2025
920bd75
feat: refactor code into resource plugins and custom device plugins
JustinChengLZ Oct 1, 2025
6d86b6e
chore: add unit tests
JustinChengLZ Oct 2, 2025
ebfd8ad
feat: introduce rdma state and allow states to share within gpu sub-p…
JustinChengLZ Oct 10, 2025
252a155
feat: refactor state to only be in one file
JustinChengLZ Oct 13, 2025
2f9f454
feat: implement rdma custom device plugin and implement logic for acc…
JustinChengLZ Oct 15, 2025
af3c994
feat: implement allocation of accompany resource first before device
JustinChengLZ Oct 18, 2025
535de01
Update gpu_plugin.go
JustinChengLZ Oct 21, 2025
a65a290
refactor(gpu): restructure device plugin and resource management
luomingmeng Oct 20, 2025
5603201
refactor: remove unused GenerateDummyGPUTopology function
luomingmeng Oct 21, 2025
d9d6a71
feat(gpu): implement strategy-based GPU allocation framework
luomingmeng Oct 21, 2025
687ce5d
refactor(gpu-strategy): reorganize gpu allocation strategy components
luomingmeng Oct 22, 2025
f3b0078
refactor(gpu-strategy): make strategy fields private and add accessors
luomingmeng Oct 22, 2025
9d44a1c
feat(device): add device affinity group support
luomingmeng Oct 22, 2025
c78ec70
feat: develop device affinity binding and filtering strategies
JustinChengLZ Oct 23, 2025
b0ce5ee
feat: implement binding strategy to prioritise device affinity during…
JustinChengLZ Oct 27, 2025
eacb837
refactor(gpu): restructure GPU strategy and state management
luomingmeng Oct 29, 2025
545399f
chore: rebase katalyst-api
JustinChengLZ Oct 29, 2025
3f60208
chore: fix unit test, format and lint issues
JustinChengLZ Oct 29, 2025
e9062c8
fix: maintain affinity subgroup sequence in larger affinity groups
JustinChengLZ Nov 3, 2025
95d4ee4
refactor: simplify code by deleting redundant parameters and refactor…
JustinChengLZ Nov 3, 2025
2d67aeb
refactor: make allocation recursive to simplify logic
JustinChengLZ Nov 4, 2025
9f6d529
fix(gpu): handle NUMA node edge cases and improve logging
luomingmeng Nov 5, 2025
02a8e5e
feat(gpu): add canonical strategy implementation and refactor gpu mem…
luomingmeng Nov 5, 2025
a5d73bf
refactor(gpu/strategy): optimize device affinity allocation algorithm
luomingmeng Nov 5, 2025
667f89b
refactor: simplify the grouping of device affinity
JustinChengLZ Nov 5, 2025
791f66a
fix: handling of nil device req
JustinChengLZ Nov 6, 2025
692a317
chore: add unit tests
JustinChengLZ Nov 10, 2025
061d079
refactor(gpumemory): move nil device request check after qos validation
luomingmeng Nov 11, 2025
2640936
fix(gpu): skip zero requests in GetGPUCount and optimize logging
luomingmeng Nov 11, 2025
f47adad
feat(gpumemory): add numa binding check and health status filter
luomingmeng Nov 14, 2025
e153426
fix(gpumemory): handle unhealthy devices and correct capacity values
luomingmeng Nov 17, 2025
8911497
refactor(qrm): remove unused state file directory fields
luomingmeng Nov 17, 2025
76b6dfa
fix(gpumemory): handle numa topology not ready case gracefully
luomingmeng Nov 20, 2025
802642a
chore: add context to interface methods
JustinChengLZ Nov 24, 2025
9d5b6ac
chore: add unit tests
JustinChengLZ Nov 28, 2025
273ebac
feat(state): enhance gpu state management using refactored migration …
JustinChengLZ Nov 28, 2025
81715d1
feat(gpu): add device name tracking and allocation filtering
luomingmeng Dec 18, 2025
3f73c62
feat(cnr): report gpu device topology to cnr
JustinChengLZ Dec 22, 2025
e18d558
fix: do not store state when getting topology hints
JustinChengLZ Dec 26, 2025
e3e6918
fix: corner case bug
JustinChengLZ Jan 5, 2026
569a8eb
fix: remove dependency on kubelet checkpoint file
JustinChengLZ Jan 6, 2026
c2944f4
fix: state nil handling and name change of custom device plugin inter…
JustinChengLZ Jan 7, 2026
00a0b17
fix: do not report CNR with missing information
JustinChengLZ Jan 20, 2026
0246bbc
feat: implement watching of topology when it changes
JustinChengLZ Feb 10, 2026
8712243
feat(gpu): implement lazy state initialization in static policy
luomingmeng Feb 14, 2026
8d6bede
refactor: optimize logic for file watching and periodic resync
JustinChengLZ Feb 16, 2026
bdd283d
fix: breaking unit tests
JustinChengLZ Feb 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions cmd/katalyst-agent/app/agent/qrm/gpu_plugin.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
/*
Copyright 2022 The Katalyst Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package qrm

import (
"fmt"
"strings"
"sync"

"github.com/kubewharf/katalyst-core/cmd/katalyst-agent/app/agent"
phconsts "github.com/kubewharf/katalyst-core/pkg/agent/utilcomponent/periodicalhandler/consts"
"github.com/kubewharf/katalyst-core/pkg/config"
)

const (
QRMPluginNameGPU = "qrm_gpu_plugin"
)

var QRMGPUPluginPeriodicalHandlerGroupName = strings.Join([]string{
QRMPluginNameGPU,
phconsts.PeriodicalHandlersGroupNameSuffix,
}, phconsts.GroupNameSeparator)

// gpuPolicyInitializers is used to store the initializing function for gpu resource plugin policies
var gpuPolicyInitializers sync.Map

// RegisterGPUPolicyInitializer is used to register user-defined resource plugin init functions
func RegisterGPUPolicyInitializer(name string, initFunc agent.InitFunc) {
gpuPolicyInitializers.Store(name, initFunc)
}

// getIOPolicyInitializers returns those policies with initialized functions
func getGPUPolicyInitializers() map[string]agent.InitFunc {
agents := make(map[string]agent.InitFunc)
gpuPolicyInitializers.Range(func(key, value interface{}) bool {
agents[key.(string)] = value.(agent.InitFunc)
return true
})
return agents
}

// InitQRMGPUPlugins initializes the gpu QRM plugins
func InitQRMGPUPlugins(agentCtx *agent.GenericContext, conf *config.Configuration, extraConf interface{}, agentName string) (bool, agent.Component, error) {
initializers := getGPUPolicyInitializers()
policyName := conf.GPUQRMPluginConfig.PolicyName

initFunc, ok := initializers[policyName]
if !ok {
return false, agent.ComponentStub{}, fmt.Errorf("invalid policy name %v for gpu resource plugin", policyName)
}

return initFunc(agentCtx, conf, extraConf, agentName)
}
2 changes: 2 additions & 0 deletions cmd/katalyst-agent/app/enableagents.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ import (
"github.com/kubewharf/katalyst-core/cmd/katalyst-agent/app/agent"
"github.com/kubewharf/katalyst-core/cmd/katalyst-agent/app/agent/qrm"
_ "github.com/kubewharf/katalyst-core/pkg/agent/qrm-plugins/cpu"
_ "github.com/kubewharf/katalyst-core/pkg/agent/qrm-plugins/gpu"
_ "github.com/kubewharf/katalyst-core/pkg/agent/qrm-plugins/io"
_ "github.com/kubewharf/katalyst-core/pkg/agent/qrm-plugins/memory"
_ "github.com/kubewharf/katalyst-core/pkg/agent/qrm-plugins/network"
Expand Down Expand Up @@ -57,6 +58,7 @@ func init() {
agentInitializers.Store(qrm.QRMPluginNameMemory, AgentStarter{Init: qrm.InitQRMMemoryPlugins})
agentInitializers.Store(qrm.QRMPluginNameNetwork, AgentStarter{Init: qrm.InitQRMNetworkPlugins})
agentInitializers.Store(qrm.QRMPluginNameIO, AgentStarter{Init: qrm.InitQRMIOPlugins})
agentInitializers.Store(qrm.QRMPluginNameGPU, AgentStarter{Init: qrm.InitQRMGPUPlugins})
}

// RegisterAgentInitializer is used to register user-defined agents
Expand Down
75 changes: 75 additions & 0 deletions cmd/katalyst-agent/app/options/qrm/gpu_plugin.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
/*
Copyright 2022 The Katalyst Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package qrm

import (
"k8s.io/apimachinery/pkg/api/resource"
cliflag "k8s.io/component-base/cli/flag"

"github.com/kubewharf/katalyst-core/cmd/katalyst-agent/app/options/qrm/gpustrategy"
qrmconfig "github.com/kubewharf/katalyst-core/pkg/config/agent/qrm"
)

type GPUOptions struct {
PolicyName string
GPUDeviceNames []string
GPUMemoryAllocatablePerGPU string
SkipGPUStateCorruption bool
RDMADeviceNames []string

GPUStrategyOptions *gpustrategy.GPUStrategyOptions
}

func NewGPUOptions() *GPUOptions {
return &GPUOptions{
PolicyName: "static",
GPUDeviceNames: []string{"nvidia.com/gpu"},
GPUMemoryAllocatablePerGPU: "100",
RDMADeviceNames: []string{},
GPUStrategyOptions: gpustrategy.NewGPUStrategyOptions(),
}
}

func (o *GPUOptions) AddFlags(fss *cliflag.NamedFlagSets) {
fs := fss.FlagSet("gpu_resource_plugin")

fs.StringVar(&o.PolicyName, "gpu-resource-plugin-policy",
o.PolicyName, "The policy gpu resource plugin should use")
fs.StringSliceVar(&o.GPUDeviceNames, "gpu-resource-names", o.GPUDeviceNames, "The name of the GPU resource")
fs.StringVar(&o.GPUMemoryAllocatablePerGPU, "gpu-memory-allocatable-per-gpu",
o.GPUMemoryAllocatablePerGPU, "The total memory allocatable for each GPU, e.g. 100")
fs.BoolVar(&o.SkipGPUStateCorruption, "skip-gpu-state-corruption",
o.SkipGPUStateCorruption, "skip gpu state corruption, and it will be used after updating state properties")
fs.StringSliceVar(&o.RDMADeviceNames, "rdma-resource-names", o.RDMADeviceNames, "The name of the RDMA resource")
o.GPUStrategyOptions.AddFlags(fss)
}

func (o *GPUOptions) ApplyTo(conf *qrmconfig.GPUQRMPluginConfig) error {
conf.PolicyName = o.PolicyName
conf.GPUDeviceNames = o.GPUDeviceNames
gpuMemory, err := resource.ParseQuantity(o.GPUMemoryAllocatablePerGPU)
if err != nil {
return err
}
conf.GPUMemoryAllocatablePerGPU = gpuMemory
conf.SkipGPUStateCorruption = o.SkipGPUStateCorruption
conf.RDMADeviceNames = o.RDMADeviceNames
if err := o.GPUStrategyOptions.ApplyTo(conf.GPUStrategyConfig); err != nil {
return err
}
return nil
}
59 changes: 59 additions & 0 deletions cmd/katalyst-agent/app/options/qrm/gpustrategy/allocate.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
/*
Copyright 2022 The Katalyst Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package gpustrategy

import (
"strings"

cliflag "k8s.io/component-base/cli/flag"

"github.com/kubewharf/katalyst-core/pkg/config/agent/qrm/gpustrategy"
)

type AllocateStrategyOptions struct {
CustomFilteringStrategies map[string]string
CustomSortingStrategy map[string]string
CustomBindingStrategy map[string]string
CustomAllocationStrategy map[string]string
}

func NewGPUAllocateStrategyOptions() *AllocateStrategyOptions {
return &AllocateStrategyOptions{}
}

func (o *AllocateStrategyOptions) AddFlags(fss *cliflag.NamedFlagSets) {
fs := fss.FlagSet("allocate_strategy")
fs.StringToStringVar(&o.CustomFilteringStrategies, "gpu-allocate-custom-filtering-strategies",
o.CustomFilteringStrategies, "The filtering strategies for each resource, e.g. gpu:filtering1/filtering2")
fs.StringToStringVar(&o.CustomSortingStrategy, "gpu-allocate-custom-sorting-strategy", o.CustomSortingStrategy, "The sorting strategy for each resource")
fs.StringToStringVar(&o.CustomBindingStrategy, "gpu-allocate-custom-binding-strategy", o.CustomBindingStrategy, "The binding strategy for each resource")
fs.StringToStringVar(&o.CustomAllocationStrategy, "gpu-allocate-custom-allocation-strategy", o.CustomAllocationStrategy, "The allocation strategy for each resource")
}

func (o *AllocateStrategyOptions) ApplyTo(c *gpustrategy.AllocateStrategyConfig) error {
for resourceName, strategies := range o.CustomFilteringStrategies {
filteringStrategies := strings.Split(strategies, "/")
for _, strategyName := range filteringStrategies {
c.CustomFilteringStrategies[resourceName] = append(c.CustomFilteringStrategies[resourceName], strategyName)
}
}

c.CustomSortingStrategy = o.CustomSortingStrategy
c.CustomBindingStrategy = o.CustomBindingStrategy
c.CustomAllocationStrategy = o.CustomAllocationStrategy
return nil
}
44 changes: 44 additions & 0 deletions cmd/katalyst-agent/app/options/qrm/gpustrategy/strategy_base.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
/*
Copyright 2022 The Katalyst Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package gpustrategy

import (
cliflag "k8s.io/component-base/cli/flag"

"github.com/kubewharf/katalyst-core/pkg/config/agent/qrm/gpustrategy"
)

type GPUStrategyOptions struct {
*AllocateStrategyOptions
}

func NewGPUStrategyOptions() *GPUStrategyOptions {
return &GPUStrategyOptions{
AllocateStrategyOptions: NewGPUAllocateStrategyOptions(),
}
}

func (o *GPUStrategyOptions) AddFlags(fss *cliflag.NamedFlagSets) {
o.AllocateStrategyOptions.AddFlags(fss)
}

func (o *GPUStrategyOptions) ApplyTo(conf *gpustrategy.GPUStrategyConfig) error {
if err := o.AllocateStrategyOptions.ApplyTo(conf.AllocateStrategyConfig); err != nil {
return err
}
return nil
}
6 changes: 6 additions & 0 deletions cmd/katalyst-agent/app/options/qrm/qrm_base.go
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ type QRMPluginsOptions struct {
MemoryOptions *MemoryOptions
NetworkOptions *NetworkOptions
IOOptions *IOOptions
GPUOptions *GPUOptions
}

func NewQRMPluginsOptions() *QRMPluginsOptions {
Expand All @@ -96,6 +97,7 @@ func NewQRMPluginsOptions() *QRMPluginsOptions {
MemoryOptions: NewMemoryOptions(),
NetworkOptions: NewNetworkOptions(),
IOOptions: NewIOOptions(),
GPUOptions: NewGPUOptions(),
}
}

Expand All @@ -104,6 +106,7 @@ func (o *QRMPluginsOptions) AddFlags(fss *cliflag.NamedFlagSets) {
o.MemoryOptions.AddFlags(fss)
o.NetworkOptions.AddFlags(fss)
o.IOOptions.AddFlags(fss)
o.GPUOptions.AddFlags(fss)
}

func (o *QRMPluginsOptions) ApplyTo(conf *qrmconfig.QRMPluginsConfiguration) error {
Expand All @@ -119,5 +122,8 @@ func (o *QRMPluginsOptions) ApplyTo(conf *qrmconfig.QRMPluginsConfiguration) err
if err := o.IOOptions.ApplyTo(conf.IOQRMPluginConfig); err != nil {
return err
}
if err := o.GPUOptions.ApplyTo(conf.GPUQRMPluginConfig); err != nil {
return err
}
return nil
}
4 changes: 2 additions & 2 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ require (
github.com/golang/mock v1.6.0
github.com/golang/protobuf v1.5.3
github.com/google/cadvisor v0.44.2
github.com/google/go-cmp v0.5.9
github.com/google/uuid v1.3.0
github.com/h2non/gock v1.2.0
github.com/klauspost/cpuid/v2 v2.2.6
Expand Down Expand Up @@ -100,7 +101,6 @@ require (
github.com/godbus/dbus/v5 v5.0.6 // indirect
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect
github.com/google/gnostic v0.6.9 // indirect
github.com/google/go-cmp v0.5.9 // indirect
github.com/google/gofuzz v1.2.0 // indirect
github.com/gopherjs/gopherjs v0.0.0-20200217142428-fce0ec30dd00 // indirect
github.com/grpc-ecosystem/go-grpc-prometheus v1.2.0 // indirect
Expand Down Expand Up @@ -196,7 +196,7 @@ replace (
k8s.io/kube-proxy => k8s.io/kube-proxy v0.24.6
k8s.io/kube-scheduler => k8s.io/kube-scheduler v0.24.6
k8s.io/kubectl => k8s.io/kubectl v0.24.6
k8s.io/kubelet => github.com/kubewharf/kubelet v1.24.6-kubewharf.9
k8s.io/kubelet => github.com/kubewharf/kubelet v1.24.6-kubewharf-pre.1
k8s.io/kubernetes => k8s.io/kubernetes v1.24.6
k8s.io/legacy-cloud-providers => k8s.io/legacy-cloud-providers v0.24.6
k8s.io/metrics => k8s.io/metrics v0.24.6
Expand Down
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -576,8 +576,8 @@ github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY=
github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE=
github.com/kubewharf/katalyst-api v0.5.9-0.20260108125536-85e136f5902c h1:ohKHA5TOlW9487menKnKH2M14LeIq1xQ1yW4xp8x9o8=
github.com/kubewharf/katalyst-api v0.5.9-0.20260108125536-85e136f5902c/go.mod h1:BZMVGVl3EP0eCn5xsDgV41/gjYkoh43abIYxrB10e3k=
github.com/kubewharf/kubelet v1.24.6-kubewharf.9 h1:jOTYZt7h/J7I8xQMKMUcJjKf5UFBv37jHWvNp5VRFGc=
github.com/kubewharf/kubelet v1.24.6-kubewharf.9/go.mod h1:MxbSZUx3wXztFneeelwWWlX7NAAStJ6expqq7gY2J3c=
github.com/kubewharf/kubelet v1.24.6-kubewharf-pre.1 h1:pzU37yZWrOBosNX+Laay9Ess0Bff/rsWanBxbdXnHnM=
github.com/kubewharf/kubelet v1.24.6-kubewharf-pre.1/go.mod h1:MxbSZUx3wXztFneeelwWWlX7NAAStJ6expqq7gY2J3c=
github.com/kyoh86/exportloopref v0.1.7/go.mod h1:h1rDl2Kdj97+Kwh4gdz3ujE7XHmH51Q0lUiZ1z4NLj8=
github.com/lib/pq v1.0.0/go.mod h1:5WUZQaWbwv1U+lTReE5YruASi9Al49XbQIvNi/34Woo=
github.com/libopenstorage/openstorage v1.0.0/go.mod h1:Sp1sIObHjat1BeXhfMqLZ14wnOzEhNx2YQedreMcUyc=
Expand Down
Loading