Motivation
The autoscaler currently only supports Deployment as a scaling target. All workload access — replica counts, GPU extraction, vLLM arg parsing, pod ownership resolution — is hardcoded to *appsv1.Deployment. This prevents scaling multi-node inference workloads that use LeaderWorkerSet (leaderworkerset.x-k8s.io/v1), the standard Kubernetes API for leader-worker patterns commonly used in tensor-parallel and pipeline-parallel vLLM deployments.
The ScaleTargetRef field in the VA CRD already accepts any Kind (it uses autoscalingv1.CrossVersionObjectReference), but the internal implementation ignores the Kind and always fetches a Deployment.
Note: The ScaleFromZero engine (internal/engines/scalefromzero/engine.go) already supports arbitrary Kinds via RESTMapper + unstructured client. This pattern should be extended to the saturation engine.
Current State: Deployment-Hardcoded Locations
| # |
Component |
File |
Line |
Issue |
| 1 |
GetDeploymentWithBackoff() |
internal/utils/utils.go |
89 |
Typed to *appsv1.Deployment parameter |
| 2 |
getDeploymentGPUsPerReplica() |
internal/engines/saturation/engine.go |
562 |
Accesses deploy.Spec.Template.Spec.Containers |
| 3 |
ParseVLLMArgs() |
internal/engines/analyzers/saturation_v2/deployment_parser.go |
55 |
Accesses deploy.Spec.Template.Spec.Containers |
| 4 |
GetCurrentDeploymentReplicas() |
internal/actuator/actuator.go |
28 |
Reads deploy.Status.Replicas and deploy.Spec.Replicas |
| 5 |
findDeploymentForPod() |
internal/collector/source/pod_va_mapper.go |
88 |
Hardcoded owner.Kind != "Deployment" string check |
| 6 |
Deployment maps |
engine.go, engine_v2.go, replica_metrics.go |
various |
map[string]*appsv1.Deployment throughout |
| 7 |
Indexer Kind default |
internal/controller/indexers/indexers.go |
45 |
Defaults unknown Kinds to apps/v1 |
LeaderWorkerSet vs Deployment
| Aspect |
Deployment |
LeaderWorkerSet |
| API group |
apps/v1 |
leaderworkerset.x-k8s.io/v1 |
| Pod template |
spec.template |
spec.leaderWorkerTemplate.workerTemplate (required) + optional spec.leaderWorkerTemplate.leaderTemplate (*PodTemplateSpec, defaults to workerTemplate when nil) |
| Replica field |
spec.replicas |
spec.replicas (number of groups) |
| Group size |
N/A (1 pod = 1 replica) |
spec.leaderWorkerTemplate.size (total pods per group: 1 leader + Size-1 workers) |
| Status replicas |
status.replicas |
status.replicas (number of ready groups) |
| Pod ownership |
Pod → ReplicaSet → Deployment |
Pod → StatefulSet → LeaderWorkerSet (leader) or Pod → LeaderWorkerSet (workers) |
| GPU resources |
On container in pod template |
On containers in both leader and worker templates — leader also does compute |
| vLLM args |
In container args/env |
In leader container args/env (leader starts vLLM API server + Ray head; workers join as Ray workers) |
| vLLM metrics |
From all pods |
From leader pod (port 8080, serves OpenAI API + exposes /metrics) |
| Scale subresource |
Supported |
Supported |
Important: Leader Also Does Compute
In vLLM multi-node deployments, both leader and worker pods run vLLM and require GPUs. The leader is not just a coordinator — it runs the vLLM API server, acts as the Ray head node, and participates in tensor-parallel/pipeline-parallel computation alongside workers.
Example from vLLM LWS deployment guide:
- Leader: 8x NVIDIA GPUs, runs
vllm.entrypoints.openai.api_server, serves port 8080
- Workers: 8x NVIDIA GPUs each, join as Ray worker nodes
- Both have identical GPU/memory resource requests
Total GPUs per replica group = leader_GPUs + (Size - 1) × worker_GPUs.
In practice, leader and worker GPU counts are typically identical for TP/PP workloads.
Proposed Design
ScaleTargetAccessor interface
Introduce a ScaleTargetAccessor interface that provides a uniform API to extract scaling-relevant information from any supported workload kind:
// ScaleTargetAccessor provides a uniform interface to extract scaling-relevant
// information from any supported scale target kind (Deployment, LeaderWorkerSet).
type ScaleTargetAccessor interface {
// GetReplicas returns current spec replicas.
GetReplicas() *int32
// GetStatusReplicas returns status replicas (actual running).
GetStatusReplicas() int32
// GetLeaderPodTemplateSpec returns the pod template for the leader/primary pod.
// For Deployment: the single pod template.
// For LWS: the leader template (falls back to worker template if not set).
// Use this for: vLLM args extraction (leader starts the API server),
// metrics port discovery, pod label matching.
GetLeaderPodTemplateSpec() corev1.PodTemplateSpec
// GetWorkerPodTemplateSpec returns the pod template for worker pods.
// For Deployment: same as GetLeaderPodTemplateSpec() (single template).
// For LWS: the worker template.
// Use this for: GPU resource extraction when workers differ from leader.
GetWorkerPodTemplateSpec() corev1.PodTemplateSpec
// GetTotalGPUsPerReplica returns total GPU count across all pods in a replica.
// For Deployment: GPUs from the single pod template.
// For LWS: leader_GPUs + (Size - 1) * worker_GPUs.
GetTotalGPUsPerReplica() int
// GetGroupSize returns the number of pods per replica.
// For Deployment: always 1.
// For LWS: spec.leaderWorkerTemplate.size (1 leader + N-1 workers).
GetGroupSize() int32
// GetObject returns the underlying client.Object for K8s operations.
GetObject() client.Object
}
Changes Required
1. Add LWS API dependency
go get sigs.k8s.io/lws@latest
Register LWS scheme in cmd/main.go:
import lwsv1 "sigs.k8s.io/lws/api/leaderworkerset/v1"
func init() {
utilruntime.Must(lwsv1.AddToScheme(scheme))
}
2. Create ScaleTargetAccessor package
New package: internal/utils/scaletarget/
| File |
Contents |
accessor.go |
ScaleTargetAccessor interface definition |
deployment.go |
DeploymentAccessor implementation |
lws.go |
LWSAccessor implementation |
fetch.go |
FetchScaleTarget() factory function |
accessor_test.go |
Unit tests for both implementations |
3. Refactor callers to use ScaleTargetAccessor
| Before |
After |
getDeploymentGPUsPerReplica(deploy) |
accessor.GetTotalGPUsPerReplica() |
ParseVLLMArgs(deploy *appsv1.Deployment) |
ParseVLLMArgs(podTemplate corev1.PodTemplateSpec) — use accessor.GetLeaderPodTemplateSpec() (leader runs the vLLM API server with --tensor-parallel-size, --model, etc.) |
GetCurrentDeploymentReplicas(va) |
accessor.GetStatusReplicas() |
deployments map[string]*appsv1.Deployment |
scaleTargets map[string]ScaleTargetAccessor |
utils.GetDeploymentWithBackoff(...) |
scaletarget.FetchScaleTarget(ctx, c, kind, name, ns) |
4. Fix pod ownership chain
internal/collector/source/pod_va_mapper.go:88:
// Before:
if rsOwner.Kind != "Deployment" {
return ""
}
// After: support multiple ownership chains
// Deployment: Pod → ReplicaSet → Deployment
// LWS leader: Pod → StatefulSet → LeaderWorkerSet
// LWS worker: Pod → LeaderWorkerSet (direct)
5. Update indexer
internal/controller/indexers/indexers.go:
switch ref.Kind {
case "Deployment":
ref.APIVersion = "apps/v1"
case "LeaderWorkerSet":
ref.APIVersion = "leaderworkerset.x-k8s.io/v1"
default:
ref.APIVersion = "apps/v1"
}
6. Update deployment_parser.go
Change ParseVLLMArgs to accept corev1.PodTemplateSpec instead of *appsv1.Deployment:
// Before:
func ParseVLLMArgs(deploy *appsv1.Deployment) VLLMEngineParams {
for _, container := range deploy.Spec.Template.Spec.Containers { ... }
}
// After:
func ParseVLLMArgs(podTemplate corev1.PodTemplateSpec) VLLMEngineParams {
for _, container := range podTemplate.Spec.Containers { ... }
}
Callers use accessor.GetLeaderPodTemplateSpec() to provide the template, because the
leader pod starts the vLLM API server with --tensor-parallel-size, --model,
--max-num-seqs, and other engine parameters. Workers join as Ray worker nodes and
inherit their configuration from the leader.
Example VA for LWS
apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
name: llama-70b-tp8
labels:
inference.optimization/acceleratorName: "H100"
spec:
scaleTargetRef:
kind: LeaderWorkerSet
name: llama-70b-tp8-lws
apiVersion: leaderworkerset.x-k8s.io/v1
modelID: "llama-70b"
variantCost: "80.0"
Backward Compatibility
- Existing VAs with
kind: Deployment work unchanged — DeploymentAccessor preserves current behavior
- The CRD
ScaleTargetRef already accepts any Kind; no schema change needed
- The refactor is purely internal; no user-visible API changes
- LWS scheme registration is additive (does not affect Deployment handling)
- If LWS CRDs are not installed in the cluster, VAs with
kind: LeaderWorkerSet will fail at fetch time with a clear error
Motivation
The autoscaler currently only supports
Deploymentas a scaling target. All workload access — replica counts, GPU extraction, vLLM arg parsing, pod ownership resolution — is hardcoded to*appsv1.Deployment. This prevents scaling multi-node inference workloads that use LeaderWorkerSet (leaderworkerset.x-k8s.io/v1), the standard Kubernetes API for leader-worker patterns commonly used in tensor-parallel and pipeline-parallel vLLM deployments.The
ScaleTargetReffield in the VA CRD already accepts any Kind (it usesautoscalingv1.CrossVersionObjectReference), but the internal implementation ignores the Kind and always fetches a Deployment.Note: The ScaleFromZero engine (
internal/engines/scalefromzero/engine.go) already supports arbitrary Kinds viaRESTMapper+ unstructured client. This pattern should be extended to the saturation engine.Current State: Deployment-Hardcoded Locations
GetDeploymentWithBackoff()internal/utils/utils.go*appsv1.DeploymentparametergetDeploymentGPUsPerReplica()internal/engines/saturation/engine.godeploy.Spec.Template.Spec.ContainersParseVLLMArgs()internal/engines/analyzers/saturation_v2/deployment_parser.godeploy.Spec.Template.Spec.ContainersGetCurrentDeploymentReplicas()internal/actuator/actuator.godeploy.Status.Replicasanddeploy.Spec.ReplicasfindDeploymentForPod()internal/collector/source/pod_va_mapper.goowner.Kind != "Deployment"string checkengine.go,engine_v2.go,replica_metrics.gomap[string]*appsv1.Deploymentthroughoutinternal/controller/indexers/indexers.goapps/v1LeaderWorkerSet vs Deployment
apps/v1leaderworkerset.x-k8s.io/v1spec.templatespec.leaderWorkerTemplate.workerTemplate(required) + optionalspec.leaderWorkerTemplate.leaderTemplate(*PodTemplateSpec, defaults to workerTemplate when nil)spec.replicasspec.replicas(number of groups)spec.leaderWorkerTemplate.size(total pods per group: 1 leader + Size-1 workers)status.replicasstatus.replicas(number of ready groups)Important: Leader Also Does Compute
In vLLM multi-node deployments, both leader and worker pods run vLLM and require GPUs. The leader is not just a coordinator — it runs the vLLM API server, acts as the Ray head node, and participates in tensor-parallel/pipeline-parallel computation alongside workers.
Example from vLLM LWS deployment guide:
vllm.entrypoints.openai.api_server, serves port 8080Total GPUs per replica group = leader_GPUs + (Size - 1) × worker_GPUs.
In practice, leader and worker GPU counts are typically identical for TP/PP workloads.
Proposed Design
ScaleTargetAccessor interface
Introduce a
ScaleTargetAccessorinterface that provides a uniform API to extract scaling-relevant information from any supported workload kind:Changes Required
1. Add LWS API dependency
Register LWS scheme in
cmd/main.go:2. Create ScaleTargetAccessor package
New package:
internal/utils/scaletarget/accessor.goScaleTargetAccessorinterface definitiondeployment.goDeploymentAccessorimplementationlws.goLWSAccessorimplementationfetch.goFetchScaleTarget()factory functionaccessor_test.go3. Refactor callers to use ScaleTargetAccessor
getDeploymentGPUsPerReplica(deploy)accessor.GetTotalGPUsPerReplica()ParseVLLMArgs(deploy *appsv1.Deployment)ParseVLLMArgs(podTemplate corev1.PodTemplateSpec)— useaccessor.GetLeaderPodTemplateSpec()(leader runs the vLLM API server with--tensor-parallel-size,--model, etc.)GetCurrentDeploymentReplicas(va)accessor.GetStatusReplicas()deployments map[string]*appsv1.DeploymentscaleTargets map[string]ScaleTargetAccessorutils.GetDeploymentWithBackoff(...)scaletarget.FetchScaleTarget(ctx, c, kind, name, ns)4. Fix pod ownership chain
internal/collector/source/pod_va_mapper.go:88:5. Update indexer
internal/controller/indexers/indexers.go:6. Update deployment_parser.go
Change
ParseVLLMArgsto acceptcorev1.PodTemplateSpecinstead of*appsv1.Deployment:Callers use
accessor.GetLeaderPodTemplateSpec()to provide the template, because theleader pod starts the vLLM API server with
--tensor-parallel-size,--model,--max-num-seqs, and other engine parameters. Workers join as Ray worker nodes andinherit their configuration from the leader.
Example VA for LWS
Backward Compatibility
kind: Deploymentwork unchanged —DeploymentAccessorpreserves current behaviorScaleTargetRefalready accepts any Kind; no schema change neededkind: LeaderWorkerSetwill fail at fetch time with a clear error