Skip to content

Support LeaderWorkerSet (LWS) as Scaling Target #811

@ev-shindin

Description

@ev-shindin

Motivation

The autoscaler currently only supports Deployment as a scaling target. All workload access — replica counts, GPU extraction, vLLM arg parsing, pod ownership resolution — is hardcoded to *appsv1.Deployment. This prevents scaling multi-node inference workloads that use LeaderWorkerSet (leaderworkerset.x-k8s.io/v1), the standard Kubernetes API for leader-worker patterns commonly used in tensor-parallel and pipeline-parallel vLLM deployments.

The ScaleTargetRef field in the VA CRD already accepts any Kind (it uses autoscalingv1.CrossVersionObjectReference), but the internal implementation ignores the Kind and always fetches a Deployment.

Note: The ScaleFromZero engine (internal/engines/scalefromzero/engine.go) already supports arbitrary Kinds via RESTMapper + unstructured client. This pattern should be extended to the saturation engine.

Current State: Deployment-Hardcoded Locations

# Component File Line Issue
1 GetDeploymentWithBackoff() internal/utils/utils.go 89 Typed to *appsv1.Deployment parameter
2 getDeploymentGPUsPerReplica() internal/engines/saturation/engine.go 562 Accesses deploy.Spec.Template.Spec.Containers
3 ParseVLLMArgs() internal/engines/analyzers/saturation_v2/deployment_parser.go 55 Accesses deploy.Spec.Template.Spec.Containers
4 GetCurrentDeploymentReplicas() internal/actuator/actuator.go 28 Reads deploy.Status.Replicas and deploy.Spec.Replicas
5 findDeploymentForPod() internal/collector/source/pod_va_mapper.go 88 Hardcoded owner.Kind != "Deployment" string check
6 Deployment maps engine.go, engine_v2.go, replica_metrics.go various map[string]*appsv1.Deployment throughout
7 Indexer Kind default internal/controller/indexers/indexers.go 45 Defaults unknown Kinds to apps/v1

LeaderWorkerSet vs Deployment

Aspect Deployment LeaderWorkerSet
API group apps/v1 leaderworkerset.x-k8s.io/v1
Pod template spec.template spec.leaderWorkerTemplate.workerTemplate (required) + optional spec.leaderWorkerTemplate.leaderTemplate (*PodTemplateSpec, defaults to workerTemplate when nil)
Replica field spec.replicas spec.replicas (number of groups)
Group size N/A (1 pod = 1 replica) spec.leaderWorkerTemplate.size (total pods per group: 1 leader + Size-1 workers)
Status replicas status.replicas status.replicas (number of ready groups)
Pod ownership Pod → ReplicaSet → Deployment Pod → StatefulSet → LeaderWorkerSet (leader) or Pod → LeaderWorkerSet (workers)
GPU resources On container in pod template On containers in both leader and worker templates — leader also does compute
vLLM args In container args/env In leader container args/env (leader starts vLLM API server + Ray head; workers join as Ray workers)
vLLM metrics From all pods From leader pod (port 8080, serves OpenAI API + exposes /metrics)
Scale subresource Supported Supported

Important: Leader Also Does Compute

In vLLM multi-node deployments, both leader and worker pods run vLLM and require GPUs. The leader is not just a coordinator — it runs the vLLM API server, acts as the Ray head node, and participates in tensor-parallel/pipeline-parallel computation alongside workers.

Example from vLLM LWS deployment guide:

  • Leader: 8x NVIDIA GPUs, runs vllm.entrypoints.openai.api_server, serves port 8080
  • Workers: 8x NVIDIA GPUs each, join as Ray worker nodes
  • Both have identical GPU/memory resource requests

Total GPUs per replica group = leader_GPUs + (Size - 1) × worker_GPUs.
In practice, leader and worker GPU counts are typically identical for TP/PP workloads.

Proposed Design

ScaleTargetAccessor interface

Introduce a ScaleTargetAccessor interface that provides a uniform API to extract scaling-relevant information from any supported workload kind:

// ScaleTargetAccessor provides a uniform interface to extract scaling-relevant
// information from any supported scale target kind (Deployment, LeaderWorkerSet).
type ScaleTargetAccessor interface {
    // GetReplicas returns current spec replicas.
    GetReplicas() *int32
    // GetStatusReplicas returns status replicas (actual running).
    GetStatusReplicas() int32
    // GetLeaderPodTemplateSpec returns the pod template for the leader/primary pod.
    // For Deployment: the single pod template.
    // For LWS: the leader template (falls back to worker template if not set).
    // Use this for: vLLM args extraction (leader starts the API server),
    // metrics port discovery, pod label matching.
    GetLeaderPodTemplateSpec() corev1.PodTemplateSpec
    // GetWorkerPodTemplateSpec returns the pod template for worker pods.
    // For Deployment: same as GetLeaderPodTemplateSpec() (single template).
    // For LWS: the worker template.
    // Use this for: GPU resource extraction when workers differ from leader.
    GetWorkerPodTemplateSpec() corev1.PodTemplateSpec
    // GetTotalGPUsPerReplica returns total GPU count across all pods in a replica.
    // For Deployment: GPUs from the single pod template.
    // For LWS: leader_GPUs + (Size - 1) * worker_GPUs.
    GetTotalGPUsPerReplica() int
    // GetGroupSize returns the number of pods per replica.
    // For Deployment: always 1.
    // For LWS: spec.leaderWorkerTemplate.size (1 leader + N-1 workers).
    GetGroupSize() int32
    // GetObject returns the underlying client.Object for K8s operations.
    GetObject() client.Object
}

Changes Required

1. Add LWS API dependency

go get sigs.k8s.io/lws@latest

Register LWS scheme in cmd/main.go:

import lwsv1 "sigs.k8s.io/lws/api/leaderworkerset/v1"

func init() {
    utilruntime.Must(lwsv1.AddToScheme(scheme))
}

2. Create ScaleTargetAccessor package

New package: internal/utils/scaletarget/

File Contents
accessor.go ScaleTargetAccessor interface definition
deployment.go DeploymentAccessor implementation
lws.go LWSAccessor implementation
fetch.go FetchScaleTarget() factory function
accessor_test.go Unit tests for both implementations

3. Refactor callers to use ScaleTargetAccessor

Before After
getDeploymentGPUsPerReplica(deploy) accessor.GetTotalGPUsPerReplica()
ParseVLLMArgs(deploy *appsv1.Deployment) ParseVLLMArgs(podTemplate corev1.PodTemplateSpec) — use accessor.GetLeaderPodTemplateSpec() (leader runs the vLLM API server with --tensor-parallel-size, --model, etc.)
GetCurrentDeploymentReplicas(va) accessor.GetStatusReplicas()
deployments map[string]*appsv1.Deployment scaleTargets map[string]ScaleTargetAccessor
utils.GetDeploymentWithBackoff(...) scaletarget.FetchScaleTarget(ctx, c, kind, name, ns)

4. Fix pod ownership chain

internal/collector/source/pod_va_mapper.go:88:

// Before:
if rsOwner.Kind != "Deployment" {
    return ""
}

// After: support multiple ownership chains
// Deployment: Pod → ReplicaSet → Deployment
// LWS leader: Pod → StatefulSet → LeaderWorkerSet
// LWS worker: Pod → LeaderWorkerSet (direct)

5. Update indexer

internal/controller/indexers/indexers.go:

switch ref.Kind {
case "Deployment":
    ref.APIVersion = "apps/v1"
case "LeaderWorkerSet":
    ref.APIVersion = "leaderworkerset.x-k8s.io/v1"
default:
    ref.APIVersion = "apps/v1"
}

6. Update deployment_parser.go

Change ParseVLLMArgs to accept corev1.PodTemplateSpec instead of *appsv1.Deployment:

// Before:
func ParseVLLMArgs(deploy *appsv1.Deployment) VLLMEngineParams {
    for _, container := range deploy.Spec.Template.Spec.Containers { ... }
}

// After:
func ParseVLLMArgs(podTemplate corev1.PodTemplateSpec) VLLMEngineParams {
    for _, container := range podTemplate.Spec.Containers { ... }
}

Callers use accessor.GetLeaderPodTemplateSpec() to provide the template, because the
leader pod starts the vLLM API server with --tensor-parallel-size, --model,
--max-num-seqs, and other engine parameters. Workers join as Ray worker nodes and
inherit their configuration from the leader.

Example VA for LWS

apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
  name: llama-70b-tp8
  labels:
    inference.optimization/acceleratorName: "H100"
spec:
  scaleTargetRef:
    kind: LeaderWorkerSet
    name: llama-70b-tp8-lws
    apiVersion: leaderworkerset.x-k8s.io/v1
  modelID: "llama-70b"
  variantCost: "80.0"

Backward Compatibility

  • Existing VAs with kind: Deployment work unchanged — DeploymentAccessor preserves current behavior
  • The CRD ScaleTargetRef already accepts any Kind; no schema change needed
  • The refactor is purely internal; no user-visible API changes
  • LWS scheme registration is additive (does not affect Deployment handling)
  • If LWS CRDs are not installed in the cluster, VAs with kind: LeaderWorkerSet will fail at fetch time with a clear error

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesttriage/acceptedIndicates an issue or PR is ready to be actively worked on.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions