Skip to content

WIP: feat: add LoRA adapter support for ModelDeployment CRD#84

Draft
sozercan wants to merge 85 commits into
mainfrom
lora
Draft

WIP: feat: add LoRA adapter support for ModelDeployment CRD#84
sozercan wants to merge 85 commits into
mainfrom
lora

Conversation

@sozercan

@sozercan sozercan commented Feb 23, 2026

Copy link
Copy Markdown
Member

Merge after #73

LoRA Adapter Support

Adds unified LoRA adapter abstraction to ModelDeployment CRD with support across all three providers.

Changes

  • CRD: spec.adapters[] with name and source (hf:// URI scheme)
  • InferenceProviderConfig: loraSupport capability for auto-selection filtering
  • Webhook: validates llamacpp blocking, unique names, hf:// scheme
  • Controller: provider auto-selection filters by LoRA support, InferenceObjective per adapter for gateway routing
  • KAITO: maps adapters → inference.adapters on Workspace CRD
  • KubeRay: injects --enable-lora + --lora-modules into VLLM_ENGINE_ARGS
  • Dynamo: --enable-lora, LoRA env vars, DynamoModel CRDs, init container for HF download, modelRef for endpoint discovery, updated to 0.9.0 runtime images

Testing

  • Unit tests for all provider transformers and webhook validation
  • E2E tested on AKS GPU cluster with Dynamo provider + unsloth/Qwen3-0.6B + lucylq/qwen3_06B_lora_math adapter

Known Issues

  • Dynamo DynamoModel operator has a race condition: tries to load LoRA before vLLM finishes initializing. Workaround: delete/recreate DynamoModel after worker is ready.
  • Dynamo hf:// download via DynamoModel is async and may silently fail. file:// local path loading works reliably after init container pre-downloads.

TODO

  • Fix Dynamo race condition (wait for worker readiness before creating DynamoModel)
  • Consider direct /v1/loras API call from provider controller instead of DynamoModel CRD for hf:// sources
  • Test with KubeRay provider
  • Test with KAITO provider (needs KAITO preset model with LoRA support)

sozercan and others added 30 commits February 13, 2026 14:23
Create docs/gateway.md covering architecture, prerequisites, compatible
gateway implementations, setup steps, configuration options (auto-detection,
explicit flags, per-deployment overrides), usage examples (curl and Python),
and troubleshooting.

Update docs/architecture.md with a Gateway API Integration section and
link to the new guide.

Update README.md with a Gateway API Integration highlight and doc link.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… routing

Add support for the Gateway API Inference Extension (inference.networking.k8s.io/v1)
to provide a single unified inference gateway endpoint across all providers. When
Gateway API CRDs are detected in the cluster, the controller automatically creates
InferencePool and HTTPRoute resources for each ModelDeployment.

Controller changes:
- Add gateway-api and gateway-api-inference-extension Go dependencies
- Add GatewaySpec (spec.gateway) and GatewayStatus to ModelDeployment CRD
- Implement gateway reconciler for InferencePool and HTTPRoute lifecycle
- Add gateway auto-detection with CRD availability caching
- Support explicit --gateway-name/--gateway-namespace flags
- Add RBAC for inferencepools, httproutes, and gateways
- Inject kubeairunway.ai/model-deployment label in all providers (KAITO, Dynamo, KubeRay)

Backend/frontend changes:
- Add GET /gateway/status and GET /gateway/models API routes
- Add gateway status to deployment detail responses
- Add GatewayStatus, GatewayInfo, GatewayModelInfo shared types
- Add gateway API client methods in frontend

Tests and docs:
- Add gateway reconciler tests (11 tests) and detection tests (7 tests)
- Add docs/gateway.md with architecture, setup, and usage guide
- Update docs/architecture.md, crd-reference.md, controller-architecture.md, api.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ndpoint

- Fix backend API group from inference.networking.x-k8s.io/v1alpha2
  to inference.networking.k8s.io/v1 to match upstream stable API
- Add required EndpointPickerRef to InferencePool with configurable
  --epp-service-name and --epp-service-port controller flags
- Resolve gateway endpoint from Gateway.status.addresses instead of
  constructing invalid DNS name
- Add Istio setup notes and EPP configuration docs to gateway.md
- Add test for endpoint resolution from Gateway status

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Probe the model server's /v1/models endpoint to resolve the actual
served model name when no explicit spec.gateway.modelName or
spec.model.servedName is set. This fixes gateway routing for
baked-in model images where the served name differs from spec.model.id.

Resolution priority:
1. spec.gateway.modelName (explicit override)
2. spec.model.servedName (user-specified)
3. Auto-discovered from /v1/models on running server
4. spec.model.id (fallback)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add tests for resolveModelName priority chain: explicit override,
  served name, unreachable server fallback, no endpoint fallback
- Update gateway.md with model name resolution section documenting
  the 4-level priority chain including auto-discovery
- Fix stale comment in modeldeployment_types.go

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ady=False

- cleanupGatewayResources now sets GatewayReady condition to False
  so conditions stay consistent when gateway resources are removed
- When deployment leaves Running phase (Failed, Terminating, etc.),
  gateway resources are cleaned up if they previously existed
- Add test for phase transition cleanup and condition verification

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fail fast at startup if only one of --gateway-name/--gateway-namespace
  is set, preventing silent fallback to auto-detection
- Add 60s TTL for negative CRD detection results so gateway integration
  self-enables if CRDs are installed after controller startup. Positive
  results remain cached permanently.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tests the full Gateway API Inference Extension integration:
- Installs Gateway API CRDs, Inference Extension CRDs, and Istio
- Creates Gateway resource and deploys a CPU model
- Verifies InferencePool created with correct selector and EPP ref
- Verifies HTTPRoute created with correct backend ref
- Verifies model name auto-discovery from /v1/models
- Tests actual inference routing through the Istio gateway
- Tests gateway disable and resource cleanup

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The gateway reconciliation may need an extra reconcile cycle after
the deployment transitions to Running phase. Add a 30-attempt
retry loop with 5s intervals instead of checking once.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Set model.id in test fixture so fallback model name is non-empty
- Replace gateway-routed inference test with direct service test
  (gateway routing requires EPP which isn't deployed in e2e)
- Keep gateway resource verification (InferencePool, HTTPRoute,
  status, conditions) as the GAIE integration test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The auto-discovery probes /v1/models on the model service, but
status.endpoint.port may contain the container port (e.g. 5000)
while the service exposes port 80. Look up the actual service port
first, falling back to status.endpoint.port if unavailable.

This specifically fixes aikit/llamacpp models where KAITO reports
container port 5000 but the service maps 80→5000.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The controller needs permission to read Services to look up the
actual service port for model name auto-discovery. Without this,
the probe used the container port (e.g. 5000) instead of the
service port (80), causing discovery to fail.

Also adds resolveServicePort() which looks up the service's HTTP
port, preferring ports named 'http' or on 80/8080.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Install the upstream inferencepool helm chart to deploy the EPP
(Endpoint Picker Proxy), then test actual inference routing through
the Istio gateway instead of direct service port-forward.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The controller now automatically creates the Endpoint Picker Proxy
(EPP) deployment, service, RBAC, and config when gateway integration
is enabled. Users no longer need to install the EPP separately.

Resources created per ModelDeployment:
- ServiceAccount, Role, RoleBinding for EPP RBAC
- ConfigMap with default plugins config
- Deployment running the upstream EPP image
- Service exposing gRPC port 9002

All resources are owned by the ModelDeployment and cleaned up
automatically. EPP image is configurable via --epp-image flag.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The controller needs pods get/watch/list and leases create/get/update
permissions on its own service account to avoid RBAC escalation errors
when creating the EPP Role (Kubernetes prevents granting permissions
the creator doesn't hold).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The HTTPRoute may be created in the same reconcile cycle as the
verification step runs. Add a retry loop to wait for it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pods created by providers may not have the kubeairunway.ai/model-deployment
label. The controller now discovers pods via the model service's selector
and patches the label onto them, provider-agnostically.

Also adds pod patch RBAC and fixes EPP log label in e2e debug.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…writes)

The EPP watches these experimental resources even when unused.
Without RBAC for them, the cache sync fails and health check
returns NOT_SERVING.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The controller needs the same permissions it grants to the EPP Role,
otherwise Kubernetes blocks the Role creation as RBAC escalation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The controller deploys the EPP (Deployment + Service + RBAC), but
Istio-specific wiring (DestinationRule with h2c upgrade) is BYO.
Apply it directly in the e2e test since this is implementation-specific.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Kind doesn't support LoadBalancer, so the Gateway never becomes
Programmed. Use networking.istio.io/service-type: NodePort annotation
to get a NodePort service that works in Kind.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port-forwarding to the gateway pod bypasses ext_proc. Use the
NodePort service endpoint instead, accessing the node's internal IP.
Also remove exclude-from-external-load-balancers label on Kind node.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
InferencePool targetPorts routes directly to pods, so it needs the
container port (e.g. 5000), not the service port (e.g. 80). Look up
the service's targetPort to get the actual container port.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
sozercan and others added 27 commits February 19, 2026 22:27
The validation fix for extensionManager.backendResources without
hooks may only be on main. Try the latest dev build.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Traffic routing through the gateway requires either:
- Envoy AI Gateway controller (for backendResources support)
- Istio with working ext_proc/mTLS (connection_termination in Kind)

Neither works in a basic Kind cluster. The e2e tests verify all
controller-side logic comprehensively. Traffic routing was validated
manually on AKS.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert from Envoy Gateway to Istio. Add cloud-provider-kind to
provide LoadBalancer IP assignment in Kind, which should fix the
Gateway Programmed=Unknown issue. Also restores the traffic routing
test using the Gateway's LoadBalancer IP directly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cloud-provider-kind provides LoadBalancer IP, Gateway is Programmed,
but Istio's ext_proc can't connect to EPP without mTLS. Enable
sidecar injection on default namespace so EPP gets Istio proxy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Explicitly tell Istio sidecar to intercept port 9002 for ext_proc
gRPC traffic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
With enableAutoMtls=false, the gateway proxy should connect to
the EPP using plaintext gRPC without mTLS. No sidecar needed on
the EPP pod. The ext_proc cluster should use h2c based on the
service port name (grpc-ext-proc) and appProtocol.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per upstream GAIE chart (inferencepool/templates/istio.yaml), Istio
needs tls.mode=SIMPLE with insecureSkipVerify=true to connect to
the EPP. The previous h2UpgradePolicy approach was wrong.

Also adds cloud-provider-kind for LoadBalancer IP in Kind.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When httpRouteRef is set, the controller skips auto-creating the
HTTPRoute and uses the user-provided one. This enables custom routing
logic like LoRA adapter selection, traffic splitting across model
versions, and custom payload processors.

The controller still auto-creates InferencePool + EPP regardless.
Cleanup also respects httpRouteRef — won't delete user-provided routes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per Gateway API conventions, readiness shouldn't be a single bool.
The GatewayReady condition with reason/message already captures this
with proper granularity. Users should check the condition or refer
to Gateway API resource status directly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If gateway reconciliation fails with a CRD-not-found error
(e.g. CRDs were removed), refresh the detection cache so
subsequent reconciles skip gateway integration gracefully.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pin Gateway API Inference Extension CRDs to v1.3.1 instead of
latest. Update Go module dependency to match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Log a warning when multiple Gateways are labeled with
kubeairunway.ai/inference-gateway=true, suggesting gatewayRef
for explicit selection. Uses the first labeled one.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BBR (Body-Based Router) is a separate deployment needed only for
multi-model setups. Updated architecture diagram, added BBR section
with helm install instructions pinned to v1.3.1, and clarified
that single-model setups don't need BBR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Install the upstream body-based-routing helm chart with Istio
provider in the e2e test. Validates the full GAIE stack.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
For multi-model setups with BBR, each HTTPRoute needs a header
match on X-Gateway-Base-Model-Name to route to the correct
InferencePool. BBR sets this header from the request body's
model field.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Define GAIE_VERSION in Makefile (v1.3.1) and DefaultGAIEVersion
constant in gateway package. EPP image tag defaults to this version
in both cmd/main.go and gateway_reconciler.go.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The header match (X-Gateway-Base-Model-Name) only works when BBR
is deployed. Add a fallback PathPrefix / match so single-model
setups work without BBR. With BBR, the header match takes priority.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ypes

1. Remove duplicate DeploymentConfig interface (incompatible properties
   broke TypeScript build — pre-existing issue also on main)
2. Derive gateway model readiness from GatewayReady condition instead
   of removed status.gateway.ready field
3. Restore shared/types/aikit.ts re-export file and barrel export

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add --enable-lora engine arg when adapters are specified
- Add loraEnvVars helper for Dynamo LoRA env vars (DYN_LORA_ENABLED,
  DYN_SYSTEM_ENABLED, DYN_SYSTEM_PORT, DYN_LORA_PATH)
- Inject LoRA env vars into aggregated, prefill, and decode workers
- Add reconcileAdapters to create/update DynamoModel CRDs per adapter
- Add cleanupOrphanedDynamoModels for adapter lifecycle management
- Add DynamoModel cleanup on ModelDeployment deletion
- Add RBAC marker for DynamoModel resources
- Set LoRASupport: true in provider capabilities

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add LoRAAdapterSpec and AdapterStatus types to ModelDeployment
- Add LoRASupport capability to InferenceProviderConfig
- Webhook validation: block llamacpp+adapters, unique names, hf:// scheme
- Provider auto-selection filters by LoRA support
- KAITO: map adapters to inference.adapters on Workspace
- KubeRay: inject --enable-lora + --lora-modules into VLLM_ENGINE_ARGS
- Dynamo: --enable-lora, LoRA env vars, DynamoModel CRDs, init container
  for HF adapter download, modelRef for endpoint discovery
- Gateway: auto-create InferenceObjective per adapter
- Update Dynamo runtime images to 0.9.0
- Add unit tests for all providers and webhook
- Add docs/lora-adapters.md user guide
- Add sample YAML with chess LoRA adapter
@sozercan sozercan linked an issue Feb 24, 2026 that may be closed by this pull request
@sozercan sozercan closed this Apr 28, 2026
@sozercan sozercan reopened this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support for lora adapters

1 participant