Add support to install by kustomize by avinashsingh77 · Pull Request #179 · llm-d-incubation/llm-d-modelservice

avinashsingh77 · 2026-01-12T11:20:11Z

This PR introduces Kustomize as an alternative installation method for llm-d-modelservice, providing users with a declarative, composable deployment approach alongside the existing Helm charts.

Multi-accelerator support: Nvidia, Intel (XE/i915/Gaudi), AMD, and Google TPU configurations
Composable components: 6 optional features including multinode (LeaderWorkerSet), monitoring (Prometheus), P/D disaggregation, DRA, and FMA
8 ready-to-use examples: From basic single-node to advanced multi-node and disaggregated deployments
Full feature parity with Helm: All existing capabilities available with documentation.

Gregory-Pereira · 2026-01-23T04:56:01Z

For context for maintainers this PR is to aggregate feedback on the potential migration and resolve points on if / how it should work rather than code that should be merged in this repo -- it will eventually land in the main repo

Gregory-Pereira

I dont have time to do the full review right, but ive called out some things to start. I think my overall objection to this right now is that there is too many configuration overlays. I think another pattern we could consider is having one modelserver directory per guide, and then just do variation based on the hardware accelerator. we could move monitoring into base as described below, this would also let us get rid of single vs multi-node as well, because guides explicitly have multi or non multi-node deployments for that pattern. We could move DRA to be based on the accelerator, typically for Nvidia or AMD GPUs we can k8s device plugin system, Ive only really ever seen intel devices go through DRA. Take that point with a grain of salt though because I am definetly no DRA expert.

The point here I guess I am making is that I think we need to aggregate more of these overlays into more "whole" deployments. We can try to group things per guide - their purpose is to demonstrate patterns within inference which I think is not being shown here. The project is aimed at providing "guides" / "well-lit-paths" which are fleshed out examples, pre tuned to work in production, it is users responsibility to walk back up the path and build their own to suit their use-case. Hope this context framing helps inform the design here

kustomize/base/resources/decode-deployment.yaml

kustomize/components/monitoring/decode-podmonitor.yaml

kustomize/components/monitoring/prefill-podmonitor.yaml

kustomize/components/routing-proxy/kustomization.yaml

Gregory-Pereira · 2026-01-23T05:19:13Z

kustomize/overlays/examples/fma-requester/README.md

Very cool I haven't seen FMA used yet

kustomize/base/monitoring/decode-podmonitor.yaml

github-actions · 2026-02-19T03:40:10Z

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

Co-authored-by: Greg Pereira <grpereir@redhat.com>

Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

avinashsingh77 · 2026-03-02T10:32:22Z

@Gregory-Pereira Does it look better now?

Changes:

Base layer — Monitoring (PodMonitors) moved into base/monitoring/, always included. Added env: [] to all base resource templates for accelerator patch compatibility.
Accelerators — DRA moved from components/dra/ to sibling directories accelerators/nvidia-dra/ and accelerators/amd-dra/, each referencing their parent accelerator.
Guide-aligned overlays — 8 overlays created, mapping 1:1 with llm-d guides:

| Overlay | → | Accelerator Variants |
inference-scheduling/ → nvidia, amd, intel-xpu, intel-gaudi, google-tpu
pd-disaggregation/ → nvidia, google-tpu, intel-xpu
wide-ep-lws/ → nvidia
workload-autoscaling/ → nvidia
simulated-accelerators/ → (CPU only)
tiered-prefix-cache/ → nvidia
precise-prefix-cache-aware/ → nvidia, intel-xpu
predicted-latency-based-scheduling/ → nvidia (placeholder)

Questions/Notes:

For now I have added a placeholder with empty overlay directory for predicted-latency-based-scheduling.

github-actions · 2026-03-24T03:37:16Z

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

nicole-lihui · 2026-03-27T03:46:08Z

Hi, really interesting PR 👍

Just wanted to share some thoughts from our side and hear how others think about this.

We’ve been using Helm to deploy llm-d-modelservice (e.g., PD disaggregation), and the experience has been quite smooth. But as mentioned here(Issue #850 · llm-d/llm-d), the LLM space evolves very quickly, and Kustomize does have an advantage when it comes to flexibility and adapting to new features.

From our perspective, these two approaches serve different goals:

Helm → more production-oriented (stability, versioning)
Kustomize → more flexible (fast iteration, experimentation)

That said, fully replacing Helm with Kustomize, or maintaining both long-term, could introduce quite a bit of overhead.

Another practical challenge we’ve seen is that as we combine different models, GPUs, and configs, the number of “well-lit paths” grows very quickly. In practice, the overall architecture (PD disaggregation + Gateway API + GAIE + modelservice) is already quite stable — most of the variation comes from:

vLLM startup commands
env configs
resource tuning

So internally we’ve been exploring a middle-ground approach:

👉 keep Helm templates, but customize via values-level patching, instead of patching raw Kubernetes YAML

Something like:

well-lit-paths/
  base/
    single-gpu-values.yaml
    pd-disaggregation-values.yaml
    tiered-prefix-cache-values.yaml
  nvidia/
    h200/
      single-patch-values.yaml
      pd-disaggregation-patch-values.yaml

This avoids writing a lot of low-level Kustomize patches like:

- op: remove
  path: /spec/template/spec/containers/0/resources/limits/amd.com~1gpu

which can be quite hard to manage — especially for more complex things like vLLM server commands.
sample single-patch-values.yaml like:

llm-d-modelservice:
    prefill:
      create: false
    decode:
      create: true
      replicas: 2
      parallelism:
        tensor: 8
        data: 1
      containers:
        - name: "vllm"
          image:
            registry: harbor.wxyungu.com
            repository: vllm/vllm-openai
            tag: v0.17.0
          modelCommand: vllmServe
          args:
            - "--trust-remote-code"
            - "--gpu_memory_utilization=0.90"
            - "--enable-expert-parallel"
            - "--enable-chunked-prefill"
            - "--enable-prefix-caching"
            - "--enable-auto-tool-choice"
            - "--tool-call-parser=kimi_k2"
            - "--reasoning-parser=kimi_k2"
            - "--mm-encoder-tp-mode=data"
            - "--compilation_config.pass_config.fuse_allreduce_rms=true"

Curious how others think about this tradeoff between flexibility and usability 👀

add feature to install by kustomize

32ac3a7

Gregory-Pereira reviewed Jan 23, 2026

View reviewed changes

github-actions bot added the lifecycle/stale label Feb 19, 2026

Update port name

8677925

Co-authored-by: Greg Pereira <grpereir@redhat.com>

github-actions bot removed the lifecycle/stale label Feb 24, 2026

avinashsingh77 added 3 commits March 2, 2026 13:17

fix lint errors

4a0a0c7

Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

remove DP and TP size env vars

cdbbc27

Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

reorganize kustomize to match guides

7f5b7fc

Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

avinashsingh77 requested a review from Gregory-Pereira March 2, 2026 10:33

github-actions bot added the lifecycle/stale label Mar 24, 2026

github-actions bot removed the lifecycle/stale label Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to install by kustomize#179

Add support to install by kustomize#179
avinashsingh77 wants to merge 5 commits intollm-d-incubation:mainfrom
avinashsingh77:helm-to-kustomize

avinashsingh77 commented Jan 12, 2026

Uh oh!

Gregory-Pereira commented Jan 23, 2026

Uh oh!

Gregory-Pereira left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gregory-Pereira Jan 23, 2026

Uh oh!

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

avinashsingh77 commented Mar 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

nicole-lihui commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

avinashsingh77 commented Jan 12, 2026

Uh oh!

Gregory-Pereira commented Jan 23, 2026

Uh oh!

Gregory-Pereira left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gregory-Pereira Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

avinashsingh77 commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

nicole-lihui commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Gregory-Pereira left a comment •

edited

Loading

avinashsingh77 commented Mar 2, 2026 •

edited

Loading

nicole-lihui commented Mar 27, 2026 •

edited

Loading