Skip to content

Add support to install by kustomize#179

Open
avinashsingh77 wants to merge 5 commits intollm-d-incubation:mainfrom
avinashsingh77:helm-to-kustomize
Open

Add support to install by kustomize#179
avinashsingh77 wants to merge 5 commits intollm-d-incubation:mainfrom
avinashsingh77:helm-to-kustomize

Conversation

@avinashsingh77
Copy link
Copy Markdown

This PR introduces Kustomize as an alternative installation method for llm-d-modelservice, providing users with a declarative, composable deployment approach alongside the existing Helm charts.

  • Multi-accelerator support: Nvidia, Intel (XE/i915/Gaudi), AMD, and Google TPU configurations
  • Composable components: 6 optional features including multinode (LeaderWorkerSet), monitoring (Prometheus), P/D disaggregation, DRA, and FMA
  • 8 ready-to-use examples: From basic single-node to advanced multi-node and disaggregated deployments
  • Full feature parity with Helm: All existing capabilities available with documentation.

@Gregory-Pereira
Copy link
Copy Markdown
Contributor

For context for maintainers this PR is to aggregate feedback on the potential migration and resolve points on if / how it should work rather than code that should be merged in this repo -- it will eventually land in the main repo

Copy link
Copy Markdown
Contributor

@Gregory-Pereira Gregory-Pereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont have time to do the full review right, but ive called out some things to start. I think my overall objection to this right now is that there is too many configuration overlays. I think another pattern we could consider is having one modelserver directory per guide, and then just do variation based on the hardware accelerator. we could move monitoring into base as described below, this would also let us get rid of single vs multi-node as well, because guides explicitly have multi or non multi-node deployments for that pattern. We could move DRA to be based on the accelerator, typically for Nvidia or AMD GPUs we can k8s device plugin system, Ive only really ever seen intel devices go through DRA. Take that point with a grain of salt though because I am definetly no DRA expert.

The point here I guess I am making is that I think we need to aggregate more of these overlays into more "whole" deployments. We can try to group things per guide - their purpose is to demonstrate patterns within inference which I think is not being shown here. The project is aimed at providing "guides" / "well-lit-paths" which are fleshed out examples, pre tuned to work in production, it is users responsibility to walk back up the path and build their own to suit their use-case. Hope this context framing helps inform the design here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool I haven't seen FMA used yet

@github-actions
Copy link
Copy Markdown

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

Co-authored-by: Greg Pereira <grpereir@redhat.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
@avinashsingh77
Copy link
Copy Markdown
Author

avinashsingh77 commented Mar 2, 2026

@Gregory-Pereira Does it look better now?

Changes:

  1. Base layer — Monitoring (PodMonitors) moved into base/monitoring/, always included. Added env: [] to all base resource templates for accelerator patch compatibility.
  2. Accelerators — DRA moved from components/dra/ to sibling directories accelerators/nvidia-dra/ and accelerators/amd-dra/, each referencing their parent accelerator.
  3. Guide-aligned overlays — 8 overlays created, mapping 1:1 with llm-d guides:

| Overlay | → | Accelerator Variants |
inference-scheduling/ → nvidia, amd, intel-xpu, intel-gaudi, google-tpu
pd-disaggregation/ → nvidia, google-tpu, intel-xpu
wide-ep-lws/ → nvidia
workload-autoscaling/ → nvidia
simulated-accelerators/ → (CPU only)
tiered-prefix-cache/ → nvidia
precise-prefix-cache-aware/ → nvidia, intel-xpu
predicted-latency-based-scheduling/ → nvidia (placeholder)

Questions/Notes:

  1. For now I have added a placeholder with empty overlay directory for predicted-latency-based-scheduling.

@github-actions
Copy link
Copy Markdown

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

@nicole-lihui
Copy link
Copy Markdown

nicole-lihui commented Mar 27, 2026

Hi, really interesting PR 👍

Just wanted to share some thoughts from our side and hear how others think about this.

We’ve been using Helm to deploy llm-d-modelservice (e.g., PD disaggregation), and the experience has been quite smooth. But as mentioned here(Issue #850 · llm-d/llm-d), the LLM space evolves very quickly, and Kustomize does have an advantage when it comes to flexibility and adapting to new features.

From our perspective, these two approaches serve different goals:

  • Helm → more production-oriented (stability, versioning)
  • Kustomize → more flexible (fast iteration, experimentation)

That said, fully replacing Helm with Kustomize, or maintaining both long-term, could introduce quite a bit of overhead.

Another practical challenge we’ve seen is that as we combine different models, GPUs, and configs, the number of “well-lit paths” grows very quickly. In practice, the overall architecture (PD disaggregation + Gateway API + GAIE + modelservice) is already quite stable — most of the variation comes from:

  • vLLM startup commands
  • env configs
  • resource tuning

So internally we’ve been exploring a middle-ground approach:

👉 keep Helm templates, but customize via values-level patching, instead of patching raw Kubernetes YAML

Something like:

well-lit-paths/
  base/
    single-gpu-values.yaml
    pd-disaggregation-values.yaml
    tiered-prefix-cache-values.yaml
  nvidia/
    h200/
      single-patch-values.yaml
      pd-disaggregation-patch-values.yaml

This avoids writing a lot of low-level Kustomize patches like:

- op: remove
  path: /spec/template/spec/containers/0/resources/limits/amd.com~1gpu

which can be quite hard to manage — especially for more complex things like vLLM server commands.
sample single-patch-values.yaml like:

llm-d-modelservice:
    prefill:
      create: false
    decode:
      create: true
      replicas: 2
      parallelism:
        tensor: 8
        data: 1
      containers:
        - name: "vllm"
          image:
            registry: harbor.wxyungu.com
            repository: vllm/vllm-openai
            tag: v0.17.0
          modelCommand: vllmServe
          args:
            - "--trust-remote-code"
            - "--gpu_memory_utilization=0.90"
            - "--enable-expert-parallel"
            - "--enable-chunked-prefill"
            - "--enable-prefix-caching"
            - "--enable-auto-tool-choice"
            - "--tool-call-parser=kimi_k2"
            - "--reasoning-parser=kimi_k2"
            - "--mm-encoder-tp-mode=data"
            - "--compilation_config.pass_config.fuse_allreduce_rms=true"

Curious how others think about this tradeoff between flexibility and usability 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants