Skip to content

Add support to install by kustomize#179

Open
avinashsingh77 wants to merge 5 commits intollm-d-incubation:mainfrom
avinashsingh77:helm-to-kustomize
Open

Add support to install by kustomize#179
avinashsingh77 wants to merge 5 commits intollm-d-incubation:mainfrom
avinashsingh77:helm-to-kustomize

Conversation

@avinashsingh77
Copy link

This PR introduces Kustomize as an alternative installation method for llm-d-modelservice, providing users with a declarative, composable deployment approach alongside the existing Helm charts.

  • Multi-accelerator support: Nvidia, Intel (XE/i915/Gaudi), AMD, and Google TPU configurations
  • Composable components: 6 optional features including multinode (LeaderWorkerSet), monitoring (Prometheus), P/D disaggregation, DRA, and FMA
  • 8 ready-to-use examples: From basic single-node to advanced multi-node and disaggregated deployments
  • Full feature parity with Helm: All existing capabilities available with documentation.

@Gregory-Pereira
Copy link
Contributor

For context for maintainers this PR is to aggregate feedback on the potential migration and resolve points on if / how it should work rather than code that should be merged in this repo -- it will eventually land in the main repo

Copy link
Contributor

@Gregory-Pereira Gregory-Pereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont have time to do the full review right, but ive called out some things to start. I think my overall objection to this right now is that there is too many configuration overlays. I think another pattern we could consider is having one modelserver directory per guide, and then just do variation based on the hardware accelerator. we could move monitoring into base as described below, this would also let us get rid of single vs multi-node as well, because guides explicitly have multi or non multi-node deployments for that pattern. We could move DRA to be based on the accelerator, typically for Nvidia or AMD GPUs we can k8s device plugin system, Ive only really ever seen intel devices go through DRA. Take that point with a grain of salt though because I am definetly no DRA expert.

The point here I guess I am making is that I think we need to aggregate more of these overlays into more "whole" deployments. We can try to group things per guide - their purpose is to demonstrate patterns within inference which I think is not being shown here. The project is aimed at providing "guides" / "well-lit-paths" which are fleshed out examples, pre tuned to work in production, it is users responsibility to walk back up the path and build their own to suit their use-case. Hope this context framing helps inform the design here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool I haven't seen FMA used yet

@github-actions
Copy link

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

Co-authored-by: Greg Pereira <grpereir@redhat.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
@avinashsingh77
Copy link
Author

avinashsingh77 commented Mar 2, 2026

@Gregory-Pereira Does it look better now?

Changes:

  1. Base layer — Monitoring (PodMonitors) moved into base/monitoring/, always included. Added env: [] to all base resource templates for accelerator patch compatibility.
  2. Accelerators — DRA moved from components/dra/ to sibling directories accelerators/nvidia-dra/ and accelerators/amd-dra/, each referencing their parent accelerator.
  3. Guide-aligned overlays — 8 overlays created, mapping 1:1 with llm-d guides:

| Overlay | → | Accelerator Variants |
inference-scheduling/ → nvidia, amd, intel-xpu, intel-gaudi, google-tpu
pd-disaggregation/ → nvidia, google-tpu, intel-xpu
wide-ep-lws/ → nvidia
workload-autoscaling/ → nvidia
simulated-accelerators/ → (CPU only)
tiered-prefix-cache/ → nvidia
precise-prefix-cache-aware/ → nvidia, intel-xpu
predicted-latency-based-scheduling/ → nvidia (placeholder)

Questions/Notes:

  1. For now I have added a placeholder with empty overlay directory for predicted-latency-based-scheduling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants