Skip to content

feat: add dynamo-platform and dynamo-crds for AI inference serving #83

Merged
yuanchen8911 merged 3 commits intoNVIDIA:mainfrom
yuanchen8911:feat/add-dynamo-platform
Feb 10, 2026
Merged

feat: add dynamo-platform and dynamo-crds for AI inference serving #83
yuanchen8911 merged 3 commits intoNVIDIA:mainfrom
yuanchen8911:feat/add-dynamo-platform

Conversation

@yuanchen8911
Copy link
Contributor

Summary

  • Add NVIDIA Dynamo inference platform (dynamo-platform + dynamo-crds) as eidos components
  • Create inference recipe overlay for inference-serving workloads
  • Add to kind overlay for local testing

Components

  • dynamo-crds (v0.8.1): 6 CRDs — DynamoGraphDeployment, DynamoModel, DynamoWorkerMetadata, etc.
  • dynamo-platform (v0.8.1): Operator + etcd + NATS with Kubernetes-native service discovery

Configuration

  • kai-scheduler sub-chart disabled (managed as separate eidos component, see feat: add kai-scheduler component for gang scheduling #80)
  • grove disabled (enable for multinode inference)
  • Prometheus endpoint configured for existing kube-prometheus-stack
  • Kubernetes-native service discovery (no external etcd dependency for discovery)
  • Kind overlay with reduced etcd resources

CNCF AI Conformance

Addresses inference-related requirements:

  • AI inference serving: Dynamo provides OpenAI-compatible endpoints, KV-cache-aware routing, disaggregated prefill/decode
  • AI service metrics: Prometheus-compatible /metrics endpoints, PodMonitor auto-creation, Grafana dashboards

Dependencies

Test plan

  • make test passes
  • eidos recipe --service kind generates recipe with 12 components
  • eidos bundle generates correct Chart.yaml and values.yaml
  • helm dependency update downloads dynamo-crds, dynamo-platform, kai-scheduler
  • helm upgrade --install on Kind cluster — all 36 pods healthy:
    eidos-stack-dynamo-operator-controller-manager   2/2   Running
    eidos-stack-etcd-0                               1/1   Running
    eidos-stack-nats-0                               2/2   Running
    kai-operator                                     1/1   Running
    kai-scheduler-default                            1/1   Running
    
  • Dynamo CRDs installed (7 CRDs)
  • KAI queues created (default + dynamo)
  • cert-manager startupapicheck passes

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner February 9, 2026 21:17
@yuanchen8911 yuanchen8911 changed the title feat: add dynamo-platform and dynamo-crds for AI inference serving feat: add dynamo-platform and dynamo-crds for AI inference serving (WIP) Feb 9, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/add-dynamo-platform branch 5 times, most recently from b1d03c7 to c0c6bef Compare February 10, 2026 00:18
@yuanchen8911 yuanchen8911 changed the title feat: add dynamo-platform and dynamo-crds for AI inference serving (WIP) feat: add dynamo-platform and dynamo-crds for AI inference serving Feb 10, 2026
@yuanchen8911 yuanchen8911 changed the title feat: add dynamo-platform and dynamo-crds for AI inference serving feat: add dynamo-platform and dynamo-crds for AI inference serving (WIP) Feb 10, 2026
@yuanchen8911 yuanchen8911 removed the request for review from mchmarny February 10, 2026 00:31
Copy link
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another one of these that will need a proper rebase:

  • conflict on inference.yaml (PR #87, #80).
  • Hardcoded wrong Prometheus namespace — dynamo-platform/values.yaml has: prometheusEndpoint: "http://kube-prometheus-prometheus.eidos-stack.svc.cluster.local:9090" - not eidos-stack
  • Missing defaultNamespace on all three registry entries
  • Kind overlay adds inference components unconditionally
  • dynamo-crds/values.yaml is comment-only — No actual YAML keys, same pattern as PR #87's kgateway-crds

@yuanchen8911 yuanchen8911 force-pushed the feat/add-dynamo-platform branch 3 times, most recently from 64ffef1 to d4bd931 Compare February 10, 2026 21:58
@yuanchen8911 yuanchen8911 changed the title feat: add dynamo-platform and dynamo-crds for AI inference serving (WIP) feat: add dynamo-platform and dynamo-crds for AI inference serving Feb 10, 2026
Add NVIDIA Dynamo inference platform as eidos components for AI
inference workloads. Dynamo provides OpenAI-compatible endpoints,
KV-cache-aware routing, disaggregated prefill/decode, and SLA-driven
autoscaling.

Components:
- dynamo-crds (v0.8.1): CRDs for DynamoGraphDeployment, DynamoModel, etc.
- dynamo-platform (v0.8.1): Operator + etcd + NATS with Kubernetes-native
  service discovery and kube-prometheus-stack integration for metrics

Changes:
- Add dynamo-crds, dynamo-platform, kai-scheduler to registry with
  defaultNamespace
- Create inference overlay scoped to intent: inference
- Set base: inference in h100-ubuntu-inference for explicit dependency
- Remove inference components from kind overlay (scoped to intent only)
- Disable kai-scheduler and grove sub-charts in dynamo-platform (managed
  as separate eidos components)
- Set fullnameOverride on dynamo-operator, dynamo-etcd, dynamo-nats

Depends on: #80 (kai-scheduler)

Signed-off-by: Yuan Chen <yuanchen8911@gmail.com>
@yuanchen8911 yuanchen8911 force-pushed the feat/add-dynamo-platform branch from d4bd931 to eefb0a9 Compare February 10, 2026 22:05
@yuanchen8911
Copy link
Contributor Author

yuanchen8911 commented Feb 10, 2026

Another one of these that will need a proper rebase:

  │                    Comment                    │                     Fix                      │
  ├───────────────────────────────────────────────┼──────────────────────────────────────────────┤
  │ Conflict on inference.yaml                    │ Resolved during rebase                       │
  ├───────────────────────────────────────────────┼──────────────────────────────────────────────┤
  │ Prometheus namespace eidos-stack → monitoring │ Fixed                                        │
  ├───────────────────────────────────────────────┼──────────────────────────────────────────────┤
  │ Missing defaultNamespace                      │ Already fixed (dynamo-system, kai-scheduler) │
  ├───────────────────────────────────────────────┼──────────────────────────────────────────────┤
  │ Kind overlay unconditional                    │ Already fixed (removed)                      │
  ├───────────────────────────────────────────────┼──────────────────────────────────────────────┤
  │ dynamo-crds comment-only values               │ Added enabled: true                          │
  └───────────────────────────────────────────────┴──────────────────────────────────────────────┘

@yuanchen8911 yuanchen8911 requested a review from dims February 10, 2026 22:39
Copy link
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@yuanchen8911 yuanchen8911 merged commit 725a19e into NVIDIA:main Feb 10, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants