Skip to content

Commit 216858e

Browse files
kalantarryojsb
andauthored
Refactor DRA implementation (#185)
* breaking change: replace dra Signed-off-by: Michael Kalantar <kalantar@us.ibm.com> * added new example Signed-off-by: ryojsb <ryoh80213@gmail.com> * review feedback Signed-off-by: Michael Kalantar <kalantar@us.ibm.com> * run make generate Signed-off-by: Michael Kalantar <kalantar@us.ibm.com> * update readme Signed-off-by: Michael Kalantar <kalantar@us.ibm.com> * ensure accelerator.dra is a boolean in the jsonschema Signed-off-by: Michael Kalantar <kalantar@us.ibm.com> * update examples/README.md Signed-off-by: Michael Kalantar <kalantar@us.ibm.com> --------- Signed-off-by: Michael Kalantar <kalantar@us.ibm.com> Signed-off-by: ryojsb <ryoh80213@gmail.com> Co-authored-by: ryojsb <ryoh80213@gmail.com>
1 parent c94bd89 commit 216858e

26 files changed

+1314
-777
lines changed

README.md

Lines changed: 58 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ TL;DR:
99
Active scenarios supported:
1010
- P/D disaggregation
1111
- Multi-node inference, utilizing data parallelism
12-
- One pod per DP rank
12+
- Dynamic Resource Allocation (DRA) for flexible accelerator management
1313

1414
Integration with `llm-d` components:
1515
- Quickstart guide in `llm-d-infra` depends on ModelService
@@ -90,7 +90,12 @@ Below are the values you can set.
9090
| `decode.parallelism.workers` | Number of workers over which data parallelism is implemented | int | 1 |
9191
| `decode.acceleratorTypes.labelKey` | Key of label on node that identifies the hosted GPU type | string | N/A |
9292
| `decode.acceleratorTypes.labelValue` | Value of label on node that identifies type of hosted GPU | string | N/A |
93+
| `decode.resourceClaims` | List of non-accelerator ResourceClaims to create and attach to decode pods | List | [] |
9394
| `prefill` | Same fields supported in `decode` | See above | See above |
95+
| `prefill.resourceClaims` | List of non-accelerator ResourceClaims to create and attach to prefill pods | List | [] |
96+
| `accelerator.type` | Accelerator type (nvidia, intel-gaudi, intel-i915, intel-xe, amd, google) | string | N/A |
97+
| `accelerator.dra` | Enable Dynamic Resource Allocation (DRA) for accelerators. When true, uses ResourceClaimTemplates instead of device plugins | bool | `false` |
98+
| `accelerator.resourceClaimTemplates` | Map of accelerator types to ResourceClaimTemplate definitions for DRA mode | map | See values.yaml |
9499
| `extraObjects` | Additional Kubernetes objects to be deployed alongside the main application | List | [] |
95100

96101
### Accelerator Resource Configuration
@@ -126,6 +131,58 @@ decode:
126131
127132
This is useful for accelerators like TPUs where tensor parallelism does not equal the number of accelerators.
128133
134+
### Dynamic Resource Allocation (DRA)
135+
136+
The chart supports Kubernetes Dynamic Resource Allocation for flexible accelerator management. Enable DRA mode with `accelerator.dra: true`.
137+
138+
**DRA vs Device Plugin Mode:**
139+
140+
| Aspect | Device Plugin (default) | DRA Mode (`accelerator.dra: true`) |
141+
|--------|------------------------|-----------------------------------|
142+
| Accelerator allocation | Via `resources.limits` (e.g., `nvidia.com/gpu: 4`) | Via ResourceClaims and ResourceClaimTemplates |
143+
| Device count | Manual or auto-calculated | Auto-calculated from parallelism settings |
144+
| Flexibility | Standard device plugin constraints | Advanced selection criteria via claim templates |
145+
| Non-accelerator resources | Specified in `resources.limits/requests` | Specified in `resources.limits/requests` (pass-through) |
146+
147+
**Example - DRA Mode:**
148+
```yaml
149+
accelerator:
150+
type: intel-gaudi
151+
dra: true # Enable DRA
152+
resourceClaimTemplates:
153+
intel-gaudi:
154+
name: gaudi-claim-template
155+
class: gaudi.intel.com
156+
match: "exactly"
157+
count: 2 # Optional override; auto-calculated from parallelism if omitted
158+
159+
decode:
160+
parallelism:
161+
tensor: 2
162+
dataLocal: 1
163+
containers:
164+
- name: vllm
165+
resources:
166+
limits:
167+
cpu: "4" # Non-accelerator resources work normally
168+
memory: "16Gi"
169+
requests:
170+
cpu: "2"
171+
memory: "8Gi"
172+
claims: # Optional: add non-accelerator claims
173+
- name: custom-resource-claim
174+
resourceClaims: # Define non-accelerator claims here
175+
- name: custom-resource-claim
176+
resourceClaimTemplateName: my-custom-template
177+
```
178+
179+
**Key Points:**
180+
- When `accelerator.dra: true`, do NOT specify accelerator resources in `resources.limits` (e.g., don't use `nvidia.com/gpu`)
181+
- Accelerator allocation is handled automatically via claims
182+
- Device count is auto-calculated as `parallelism.tensor * parallelism.dataLocal` unless explicitly overridden in `resourceClaimTemplates[].count`
183+
- CPU, memory, and other non-accelerator resources are specified normally in `resources.limits/requests`
184+
- User-defined claims for non-accelerator resources (e.g., RDMA, custom devices) can be added via `resourceClaims` and referenced in `resources.claims`
185+
129186
## Contribute
130187

131188
We welcome contributions to llm-d-modelservice! Please see our [Contributing Guide](CONTRIBUTING.md) for detailed information on how to contribute to this project, including guidelines for submitting issues, pull requests, and development setup.

charts/llm-d-modelservice/Chart.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ description: A Helm chart for ModelService in llm-d
1212
type: application
1313
# This is the chart version. This version number should be incremented each time you make changes
1414
# to the chart and its templates, including the app version.
15-
# Versions are expected to follow Semantic Versioning (https://semver.org/)
16-
version: "v0.3.18"
15+
# Versions are expected to follow Semantic Versioning (https://semver.org/)<<<<<<< HEAD
16+
version: "v0.4.0"
1717
# This is the version number of the application being deployed. This version number should be
1818
# incremented each time you make changes to the application. Versions are not expected to
1919
# follow Semantic Versioning. They should reflect the version the application is using.

charts/llm-d-modelservice/templates/_dra.tpl

Lines changed: 0 additions & 45 deletions
This file was deleted.
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
{{/*
2+
DRA (Dynamic Resource Allocation) Helper Functions
3+
*/}}
4+
5+
{{/* Check if DRA is enabled */}}
6+
{{- define "llm-d-modelservice.draEnabled" -}}
7+
{{- if .Values.accelerator.dra -}}
8+
true
9+
{{- else -}}
10+
false
11+
{{- end -}}
12+
{{- end }}
13+
14+
{{/* Get accelerator type */}}
15+
{{- define "llm-d-modelservice.acceleratorType" -}}
16+
{{- .Values.accelerator.type | default "nvidia" -}}
17+
{{- end }}
18+
19+
{{/* Get accelerator claim name based on type */}}
20+
{{- define "llm-d-modelservice.acceleratorClaimName" -}}
21+
{{- $acceleratorType := include "llm-d-modelservice.acceleratorType" . -}}
22+
{{- if hasKey .Values.accelerator.resourceClaimTemplates $acceleratorType -}}
23+
{{- $template := index .Values.accelerator.resourceClaimTemplates $acceleratorType -}}
24+
{{- $template.name | default (printf "%s-claim" $acceleratorType) -}}
25+
{{- else -}}
26+
{{- printf "%s-claim" $acceleratorType -}}
27+
{{- end -}}
28+
{{- end }}
29+
30+
{{/* Get accelerator claim template name */}}
31+
{{- define "llm-d-modelservice.acceleratorClaimTemplateName" -}}
32+
{{- $acceleratorType := include "llm-d-modelservice.acceleratorType" . -}}
33+
{{- if hasKey .Values.accelerator.resourceClaimTemplates $acceleratorType -}}
34+
{{- $template := index .Values.accelerator.resourceClaimTemplates $acceleratorType -}}
35+
{{- $template.name | default (printf "%s-claim-template" $acceleratorType) -}}
36+
{{- else -}}
37+
{{- printf "%s-claim-template" $acceleratorType -}}
38+
{{- end -}}
39+
{{- end }}
40+
41+
{{/* Get DRA claim count (auto-calculate from parallelism if not set) */}}
42+
{{- define "llm-d-modelservice.draClaimCount" -}}
43+
{{- $acceleratorType := include "llm-d-modelservice.acceleratorType" . -}}
44+
{{- $count := 1 -}}
45+
{{- if hasKey .Values.accelerator.resourceClaimTemplates $acceleratorType -}}
46+
{{- $template := index .Values.accelerator.resourceClaimTemplates $acceleratorType -}}
47+
{{- if hasKey $template "count" -}}
48+
{{- $count = $template.count -}}
49+
{{- else -}}
50+
{{- /* Auto-calculate from parallelism */}}
51+
{{- $count = int (include "llm-d-modelservice.numGpuPerWorker" .parallelism) -}}
52+
{{- end -}}
53+
{{- else -}}
54+
{{- $count = int (include "llm-d-modelservice.numGpuPerWorker" .parallelism) -}}
55+
{{- end -}}
56+
{{- $count -}}
57+
{{- end }}
58+
59+
{{/* Generate pod-level resourceClaims (merges accelerator + user-defined claims) */}}
60+
{{- define "llm-d-modelservice.podResourceClaims" -}}
61+
{{- $claims := list -}}
62+
{{- $draEnabled := eq (include "llm-d-modelservice.draEnabled" .) "true" -}}
63+
{{- if $draEnabled -}}
64+
{{- $claimName := include "llm-d-modelservice.acceleratorClaimName" . -}}
65+
{{- $templateName := include "llm-d-modelservice.acceleratorClaimTemplateName" . -}}
66+
{{- $claims = append $claims (dict "name" $claimName "resourceClaimTemplateName" $templateName) -}}
67+
{{- end -}}
68+
{{- if .pdSpec.resourceClaims -}}
69+
{{- $claims = concat $claims .pdSpec.resourceClaims -}}
70+
{{- end -}}
71+
{{- if $claims -}}
72+
resourceClaims:
73+
{{- toYaml $claims | nindent 2 }}
74+
{{- end -}}
75+
{{- end }}
76+
77+
{{/* Generate container-level resource claims (merges accelerator + user-defined claims) */}}
78+
{{- define "llm-d-modelservice.containerResourceClaims" -}}
79+
{{- $claims := list -}}
80+
{{- $draEnabled := eq (include "llm-d-modelservice.draEnabled" .) "true" -}}
81+
{{- if $draEnabled -}}
82+
{{- $claimName := include "llm-d-modelservice.acceleratorClaimName" . -}}
83+
{{- $claims = append $claims (dict "name" $claimName) -}}
84+
{{- end -}}
85+
{{- if and .resources .resources.claims -}}
86+
{{- if kindIs "slice" .resources.claims -}}
87+
{{- $claims = concat $claims .resources.claims -}}
88+
{{- else -}}
89+
{{- fail "resources.claims must be a list of objects with 'name' field, e.g., [{\"name\": \"claim-name\"}]" -}}
90+
{{- end -}}
91+
{{- end -}}
92+
{{- if $claims -}}
93+
claims:
94+
{{- toYaml $claims | nindent 2 }}
95+
{{- end -}}
96+
{{- end }}
97+
98+
{{/* Get DRA ResourceClaimTemplate configuration for the current accelerator type */}}
99+
{{- define "llm-d-modelservice.draResourceClaimTemplateConfig" -}}
100+
{{- $acceleratorType := include "llm-d-modelservice.acceleratorType" . -}}
101+
{{- $config := dict -}}
102+
{{- if hasKey .Values.accelerator.resourceClaimTemplates $acceleratorType -}}
103+
{{- $config = index .Values.accelerator.resourceClaimTemplates $acceleratorType -}}
104+
{{- end -}}
105+
{{- $config | toJson -}}
106+
{{- end }}

charts/llm-d-modelservice/templates/_helpers.tpl

Lines changed: 37 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -272,34 +272,52 @@ nvidia.com/gpu
272272

273273
{{/* P/D deployment container resources */}}
274274
{{- define "llm-d-modelservice.resources" -}}
275-
{{- $numGpus := int (include "llm-d-modelservice.numGpuPerWorker" .parallelism) -}}
276-
{{- $acceleratorResource := include "llm-d-modelservice.acceleratorResource" . -}}
277275
{{- $limits := dict }}
278276
{{- if and .resources .resources.limits }}
279-
{{- $limits = deepCopy .resources.limits }}
280-
{{- end }}
281-
{{- if and (ge (int $numGpus) 1) (ne $acceleratorResource "") }}
282-
{{- /* Respect user's explicit accelerator setting; only auto-fill if not set */}}
283-
{{- /* This allows TPUs where tensor_parallelism != num_accelerators (e.g., TP=8 needs 4 TPUs) */}}
284-
{{- if not (hasKey $limits $acceleratorResource) }}
285-
{{- $limits = mergeOverwrite $limits (dict $acceleratorResource (toString $numGpus)) }}
286-
{{- end }}
277+
{{- $limits = deepCopy .resources.limits }}
287278
{{- end }}
288279
{{- $requests := dict }}
289280
{{- if and .resources .resources.requests }}
290-
{{- $requests = deepCopy .resources.requests }}
291-
{{- end }}
292-
{{- if and (ge (int $numGpus) 1) (ne $acceleratorResource "") }}
293-
{{- /* Respect user's explicit accelerator setting; only auto-fill if not set */}}
294-
{{- if not (hasKey $requests $acceleratorResource) }}
295-
{{- $requests = mergeOverwrite $requests (dict $acceleratorResource (toString $numGpus)) }}
296-
{{- end }}
281+
{{- $requests = deepCopy .resources.requests }}
297282
{{- end }}
283+
{{- $draEnabled := eq (include "llm-d-modelservice.draEnabled" .) "true" -}}
284+
{{- if $draEnabled -}}
285+
{{- /* DRA mode: pass through user-defined limits/requests as-is, add claims */}}
286+
{{- /* Users should not include accelerator resources in limits when DRA is enabled */}}
287+
resources:
288+
limits:
289+
{{- toYaml $limits | nindent 4 }}
290+
requests:
291+
{{- toYaml $requests | nindent 4 }}
292+
{{- include "llm-d-modelservice.containerResourceClaims" . | nindent 2 }}
293+
{{- else -}}
294+
{{- /* Device Plugin mode: existing logic */}}
295+
{{- $numGpus := int (include "llm-d-modelservice.numGpuPerWorker" .parallelism) -}}
296+
{{- $acceleratorResource := include "llm-d-modelservice.acceleratorResource" . -}}
297+
{{- if and (ge (int $numGpus) 1) (ne $acceleratorResource "") }}
298+
{{- /* Respect user's explicit accelerator setting; only auto-fill if not set */}}
299+
{{- /* This allows TPUs where tensor_parallelism != num_accelerators (e.g., TP=8 needs 4 TPUs) */}}
300+
{{- if not (hasKey $limits $acceleratorResource) }}
301+
{{- $limits = mergeOverwrite $limits (dict $acceleratorResource (toString $numGpus)) }}
302+
{{- end }}
303+
{{- end }}
304+
{{- if and (ge (int $numGpus) 1) (ne $acceleratorResource "") }}
305+
{{- /* Respect user's explicit accelerator setting; only auto-fill if not set */}}
306+
{{- if not (hasKey $requests $acceleratorResource) }}
307+
{{- $requests = mergeOverwrite $requests (dict $acceleratorResource (toString $numGpus)) }}
308+
{{- end }}
309+
{{- end }}
298310
resources:
299311
limits:
300312
{{- toYaml $limits | nindent 4 }}
301313
requests:
302314
{{- toYaml $requests | nindent 4 }}
315+
{{- /* Include user-defined claims even in Device Plugin mode */}}
316+
{{- if and .resources .resources.claims }}
317+
claims:
318+
{{- toYaml .resources.claims | nindent 4 }}
319+
{{- end }}
320+
{{- end -}}
303321
{{- end }}
304322

305323
{{/* prefill name */}}
@@ -416,9 +434,8 @@ context is a pdSpec
416434
{{- if $hasModelVolume }}
417435
{{ include "llm-d-modelservice.mountModelVolumeVolumes" .Values.modelArtifacts | nindent 4}}
418436
{{- end -}}
419-
{{- if .Values.dra.enabled -}}
420-
{{- (include "llm-d-modelservice.draResourceClaims" (dict "Values" .Values)) | nindent 2 }}
421-
{{- end -}}
437+
{{- /* Add resourceClaims for DRA (new and old API) */}}
438+
{{- include "llm-d-modelservice.podResourceClaims" . | nindent 2 }}
422439
{{- end }}
423440

424441
{{/*
@@ -477,11 +494,7 @@ context is a dict with helm root context plus:
477494
startupProbe:
478495
{{- toYaml . | nindent 4 }}
479496
{{- end }}
480-
{{- if .Values.dra.enabled }}
481-
{{- (include "llm-d-modelservice.draResources" (dict "resources" .container.resources "parallelism" .parallelism "container" .container "Values" .Values)) | nindent 2 }}
482-
{{- else }}
483497
{{- (include "llm-d-modelservice.resources" (dict "resources" .container.resources "parallelism" .parallelism "container" .container "Values" .Values)) | nindent 2 }}
484-
{{- end }}
485498
{{- include "llm-d-modelservice.mountModelVolumeVolumeMounts" (dict "container" .container "Values" .Values) | nindent 2 }}
486499
{{- /* DEPRECATED; use extraConfig.workingDir instead */ -}}
487500
{{- with .container.workingDir }}
Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,39 @@
1-
{{- if .Values.dra.enabled -}}
1+
{{- $draEnabled := eq (include "llm-d-modelservice.draEnabled" .) "true" -}}
2+
{{- if $draEnabled -}}
3+
{{- $acceleratorType := include "llm-d-modelservice.acceleratorType" . -}}
4+
{{- $templateName := include "llm-d-modelservice.acceleratorClaimTemplateName" . -}}
5+
{{- $configJson := include "llm-d-modelservice.draResourceClaimTemplateConfig" . -}}
6+
{{- $config := $configJson | fromJson -}}
7+
{{- if $config -}}
8+
{{- /* Calculate count from parallelism if not explicitly set */}}
9+
{{- $count := 1 -}}
10+
{{- if hasKey $config "count" -}}
11+
{{- $count = $config.count -}}
12+
{{- else -}}
13+
{{- /* Auto-calculate from decode parallelism (use decode as default) */}}
14+
{{- $count = int (include "llm-d-modelservice.numGpuPerWorker" .Values.decode.parallelism) -}}
15+
{{- end -}}
16+
{{- $class := $config.class | default (printf "gpu.%s.com" $acceleratorType) -}}
17+
{{- $match := $config.match | default "exactly" -}}
18+
{{- $selectors := $config.selectors | default list -}}
19+
---
220
apiVersion: resource.k8s.io/v1
321
kind: ResourceClaimTemplate
422
metadata:
5-
name: {{ .Values.dra.type }}-resource-claim-template
23+
name: {{ $templateName }}
24+
labels:
25+
{{- include "llm-d-modelservice.labels" . | nindent 4 }}
626
spec:
727
spec:
828
devices:
9-
{{- (include "llm-d-modelservice.draResourceClaimDeviceClaim" (dict "Values" .Values)) | nindent 6 }}
10-
{{- end}}
29+
requests:
30+
- name: {{ $acceleratorType }}
31+
{{ $match }}:
32+
deviceClassName: {{ $class }}
33+
count: {{ $count }}
34+
{{- if $selectors }}
35+
selectors:
36+
{{- toYaml $selectors | nindent 10 }}
37+
{{- end }}
38+
{{- end -}}
39+
{{- end -}}

0 commit comments

Comments
 (0)