Skip to content

Commit c835798

Browse files
authored
feat: install mock-device-plugin (#1534)
* feat: add mock device plugin Signed-off-by: james <open4pd@4paradigm.com> * fix: fix format Signed-off-by: james <open4pd@4paradigm.com> * fix: fix comment Signed-off-by: james <open4pd@4paradigm.com> * feat: add memoryfactor Signed-off-by: james <open4pd@4paradigm.com> * fix: fix test case Signed-off-by: james <open4pd@4paradigm.com> --------- Signed-off-by: james <open4pd@4paradigm.com>
1 parent b8d7aa6 commit c835798

16 files changed

Lines changed: 319 additions & 11 deletions

File tree

charts/hami/templates/_helpers.tpl

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,13 @@ The app name for DevicePlugin
4848
{{- printf "%s-device-plugin" ( include "hami-vgpu.fullname" . ) | trunc 63 | trimSuffix "-" -}}
4949
{{- end -}}
5050

51+
{{/*
52+
The app name for MockDevicePlugin
53+
*/}}
54+
{{- define "hami-vgpu.mock-device-plugin" -}}
55+
{{- printf "%s-mock-device-plugin" ( include "hami-vgpu.fullname" . ) | trunc 63 | trimSuffix "-" -}}
56+
{{- end -}}
57+
5158
{{/*
5259
The tls secret name for Scheduler
5360
*/}}
@@ -123,6 +130,10 @@ app.kubernetes.io/instance: {{ .Release.Name }}
123130
{{ include "common.images.image" (dict "imageRoot" .Values.devicePlugin.image "global" .Values.global "tag" .Values.global.imageTag) }}
124131
{{- end -}}
125132

133+
{{- define "hami.mockDevicePlugin.image" -}}
134+
{{ include "common.images.image" (dict "imageRoot" .Values.mockDevicePlugin.image "global" .Values.global "tag" .Values.mockDevicePlugin.tag) }}
135+
{{- end -}}
136+
126137
{{- define "hami.devicePlugin.monitor.image" -}}
127138
{{ include "common.images.image" (dict "imageRoot" .Values.devicePlugin.monitor.image "global" .Values.global "tag" .Values.global.imageTag) }}
128139
{{- end -}}
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
{{- if .Values.mockDevicePlugin.enabled }}
2+
apiVersion: apps/v1
3+
kind: DaemonSet
4+
metadata:
5+
name: {{ include "hami-vgpu.mock-device-plugin" . }}
6+
namespace: {{ include "hami-vgpu.namespace" . }}
7+
spec:
8+
selector:
9+
matchLabels:
10+
app.kubernetes.io/component: hami-mock-device-plugin
11+
{{- include "hami-vgpu.selectorLabels" . | nindent 6 }}
12+
template:
13+
metadata:
14+
annotations:
15+
scheduler.alpha.kubernetes.io/critical-pod: ""
16+
labels:
17+
app.kubernetes.io/component: hami-mock-device-plugin
18+
{{- include "hami-vgpu.selectorLabels" . | nindent 8 }}
19+
spec:
20+
serviceAccountName: {{ include "hami-vgpu.mock-device-plugin" . }}
21+
tolerations:
22+
- key: CriticalAddonsOnly
23+
operator: Exists
24+
containers:
25+
- image: {{ include "hami.mockDevicePlugin.image" . }}
26+
imagePullPolicy: {{ .Values.mockDevicePlugin.image.pullPolicy }}
27+
name: hami-mock-dp-cntr
28+
env:
29+
- name: NODE_NAME
30+
valueFrom:
31+
fieldRef:
32+
fieldPath: spec.nodeName
33+
command:
34+
- ./k8s-device-plugin
35+
- -v=5
36+
- --device-config-file=/device-config.yaml
37+
volumeMounts:
38+
- name: dp
39+
mountPath: /var/lib/kubelet/device-plugins
40+
- name: sys
41+
mountPath: /sys
42+
- name: device-config
43+
mountPath: /device-config.yaml
44+
subPath: device-config.yaml
45+
volumes:
46+
- name: dp
47+
hostPath:
48+
path: /var/lib/kubelet/device-plugins
49+
- name: sys
50+
hostPath:
51+
path: /sys
52+
- name: device-config
53+
configMap:
54+
name: {{ include "hami-vgpu.scheduler" . }}-device
55+
{{- end -}}

charts/hami/templates/scheduler/clusterrole.yaml

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,14 @@ rules:
2121
- apiGroups: [""]
2222
resources: ["resourcequotas"]
2323
verbs: ["get", "list", "watch"]
24-
24+
---
25+
{{- if .Values.mockDevicePlugin.enabled }}
26+
apiVersion: rbac.authorization.k8s.io/v1
27+
kind: ClusterRole
28+
metadata:
29+
name: {{ include "hami-vgpu.mock-device-plugin" . }}
30+
rules:
31+
- apiGroups: [""]
32+
resources: ["nodes"]
33+
verbs: ["get", "update", "list", "patch"]
34+
{{- end -}}

charts/hami/templates/scheduler/device-configmap.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ data:
2121
defaultMemory: 0
2222
defaultCores: 0
2323
defaultGPUNum: 1
24+
memoryFactor: 1
2425
deviceSplitCount: {{ .Values.devicePlugin.deviceSplitCount }}
2526
deviceMemoryScaling: {{ .Values.devicePlugin.deviceMemoryScaling }}
2627
deviceCoreScaling: {{ .Values.devicePlugin.deviceCoreScaling }}
@@ -235,6 +236,7 @@ data:
235236
resourceCountName: {{ .Values.dcuResourceName }}
236237
resourceMemoryName: {{ .Values.dcuResourceMem }}
237238
resourceCoreName: {{ .Values.dcuResourceCores }}
239+
memoryFactor: 1
238240
metax:
239241
resourceCountName: "metax-tech.com/gpu"
240242
resourceVCountName: {{ .Values.metaxResourceName }}
@@ -286,6 +288,7 @@ data:
286288
resourceMemoryName: huawei.com/Ascend910A-memory
287289
memoryAllocatable: 32768
288290
memoryCapacity: 32768
291+
memoryFactor: 1
289292
aiCore: 30
290293
templates:
291294
- name: vir02
@@ -306,6 +309,7 @@ data:
306309
resourceMemoryName: huawei.com/Ascend910B2-memory
307310
memoryAllocatable: 65536
308311
memoryCapacity: 65536
312+
memoryFactor: 1
309313
aiCore: 24
310314
aiCPU: 6
311315
templates:
@@ -327,6 +331,7 @@ data:
327331
resourceMemoryName: huawei.com/Ascend910B3-memory
328332
memoryAllocatable: 65536
329333
memoryCapacity: 65536
334+
memoryFactor: 1
330335
aiCore: 20
331336
aiCPU: 7
332337
templates:
@@ -344,6 +349,7 @@ data:
344349
resourceMemoryName: huawei.com/Ascend910B4-1-memory
345350
memoryAllocatable: 65536
346351
memoryCapacity: 65536
352+
memoryFactor: 1
347353
aiCore: 20
348354
aiCPU: 7
349355
templates:
@@ -365,6 +371,7 @@ data:
365371
resourceMemoryName: huawei.com/Ascend910B4-memory
366372
memoryAllocatable: 32768
367373
memoryCapacity: 32768
374+
memoryFactor: 1
368375
aiCore: 20
369376
aiCPU: 7
370377
templates:
@@ -382,6 +389,7 @@ data:
382389
resourceMemoryName: huawei.com/Ascend310P-memory
383390
memoryAllocatable: 21527
384391
memoryCapacity: 24576
392+
memoryFactor: 1
385393
aiCore: 8
386394
aiCPU: 7
387395
templates:

charts/hami/templates/scheduler/rolebinding.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,18 @@ subjects:
1414
- kind: ServiceAccount
1515
name: {{ include "hami-vgpu.scheduler" . }}
1616
namespace: {{ include "hami-vgpu.namespace" . }}
17+
---
18+
{{- if .Values.mockDevicePlugin.enabled }}
19+
apiVersion: rbac.authorization.k8s.io/v1
20+
kind: ClusterRoleBinding
21+
metadata:
22+
name: {{ include "hami-vgpu.mock-device-plugin" . }}
23+
roleRef:
24+
apiGroup: rbac.authorization.k8s.io
25+
kind: ClusterRole
26+
name: {{ include "hami-vgpu.mock-device-plugin" . }}
27+
subjects:
28+
- kind: ServiceAccount
29+
name: {{ include "hami-vgpu.mock-device-plugin" . }}
30+
namespace: {{ include "hami-vgpu.namespace" . }}
31+
{{- end -}}

charts/hami/templates/scheduler/serviceaccount.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,11 @@ metadata:
66
labels:
77
app.kubernetes.io/component: "hami-scheduler"
88
{{- include "hami-vgpu.labels" . | nindent 4 }}
9+
---
10+
{{- if .Values.mockDevicePlugin.enabled }}
11+
apiVersion: v1
12+
kind: ServiceAccount
13+
metadata:
14+
name: {{ include "hami-vgpu.mock-device-plugin" . }}
15+
namespace: {{ include "hami-vgpu.namespace" . }}
16+
{{- end -}}

charts/hami/values.yaml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -374,6 +374,25 @@ devicePlugin:
374374
# cpu: 100m
375375
# memory: 100Mi
376376

377+
mockDevicePlugin:
378+
enabled: false
379+
image:
380+
registry: "docker.io"
381+
repository: "projecthami/mock-device-plugin"
382+
tag: "0.1.0"
383+
## Specify a imagePullPolicy
384+
## Defaults to 'Always' if image tag is 'latest', else set to 'IfNotPresent'
385+
## ref: https://kubernetes.io/docs/user-guide/images/#pre-pulling-images
386+
##
387+
pullPolicy: IfNotPresent
388+
## Optionally specify an array of imagePullSecrets.
389+
## Secrets must be manually created in the namespace.
390+
## Example:
391+
## pullSecrets:
392+
## - myRegistryKeySecretName
393+
##
394+
pullSecrets: []
395+
377396
devices:
378397
amd:
379398
customresources:

docs/config.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,8 @@ You can update these configurations using one of the following methods:
3131
Note: When a container requests `nvidia.com/gpu` and its GPU memory reservation is exclusive (for example `nvidia.com/gpumem-percentage` is 100, or memory fields are omitted so `nvidia.defaultMem` remains 0 and defaults to 100%), and the pod spec does not set `nvidia.com/gpucores`, HAMi defaults `nvidia.com/gpucores` to 100 during admission. Non-exclusive memory requests or pods that already set `nvidia.com/gpucores` remain unchanged.
3232
* `nvidia.defaultGPUNum`:
3333
Integer type, by default: equals 1, if configuration value is 0, then the configuration value will not take effect and will be filtered. when a user does not set nvidia.com/gpu this key in pod resource, webhook should check nvidia.com/gpumem、resource-mem-percentage、nvidia.com/gpucores this three key, anyone a key having value, webhook should add nvidia.com/gpu key and this default value to resources limits map.
34+
* `nvidia.memoryFactor`:
35+
Integer type, by default: equals 1. During resource requests, the actual value of `nvidia.com/gpumem` will be multiplied by this factor. If `mock-device-plugin` is deployed, the actual value `nvidia.com/gpumem` in `node.status.capacity` will also be amplified by the corresponding multiple.
3436
* `nvidia.resourceCountName`:
3537
String type, vgpu number resource name, default: "nvidia.com/gpu"
3638
* `nvidia.resourceMemoryName`:

docs/config_cn.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@
3232
* `nvidia.defaultGPUNum`
3333
整数类型,默认为 1,如果配置为 0,则配置不会生效。当用户在 Pod 资源中没有设置 nvidia.com/gpu 这个 key 时,webhook 会检查 nvidia.com/gpumem、
3434
resource-mem-percentage、nvidia.com/gpucores 这三个 key 中的任何一个 key 有值,webhook 都会添加 nvidia.com/gpu 键和此默认值到 resources limit 中。
35+
* `nvidia.memoryFactor`:
36+
整数类型,默认为 1。在资源申请时`nvidia.com/gpumem`的真实值会放大相应的倍数。如果部署了`mock-device-plugin`, 在`node.status.capacity`的真实值也会放大对应的倍数。
3537
* `nvidia.resourceCountName`
3638
字符串类型,申请 vgpu 个数的资源名,默认:"nvidia.com/gpu"
3739
* `nvidia.resourceMemoryName`

pkg/device/ascend/device.go

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -263,6 +263,11 @@ func (dev *Devices) GenerateResourceRequests(ctr *corev1.Container) device.Conta
263263
if ok {
264264
memnums, ok := mem.AsInt64()
265265
if ok {
266+
if dev.config.MemoryFactor > 1 {
267+
rawMemnums := memnums
268+
memnums = memnums * int64(dev.config.MemoryFactor)
269+
klog.V(4).Infof("Update Ascend memory request. before %d, after %d, factor %d", rawMemnums, memnums, dev.config.MemoryFactor)
270+
}
266271
m, _ := dev.trimMemory(memnums)
267272
memnum = int(m)
268273
}

0 commit comments

Comments
 (0)