Open
Description
hami=2.5.0版本Device_utilization_desc_of_container算力监控数据异常
具体表现:宿主机物理卡监控数据正常,容器内用nvidia-smi命令看到算力监控数据正常,但是mectris接口获取到的虚拟卡的算力监控数据异常。
任务yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: linc-deepseek-r1-32b-pod
namespace: default
spec:
replicas: 1
selector:
matchLabels:
name: linc-deepseek-r1-32b-pod
template:
metadata:
labels:
name: linc-deepseek-r1-32b-pod
spec:
schedulerName: default-scheduler
nodeSelector:
"kubernetes.io/hostname": "k8s-10-15-12-12"
hostPID: true
containers:
- name: linc-deepseek-r1-32b-pod
image: easzlab.io.local:5000/library/magicllm_deploy_xinfer:test2.2
imagePullPolicy: IfNotPresent
command: ["bash", "-c"]
args: ["while true; do sleep 30; done;"]
resources:
requests:
cpu: '4'
memory: 62Gi
nvidia.com/gpu: 8
nvidia.com/gpumem: 79000
nvidia.com/gpucores: 100
limits:
cpu: '4'
memory: 62Gi
nvidia.com/gpu: 8
nvidia.com/gpumem: 79000
nvidia.com/gpucores: 100
Environment:
- HAMi version: 2.5.0
- nvidia driver: 560.28.03 (cuda 12.6)
- Kernel version from
uname -a
:Linux k8s-10-15-12-12 6.8.0-52-generic # 53~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 15 19:18:46 UTC 2 x86_64 x86_64 x86_64 GNU/Linux - GPU:Nvidia H100 80G