Skip to content

hami=2.5.0版本Device_utilization_desc_of_container算力监控数据异常 #890

Open
@mxhyxym

Description

@mxhyxym

hami=2.5.0版本Device_utilization_desc_of_container算力监控数据异常
具体表现:宿主机物理卡监控数据正常,容器内用nvidia-smi命令看到算力监控数据正常,但是mectris接口获取到的虚拟卡的算力监控数据异常。

Image

任务yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: linc-deepseek-r1-32b-pod
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      name: linc-deepseek-r1-32b-pod
  template:
    metadata:
      labels:
        name: linc-deepseek-r1-32b-pod
    spec:
      schedulerName: default-scheduler
      nodeSelector:
        "kubernetes.io/hostname": "k8s-10-15-12-12"
      hostPID: true
      containers:
      - name: linc-deepseek-r1-32b-pod
        image: easzlab.io.local:5000/library/magicllm_deploy_xinfer:test2.2
        imagePullPolicy: IfNotPresent
        command: ["bash", "-c"] 
        args: ["while true; do sleep 30; done;"]
        resources:
          requests:
            cpu: '4'
            memory: 62Gi
            nvidia.com/gpu: 8
            nvidia.com/gpumem: 79000
            nvidia.com/gpucores: 100
          limits:
            cpu: '4'
            memory: 62Gi
            nvidia.com/gpu: 8
            nvidia.com/gpumem: 79000
            nvidia.com/gpucores: 100

Environment:

  • HAMi version: 2.5.0
  • nvidia driver: 560.28.03 (cuda 12.6)
  • Kernel version from uname -a:Linux k8s-10-15-12-12 6.8.0-52-generic # 53~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 15 19:18:46 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  • GPU:Nvidia H100 80G

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions