Skip to content

Cambricon MLU370 一机多卡时只能被 Hami 调度到一张物理卡上 #946

Open
@DanceKiddle

Description

@DanceKiddle

背景:

一个k8s node节点上8张物理MLU370卡,都开启了smlu; Hami版本为 v2.5.0; k8s 1.23版本。

在启动了4个任务后,发现所有 pod 都运行到 id 为 7 的物理卡上,pod运行正常,主机 cnmon 显示为卡 7 创建了 4 个 smlu。

问题一:此时 通过 31993/metrics 获取的数据中,所有已使用量都是 0:

GPUDeviceCoreAllocated{deviceidx="6",deviceuuid="100.10.15.224-cambricon-mlu-6",nodeid="100.10.15.224",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="7",deviceuuid="100.10.15.224-cambricon-mlu-7",nodeid="100.10.15.224",zone="vGPU"} 0
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="6",deviceuuid="100.10.15.224-cambricon-mlu-6",nodeid="100.10.15.224",zone="vGPU"} 0
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="7",deviceuuid="100.10.15.224-cambricon-mlu-7",nodeid="100.10.15.224",zone="vGPU"} 0

按理应该显示已分配量。

问题二:

后续再新增 pod 时,发现无法调度。scheduler报错:

Allocate failed due toerror: code = Unknown desc = fourPDcexist profile 2 for device 7 but its remain 0 is invaild, which is unexpected

如果卡7 资源不足,应该将该pod 调度到其他物理卡上。

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions