Skip to content

Memory Usage统计不准确的问题/Inaccurate GPU Memory Usage Reporting #952

Open
@zf930530

Description

@zf930530

What happened:
当我在Pod内部执行nvidia-smi命令时,看到的Memory-Usage比在宿主机上直接执行nvidia-smi命令要小。
例如:当我在Pod中执行nvidia-smi命令时,我看到返回的GPU Memor-Usage是35345MiB
但是,当我在宿主几上执行nvidia-smi命令时,查看Processes找到Pod对应的进程,显示为36336MiB
这会导致显存的限制不准确,后面启动的Pod分配不到内存而OOM。

When executing the nvidia-smi command inside a Pod, the reported GPU memory usage appears to be lower than the value shown when running the same command directly on the host machine.

For example, running nvidia-smi inside the Pod reports GPU memory usage as 35345 MiB, while running nvidia-smi on the host and checking the corresponding process under Processes shows 36336 MiB.

This discrepancy leads to inaccurate memory usage reporting, which can result in GPU memory overallocation. As a consequence, newly launched Pods may fail to allocate sufficient memory and encounter OOM (Out of Memory) errors.

What you expected to happen:
Pod内部统计到的Memory-Usage与宿主机的一致,从而使得显存限制准确,不会导致切分多份的情况下,后启动的Pod无法分配显存。

Ideally, the GPU memory usage reported inside the Pod should match the value reported on the host. This would ensure that memory limits are enforced accurately, preventing memory allocation issues when the GPU is shared among multiple Pods.

Environment:

  • HAMi version: v2.5.0 or latest
  • nvidia driver or other AI device driver version: v100/A100 535/570

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions