Skip to content

Topological alignment between GPUs and NICs in DRA (exposing pci device topology as device attribute?) #213

Open
@everpeace

Description

@everpeace

I understand DRA will finally promote to Beta in v1.32🎉 Thank you very much contributors for your hard work standardizing flexible device scheduling and implementing NVIDIA's dra-driver.

Do you have a plan exposing intra-node topology as device attribute?? Especially distances between GPU<->GPU and GPU<->NIC or HCA (I imagine nvidia-smi topo -m equivalent information)? Or, would you have a plan to provide some extension point to add user-defined device attribute in this dar-driver??

I imagine below usecases for optimizing training performance:

  • Single Node Multi GPUs:
    • a user wants to have 1 pod with 2 gpus which are connected via NVLink each other (NV# in nvidia-smi topo -m)
      → discussed in NVLINK Aware Scheduling #214
  • Multi Node Multi GPUs:

Thanks, in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions