Open
Description
I understand DRA will finally promote to Beta in v1.32🎉 Thank you very much contributors for your hard work standardizing flexible device scheduling and implementing NVIDIA's dra-driver.
Do you have a plan exposing intra-node topology as device attribute?? Especially distances between GPU<->GPU and GPU<->NIC or HCA (I imagine nvidia-smi topo -m
equivalent information)? Or, would you have a plan to provide some extension point to add user-defined device attribute in this dar-driver??
I imagine below usecases for optimizing training performance:
Single Node Multi GPUs:a user wants to have 1 pod with 2 gpus which are connected via NVLink each other (NV#
innvidia-smi topo -m
)
→ discussed in NVLINK Aware Scheduling #214
- Multi Node Multi GPUs:
- a user wants like to have N pods per 4 gpus each of which have adjacent NIC or HCA (
PIX
innvidia-smi topo -m
) in specific zone(achieved by node selector)- probably, it needs integration with cni and network device plugins (e.g. https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin)
- a user wants like to have N pods per 4 gpus each of which have adjacent NIC or HCA (
Thanks, in advance.
Metadata
Metadata
Assignees
Labels
No labels