Poor performance with NVLink

I was running some benchmarks with torch-ucc using xccl for collectives, and I noticed very bad performance compared to NCCL. See numbers here: https://gist.github.com/froody/a86a5b2c5d9f46aedba7e930f4b4e08d

It's possible this is due to a misconfiguration, I built xccl with cuda and ucx support, but without sharp or vmc support. My question is - is it expected for xccl to properly utilize NVLink when available (in this case on a DGX-1 doing all-reduce across all 8 GPUs)?

I also noticed when running the benchmarks that CPU utilization as very high for all workers which seemed to be due to high-frequency polling.

Also as you can see in the output, ucc fails trying to reduce a 2gb tensor whereas nccl fails trying to reduce an 8gb tensor. This could be indicative of a leak somewhere.

Repro steps:
Run benchmark here: https://gist.github.com/froody/01ed6ce8d6ab72bd868431d793591379
Use BACKEND=ucc or BACKEND=nccl to select backend

hardware: DGX-1, Driver Version: 418.116.00
cuda: 10.1
pytorch: 1.6.0
ucx: 1.9.0
torch-ucc: a277d7da24ae6e8a40bda658d0f0d4e06fcadb8b
xccl: 2e97986fa14ee2538c6ffc577bb75d7434755935

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance with NVLink #57

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor performance with NVLink #57

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions