Skip to content

Commit 6832c84

Browse files
authored
feat(gpu): add gpu.nccl.* metrics to metadata.csv (#23962)
Adds 5 new GPU NCCL collective metrics emitted by the Datadog Agent NCCL check (pkg/collector/corechecks/gpu/nccl) under the gpu.nccl.* namespace (migrated from the legacy nccl.collective.* prefix): gpu.nccl.collective.algo_bandwidth_gbps - GB/s algorithmic bandwidth per rank gpu.nccl.collective.bus_bandwidth_gbps - GB/s bus bandwidth per rank gpu.nccl.collective.exec_time_us - µs execution time per rank gpu.nccl.collective.msg_size_bytes - bytes message size per rank gpu.nccl.rank.seconds_since_last_event - seconds since last event (hang detection) Inserted alphabetically between gpu.memory.temperature and gpu.nvlink.count.active.
1 parent 28830fd commit 6832c84

1 file changed

Lines changed: 5 additions & 0 deletions

File tree

gpu/metadata.csv

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,11 @@ gpu.memory.free,gauge,16,byte,,Unallocated device memory (in bytes).,0,gpu,memor
4343
gpu.memory.limit,gauge,16,byte,,"Total device memory (framebuffer). This is always the device-level memory limit; the `pid` and `container_id` tags are present to enable per-process and per-container utilization formulas, but the value itself does not change.",0,gpu,memory.limit,,
4444
gpu.memory.reserved,gauge,16,byte,,Device memory (in bytes) reserved for system use (driver or firmware).,0,gpu,memory.reserved,,
4545
gpu.memory.temperature,gauge,16,degree celsius,,Temperature of the memory chip,0,gpu,memory.temperature,,
46+
gpu.nccl.collective.algo_bandwidth_gbps,gauge,16,gigabyte,second,Algorithmic bandwidth of a collective operation per rank,0,gpu,nccl.collective.algo_bandwidth_gbps,,
47+
gpu.nccl.collective.bus_bandwidth_gbps,gauge,16,gigabyte,second,Bus bandwidth of a collective operation per rank,0,gpu,nccl.collective.bus_bandwidth_gbps,,
48+
gpu.nccl.collective.exec_time_us,gauge,16,microsecond,,Execution time of a collective operation per rank,0,gpu,nccl.collective.exec_time_us,,
49+
gpu.nccl.collective.msg_size_bytes,gauge,16,byte,,Message size of a collective operation per rank,0,gpu,nccl.collective.msg_size_bytes,,
50+
gpu.nccl.rank.seconds_since_last_event,gauge,16,second,,Seconds since the last NCCL event was received for a rank. Used for hang detection.,0,gpu,nccl.rank.seconds_since_last_event,,
4651
gpu.nvlink.ber.effective,gauge,16,,,NVLink effective error counter total for all links (errors not corrected by FEC/recovery mechanisms).,0,gpu,nvlink.ber.effective,,
4752
gpu.nvlink.ber.symbol,gauge,16,,,Symbol bit error rate for all NVLINK links,0,gpu,nvlink.ber.symbol,,
4853
gpu.nvlink.count.active,gauge,16,,,Number of active nvlinks for the device,0,gpu,nvlink.count.active,,

0 commit comments

Comments
 (0)