refactor(nvml-mock): consolidate NVLink bandwidth to bandwidth_per_link_mbps by giuliocalzo · Pull Request #405 · NVIDIA/k8s-test-infra

giuliocalzo · 2026-06-17T15:50:02Z

Summary

NVLinkConfig exposed two overlapping bandwidth fields — bandwidth_per_link_gbps (whole GB/s) and bandwidth_per_link_mbps (Mbps). The engine preferred _mbps and only fell back to _gbps, so the whole-GB/s field was dead weight (it can't express NVLink5's 53.125 GB/s) and just added noise on gb200/gb300.

This consolidates to a single canonical bandwidth_per_link_mbps:

Remove BandwidthPerLinkGBPS from NVLinkConfig and simplify the bandwidth selection in topology.go.
Convert the _gbps-only profiles to _mbps and drop the redundant _gbps: 53 fallback line from gb200/gb300 (they already set _mbps: 53125).
Update both the Helm profiles/ and the pkg/gpu/mocknvml/configs/ mirror, plus docs/configuration.md.
Update unit tests and regenerate Helm snapshots.

Realistic per-link speeds

While consolidating, the a100/h100/b200 profiles were found to use the bidirectional-per-link marketing figure (50/50/100 GB/s) instead of the per-link unidirectional line rate that nvidia-smi nvlink -s actually reports (the semantics the field documents and that gb200/gb300 already follow). Corrected to:

Profile	NVLink gen	Links	`bandwidth_per_link_mbps`	Aggregate (per-link × 2 × links)
a100	3.0	12	`25781` (25.781 GB/s)	~600 GB/s bidir
h100	4.0	18	`26562` (26.562 GB/s)	~900 GB/s bidir
b200	5.0	18	`53125` (53.125 GB/s)	~1.8 TB/s bidir
gb200	5.0	18	`53125` (unchanged)	~1.8 TB/s bidir
gb300	5.0	18	`53125` (unchanged)	~1.8 TB/s bidir

These match the documented nvidia-smi nvlink -s values (e.g. GV100/A100 25.781 GB/s, H100 26.562 GB/s). The docs example was bumped to the NVLink4 rate accordingly.

Closes #404. Deferred from #387 to keep that PR focused.

Test plan

go build ./...
go test ./pkg/gpu/mocknvml/...
golangci-lint run (0 issues)
helm unittest deployments/nvml-mock/helm/nvml-mock

ArangoGutierrez

The NVLink consolidation itself looks clean — bandwidth math checks out (per-link x2 x links reproduces the bidir aggregates) and the switch-to-if in topology.go is correct now that the fallback is gone. One thing to sort before merge: commit 09caf03 (the SIGPIPE fix) is identical to all of #409 — same edits to the same three validate-*.sh files. Whichever lands second will conflict or no-op, and it's an unrelated e2e fix in an NVLink refactor. Could you drop 09caf03 here and let #409 carry it? Happy to approve once that's split out.

ArangoGutierrez · 2026-06-24T05:56:12Z

409 was merged now this one needs rebase before merge

copy-pr-bot · 2026-06-24T06:03:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…nk_mbps NVLinkConfig exposed two overlapping bandwidth fields (bandwidth_per_link_gbps and bandwidth_per_link_mbps), where the engine preferred _mbps and only fell back to _gbps. The whole-GB/s field could not express NVLink5's 53.125 GB/s and was dead weight on gb200/gb300. Collapse to a single canonical bandwidth_per_link_mbps: - Remove BandwidthPerLinkGBPS from NVLinkConfig and simplify the engine bandwidth selection in topology.go. - Convert the _gbps-only profiles to _mbps (a100/h100 50->50000, b200 100->100000) and drop the redundant _gbps fallback on gb200/gb300. - Update docs/configuration.md and the configs/ mirror. - Update unit tests and regenerate Helm snapshots. Closes NVIDIA#404 Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

The a100/h100/b200 profiles previously used the bidirectional-per-link marketing figure (50/50/100 GB/s) rather than the per-link unidirectional line rate that `nvidia-smi nvlink -s` actually reports. Align them with the gb200/gb300 convention: - a100 (NVLink3): 25781 Mbps (25.781 GB/s/link) - h100 (NVLink4): 26562 Mbps (26.562 GB/s/link) - b200 (NVLink5): 53125 Mbps (53.125 GB/s/link, same silicon as gb200) Per-link x 2 x links still reproduces the marketed bidirectional aggregates (~600 GB/s, ~900 GB/s, ~1.8 TB/s). Update the docs example to the NVLink4 rate and regenerate Helm snapshots. Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>

giuliocalzo · 2026-06-25T11:31:32Z

@ArangoGutierrez rebased onto the latest main (now includes #409 and #413). As requested, the duplicate SIGPIPE commit (09caf03) is dropped — #409 owns that fix, so this PR is back to just the two NVLink commits:

refactor(nvml-mock): consolidate NVLink bandwidth to bandwidth_per_link_mbps
fix(nvml-mock): use realistic per-link NVLink speeds in profiles

CI is fully green ✅ (all 30 checks pass, including the complete nvml-mock-e2e matrix). Could you re-review and merge when you have a moment? Thanks!

giuliocalzo requested a review from ArangoGutierrez as a code owner June 17, 2026 15:50

giuliocalzo marked this pull request as draft June 20, 2026 11:22

giuliocalzo force-pushed the nvml-mock/consolidate-nvlink-bandwidth-mbps branch from 7080c57 to b164364 Compare June 22, 2026 07:30

giuliocalzo marked this pull request as ready for review June 22, 2026 09:20

ArangoGutierrez mentioned this pull request Jun 22, 2026

mocknvml: implement per-process GetProcessUtilization + per-device processes override #381

Open

ArangoGutierrez requested changes Jun 22, 2026

View reviewed changes

Comment thread tests/e2e/validate-iblinkinfo.sh

ArangoGutierrez mentioned this pull request Jun 22, 2026

fix(e2e): avoid SIGPIPE false-negative in IB validation scripts #409

Merged

2 tasks

giuliocalzo changed the base branch from main to arc June 24, 2026 06:04

giuliocalzo changed the base branch from arc to main June 24, 2026 06:04

giuliocalzo force-pushed the nvml-mock/consolidate-nvlink-bandwidth-mbps branch from b74c0a7 to 8670f2b Compare June 24, 2026 06:06

giuliocalzo added 2 commits June 25, 2026 13:09

giuliocalzo force-pushed the nvml-mock/consolidate-nvlink-bandwidth-mbps branch from 8670f2b to 6d73669 Compare June 25, 2026 11:15

giuliocalzo requested a review from ArangoGutierrez June 25, 2026 11:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(nvml-mock): consolidate NVLink bandwidth to bandwidth_per_link_mbps#405

refactor(nvml-mock): consolidate NVLink bandwidth to bandwidth_per_link_mbps#405
giuliocalzo wants to merge 2 commits into
NVIDIA:mainfrom
giuliocalzo:nvml-mock/consolidate-nvlink-bandwidth-mbps

giuliocalzo commented Jun 17, 2026 •

edited

Loading

Uh oh!

ArangoGutierrez left a comment

Uh oh!

Uh oh!

ArangoGutierrez commented Jun 24, 2026

Uh oh!

copy-pr-bot Bot commented Jun 24, 2026

Uh oh!

giuliocalzo commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

giuliocalzo commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Realistic per-link speeds

Test plan

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArangoGutierrez commented Jun 24, 2026

Uh oh!

copy-pr-bot Bot commented Jun 24, 2026

Uh oh!

giuliocalzo commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

giuliocalzo commented Jun 17, 2026 •

edited

Loading