Skip to content

refactor(nvml-mock): consolidate NVLink bandwidth to bandwidth_per_link_mbps#405

Open
giuliocalzo wants to merge 2 commits into
NVIDIA:mainfrom
giuliocalzo:nvml-mock/consolidate-nvlink-bandwidth-mbps
Open

refactor(nvml-mock): consolidate NVLink bandwidth to bandwidth_per_link_mbps#405
giuliocalzo wants to merge 2 commits into
NVIDIA:mainfrom
giuliocalzo:nvml-mock/consolidate-nvlink-bandwidth-mbps

Conversation

@giuliocalzo

@giuliocalzo giuliocalzo commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

NVLinkConfig exposed two overlapping bandwidth fields — bandwidth_per_link_gbps (whole GB/s) and bandwidth_per_link_mbps (Mbps). The engine preferred _mbps and only fell back to _gbps, so the whole-GB/s field was dead weight (it can't express NVLink5's 53.125 GB/s) and just added noise on gb200/gb300.

This consolidates to a single canonical bandwidth_per_link_mbps:

  • Remove BandwidthPerLinkGBPS from NVLinkConfig and simplify the bandwidth selection in topology.go.
  • Convert the _gbps-only profiles to _mbps and drop the redundant _gbps: 53 fallback line from gb200/gb300 (they already set _mbps: 53125).
  • Update both the Helm profiles/ and the pkg/gpu/mocknvml/configs/ mirror, plus docs/configuration.md.
  • Update unit tests and regenerate Helm snapshots.

Realistic per-link speeds

While consolidating, the a100/h100/b200 profiles were found to use the bidirectional-per-link marketing figure (50/50/100 GB/s) instead of the per-link unidirectional line rate that nvidia-smi nvlink -s actually reports (the semantics the field documents and that gb200/gb300 already follow). Corrected to:

Profile NVLink gen Links bandwidth_per_link_mbps Aggregate (per-link × 2 × links)
a100 3.0 12 25781 (25.781 GB/s) ~600 GB/s bidir
h100 4.0 18 26562 (26.562 GB/s) ~900 GB/s bidir
b200 5.0 18 53125 (53.125 GB/s) ~1.8 TB/s bidir
gb200 5.0 18 53125 (unchanged) ~1.8 TB/s bidir
gb300 5.0 18 53125 (unchanged) ~1.8 TB/s bidir

These match the documented nvidia-smi nvlink -s values (e.g. GV100/A100 25.781 GB/s, H100 26.562 GB/s). The docs example was bumped to the NVLink4 rate accordingly.

Closes #404. Deferred from #387 to keep that PR focused.

Test plan

  • go build ./...
  • go test ./pkg/gpu/mocknvml/...
  • golangci-lint run (0 issues)
  • helm unittest deployments/nvml-mock/helm/nvml-mock

@giuliocalzo giuliocalzo marked this pull request as draft June 20, 2026 11:22
@giuliocalzo giuliocalzo force-pushed the nvml-mock/consolidate-nvlink-bandwidth-mbps branch from 7080c57 to b164364 Compare June 22, 2026 07:30
@giuliocalzo giuliocalzo marked this pull request as ready for review June 22, 2026 09:20

@ArangoGutierrez ArangoGutierrez left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NVLink consolidation itself looks clean — bandwidth math checks out (per-link x2 x links reproduces the bidir aggregates) and the switch-to-if in topology.go is correct now that the fallback is gone. One thing to sort before merge: commit 09caf03 (the SIGPIPE fix) is identical to all of #409 — same edits to the same three validate-*.sh files. Whichever lands second will conflict or no-op, and it's an unrelated e2e fix in an NVLink refactor. Could you drop 09caf03 here and let #409 carry it? Happy to approve once that's split out.

Comment thread tests/e2e/validate-iblinkinfo.sh
@ArangoGutierrez

Copy link
Copy Markdown
Collaborator

409 was merged now this one needs rebase before merge

@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@giuliocalzo giuliocalzo changed the base branch from main to arc June 24, 2026 06:04
@giuliocalzo giuliocalzo changed the base branch from arc to main June 24, 2026 06:04
@giuliocalzo giuliocalzo force-pushed the nvml-mock/consolidate-nvlink-bandwidth-mbps branch from b74c0a7 to 8670f2b Compare June 24, 2026 06:06
…nk_mbps

NVLinkConfig exposed two overlapping bandwidth fields
(bandwidth_per_link_gbps and bandwidth_per_link_mbps), where the engine
preferred _mbps and only fell back to _gbps. The whole-GB/s field could
not express NVLink5's 53.125 GB/s and was dead weight on gb200/gb300.

Collapse to a single canonical bandwidth_per_link_mbps:
- Remove BandwidthPerLinkGBPS from NVLinkConfig and simplify the engine
  bandwidth selection in topology.go.
- Convert the _gbps-only profiles to _mbps (a100/h100 50->50000,
  b200 100->100000) and drop the redundant _gbps fallback on gb200/gb300.
- Update docs/configuration.md and the configs/ mirror.
- Update unit tests and regenerate Helm snapshots.

Closes NVIDIA#404

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
The a100/h100/b200 profiles previously used the bidirectional-per-link
marketing figure (50/50/100 GB/s) rather than the per-link unidirectional
line rate that `nvidia-smi nvlink -s` actually reports. Align them with the
gb200/gb300 convention:

- a100 (NVLink3): 25781 Mbps (25.781 GB/s/link)
- h100 (NVLink4): 26562 Mbps (26.562 GB/s/link)
- b200 (NVLink5): 53125 Mbps (53.125 GB/s/link, same silicon as gb200)

Per-link x 2 x links still reproduces the marketed bidirectional aggregates
(~600 GB/s, ~900 GB/s, ~1.8 TB/s). Update the docs example to the NVLink4
rate and regenerate Helm snapshots.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
@giuliocalzo giuliocalzo force-pushed the nvml-mock/consolidate-nvlink-bandwidth-mbps branch from 8670f2b to 6d73669 Compare June 25, 2026 11:15
@giuliocalzo

Copy link
Copy Markdown
Contributor Author

@ArangoGutierrez rebased onto the latest main (now includes #409 and #413). As requested, the duplicate SIGPIPE commit (09caf03) is dropped — #409 owns that fix, so this PR is back to just the two NVLink commits:

  • refactor(nvml-mock): consolidate NVLink bandwidth to bandwidth_per_link_mbps
  • fix(nvml-mock): use realistic per-link NVLink speeds in profiles

CI is fully green ✅ (all 30 checks pass, including the complete nvml-mock-e2e matrix). Could you re-review and merge when you have a moment? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nvml-mock: consolidate NVLink bandwidth config to a single bandwidth_per_link_mbps field

2 participants