This file provides guidance to Codex, Cursor, Copilot, and other coding agents when working with code in this repository.
Topograph discovers the physical network topology of a cluster (NVLink domains, InfiniBand/Ethernet switch fabric, cloud rack topology) and exposes it to workload schedulers — Slurm, Kubernetes, and Slurm-on-Kubernetes (Slinky). It has five runtime components:
- API Server — receives
/v1/generaterequests, aggregates bursts, dispatches to a Provider - Node Observer — Kubernetes-only; watches node status changes and triggers regeneration
- Node Data Broker — Kubernetes-only DaemonSet; collects per-node attributes (NVLink clique IDs, etc.) as node annotations
- Provider — per-environment adapter that queries a topology source (CSP API, NetQ,
ibnetdiscover, DRA labels) and returns a canonical representation - Engine — per-scheduler translator that writes the canonical representation out as
topology.conf, Kubernetes node labels, or a Slinky ConfigMap
Providers differ by environment. The canonical topology.Graph is stable. Engines only translate — they do not discover.
This separation is load-bearing. If you find yourself reading the fabric in an engine, or emitting scheduler-specific output from a provider, stop and reconsider.
cmd/ # Four entry points: topograph, node-observer, node-data-broker-initc
pkg/
providers/ # One directory per provider: aws, gcp, oci, nebius, netq, dra, infiniband, lambdai, cw, test
engines/ # One directory per engine: k8s, slinky, slurm
topology/ # Canonical Graph, Vertex tree, and topology constants (DO NOT CHANGE CASUALLY)
registry/ # Central NamedLoader wiring for providers + engines
translate/ # topology.conf and block/tree generation shared by engines
server/ # HTTP server and request aggregator
node_observer/ # Kubernetes Node watcher
ib/ # InfiniBand fabric discovery helpers
config/ # Config file parser
metrics/ # Prometheus metrics
models/ # Go types and loader for YAML simulation models (the YAML files live in tests/models/)
test/ # Cross-package test helpers
internal/ # Shared utilities not part of the public API
cluset, component, config, exec, files, httperr, httpreq, k8s, version
charts/topograph/ # Helm chart (with node-data-broker subchart)
docs/ # Public-facing docs — overview.md, architecture.md, api.md + providers/, engines/, reference/ subdirectories
tests/models/ # YAML simulation fixtures
tests/charts/ # Helm golden outputs for chart values fixtures
config/ # Sample topograph-config.yaml
scripts/ # Build scripts (deb, rpm, SSL, clean)
localdev/ # Developer-local workspace — not tracked; personal scratch files
These structures propagate across every provider and engine. Changing them in a single PR usually means the PR is too broad.
| Surface | Why it's load-bearing |
|---|---|
pkg/topology/ — Graph, the Vertex tree, and topology constants |
Every provider returns it; every engine consumes it. A shape change ripples to all of them. |
Helm global.provider.name / global.engine.name / topologyNodeLabels |
External contract for operators deploying Topograph. |
The four default label keys network.topology.nvidia.com/{accelerator,leaf,spine,core} |
Consumed by downstream projects (KAI Scheduler, NVSentinel, Kueue). |
- Go 1.25.9 (see
go.mod) — newer minor versions are fine; older will not build - make
- golangci-lint —
brew install golangci-lintor viago install - docker — only for container image builds and the IB variant
git clone https://github.com/NVIDIA/topograph.git
cd topograph
make build # produces bin/topograph, bin/node-observer, bin/node-data-broker-initcCross-compile with make build-linux-amd64, make build-darwin-arm64, etc.
make qualify # runs fmt, vet, lint, and test in sequence — pre-push aggregator
make fmt # go fmt ./...
make vet # go vet ./...
make lint # golangci-lint run (only flags new issues vs. main)
make test # go test -race -coverprofile=coverage.out ./...
make chart-test # helm chart smoke + golden tests (see scripts/chart-test.sh)
make chart-test-update-golden # refresh tests/charts/*.golden.yaml (review before commit)
make coverage # human-readable per-package summaryRun make qualify before pushing. The individual targets are available if you want to run a single check during iteration. Run make chart-test when you change charts/topograph/ or its subcharts; CI runs it on every workflow trigger.
From codecov.yml:
- Project coverage: 60% target, 5% threshold for drops
- Patch coverage: 50% target, 5% threshold
Coverage checks run on pull requests. A drop below target with no matching uplift in the touched files will fail the Codecov check.
.github/workflows/go.yml— build, test, lint, and Helm chart tests (make chart-test) on every push and PR.github/workflows/docker.yml— container image build (manual trigger).github/workflows/docker-ib.yml— InfiniBand-variant container (manual trigger).github/workflows/helm-release.yaml— Helm chart release (manual trigger)
- Binaries —
debandrpmpackages viamake deb/make rpm(consumed by Slurm users) - Container images —
ghcr.io/nvidia/topograph(consumed by Kubernetes users) - Helm chart —
charts/topograph/(withnode-data-brokersubchart)
go fmt ./...is authoritative — do not hand-formatgolangci-lintruns in CI with--new-from-revso only new issues block; fix warnings in code you touch- Copyright header on every new Go file:
Copyright (c) <year>, NVIDIA CORPORATION. All rights reserved.followed by the Apache 2.0 boilerplate matching existing files
The contract lives in pkg/providers/providers.go:
type Provider interface {
GenerateTopologyConfig(
ctx context.Context,
pageSize *int,
instances []topology.ComputeInstances,
) (*topology.Graph, *httperr.Error)
}A provider returns a *topology.Graph of the discovered topology. Tiers is the root of the switch hierarchy; Domains is a topology.DomainMap mapping accelerator/block domains to hosts, with each finalized domain carrying the enumerated ID used by block-topology output. Leaf vertices are compute nodes; interior tier vertices are switches. Return *httperr.Error so the API server can propagate the correct HTTP status code — plain error is not acceptable at this boundary.
- Create
pkg/providers/<name>/with at minimumprovider.goandprovider_test.go - Expose a
NamedLoaderfunction with signaturefunc NamedLoader() (string, providers.Loader)— this is how the registry wires the provider - Register in
pkg/registry/registry.goby adding<name>.NamedLoaderto theproviders.NewRegistry(...)call list - Add
docs/providers/<name>.mdfollowing the shape ofaws.md/netq.md(prerequisites, credentials, parameters, how it works, verification) - Update
docs/overview.md— add the provider to the "Currently supported providers" list and the "Choosing a Provider" scenario table - If the provider has a simulated variant for testing, export a second
NamedLoaderSimand register it alongside (seeaws,gcp,oci,lambdai)
Engines are much rarer (three exist: slurm, k8s, slinky). Follow the same registry pattern but register in engines.NewRegistry(...). Coordinate with maintainers before starting — adding an engine implies a new output format that every provider's output must be translatable into.
| Don't | Because |
|---|---|
| Read the fabric inside an engine | Engines only translate; discovery belongs in providers |
| Emit scheduler-specific output from a provider | Same invariant in reverse |
Change pkg/topology/Vertex fields without discussion |
Every provider and engine depends on the shape |
Add a new provider in pkg/providers/<name>/ without also updating pkg/registry/registry.go |
Orphaned code; provider will not be loadable |
Modify an AGENTS.md-described surface (new Makefile target, top-level directory, chart template, invariant) without updating AGENTS.md + .claude/CLAUDE.md in the same PR |
Drift between the code and its agent-facing description; the next contributor / agent reads stale guidance |
| Skip DCO sign-off to "fix later" | The DCO bot will block the PR; rebase with --signoff is always available |
Use plain error at the provider interface boundary |
Must be *httperr.Error so the API server returns the correct HTTP status |
Enable both ingress.enabled and gatewayAPI.enabled in the same Helm release |
Mutually exclusive; deploying both routing resources against the same Service is almost always a misconfiguration. Enforced by charts/topograph/templates/_validation.tpl. |
Add implementation-specific annotations, CRDs, or extensions to charts/topograph/templates/httproute.yaml |
The default HTTPRoute must use only standard gateway.networking.k8s.io/v1 fields so it renders and functions against any conformant Gateway API implementation. Implementation-specific examples (kgateway TrafficPolicy, etc.) belong in values.k8s.gateway-api-example.yaml as separate attached resources, not in the chart's default template. |
Label keys written by the Kubernetes and Slinky engines are documented in docs/reference/node-labels.md. Do not invent new keys in provider or engine code — values flow through the canonical graph; keys are configured via Helm topologyNodeLabels.
Use a prefix that matches the change type: feat/, fix/, docs/, chore/, refactor/, test/. Example: docs/agents-md, feat/crusoe-provider.
Conventional Commits format:
type(scope): short description
optional body
Signed-off-by: Your Name <you@example.com>
Type must be one of: feat, fix, docs, chore, refactor, style, perf, test, build, ci.
Every commit must carry a Signed-off-by: trailer. There is no .github/dco.yml exemption on this repo — NVIDIA org membership does not bypass the DCO bot here. Two ways to add it:
git commit -s -m "feat(provider/foo): add Foo provider" # adds trailer
git commit -s -S -m "..." # sign-off + GPG signIf a PR arrives without sign-off, rebase the branch to add it:
git rebase --signoff upstream/main
git push --force-with-leaseConfigure once:
git config --global user.signingkey <key-id>
git config --global commit.gpgsign trueSigned commits get a Verified badge on GitHub. The GPG public key must be uploaded to your GitHub account.
If you discover what appears to be a security vulnerability while working in this codebase — unauthenticated code path, exposed credential, injection vulnerability, privilege-escalation path, dependency with a known CVE, or similar — do not file a public GitHub issue or include it in a public PR description. Surface it privately to the maintainer, who can route it through the NVIDIA PSIRT channels documented in SECURITY.md (psirt@nvidia.com and the submission form; not GitHub).
Every PR should be evaluated for documentation impact before pre-push qualification. The following changes imply specific doc updates in the same PR:
| Change | Docs update required |
|---|---|
| New / changed / removed provider | docs/providers/<name>.md + docs/overview.md provider list + "Choosing a Provider" scenario table |
| New / changed / removed engine | docs/engines/<engine>.md |
| New / changed chart template (Ingress, HTTPRoute, NetworkPolicy, ServiceMonitor, etc.) | docs/engines/k8s.md "Exposing the Topograph API" section |
| New / changed chart values schema | charts/topograph/values.yaml comments, NOTES.txt output, and any docs that reference the values |
| New / changed label or annotation key | docs/reference/node-labels.md |
| New / changed API endpoint, request parameter, or response field | docs/api.md |
New / changed config schema (topograph-config.yaml fields, defaults, validation) |
docs/api.md |
| New invariant or "do not change without discussion" surface | AGENTS.md + .claude/CLAUDE.md in the same PR |
| New Makefile target, top-level directory, or repository-layout change described by the repository map | AGENTS.md + .claude/CLAUDE.md in the same PR |
If a change falls outside these categories, it still warrants a moment's review for collateral doc drift.
When filing a PR (gh pr create or the GitHub UI), .github/PULL_REQUEST_TEMPLATE.md auto-populates the body with a Description section and a Checklist. Fill in the Description and tick the checklist items as completed — do not delete or replace the template wholesale.
-
make qualifypasses (runs fmt, vet, lint, test) - New or changed public behavior is covered by a test
- Documentation impact evaluated per the table above — applicable doc updates are included in this PR
-
pkg/topology/changes were discussed in an issue first - Every commit has a DCO sign-off
- All CI checks must be green before merge (Go build/test/lint, Codecov, DCO)
- Reviewers look for: adherence to the provider/engine boundary, test coverage on new code paths, doc updates when contract changes
- Breaking changes to the config schema, label keys, or
Vertexshape are rejected unless discussed in an issue first
Read docs/ before asking. Provider-specific questions usually have answers in docs/providers/<name>.md. Label semantics are in docs/reference/node-labels.md. The scenario-to-provider mapping is in the "Choosing a Provider" table in docs/overview.md. API endpoints and config schema live in docs/api.md.