Skip to content

Add check_ib_counters: IB port error counter monitoring via sysfs#119

Open
gustcol wants to merge 3 commits intofacebookresearch:mainfrom
gustcol:feature/check-ib-counters
Open

Add check_ib_counters: IB port error counter monitoring via sysfs#119
gustcol wants to merge 3 commits intofacebookresearch:mainfrom
gustcol:feature/check-ib-counters

Conversation

@gustcol
Copy link
Copy Markdown
Contributor

@gustcol gustcol commented Mar 31, 2026

Summary

Implements the first item from #103 — an IB port error counter health check that reads directly from sysfs (/sys/class/infiniband/*/ports/*/counters/).

This complements check_iblink (link presence validation) by detecting runtime fabric degradation that silently impacts distributed training performance. A single degraded IB port can cause 5-30% throughput regressions in NCCL collectives (AllReduce, AllGather) with no obvious GPU/storage cause.

What it does

  • Discovers all IB device/port pairs on the node
  • Reads 12 error counters: SymbolErrorCounter, LinkErrorRecoveryCounter, LinkDownedCounter, PortRcvErrors, PortRcvRemotePhysicalErrors, PortRcvSwitchRelayErrors, PortXmitDiscards, PortXmitConstraintErrors, PortRcvConstraintErrors, LocalLinkIntegrityErrors, ExcessiveBufferOverrunErrors, VL15Dropped
  • Reads 4 throughput counters: PortXmitData, PortRcvData, PortXmitPkts, PortRcvPkts
  • Alerts when aggregate error count exceeds configurable thresholds (--warn-threshold=0, --crit-threshold=100)
  • Emits Nagios-format metrics for all counters (integrates with existing telemetry/sink infrastructure)

Architecture

Follows existing health check patterns exactly:

  • IBCountersCheck Protocol (extends CheckEnv) with IBCountersCheckImpl dataclass
  • Pure process_ib_counters() function for threshold evaluation
  • HealthCheckRuntime with killswitch (disable_check_ib_counters)
  • sysfs reads (same approach as check_iblink, no subprocess dependency)

Files changed

File Change
gcm/health_checks/checks/check_ib_counters.py New health check implementation
gcm/tests/health_checks_tests/test_check_ib_counters.py 14 tests (7 unit + 5 integration + 2 threshold)
gcm/schemas/health_check/health_check_name.py Add CHECK_IB_COUNTERS enum
gcm/monitoring/features/feature_definitions/health_checks_features.py Add killswitch
gcm/monitoring/features/gen/generated_features_healthchecksfeatures.py Generated killswitch getter
gcm/health_checks/checks/__init__.py Export registration
gcm/health_checks/cli/health_checks.py CLI entrypoint registration

Test plan

  • 14/14 new tests pass
  • 428/431 existing health check tests pass (3 pre-existing failures in test_check_processor.py)
  • Manual validation on a node with IB hardware

Read InfiniBand port error and throughput counters from sysfs
(/sys/class/infiniband/*/ports/*/counters/) and alert when aggregate
error counts exceed configurable thresholds.

This complements check_iblink (link presence) by detecting runtime
fabric degradation that silently impacts distributed training
performance (NCCL AllReduce, FSDP).

Tracks 12 error counters (SymbolErrorCounter, LinkDownedCounter,
PortRcvErrors, etc.) and 4 throughput counters with Nagios-format
metrics output. Configurable --warn-threshold and --crit-threshold.

Implements: facebookresearch#103
@github-actions
Copy link
Copy Markdown

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant