Skip to content

Migrate health checks to HealthCheckRuntime (#102)#102

Merged
luccabb merged 1 commit intomainfrom
health_rt
Mar 28, 2026
Merged

Migrate health checks to HealthCheckRuntime (#102)#102
luccabb merged 1 commit intomainfrom
health_rt

Conversation

@luccabb
Copy link
Copy Markdown
Member

@luccabb luccabb commented Mar 20, 2026

Summary:
moves healthchecks boilerplate to the HealthCheckRuntime context manager, this removes the repeated setup across every check.

Test Plan:
runs this shell file before and after changes: https://gist.github.com/luccabb/d0c7b0d2ea4c4fe0925af4eb6001cee0

diffing{before,after}.txt files shows they're the same, which proves this PR shouldn't impact healthchecks output

Reviewed By: calebho

Differential Revision: D97423071

Pulled By: luccabb

@meta-cla meta-cla Bot added the cla signed label Mar 20, 2026
@github-actions
Copy link
Copy Markdown

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@luccabb luccabb changed the title Health rt Migrate remaining health checks to HealthCheckRuntime Mar 20, 2026
@luccabb luccabb changed the title Migrate remaining health checks to HealthCheckRuntime Migrate health checks to HealthCheckRuntime Mar 20, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 20, 2026

@luccabb has imported this pull request. If you are a Meta employee, you can view this in D97423071.

@luccabb luccabb marked this pull request as ready for review March 20, 2026 04:51
@luccabb
Copy link
Copy Markdown
Member Author

luccabb commented Mar 20, 2026

@claude

@claude
Copy link
Copy Markdown

claude Bot commented Mar 20, 2026

Claude finished @luccabb's task in 3m 34s —— View job


PR Review: Migrate health checks to HealthCheckRuntime

  • Read CLAUDE.md and gather context
  • Review HealthCheckRuntime implementation
  • Sample changed health check files
  • Analyze diff patterns and consistency
  • Review test changes
  • Post final review

This is a solid, well-scoped refactor. The overall approach is correct and the implementation is clean. A few things worth flagging:


Strengths

  • Massive DRY improvement — net -2092 lines by eliminating identical boilerplate from 24 check files
  • Correct exception handlingHealthCheckRuntime.__exit__ returns False, letting SystemExit propagate correctly through TelemetryContext and OutputContext (both of which already handle SystemExit)
  • Killswitch path is safe — explicitly calling self._stack.__exit__(None, None, None) before sys.exit() inside __enter__ is the right thing to do, since Python doesn't call __exit__ when __enter__ raises
  • Good test coverage for the runtimetest_runtime.py covers killswitch cleanup, finish(), GPU node ID failure, and context lifecycle
  • Dependency injection — the injectable get_hostname, get_gpu_node_id, etc. make the runtime testable without real infrastructure

Issues

Test coverage regression (test_check_ssh_certs.py:271)

The test previously verified the log message content:

assert expected[2] in caplog.text

This assertion is dropped in the PR. The expected values encode the exact error message format (e.g. "exit code 3: CHECK_SSH_CERTS - Error 1 running...") which would catch regressions in message construction. Now only the exit code is asserted. The underlying issue is that check_ssh_certs now calls rt.finish() as an early-exit guard (without a preceding rt.logger.info(...) for the final message), so there's nothing in the log to assert against. Worth either restoring the check via result.output / stdout capture, or adding a rt.logger.info() before each rt.finish() call. Fix this →

Reduced startup log verbosity (runtime.py:97-103)

The old check-specific log lines included operation parameters useful for debugging, e.g.:

check-airstore flash-array-credential-count: cluster: X, node: Y, type: Z, expected-count-ge: 10

The new generic log only captures name, cluster, node, type. Check-specific params like expected-count-ge, governor, proc_freq, etc. are silently dropped. This is a minor observability regression. Consider logging them at the call site before entering the with block if they're useful for debugging in production.

check_syslogs.py placement change (informational)

Several commands in check_syslogs.py (e.g. link_flaps, xid) had sys.exit(exit_code.value) outside the old ExitStack block. In the new code rt.finish() is inside the with HealthCheckRuntime() block. This is functionally equivalent — OutputContext.__exit__ checks exc_type != SystemExit specifically, and TelemetryContext.__exit__ runs unconditionally — but worth noting in case you're debugging timing differences in telemetry flushing.


Nits

  • The _make_runtime() helper in test_runtime.py sets telemetry_context_factory and output_context_factory via attribute assignment after construction rather than as constructor args. Since these are dataclass fields, they could be passed directly at construction time for clarity, though the current approach works.

Overall this is a clean, well-executed refactor. The main thing I'd want addressed before merge is the test coverage regression in test_check_ssh_certs.py.

Summary:
moves healthchecks boilerplate to the HealthCheckRuntime context manager, this removes the repeated setup across every check.


Test Plan:
runs this shell file before and after changes: https://gist.github.com/luccabb/d0c7b0d2ea4c4fe0925af4eb6001cee0

diffing`{before,after}.txt` files shows they're the same, which proves this PR shouldn't impact healthchecks output

Reviewed By: calebho

Differential Revision: D97423071

Pulled By: luccabb
@meta-codesync meta-codesync Bot changed the title Migrate health checks to HealthCheckRuntime Migrate health checks to HealthCheckRuntime (#102) Mar 27, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 27, 2026

@luccabb has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97423071.

@luccabb luccabb merged commit 490cc4e into main Mar 28, 2026
45 checks passed
@luccabb luccabb deleted the health_rt branch March 28, 2026 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant