Add scaffold tool for creating new health checks [2/2] by gustcol · Pull Request #78 · facebookresearch/gcm

gustcol · 2026-03-01T11:41:52Z

Summary

Ref: #75
Depends on: #77

Introduce bin/create_new_health_check.py, a scaffold tool that automates the creation of new health checks by generating all required files and registering the check across all touchpoints
Generated checks use the HealthCheckRuntime context manager from Add HealthCheckRuntime context manager for shared boilerplate [1/2] #77
Supports --group flag for grouped checks (@click.group()) and --dry-run for previewing changes
Idempotent — safe to re-run without duplicating entries
Updates the "Adding New Health Check" documentation with a Quick Start section

Stacked PR series: [1/2] Runtime helper (#77) → [2/2] Scaffold tool (this PR)

What the tool does

$ python bin/create_new_health_check.py check_ntp_sync --dry-run
[dry-run] Would create: gcm/health_checks/checks/check_ntp_sync.py
[dry-run] Would create: gcm/tests/health_checks_tests/test_check_ntp_sync.py
[dry-run] Would create: website/docs/GCM_Health_Checks/health_checks/check-ntp-sync.md
[dry-run] Would add import to gcm/health_checks/checks/__init__.py
[dry-run] Would add __all__ entry to gcm/health_checks/checks/__init__.py
[dry-run] Would add checks.check_ntp_sync to gcm/health_checks/cli/health_checks.py
[dry-run] Would add CHECK_NTP_SYNC to gcm/schemas/health_check/health_check_name.py
[dry-run] Would add disable_check_ntp_sync to gcm/monitoring/features/.../health_checks_features.py

Action	Target	Insert Order
Create	`gcm/health_checks/checks/check_{name}.py`	—
Create	`gcm/tests/health_checks_tests/test_check_{name}.py`	—
Create	`website/docs/.../check-{name}.md`	—
Modify	`checks/__init__.py`	Alphabetical import + append `__all__`
Modify	`cli/health_checks.py`	Append to `list_of_checks`
Modify	`health_check_name.py`	Alphabetical enum entry
Modify	`health_checks_features.py`	Alphabetical `disable_` field

Test plan

nox -s tests -- gcm/tests/test_create_health_check.py — 15 tests covering validation, file generation, registration, idempotency, dry-run, grouped checks
python bin/create_new_health_check.py check_test_example --dry-run — integration smoke test
nox -s lint
nox -s format
Verify Docusaurus website builds with updated docs

…plate Extract the ~30 lines of repeated setup code (logger init, GPU node ID detection, derived cluster resolution, TelemetryContext + OutputContext nesting, killswitch check) into a reusable HealthCheckRuntime dataclass context manager. This reduces per-subcommand boilerplate from ~30 lines to ~5 lines. The helper is purely additive — existing checks continue to work unchanged. New checks can use `with HealthCheckRuntime(...) as rt:` instead of manually wiring up the setup ceremony. Includes comprehensive tests covering field initialization, killswitch behavior, context manager nesting, GPU node ID failure handling, and the finish() convenience method. Refs: facebookresearch#75

github-actions · 2026-03-01T11:42:00Z

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow	What it runs
GPU Cluster Monitoring Python CI	lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI	shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command	Description	Requires approval?
`/metaci tests`	Runs Meta internal integration tests (pytest)	Yes — a maintainer must trigger the command and approve the deployment request
`/metaci integration tests`	Same as above (alias)	Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

Apply ufmt formatting and fix mypy errors in test helper by using explicit typed parameters instead of **kwargs dict unpacking.

Introduce bin/create_new_health_check.py that automates the creation of new health checks by generating all required files (check module, test skeleton, documentation stub) and registering the check across all touchpoints (checks/__init__.py, CLI entry point, HealthCheckName enum, killswitch feature flag). The tool supports single-command and grouped-command checks, has a dry-run mode, and is idempotent (safe to re-run). Generated checks use the new HealthCheckRuntime context manager from the previous commit. Also updates the "Adding New Health Check" guide with a Quick Start section pointing to the scaffold tool. Refs: facebookresearch#75

- Replace len() slice with constant to avoid E203 whitespace-before-colon - Remove placeholder-less f-strings (F541) - Remove unused StringIO import (F401)

Add type narrowing asserts for importlib spec/loader (which return Optional types), type the dynamically-loaded scaffold module as Any to allow attribute access, and annotate the scaffold_env fixture return type as Iterator[Path].

luccabb · 2026-03-01T20:57:20Z

+```bash
+python bin/generate_features.py
+ufmt format gcm
+```


run these from create_new_health_check.py?

Good idea — the scaffold tool now runs generate_features.py and ufmt format gcm automatically as a post-scaffold step (skipped in --dry-run mode). Updated the docs to remove the manual instructions.

The scaffold tool now runs generate_features.py and ufmt format gcm as a post-scaffold step, so users no longer need to run them manually. Updated docs to reflect the automated workflow.

- Remove unused import sys from generated templates (would fail lint) - Remove unused check_name parameter from run_post_scaffold - Add guard for silent no-op when CLI anchor string is not found - Fix dry-run inconsistency in update_init __all__ block - Remove tautological assertion in test_update_init_idempotent - Add test for missing CLI anchor warning - Update Quick Start docs to list all remaining manual steps

luccabb

@gustcol if you could move the base branch to #77 instead of main

luccabb · 2026-03-12T01:22:05Z

@claude

claude · 2026-03-12T01:22:21Z

Claude finished @luccabb's task in 3m 5s —— View job

Updating PR Base Branch

Gather context
Analyze what's needed to change the base branch

Summary

The request is to retarget PR #78 so it points to PR #77's branch instead of main. This would make the diff show only the scaffold tool changes (not the runtime helper changes from #77).

What I found: The current branch feature/health-check-scaffold is already structured correctly — the runtime commits (8bc8cb6, a076093) sit below the scaffold commits (b22f798 through fe67af4), so no rebasing is needed. Only the GitHub base branch setting needs to change.

However, I'm unable to complete this automatically because:

curl / network requests require approval that hasn't been granted
gh CLI requires approval that hasn't been granted
I cannot change the PR base branch without an authenticated GitHub API call

How to fix this manually

Option 1 — GitHub UI:

Go to Add scaffold tool for creating new health checks [2/2] #78
Click Edit next to the PR title
Change the base dropdown from main to PR Add HealthCheckRuntime context manager for shared boilerplate [1/2] #77's branch (e.g. feature/health-check-runtime)
Click Change base

Option 2 — gh CLI (from your local machine):

# First find PR #77's branch
gh pr view 77 --json headRefName -q .headRefName

# Then update the base branch
gh pr edit 78 --base <pr77-branch-name>

No code changes are needed — the git history is already correct for a stacked PR.

…ealth-check-scaffold

luccabb · 2026-03-19T08:17:46Z

@claude

claude · 2026-03-19T08:18:01Z

Claude finished @luccabb's task in 2m 14s —— View job

Code Review

Gather context
Read changed files
Post findings

Overall this is a solid scaffold tool — the design is clean, dry-run/idempotency behavior is correct, and the test coverage is thorough. A few issues worth addressing before merge:

Issues

1. update_init() — fragile ]\n anchor for __all__ insertion (bin/create_new_health_check.py:619)

new_content = content.replace("]\n", f"{all_entry}]\n", 1)

str.replace(..., 1) replaces the first ]\n in the whole file. If any list/tuple closes with ]\n before the __all__ block (e.g., a type annotation or helper tuple), the entry gets inserted in the wrong place and the file is silently corrupted. Consider using a more specific anchor like ' "check_' to find the __all__ block boundary, or search for the literal __all__ = [ and locate the matching close. Fix this →

2. update_enum() / update_features() — fallback appends to end of file (bin/create_new_health_check.py:700-704, 750-752)

else:
    # No existing CHECK_ entries; append before end of class.
    content = content.rstrip("\n") + "\n" + new_line

The comment says "append before end of class" but the code appends to the end of the file. If the file has anything after the class body (trailing comment, blank line, or additional code), the generated entry lands outside the class definition, producing invalid Python. This edge case is unlikely today but worth hardening. Fix this →

3. update_cli() — no alphabetical ordering (bin/create_new_health_check.py:648-649)

new_content = content.replace("]\n\nfor check", f"{entry}]\n\nfor check", 1)

All other registrations (imports, enum, features) insert alphabetically. The CLI list just appends before the closing bracket. Inconsistency means the CLI list will gradually drift out of order, which is surprising given the PR description's table explicitly notes this is an "Append" operation. Worth either documenting the intentional difference or making it alphabetical for consistency.

4. validate_name() — calls sys.exit() directly (bin/create_new_health_check.py:443)

sys.exit() inside a library function is a code smell — it couples the validation logic to process exit, making it harder to reuse or test cleanly. The tests correctly catch SystemExit, but raising ValueError here and calling sys.exit() only in main() would be cleaner.

Minor observations

_render() is elegant — avoids str.format() collision with template braces like {'attr_1': 'value1'} in doc templates. Worth a brief inline comment explaining why it exists.
Module-level import in test file (gcm/tests/test_create_health_check.py:22-28): The scaffold module is executed at collection time. If create_new_health_check.py ever has a top-level import error, the entire test module fails to collect with a confusing error. Consider wrapping in a pytest.importorskip() or a try/except with a pytest.skip().
run_post_scaffold() subprocess calls (bin/create_new_health_check.py:772-814): No capture_output or text=True — stdout/stderr from generate_features.py and ufmt pass through directly to the terminal, which is actually desirable for a CLI tool. ✓
Test coverage is excellent — 15 tests covering validation, file generation, registration (init/CLI/enum/features), idempotency, dry-run, grouped checks, and warning paths. The fixture pattern with importlib.util loading the script is clever.
Doc update is clean — the Quick Start section is concise and the manual steps are preserved for users who need fine-grained control.

luccabb · 2026-03-19T08:59:54Z

@claude

claude · 2026-03-19T09:00:11Z

Claude finished @luccabb's task in 2m 44s —— View job

Code Review — Follow-up

Gather context
Read changed files
Compare against previous review findings

All four issues flagged in the previous review have been addressed in subsequent commits. Here's the status:

Previous Issue	Status
`update_init()` fragile `]\n` anchor	✅ Fixed — now uses `content.find("]", all_start)` anchored to `__all__ = [` position
`update_enum()` / `update_features()` EOF fallback	✅ Fixed — both now use `class_match` to locate the class body before inserting
`update_cli()` no alphabetical ordering	✅ Fixed — `entry_pattern` regex finds existing entries and inserts in order; fallback only runs when no entries exist
`validate_name()` calls `sys.exit()` directly	✅ Fixed — raises `ValueError`; `sys.exit(1)` is correctly confined to `main()`

Remaining minor observations

1. __all__ entries are not inserted alphabetically (bin/create_new_health_check.py:632)

new_content = content[:close_bracket] + all_entry + content[close_bracket:]

This appends the new entry right before the closing ] without alphabetical ordering. The import block is inserted alphabetically (lines 584–596), so the __all__ list will gradually drift out of sync with the imports above it. This is low-severity since __all__ order has no semantic effect, but worth noting for consistency.

2. Module-level script execution in test file (gcm/tests/test_create_health_check.py:22-28)

spec = importlib.util.spec_from_file_location(...)
scaffold: Any = importlib.util.module_from_spec(spec)
spec.loader.exec_module(scaffold)  # runs at collection time

If create_new_health_check.py ever has a top-level import error, the entire test module silently fails to collect. Wrapping in a try/except with pytest.skip() would give a cleaner failure message.

Overall

The scaffold tool is solid — the core logic is correct, idempotency works, dry-run is properly gated, and the 15-test suite covers the important paths. Ready to merge once the base branch change (from main → feature/health-check-runtime) is confirmed.

gustcol requested review from A-Kokolis, calebho, giongto35, jj10306 and luccabb as code owners March 1, 2026 11:41

meta-cla Bot added the cla signed label Mar 1, 2026

gustcol mentioned this pull request Mar 1, 2026

[rfc] Simplifying the New Health Check #75

Open

gustcol added 3 commits March 1, 2026 12:51

Fix formatting (ufmt) and type annotations (mypy)

a076093

Apply ufmt formatting and fix mypy errors in test helper by using explicit typed parameters instead of **kwargs dict unpacking.

Fix lint (flake8) and formatting (ufmt) issues

4110400

- Replace len() slice with constant to avoid E203 whitespace-before-colon - Remove placeholder-less f-strings (F541) - Remove unused StringIO import (F401)

gustcol force-pushed the feature/health-check-scaffold branch from 158959c to 4110400 Compare March 1, 2026 11:58

Fix mypy type errors in scaffold test module

cacc3a0

Add type narrowing asserts for importlib spec/loader (which return Optional types), type the dynamically-loaded scaffold module as Any to allow attribute access, and annotate the scaffold_env fixture return type as Iterator[Path].

luccabb requested changes Mar 1, 2026

View reviewed changes

gustcol added 2 commits March 1, 2026 22:02

Run generate_features and ufmt format automatically after scaffolding

9781e65

The scaffold tool now runs generate_features.py and ufmt format gcm as a post-scaffold step, so users no longer need to run them manually. Updated docs to reflect the automated workflow.

luccabb requested changes Mar 12, 2026

View reviewed changes

luccabb added 3 commits March 18, 2026 22:34

Merge branch 'main' of github.com:facebookresearch/gcm into feature/h…

aeab20e

…ealth-check-scaffold

merge fix

08f904b

add timeout and better support for group cmds

6c72d45

fixes

dd0f06a

claude fixes

cd6a6c5

luccabb approved these changes Mar 19, 2026

View reviewed changes

luccabb merged commit 9d39a9f into facebookresearch:main Mar 19, 2026
37 checks passed

luccabb deleted the feature/health-check-scaffold branch March 19, 2026 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scaffold tool for creating new health checks [2/2]#78

Add scaffold tool for creating new health checks [2/2]#78
luccabb merged 12 commits intofacebookresearch:mainfrom
gustcol:feature/health-check-scaffold

gustcol commented Mar 1, 2026

Uh oh!

github-actions Bot commented Mar 1, 2026

Uh oh!

luccabb Mar 1, 2026

Uh oh!

gustcol Mar 1, 2026

Uh oh!

luccabb left a comment

Uh oh!

luccabb commented Mar 12, 2026

Uh oh!

claude Bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

luccabb commented Mar 19, 2026

Uh oh!

claude Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

luccabb commented Mar 19, 2026

Uh oh!

claude Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gustcol commented Mar 1, 2026

Summary

What the tool does

Test plan

Uh oh!

github-actions Bot commented Mar 1, 2026

CI Commands

Uh oh!

luccabb Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

gustcol Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

luccabb commented Mar 12, 2026

Uh oh!

claude Bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updating PR Base Branch

Summary

How to fix this manually

Uh oh!

luccabb commented Mar 19, 2026

Uh oh!

claude Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Issues

Minor observations

Uh oh!

luccabb commented Mar 19, 2026

Uh oh!

claude Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — Follow-up

Remaining minor observations

Overall

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Mar 12, 2026 •

edited

Loading

claude Bot commented Mar 19, 2026 •

edited

Loading

claude Bot commented Mar 19, 2026 •

edited

Loading