Skip to content

Add scaffold tool for creating new health checks [2/2]#78

Merged
luccabb merged 12 commits intofacebookresearch:mainfrom
gustcol:feature/health-check-scaffold
Mar 19, 2026
Merged

Add scaffold tool for creating new health checks [2/2]#78
luccabb merged 12 commits intofacebookresearch:mainfrom
gustcol:feature/health-check-scaffold

Conversation

@gustcol
Copy link
Copy Markdown
Contributor

@gustcol gustcol commented Mar 1, 2026

Summary

Ref: #75
Depends on: #77

  • Introduce bin/create_new_health_check.py, a scaffold tool that automates the creation of new health checks by generating all required files and registering the check across all touchpoints
  • Generated checks use the HealthCheckRuntime context manager from Add HealthCheckRuntime context manager for shared boilerplate [1/2] #77
  • Supports --group flag for grouped checks (@click.group()) and --dry-run for previewing changes
  • Idempotent — safe to re-run without duplicating entries
  • Updates the "Adding New Health Check" documentation with a Quick Start section

Stacked PR series: [1/2] Runtime helper (#77) → [2/2] Scaffold tool (this PR)

What the tool does

$ python bin/create_new_health_check.py check_ntp_sync --dry-run
[dry-run] Would create: gcm/health_checks/checks/check_ntp_sync.py
[dry-run] Would create: gcm/tests/health_checks_tests/test_check_ntp_sync.py
[dry-run] Would create: website/docs/GCM_Health_Checks/health_checks/check-ntp-sync.md
[dry-run] Would add import to gcm/health_checks/checks/__init__.py
[dry-run] Would add __all__ entry to gcm/health_checks/checks/__init__.py
[dry-run] Would add checks.check_ntp_sync to gcm/health_checks/cli/health_checks.py
[dry-run] Would add CHECK_NTP_SYNC to gcm/schemas/health_check/health_check_name.py
[dry-run] Would add disable_check_ntp_sync to gcm/monitoring/features/.../health_checks_features.py
Action Target Insert Order
Create gcm/health_checks/checks/check_{name}.py
Create gcm/tests/health_checks_tests/test_check_{name}.py
Create website/docs/.../check-{name}.md
Modify checks/__init__.py Alphabetical import + append __all__
Modify cli/health_checks.py Append to list_of_checks
Modify health_check_name.py Alphabetical enum entry
Modify health_checks_features.py Alphabetical disable_ field

Test plan

  • nox -s tests -- gcm/tests/test_create_health_check.py — 15 tests covering validation, file generation, registration, idempotency, dry-run, grouped checks
  • python bin/create_new_health_check.py check_test_example --dry-run — integration smoke test
  • nox -s lint
  • nox -s format
  • Verify Docusaurus website builds with updated docs

…plate

Extract the ~30 lines of repeated setup code (logger init, GPU node ID
detection, derived cluster resolution, TelemetryContext + OutputContext
nesting, killswitch check) into a reusable HealthCheckRuntime dataclass
context manager. This reduces per-subcommand boilerplate from ~30 lines
to ~5 lines.

The helper is purely additive — existing checks continue to work
unchanged. New checks can use `with HealthCheckRuntime(...) as rt:`
instead of manually wiring up the setup ceremony.

Includes comprehensive tests covering field initialization, killswitch
behavior, context manager nesting, GPU node ID failure handling, and
the finish() convenience method.

Refs: facebookresearch#75
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 1, 2026

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

gustcol added 3 commits March 1, 2026 12:51
Apply ufmt formatting and fix mypy errors in test helper
by using explicit typed parameters instead of **kwargs dict
unpacking.
Introduce bin/create_new_health_check.py that automates the creation
of new health checks by generating all required files (check module,
test skeleton, documentation stub) and registering the check across
all touchpoints (checks/__init__.py, CLI entry point, HealthCheckName
enum, killswitch feature flag).

The tool supports single-command and grouped-command checks, has a
dry-run mode, and is idempotent (safe to re-run). Generated checks
use the new HealthCheckRuntime context manager from the previous
commit.

Also updates the "Adding New Health Check" guide with a Quick Start
section pointing to the scaffold tool.

Refs: facebookresearch#75
- Replace len() slice with constant to avoid E203 whitespace-before-colon
- Remove placeholder-less f-strings (F541)
- Remove unused StringIO import (F401)
@gustcol gustcol force-pushed the feature/health-check-scaffold branch from 158959c to 4110400 Compare March 1, 2026 11:58
Add type narrowing asserts for importlib spec/loader (which return
Optional types), type the dynamically-loaded scaffold module as Any
to allow attribute access, and annotate the scaffold_env fixture
return type as Iterator[Path].
```bash
python bin/generate_features.py
ufmt format gcm
```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run these from create_new_health_check.py?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea — the scaffold tool now runs generate_features.py and ufmt format gcm automatically as a post-scaffold step (skipped in --dry-run mode). Updated the docs to remove the manual instructions.

gustcol added 2 commits March 1, 2026 22:02
The scaffold tool now runs generate_features.py and ufmt format gcm
as a post-scaffold step, so users no longer need to run them manually.
Updated docs to reflect the automated workflow.
- Remove unused import sys from generated templates (would fail lint)
- Remove unused check_name parameter from run_post_scaffold
- Add guard for silent no-op when CLI anchor string is not found
- Fix dry-run inconsistency in update_init __all__ block
- Remove tautological assertion in test_update_init_idempotent
- Add test for missing CLI anchor warning
- Update Quick Start docs to list all remaining manual steps
Copy link
Copy Markdown
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gustcol if you could move the base branch to #77 instead of main

@luccabb
Copy link
Copy Markdown
Member

luccabb commented Mar 12, 2026

@claude

@claude
Copy link
Copy Markdown

claude Bot commented Mar 12, 2026

Claude finished @luccabb's task in 3m 5s —— View job


Updating PR Base Branch

  • Gather context
  • Analyze what's needed to change the base branch

Summary

The request is to retarget PR #78 so it points to PR #77's branch instead of main. This would make the diff show only the scaffold tool changes (not the runtime helper changes from #77).

What I found: The current branch feature/health-check-scaffold is already structured correctly — the runtime commits (8bc8cb6, a076093) sit below the scaffold commits (b22f798 through fe67af4), so no rebasing is needed. Only the GitHub base branch setting needs to change.

However, I'm unable to complete this automatically because:

  • curl / network requests require approval that hasn't been granted
  • gh CLI requires approval that hasn't been granted
  • I cannot change the PR base branch without an authenticated GitHub API call

How to fix this manually

Option 1 — GitHub UI:

  1. Go to Add scaffold tool for creating new health checks [2/2] #78
  2. Click Edit next to the PR title
  3. Change the base dropdown from main to PR Add HealthCheckRuntime context manager for shared boilerplate [1/2] #77's branch (e.g. feature/health-check-runtime)
  4. Click Change base

Option 2 — gh CLI (from your local machine):

# First find PR #77's branch
gh pr view 77 --json headRefName -q .headRefName

# Then update the base branch
gh pr edit 78 --base <pr77-branch-name>

No code changes are needed — the git history is already correct for a stacked PR.

@luccabb
Copy link
Copy Markdown
Member

luccabb commented Mar 19, 2026

@claude

@claude
Copy link
Copy Markdown

claude Bot commented Mar 19, 2026

Claude finished @luccabb's task in 2m 14s —— View job


Code Review

  • Gather context
  • Read changed files
  • Post findings

Overall this is a solid scaffold tool — the design is clean, dry-run/idempotency behavior is correct, and the test coverage is thorough. A few issues worth addressing before merge:

Issues

1. update_init() — fragile ]\n anchor for __all__ insertion (bin/create_new_health_check.py:619)

new_content = content.replace("]\n", f"{all_entry}]\n", 1)

str.replace(..., 1) replaces the first ]\n in the whole file. If any list/tuple closes with ]\n before the __all__ block (e.g., a type annotation or helper tuple), the entry gets inserted in the wrong place and the file is silently corrupted. Consider using a more specific anchor like ' "check_' to find the __all__ block boundary, or search for the literal __all__ = [ and locate the matching close. Fix this →

2. update_enum() / update_features() — fallback appends to end of file (bin/create_new_health_check.py:700-704, 750-752)

else:
    # No existing CHECK_ entries; append before end of class.
    content = content.rstrip("\n") + "\n" + new_line

The comment says "append before end of class" but the code appends to the end of the file. If the file has anything after the class body (trailing comment, blank line, or additional code), the generated entry lands outside the class definition, producing invalid Python. This edge case is unlikely today but worth hardening. Fix this →

3. update_cli() — no alphabetical ordering (bin/create_new_health_check.py:648-649)

new_content = content.replace("]\n\nfor check", f"{entry}]\n\nfor check", 1)

All other registrations (imports, enum, features) insert alphabetically. The CLI list just appends before the closing bracket. Inconsistency means the CLI list will gradually drift out of order, which is surprising given the PR description's table explicitly notes this is an "Append" operation. Worth either documenting the intentional difference or making it alphabetical for consistency.

4. validate_name() — calls sys.exit() directly (bin/create_new_health_check.py:443)

sys.exit() inside a library function is a code smell — it couples the validation logic to process exit, making it harder to reuse or test cleanly. The tests correctly catch SystemExit, but raising ValueError here and calling sys.exit() only in main() would be cleaner.

Minor observations

  • _render() is elegant — avoids str.format() collision with template braces like {'attr_1': 'value1'} in doc templates. Worth a brief inline comment explaining why it exists.
  • Module-level import in test file (gcm/tests/test_create_health_check.py:22-28): The scaffold module is executed at collection time. If create_new_health_check.py ever has a top-level import error, the entire test module fails to collect with a confusing error. Consider wrapping in a pytest.importorskip() or a try/except with a pytest.skip().
  • run_post_scaffold() subprocess calls (bin/create_new_health_check.py:772-814): No capture_output or text=True — stdout/stderr from generate_features.py and ufmt pass through directly to the terminal, which is actually desirable for a CLI tool. ✓
  • Test coverage is excellent — 15 tests covering validation, file generation, registration (init/CLI/enum/features), idempotency, dry-run, grouped checks, and warning paths. The fixture pattern with importlib.util loading the script is clever.
  • Doc update is clean — the Quick Start section is concise and the manual steps are preserved for users who need fine-grained control.

@luccabb
Copy link
Copy Markdown
Member

luccabb commented Mar 19, 2026

@claude

@claude
Copy link
Copy Markdown

claude Bot commented Mar 19, 2026

Claude finished @luccabb's task in 2m 44s —— View job


Code Review — Follow-up

  • Gather context
  • Read changed files
  • Compare against previous review findings

All four issues flagged in the previous review have been addressed in subsequent commits. Here's the status:

Previous Issue Status
update_init() fragile ]\n anchor ✅ Fixed — now uses content.find("]", all_start) anchored to __all__ = [ position
update_enum() / update_features() EOF fallback ✅ Fixed — both now use class_match to locate the class body before inserting
update_cli() no alphabetical ordering ✅ Fixed — entry_pattern regex finds existing entries and inserts in order; fallback only runs when no entries exist
validate_name() calls sys.exit() directly ✅ Fixed — raises ValueError; sys.exit(1) is correctly confined to main()

Remaining minor observations

1. __all__ entries are not inserted alphabetically (bin/create_new_health_check.py:632)

new_content = content[:close_bracket] + all_entry + content[close_bracket:]

This appends the new entry right before the closing ] without alphabetical ordering. The import block is inserted alphabetically (lines 584–596), so the __all__ list will gradually drift out of sync with the imports above it. This is low-severity since __all__ order has no semantic effect, but worth noting for consistency.

2. Module-level script execution in test file (gcm/tests/test_create_health_check.py:22-28)

spec = importlib.util.spec_from_file_location(...)
scaffold: Any = importlib.util.module_from_spec(spec)
spec.loader.exec_module(scaffold)  # runs at collection time

If create_new_health_check.py ever has a top-level import error, the entire test module silently fails to collect. Wrapping in a try/except with pytest.skip() would give a cleaner failure message.

Overall

The scaffold tool is solid — the core logic is correct, idempotency works, dry-run is properly gated, and the 15-test suite covers the important paths. Ready to merge once the base branch change (from mainfeature/health-check-runtime) is confirmed.

@luccabb luccabb merged commit 9d39a9f into facebookresearch:main Mar 19, 2026
37 checks passed
@luccabb luccabb deleted the feature/health-check-scaffold branch March 19, 2026 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants