Skip to content

scripts: make metrics gate() thread-safe#14301

Merged
matanl-starkware merged 1 commit into
mainfrom
matanl/prod-metrics-thread-safe
Jun 3, 2026
Merged

scripts: make metrics gate() thread-safe#14301
matanl-starkware merged 1 commit into
mainfrom
matanl/prod-metrics-thread-safe

Conversation

@matanl-starkware

Copy link
Copy Markdown
Collaborator

Guard signal.signal registration to the main thread (it raises ValueError on
worker threads), and track live kubectl port-forward processes in a lock-guarded
registry exposing terminate_all_port_forwards for best-effort cleanup from a
main-thread signal handler. Enables gate() to run under parallel waits.

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

@cursor

cursor Bot commented Jun 3, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Prod scripting/metrics wait tooling only; no auth, data, or runtime service paths.

Overview
Makes MetricConditionGater.gate() safe to run from worker threads during parallel metric waits.

Adds a lock-protected registry of live kubectl port-forward processes and terminate_all_port_forwards() so a main-thread SIGINT/SIGTERM handler can tear down all forwards when per-thread signal.signal is unavailable. gate() registers each forward on start and removes it in _terminate_port_forward_process. Signal handlers are installed only on the main thread; worker threads rely on the orchestrator’s handler plus the existing finally cleanup.

Replaces the startup assert with a check that treats an early exit as a clean return when _shutdown_requested is set (interrupt during startup), otherwise raises RuntimeError. Status messages use print_colored.

Reviewed by Cursor Bugbot for commit a69f19c. Bugbot is set up for automated code reviews on this repo. Configure here.

@reviewable-StarkWare

Copy link
Copy Markdown

This change is Reviewable

@ron-starkware ron-starkware left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ron-starkware reviewed 1 file and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on matanl-starkware).


scripts/prod/metrics_lib.py line 168 at r1 (raw file):

            with _active_port_forwards_lock:
                _active_port_forwards.add(pf_process)
            print("Waiting for forwarding to start")

Should this be changed to print_colored instead so that it doesn't write to the shared stdout?

Suggestion:

print_colored("Waiting for forwarding to start")

scripts/prod/metrics_lib.py line 178 at r1 (raw file):

            print(
                f"Forwarding started (from local port {self.local_port} to {self.pod}:{self.metrics_port})"
            )

Should this be changed to print_colored instead so that it doesn't write to the shared stdout?

Suggestion:

            print_colored(
                f"Forwarding started (from local port {self.local_port} to {self.pod}:{self.metrics_port})"
            )

@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 878e911 to fcc2da4 Compare June 3, 2026 08:04
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-helper branch from 893cb64 to a92e79a Compare June 3, 2026 08:04
@matanl-starkware

Copy link
Copy Markdown
Collaborator Author

Addressed: both bare print(...) calls in gate() ("Waiting for forwarding to start" and "Forwarding started ...") now use print_colored(...) so they honor the per-node output buffer under parallel waits.

Comment thread scripts/prod/metrics_lib.py

@ron-starkware ron-starkware left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ron-starkware reviewed 1 file and all commit messages, and resolved 2 discussions.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on matanl-starkware).

@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from fcc2da4 to 8d5a423 Compare June 3, 2026 08:29
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-helper branch from a92e79a to 04009a0 Compare June 3, 2026 08:29
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 8d5a423 to 41e48cc Compare June 3, 2026 08:36
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-helper branch from 04009a0 to 029ce2e Compare June 3, 2026 08:36
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 41e48cc to 699ce08 Compare June 3, 2026 11:56
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-helper branch from 029ce2e to 463f2bc Compare June 3, 2026 11:56
@graphite-app graphite-app Bot changed the base branch from matanl/prod-parallel-helper to graphite-base/14301 June 3, 2026 12:19
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 699ce08 to 32948e4 Compare June 3, 2026 12:19
@matanl-starkware matanl-starkware changed the base branch from graphite-base/14301 to matanl/prod-parallel-helper June 3, 2026 12:19
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-helper branch from a185a20 to fc6f37a Compare June 3, 2026 12:22
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 32948e4 to 137d975 Compare June 3, 2026 12:22
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-helper branch from fc6f37a to fdefe46 Compare June 3, 2026 12:33
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 137d975 to 0270f55 Compare June 3, 2026 12:33
Comment thread scripts/prod/metrics_lib.py
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 0270f55 to ed98ef3 Compare June 3, 2026 13:50
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-helper branch from fdefe46 to df94c18 Compare June 3, 2026 13:50
@matanl-starkware matanl-starkware changed the base branch from matanl/prod-parallel-helper to main June 3, 2026 13:59
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from ed98ef3 to fc1a2f9 Compare June 3, 2026 14:00
Guard signal.signal registration to the main thread (it raises ValueError on
worker threads), and track live kubectl port-forward processes in a lock-guarded
registry exposing terminate_all_port_forwards for best-effort cleanup from a
main-thread signal handler. Enables gate() to run under parallel waits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from fc1a2f9 to a69f19c Compare June 3, 2026 14:16
@matanl-starkware

Copy link
Copy Markdown
Collaborator Author

@ron-starkware — could you re-approve this one when you get a chance? 🙏

Context: #14300 (bottom of the stack) merged, so I rebased the rest onto main. The rebase reset Reviewable's status check, so it needs a fresh LGTM on this revision.

I also addressed the new Cursor Bugbot finding on this push: in gate() on a worker thread, a Ctrl-C handler calling terminate_all_port_forwards during the 3s startup sleep could trip the assert pf_process.poll() is None (teardown-only). It now records a shutdown flag and gate() returns cleanly in that case instead of asserting; a genuine startup failure raises a descriptive RuntimeError instead of a bare assert.

Heads-up on cadence: because the repo squash-merges and these are separate stacked PRs, each time I merge one and gt sync, the next PR rebases and its Reviewable check resets — so I'll request re-approval one at a time (#14302 after this merges, etc.). Appreciate the patience!

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a69f19c. Configure here.

Comment thread scripts/prod/metrics_lib.py
Comment thread scripts/prod/metrics_lib.py

@ron-starkware ron-starkware left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ron-starkware reviewed 1 file and all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on matanl-starkware).

@matanl-starkware matanl-starkware left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matanl-starkware resolved 2 discussions.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on matanl-starkware).

@matanl-starkware matanl-starkware added this pull request to the merge queue Jun 3, 2026
Merged via the queue into main with commit c9f229d Jun 3, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants