Skip to content

mesh-admin: surface in-flight handler execution (Rust + Python) (#4127)#4127

Open
shayne-fletcher wants to merge 2 commits into
meta-pytorch:mainfrom
shayne-fletcher:export-D107192488
Open

mesh-admin: surface in-flight handler execution (Rust + Python) (#4127)#4127
shayne-fletcher wants to merge 2 commits into
meta-pytorch:mainfrom
shayne-fletcher:export-D107192488

Conversation

@shayne-fletcher
Copy link
Copy Markdown
Contributor

@shayne-fletcher shayne-fletcher commented Jun 2, 2026

Summary:

mesh-admin reports the Rust actor-loop's state, but a Python actor's endpoint work runs detached from that loop -- PythonActor::handle hands the message off and returns, so cell.status() reads idle while Python user code is actively running (the TUI showed Status: idle for a busy actor). this adds a general execution introspection plane that both runtimes feed, so mesh-admin reports in-flight handler work truthfully without changing dispatch semantics; lifecycle status stays a separate plane.

the execution block is always present on NodeProperties::Actor and the HTTP DTO (count 0 when idle) and carries active_handler_count (the full live total across all invocations), total_handler_names, oldest_active_handler/oldest_active_since, and active_handlers[] ({name, active_count, oldest_active_since}, aggregated by endpoint name, sorted oldest-first, capped with an active_handlers_truncated flag). a core-owned ExecutionRegistry (a per-cell DashMap<token, {name, started_at}> plus an AtomicU64) on InstanceCellState owns storage and aggregation; finished is idempotent. composition is by kind: a cell with the registry installed self-reports (Python), otherwise the snapshot is derived from ActorStatus::Processing (Rust, count 0 or 1).

Python actors feed the registry through new PyInstance._execution_started/_execution_finished hooks that _Actor.handle brackets around the user-method call in a try/finally; the registry is eager-installed at the top of PythonActor::init so a live Python actor never falls back to the raw Processing path. the TUI renders an Execution section when active_handler_count > 0.

two operator-facing observability helpers ship alongside the surface (no core/DTO/TUI behavior change): python/examples/execution_demo.py, a run-and-watch dining-philosophers workload with real think/eat endpoints and a central ForkManager, so the execution surface can be watched live in the TUI without driving stdin -- browse a philosopher for think/eat turnover, the fork manager for acquire xN contention; and logger.info flight-recorder lines inside the handler bodies of execution_workload (and the demo) so the TUI flight-recorder pane shows recent activity for the observed actors instead of "No events".

Differential Revision: D107192488

Summary:
Pull Request resolved: meta-pytorch#4114

the detail pane shows terse field names (`queue depth`, `rss`, `sessions stalled`, …) whose meanings are easy to forget. this adds a `?` key that opens a static help glossary in the detail pane: a one-line meaning per non-obvious field for the selected node kind (root/host/proc/actor/error), with a subdued note line for the fields that carry caveats (proc/actor `queue depth`, actor `buffered`).

the glossary is a synchronous modal, not an `ActiveJob`: `?` sets `show_help`, any key dismisses it before normal key handling, and `Ctrl-C` still quits. rendering gives the help overlay precedence over `app.overlay` and node detail, and it never mutates the topology or detail cache. the block title carries the node kind (e.g. ` ? actor help `). the idle footer gains a `?: help` hint (`?: 帮助` in zh); the glossary body itself is english-only this pass.

added as invariant TUI-22 in the `lib.rs` registry.

Differential Revision: D106889787
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 2, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Jun 2, 2026

@shayne-fletcher has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107192488.

…-pytorch#4127)

Summary:
Pull Request resolved: meta-pytorch#4127

mesh-admin reports the Rust actor-loop's state, but a Python actor's endpoint work runs detached from that loop -- `PythonActor::handle` hands the message off and returns, so `cell.status()` reads `idle` while Python user code is actively running (the TUI showed `Status: idle` for a busy actor). this adds a general `execution` introspection plane that both runtimes feed, so mesh-admin reports in-flight handler work truthfully without changing dispatch semantics; lifecycle `status` stays a separate plane.

the `execution` block is always present on `NodeProperties::Actor` and the HTTP DTO (count 0 when idle) and carries `active_handler_count` (the full live total across all invocations), `total_handler_names`, `oldest_active_handler`/`oldest_active_since`, and `active_handlers[]` (`{name, active_count, oldest_active_since}`, aggregated by endpoint name, sorted oldest-first, capped with an `active_handlers_truncated` flag). a core-owned `ExecutionRegistry` (a per-cell `DashMap<token, {name, started_at}>` plus an `AtomicU64`) on `InstanceCellState` owns storage and aggregation; `finished` is idempotent. composition is by kind: a cell with the registry installed self-reports (Python), otherwise the snapshot is derived from `ActorStatus::Processing` (Rust, count 0 or 1).

Python actors feed the registry through new `PyInstance._execution_started`/`_execution_finished` hooks that `_Actor.handle` brackets around the user-method call in a `try`/`finally`; the registry is eager-installed at the top of `PythonActor::init` so a live Python actor never falls back to the raw `Processing` path. the TUI renders an `Execution` section when `active_handler_count > 0`.

two operator-facing observability helpers ship alongside the surface (no core/DTO/TUI behavior change): `python/examples/execution_demo.py`, a run-and-watch dining-philosophers workload with real `think`/`eat` endpoints and a central `ForkManager`, so the `execution` surface can be watched live in the TUI without driving stdin -- browse a philosopher for `think`/`eat` turnover, the fork manager for `acquire xN` contention; and `logger.info` flight-recorder lines inside the handler bodies of `execution_workload` (and the demo) so the TUI flight-recorder pane shows recent activity for the observed actors instead of "No events".

Differential Revision: D107192488
@meta-codesync meta-codesync Bot changed the title mesh-admin: surface in-flight handler execution (Rust + Python) mesh-admin: surface in-flight handler execution (Rust + Python) (#4127) Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant