mesh-admin: surface in-flight handler execution (Rust + Python) (#4127)#4127
Open
shayne-fletcher wants to merge 2 commits into
Open
mesh-admin: surface in-flight handler execution (Rust + Python) (#4127)#4127shayne-fletcher wants to merge 2 commits into
shayne-fletcher wants to merge 2 commits into
Conversation
Summary: Pull Request resolved: meta-pytorch#4114 the detail pane shows terse field names (`queue depth`, `rss`, `sessions stalled`, …) whose meanings are easy to forget. this adds a `?` key that opens a static help glossary in the detail pane: a one-line meaning per non-obvious field for the selected node kind (root/host/proc/actor/error), with a subdued note line for the fields that carry caveats (proc/actor `queue depth`, actor `buffered`). the glossary is a synchronous modal, not an `ActiveJob`: `?` sets `show_help`, any key dismisses it before normal key handling, and `Ctrl-C` still quits. rendering gives the help overlay precedence over `app.overlay` and node detail, and it never mutates the topology or detail cache. the block title carries the node kind (e.g. ` ? actor help `). the idle footer gains a `?: help` hint (`?: 帮助` in zh); the glossary body itself is english-only this pass. added as invariant TUI-22 in the `lib.rs` registry. Differential Revision: D106889787
Contributor
|
@shayne-fletcher has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107192488. |
…-pytorch#4127) Summary: Pull Request resolved: meta-pytorch#4127 mesh-admin reports the Rust actor-loop's state, but a Python actor's endpoint work runs detached from that loop -- `PythonActor::handle` hands the message off and returns, so `cell.status()` reads `idle` while Python user code is actively running (the TUI showed `Status: idle` for a busy actor). this adds a general `execution` introspection plane that both runtimes feed, so mesh-admin reports in-flight handler work truthfully without changing dispatch semantics; lifecycle `status` stays a separate plane. the `execution` block is always present on `NodeProperties::Actor` and the HTTP DTO (count 0 when idle) and carries `active_handler_count` (the full live total across all invocations), `total_handler_names`, `oldest_active_handler`/`oldest_active_since`, and `active_handlers[]` (`{name, active_count, oldest_active_since}`, aggregated by endpoint name, sorted oldest-first, capped with an `active_handlers_truncated` flag). a core-owned `ExecutionRegistry` (a per-cell `DashMap<token, {name, started_at}>` plus an `AtomicU64`) on `InstanceCellState` owns storage and aggregation; `finished` is idempotent. composition is by kind: a cell with the registry installed self-reports (Python), otherwise the snapshot is derived from `ActorStatus::Processing` (Rust, count 0 or 1). Python actors feed the registry through new `PyInstance._execution_started`/`_execution_finished` hooks that `_Actor.handle` brackets around the user-method call in a `try`/`finally`; the registry is eager-installed at the top of `PythonActor::init` so a live Python actor never falls back to the raw `Processing` path. the TUI renders an `Execution` section when `active_handler_count > 0`. two operator-facing observability helpers ship alongside the surface (no core/DTO/TUI behavior change): `python/examples/execution_demo.py`, a run-and-watch dining-philosophers workload with real `think`/`eat` endpoints and a central `ForkManager`, so the `execution` surface can be watched live in the TUI without driving stdin -- browse a philosopher for `think`/`eat` turnover, the fork manager for `acquire xN` contention; and `logger.info` flight-recorder lines inside the handler bodies of `execution_workload` (and the demo) so the TUI flight-recorder pane shows recent activity for the observed actors instead of "No events". Differential Revision: D107192488
e873107 to
658079a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
mesh-admin reports the Rust actor-loop's state, but a Python actor's endpoint work runs detached from that loop --
PythonActor::handlehands the message off and returns, socell.status()readsidlewhile Python user code is actively running (the TUI showedStatus: idlefor a busy actor). this adds a generalexecutionintrospection plane that both runtimes feed, so mesh-admin reports in-flight handler work truthfully without changing dispatch semantics; lifecyclestatusstays a separate plane.the
executionblock is always present onNodeProperties::Actorand the HTTP DTO (count 0 when idle) and carriesactive_handler_count(the full live total across all invocations),total_handler_names,oldest_active_handler/oldest_active_since, andactive_handlers[]({name, active_count, oldest_active_since}, aggregated by endpoint name, sorted oldest-first, capped with anactive_handlers_truncatedflag). a core-ownedExecutionRegistry(a per-cellDashMap<token, {name, started_at}>plus anAtomicU64) onInstanceCellStateowns storage and aggregation;finishedis idempotent. composition is by kind: a cell with the registry installed self-reports (Python), otherwise the snapshot is derived fromActorStatus::Processing(Rust, count 0 or 1).Python actors feed the registry through new
PyInstance._execution_started/_execution_finishedhooks that_Actor.handlebrackets around the user-method call in atry/finally; the registry is eager-installed at the top ofPythonActor::initso a live Python actor never falls back to the rawProcessingpath. the TUI renders anExecutionsection whenactive_handler_count > 0.two operator-facing observability helpers ship alongside the surface (no core/DTO/TUI behavior change):
python/examples/execution_demo.py, a run-and-watch dining-philosophers workload with realthink/eatendpoints and a centralForkManager, so theexecutionsurface can be watched live in the TUI without driving stdin -- browse a philosopher forthink/eatturnover, the fork manager foracquire xNcontention; andlogger.infoflight-recorder lines inside the handler bodies ofexecution_workload(and the demo) so the TUI flight-recorder pane shows recent activity for the observed actors instead of "No events".Differential Revision: D107192488