Hash telemetry message port IDs from PortId#4126
Open
thedavekwon wants to merge 6 commits into
Open
Conversation
Summary: Add `TelemetryActor` as the host-local telemetry collector: a narrow Python control/query shell over the Rust `DatabaseScanner`. Hot telemetry frames bypass Python entirely (producers write framed Arrow IPC over `telemetry.sock`, Rust socket ingest decodes); Python only coordinates activation and exposes the actor endpoints. `activate()` is the bind decision: it pre-registers trace/entity/snapshot schemas, attempts a non-destructive `_start_socket_ingest`, and on success holds the resulting scanner. Multiple `TelemetryActor` candidates may target the same `/tmp/monarch_<apply_id>/telemetry.sock` (collocated jobs like `ProcessJob`, sidecar refreshes); the non-destructive bind is the serialization point, so only one becomes the live collector and the rest stay inert with no scanner. Activation is idempotent on success (`_scanner is not None` gates the fast path) and retries on subsequent calls if the prior attempt was skipped or failed. Endpoints: `activate`, `table_names`, `schema_for`, `apply_retention`, `store_pyspy_dump`, `set_worker_collectors`, and `scan` (local store first, then flat fan-out over `_worker_collectors`, which is always empty in this commit and gets populated by the sidecar in a follow-up step). The actor uses a 0o700 socket directory under `/tmp` for per-user isolation against co-tenants on shared hosts. Adds the supporting PyO3 helpers `_register_trace_entity_schemas` and `_set_unix_socket_sink_path` in `monarch_distributed_telemetry::lib` alongside the existing `_start_socket_ingest`, with matching `.pyi` stubs. Differential Revision: D106565535
Summary: Two improvements to `start_dashboard` that make it robust under contexts the in-process dashboard doesn't usually hit but the upcoming sidecar will. (1) Resolve the React build dir via `os.path.dirname(monarch.monarch_dashboard.__file__)` instead of `importlib.resources.files(...)`. The latter returns a `MultiplexedPath` / zip-path inside Buck PARs whose `str(...)` form is not what `os.path.isdir` / Flask's `static_folder` expect, so frontend static files 404 when the dashboard runs from a PAR. `__file__` yields a real filesystem path in both pip-installed and PAR-bundled layouts. (2) Bind the HTTP server via `werkzeug.serving.make_server(host, port, app)` and return `server.server_port` instead of `app.run(port=port)` and the requested port. With `make_server`, a `port=0` request is bound to an OS-assigned ephemeral port and the resolved port is read back from the server object, so callers get back a real URL instead of `http://localhost:0`. Drops the pre-bind socket-availability probe (TOCTOU dance) because `make_server` raises `OSError` synchronously on bind failure. Differential Revision: D106687761
Summary: Adds the telemetry sidecar process: a `ProcessGuard`-launched standalone PAR worker that hosts a `TelemetryActor`, activates the local Unix socket sink, and serves the dashboard `/api/query` route. The worker is bundled as a separate `python_binary` PAR and resolved at launch via `importlib.resources.as_file` + copy-to-stable-path; this is necessary because `sys.executable` in PAR-deployed contexts (including buck test PARs) is a wrapper script that doesn't honor `python -m foo.bar`. Moves socket-ingest startup onto Monarch's shared Tokio runtime so `_start_socket_ingest` works when called from a PyO3 thread (no ambient runtime). Differential Revision: D106718874
Summary: Actor telemetry IDs should use runtime actor identity, not routable actor addresses (changed in D102822047) This updates actor rows, sender IDs, receiver IDs, and actor status IDs to hash `ActorId` while keeping `ActorAddr` strings only for display/debug fields. This prevents location changes from changing the telemetry actor join key. Differential Revision: D107156872
Summary: ActorMeshId values are only unique within a ProcMesh, so actor mesh telemetry IDs must include both `ProcMeshId` and `ActorMeshId`. This showed up with the sidecar design because entity memtables are now per host: singleton actor mesh IDs from multiple proc meshes on the same host land in the same table, so hashing only the actor mesh name makes those singleton rows collide. Previously, this inconsistency was swollen. Add `telemetry_actor_mesh_id` as the shared derivation helper, and use it for actor mesh creation, sent-message telemetry, actor rows, and the bootstrap client mesh so `sent_messages.actor_mesh_id`, `meshes.id`, and `actors.mesh_id` stay joinable. Caveat: this does not fully address potential collisions for `ProcMeshId::singleton` values themselves; proc mesh telemetry IDs may also need parent scoping if multiple singleton proc meshes can share a host-local memtable. Differential Revision: D107178870
Summary: `messages.port_id` was using only the destination port index, which collides across receiver actors that expose the same endpoint. Stamp `TELEMETRY_PORT_ID_HASH` with `hash_to_u64(dest.id())` so received-message telemetry uses the full destination `PortId`, matching the actor-scoped identity stored in the message destination. Differential Revision: D107180548
Contributor
|
@thedavekwon has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107180548. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
messages.port_idwas using only the destination port index, which collides across receiver actors that expose the same endpoint. StampTELEMETRY_PORT_ID_HASHwithhash_to_u64(dest.id())so received-message telemetry uses the full destinationPortId, matching the actor-scoped identity stored in the message destination.Differential Revision: D107180548