Skip to content

Revisit mesh telemetry IDs#4125

Open
thedavekwon wants to merge 5 commits into
meta-pytorch:mainfrom
thedavekwon:export-D107178870
Open

Revisit mesh telemetry IDs#4125
thedavekwon wants to merge 5 commits into
meta-pytorch:mainfrom
thedavekwon:export-D107178870

Conversation

@thedavekwon
Copy link
Copy Markdown
Contributor

@thedavekwon thedavekwon commented Jun 1, 2026

Summary:
ActorMeshId values are only unique within a ProcMesh, so actor mesh telemetry IDs must include both ProcMeshId and ActorMeshId. This showed up with the sidecar design because entity memtables are now per host: singleton actor mesh IDs from multiple proc meshes on the same host land in the same table, so hashing only the actor mesh name makes those singleton rows collide. Previously, this inconsistency was hidden. Add telemetry_actor_mesh_id as the shared derivation helper, and use it for actor mesh creation, sent-message telemetry, actor rows, and the bootstrap client mesh so sent_messages.actor_mesh_id, meshes.id, and actors.mesh_id stay joinable. Also align host and proc mesh row IDs to hash typed HostMeshId and ProcMeshId values, so meshes.parent_mesh_id uses the same derivation as the parent meshes.id row.

Caveat: this does not fully address potential collisions for ProcMeshId::singleton values themselves; proc mesh telemetry IDs may also need parent scoping if multiple singleton proc meshes can share a host-local memtable.

Differential Revision: D107178870

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 1, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Jun 1, 2026

@thedavekwon has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107178870.

Summary:

Add `TelemetryActor` as the host-local telemetry collector: a narrow Python control/query shell over the Rust `DatabaseScanner`. Hot telemetry frames bypass Python entirely (producers write framed Arrow IPC over `telemetry.sock`, Rust socket ingest decodes); Python only coordinates activation and exposes the actor endpoints. `activate()` is the bind decision: it pre-registers trace/entity/snapshot schemas, attempts a non-destructive `_start_socket_ingest`, and on success holds the resulting scanner. Multiple `TelemetryActor` candidates may target the same `/tmp/monarch_<apply_id>/telemetry.sock` (collocated jobs like `ProcessJob`, sidecar refreshes); the non-destructive bind is the serialization point, so only one becomes the live collector and the rest stay inert with no scanner. Activation is idempotent on success (`_scanner is not None` gates the fast path) and retries on subsequent calls if the prior attempt was skipped or failed. Endpoints: `activate`, `table_names`, `schema_for`, `apply_retention`, `store_pyspy_dump`, `set_worker_collectors`, and `scan` (local store first, then flat fan-out over `_worker_collectors`, which is always empty in this commit and gets populated by the sidecar in a follow-up step). The actor uses a 0o700 socket directory under `/tmp` for per-user isolation against co-tenants on shared hosts. Adds the supporting PyO3 helpers `_register_trace_entity_schemas` and `_set_unix_socket_sink_path` in `monarch_distributed_telemetry::lib` alongside the existing `_start_socket_ingest`, with matching `.pyi` stubs.

Differential Revision: D106565535
…a-pytorch#4087)

Summary:

Two improvements to `start_dashboard` that make it robust under contexts the in-process dashboard doesn't usually hit but the upcoming sidecar will. (1) Resolve the React build dir via `os.path.dirname(monarch.monarch_dashboard.__file__)` instead of `importlib.resources.files(...)`. The latter returns a `MultiplexedPath` / zip-path inside Buck PARs whose `str(...)` form is not what `os.path.isdir` / Flask's `static_folder` expect, so frontend static files 404 when the dashboard runs from a PAR. `__file__` yields a real filesystem path in both pip-installed and PAR-bundled layouts. (2) Bind the HTTP server via `werkzeug.serving.make_server(host, port, app)` and return `server.server_port` instead of `app.run(port=port)` and the requested port. With `make_server`, a `port=0` request is bound to an OS-assigned ephemeral port and the resolved port is read back from the server object, so callers get back a real URL instead of `http://localhost:0`. Drops the pre-bind socket-availability probe (TOCTOU dance) because `make_server` raises `OSError` synchronously on bind failure.

Differential Revision: D106687761
Summary:

Adds the telemetry sidecar process: a `ProcessGuard`-launched standalone PAR worker that hosts a `TelemetryActor`, activates the local Unix socket sink, and serves the dashboard `/api/query` route. The worker is bundled as a separate `python_binary` PAR and resolved at launch via `importlib.resources.as_file` + copy-to-stable-path; this is necessary because `sys.executable` in PAR-deployed contexts (including buck test PARs) is a wrapper script that doesn't honor `python -m foo.bar`. Moves socket-ingest startup onto Monarch's shared Tokio runtime so `_start_socket_ingest` works when called from a PyO3 thread (no ambient runtime).

Differential Revision: D106718874
Summary:

Actor telemetry IDs should use runtime actor identity, not routable actor addresses (changed in D102822047) This updates actor rows, sender IDs, receiver IDs, and actor status IDs to hash `ActorId` while keeping `ActorAddr` strings only for display/debug fields. This prevents location changes from changing the telemetry actor join key.

Reviewed By: mariusae

Differential Revision: D107156872
Summary:
ActorMeshId values are only unique within a ProcMesh, so actor mesh telemetry IDs must include both `ProcMeshId` and `ActorMeshId`. This showed up with the sidecar design because entity memtables are now per host: singleton actor mesh IDs from multiple proc meshes on the same host land in the same table, so hashing only the actor mesh name makes those singleton rows collide. Previously, this inconsistency was hidden. Add `telemetry_actor_mesh_id` as the shared derivation helper, and use it for actor mesh creation, sent-message telemetry, actor rows, and the bootstrap client mesh so `sent_messages.actor_mesh_id`, `meshes.id`, and `actors.mesh_id` stay joinable. Also align host and proc mesh row IDs to hash typed `HostMeshId` and `ProcMeshId` values, so `meshes.parent_mesh_id` uses the same derivation as the parent `meshes.id` row.

Caveat: this does not fully address potential collisions for `ProcMeshId::singleton` values themselves; proc mesh telemetry IDs may also need parent scoping if multiple singleton proc meshes can share a host-local memtable.

Differential Revision: D107178870
@meta-codesync meta-codesync Bot changed the title Revisit actor mesh telemetry IDs Revisit mesh telemetry IDs Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant