Module: cube_harness.analyze
Gradio-based web UI for exploring experiment outputs. Browse agents → tasks → seeds, step through trajectories, inspect observations (screenshots, AXTree, HTML, reward), view agent reasoning, and compare runs across experiments.
make xray # Makefile target
# or
uv run python -m cube_harness.analyze.xray --results-dir <path>Holds all mutable viewer state. Captured by Gradio handler closures. Not a serializable model — it's UI-only state that lives for the duration of a viewer session.
Key fields:
trajectories: list[Trajectory]— currently loaded setcurrent_trajectory,step— navigation cursor_storages: list[FileStorage]— one per loaded experiment dir_traj_storages: list[FileStorage]— index-aligned with trajectories_exp_tags— timestamp tag per storage (for disambiguation)_bg_loading_done/_bg_gen— background loading coordination
CLI-style inspection helpers used by the viewer and exported for ad-hoc scripts.
Formatting and data-extraction helpers (HTML rendering, trace fragments, step
summaries), plus _promote_ghost_episodes(exp_dir) — best-effort sweep run on
every UI refresh:
- RUNNING + ray (or no exp_status) → promote when per-episode heartbeat is older
than
GHOST_TIMEOUT(should_sweep_running_to_stalepredicate). - RUNNING + sequential + driver_dead → promote immediately (driver IS the worker; both dead).
- QUEUED + driver_dead → promote (no worker will ever pick it up if the scheduler is gone). QUEUED is never promoted when the driver is alive — in a large parallel batch, tasks legitimately wait hours for a slot.
The "is the driver alive?" decision lives with the type it queries: see
is_driver_alive(exp_status, exp_dir, *, timeout_s) in
cube_harness.experiment_status for the mode-aware logic. Same shape as
should_sweep_running_to_stale for episode statuses — predicate over the
status object, callable from any consumer (viewer, monitoring, reports).
A "UI step" is one environment observation paired with the agent action that follows it. Navigation moves between environment steps. For UI step N:
- Shows the Nth
EnvironmentOutput(screenshot, axtree, reward, etc.) - Shows the
AgentOutputthat immediately follows it (actions, LLM call, thoughts)
- Read-only for trajectory data — the viewer never modifies trajectories,
logs, or configs. The single exception is
_promote_ghost_episodeswritingSTALEintostatus.jsonfiles for in-flight episodes whose driver is provably dead (seexray_utilsabove). This is gated byexperiment_status.jsonso the viewer cannot accidentally kill live work. - Handles V2 (episodes/) and V1 (jsonl) layouts via
FileStorage. - Background loading: a worker thread populates
trajectoriesincrementally; stale threads self-abort by comparing_bg_gen. - Displays
_missing=Truestub trajectories (planned but never ran) distinctly. - Injects
_failure_textfromfailure.txtinto metadata when a trajectory has noend_time— so failed episodes show their stack trace in the UI.
- Gradio state is per-tab. Closing and reopening the browser resets the view; the server keeps running.
- Large trajectories (thousands of steps) are loaded lazily — switching trajectories may have noticeable latency on first open.
- The viewer caches step deserialization in-memory per session; very long sessions with many open trajectories can grow memory use.