feat(signals): emit scout run started + reaped lifecycle events#65034
Conversation
Adds two scout-owned analytics events so the full run lifecycle is observable from events alone (no warehouse-sync lag): - signals_scout_run_started: fired once the TaskRun + bridge row exist and the run has cleared the reap + single-flight guards, so it counts only runs that actually start. Pairs with signals_scout_run_finished (joined on run_id) for throughput and stall detection: a started with no finished is a run that died before finalize. - signals_scout_run_reaped: fired when _self_heal_stale_runs reaps a stranded orphan. A reaped run never reaches the finalize path, so it emits no run_finished event and was previously visible only in the logs. Carries status_before + age_seconds so a routine one-off is distinguishable from the worker-death / mass-stall shape (e.g. the 06-16 fleet freeze). Both best-effort, keyed on the team to match the existing run_finished event. 3 tests added; full scout-harness suite green (26).
👀 Auto-assigned reviewersThese soft owners were skipped because they only have minor changes here. Nothing blocks merge, so self-assign if you'd like a look:
Soft owners come from |
|
Reviews (1): Last reviewed commit: "feat(signals): emit scout run started + ..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1d8eaabb79
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Make the stale-run reap a compare-and-set: a conditional UPDATE off QUEUED/IN_PROGRESS lets exactly one concurrent trigger win the transition, so a single stranded run can't double-count in the signals_scout_run_reaped worker-death/mass-stall signal. Also drop a date-specific operational reference from a code comment.
There was a problem hiding this comment.
Analytics lifecycle events are best-effort (all try/except-wrapped), the CAS race fix is strictly safer than before, and both bot-flagged issues (private incident context, double-emit race) were resolved in the follow-up commit visible in this diff. No data model, API, or dependency changes.
New commits pushed (delta classified label_absent) — stamphog approval dismissed; re-review running automatically.
There was a problem hiding this comment.
Purely additive analytics instrumentation — no data model, API, or dependency changes. The CAS fix for the concurrent-reap double-emit is correct, all capture calls are best-effort (try/except), and both bot-flagged issues were resolved in the follow-up commit included in this diff.
Problem
We can't currently detect a stranded scout run from events alone. The dogfood fleet has exactly one outstanding observability gap (issue 09): a worker dying mid-run strands a
TaskRunatIN_PROGRESS, which wedges that(team, skill)lane. The reaper that auto-clears these (#65028) just merged — but it ships no telemetry, so the strand is still invisible except a Loki log line, and the masked 06-16 fleet freeze (~half the project-2 lanes dead for 4 days) went undetected becauselast_run_atkept advancing.The two surfaces we have for scout-run health both miss the strand:
$ai_generation— a stranded run produces no generation, so it's invisible except as diluted aggregate volume.postgres_signals_signalscoutrun⋈system.task_runs— but that base syncs on a cadence (lags), and the scout warehouse views are events-first anyway.We have
signals_scout_run_finishedbut nostartedevent and nothing for the reap, so throughput, stall, and worker-death can't be derived from events.Changes
Adds two scout-owned analytics events so the full run lifecycle is event-derived with no warehouse-sync lag, alongside the existing
signals_scout_run_finished:signals_scout_run_startedon_task_run_createdhook)skill_name,skill_version,scout_config_id,run_id,task_run_idsignals_scout_run_reaped_self_heal_stale_runsreaps a stranded orphanskill_name,run_id,task_run_id,status_before,age_seconds,stale_cutoff_secondsWhat this unlocks, all event-derived:
startedminusfinished(joined onrun_id) is the in-flight + stalled set; astartedwith nofinishedis a run that died before finalize.startedfires only for runs that actually start (a skipped dispatch emits nothing), so it's the signallast_run_atfails to be —last_run_atadvances on skipped dispatches too.finished;signals_scout_run_reapedis the strand's only event. A rising count is the 06-16 shape, caught within a tick of the cutoff rather than days late.Both captures are best-effort (a failure never blocks the run or the reap) and keyed on the team to match
signals_scout_run_finished.How did you test this code?
I'm an agent (Claude Code). Automated tests only — no manual testing claimed.
Extended
test_scout_harness.pyand ran the full file (26 passed):test_successful_run_captures_run_started_event— a successful run emitssignals_scout_run_startedwith the right team/skill/config/run/task_run identity.test_stale_run_reap_captures_run_reaped_event— reaping a stranded orphan emitssignals_scout_run_reapedwithstatus_before,age_seconds, andstale_cutoff_seconds.test_successful_run_captures_run_finished_eventto expect both lifecycle captures in order (startedthenfinished).ruff checkandruff formatclean.Note:
test_scout_harness.pycan't be collected in isolation due to a pre-existing circular import (reproduces on clean master); ran via the temporal pre-import that CI's full-suite collection uses.Automatic notifications
🤖 Agent context
Autonomy: Human-driven (agent-assisted)
Andy asked whether the scout fleet had enough observability to detect and alert on stale runs after the issue-09 reaper (#65028) merged, and steered toward event-derived metrics (warehouse tables lag) — specifically asking for a run-started event so throughput is derivable.
Investigation used the dogfooding skills (
/phs scouts-dogfooding,signals-alerts,signals-dwh) plus reading the merged reaper. Confirmed live: 20 runs still strandedin_progress(the 06-16 freeze), the reaper merged but not yet deployed to the worker, and thatsignals_scout_run_finished(15k/14d) is already flowing but unused by any alert. Chose to emitstartedat theon_task_run_createdhook (only point where bothrun_idandtask_run_idexist and the guards have passed) and to add thereapedevent in the reaper rather than reuse the generictask_run_failedit already fires, so the strand is a first-class, low-cardinality signal.Deliberately out of scope: the alerts themselves (created via MCP on the dogfood project, documented in
signals-alerts), and changing thelast_run_atstamp-on-dispatch behavior — the reaper already neutralizes most of its masking, and these events key on real run rows instead.