feat(signals): emit scout run started + reaped lifecycle events by andrewm4894 · Pull Request #65034 · PostHog/posthog

andrewm4894 · 2026-06-20T17:29:16Z

Problem

We can't currently detect a stranded scout run from events alone. The dogfood fleet has exactly one outstanding observability gap (issue 09): a worker dying mid-run strands a TaskRun at IN_PROGRESS, which wedges that (team, skill) lane. The reaper that auto-clears these (#65028) just merged — but it ships no telemetry, so the strand is still invisible except a Loki log line, and the masked 06-16 fleet freeze (~half the project-2 lanes dead for 4 days) went undetected because last_run_at kept advancing.

The two surfaces we have for scout-run health both miss the strand:

The 4 existing scout-fleet alerts derive from $ai_generation — a stranded run produces no generation, so it's invisible except as diluted aggregate volume.
The only place a run shows up as currently stuck is postgres_signals_signalscoutrun ⋈ system.task_runs — but that base syncs on a cadence (lags), and the scout warehouse views are events-first anyway.

We have signals_scout_run_finished but no started event and nothing for the reap, so throughput, stall, and worker-death can't be derived from events.

Changes

Adds two scout-owned analytics events so the full run lifecycle is event-derived with no warehouse-sync lag, alongside the existing signals_scout_run_finished:

Event	When	Key props
`signals_scout_run_started`	TaskRun + bridge row exist and the run has cleared the reap + single-flight guards (the `on_task_run_created` hook)	`skill_name`, `skill_version`, `scout_config_id`, `run_id`, `task_run_id`
`signals_scout_run_reaped`	`_self_heal_stale_runs` reaps a stranded orphan	`skill_name`, `run_id`, `task_run_id`, `status_before`, `age_seconds`, `stale_cutoff_seconds`

What this unlocks, all event-derived:

Throughput / stall: started minus finished (joined on run_id) is the in-flight + stalled set; a started with no finished is a run that died before finalize.
Trustworthy "did this lane run": started fires only for runs that actually start (a skipped dispatch emits nothing), so it's the signal last_run_at fails to be — last_run_at advances on skipped dispatches too.
Worker-death / mass-stall: a reaped run never reaches the finalize path, so it emits no finished; signals_scout_run_reaped is the strand's only event. A rising count is the 06-16 shape, caught within a tick of the cutoff rather than days late.

Both captures are best-effort (a failure never blocks the run or the reap) and keyed on the team to match signals_scout_run_finished.

How did you test this code?

I'm an agent (Claude Code). Automated tests only — no manual testing claimed.

Extended test_scout_harness.py and ran the full file (26 passed):

test_successful_run_captures_run_started_event — a successful run emits signals_scout_run_started with the right team/skill/config/run/task_run identity.
test_stale_run_reap_captures_run_reaped_event — reaping a stranded orphan emits signals_scout_run_reaped with status_before, age_seconds, and stale_cutoff_seconds.
Updated test_successful_run_captures_run_finished_event to expect both lifecycle captures in order (started then finished).

ruff check and ruff format clean.

Note: test_scout_harness.py can't be collected in isolation due to a pre-existing circular import (reproduces on clean master); ran via the temporal pre-import that CI's full-suite collection uses.

Automatic notifications

Publish to changelog?
Alert Sales and Marketing teams?

🤖 Agent context

Autonomy: Human-driven (agent-assisted)

Andy asked whether the scout fleet had enough observability to detect and alert on stale runs after the issue-09 reaper (#65028) merged, and steered toward event-derived metrics (warehouse tables lag) — specifically asking for a run-started event so throughput is derivable.

Investigation used the dogfooding skills (/phs scouts-dogfooding, signals-alerts, signals-dwh) plus reading the merged reaper. Confirmed live: 20 runs still stranded in_progress (the 06-16 freeze), the reaper merged but not yet deployed to the worker, and that signals_scout_run_finished (15k/14d) is already flowing but unused by any alert. Chose to emit started at the on_task_run_created hook (only point where both run_id and task_run_id exist and the guards have passed) and to add the reaped event in the reaper rather than reuse the generic task_run_failed it already fires, so the strand is a first-class, low-cardinality signal.

Deliberately out of scope: the alerts themselves (created via MCP on the dogfood project, documented in signals-alerts), and changing the last_run_at stamp-on-dispatch behavior — the reaper already neutralizes most of its masking, and these events key on real run rows instead.

Adds two scout-owned analytics events so the full run lifecycle is observable from events alone (no warehouse-sync lag): - signals_scout_run_started: fired once the TaskRun + bridge row exist and the run has cleared the reap + single-flight guards, so it counts only runs that actually start. Pairs with signals_scout_run_finished (joined on run_id) for throughput and stall detection: a started with no finished is a run that died before finalize. - signals_scout_run_reaped: fired when _self_heal_stale_runs reaps a stranded orphan. A reaped run never reaches the finalize path, so it emits no run_finished event and was previously visible only in the logs. Carries status_before + age_seconds so a routine one-off is distinguishable from the worker-death / mass-stall shape (e.g. the 06-16 fleet freeze). Both best-effort, keyed on the team to match the existing run_finished event. 3 tests added; full scout-harness suite green (26).

assign-reviewers-posthog · 2026-06-20T17:29:35Z

👀 Auto-assigned reviewers

These soft owners were skipped because they only have minor changes here. Nothing blocks merge, so self-assign if you'd like a look:

@PostHog/team-devex (AGENTS.md)

Soft owners come from CODEOWNERS-soft and each product's product.yaml. Generated files and lockfiles are ignored when deciding ownership.

greptile-apps · 2026-06-20T17:33:03Z

_{Reviews (1): Last reviewed commit: "feat(signals): emit scout run started + ..." | Re-trigger Greptile}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d8eaabb79

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Make the stale-run reap a compare-and-set: a conditional UPDATE off QUEUED/IN_PROGRESS lets exactly one concurrent trigger win the transition, so a single stranded run can't double-count in the signals_scout_run_reaped worker-death/mass-stall signal. Also drop a date-specific operational reference from a code comment.

github-actions

Analytics lifecycle events are best-effort (all try/except-wrapped), the CAS race fix is strictly safer than before, and both bot-flagged issues (private incident context, double-emit race) were resolved in the follow-up commit visible in this diff. No data model, API, or dependency changes.

New commits pushed (delta classified label_absent) — stamphog approval dismissed; re-review running automatically.

github-actions

Purely additive analytics instrumentation — no data model, API, or dependency changes. The CAS fix for the concurrent-reap double-emit is correct, all capture calls are best-effort (try/except), and both bot-flagged issues were resolved in the follow-up commit included in this diff.

deployment-status-posthog · 2026-06-20T20:41:06Z

Deploy status

Environment	Status	Deployed At	Workflow
dev	✅ Deployed	2026-06-20 20:41 UTC	Run
prod-us	✅ Deployed	2026-06-20 20:52 UTC	Run
prod-eu	✅ Deployed	2026-06-20 21:47 UTC	Run

andrewm4894 self-assigned this Jun 20, 2026

assign-reviewers-posthog Bot requested a review from a team June 20, 2026 17:29

andrewm4894 removed the request for review from a team June 20, 2026 17:30

chatgpt-codex-connector Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread products/signals/backend/scout_harness/runner.py Outdated

Comment thread products/signals/backend/scout_harness/runner.py

andrewm4894 added the stamphog Request AI review from stamphog label Jun 20, 2026

andrewm4894 enabled auto-merge (squash) June 20, 2026 17:53

github-actions Bot previously approved these changes Jun 20, 2026

View reviewed changes

andrewm4894 added stamphog Request AI review from stamphog and removed stamphog Request AI review from stamphog labels Jun 20, 2026

github-actions Bot approved these changes Jun 20, 2026

View reviewed changes

andrewm4894 merged commit e5d0cb8 into master Jun 20, 2026
401 of 456 checks passed

andrewm4894 deleted the observability/scout-reap-telemetry branch June 20, 2026 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(signals): emit scout run started + reaped lifecycle events#65034

feat(signals): emit scout run started + reaped lifecycle events#65034
andrewm4894 merged 2 commits into
masterfrom
observability/scout-reap-telemetry

andrewm4894 commented Jun 20, 2026

Uh oh!

assign-reviewers-posthog Bot commented Jun 20, 2026

Uh oh!

greptile-apps Bot commented Jun 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

deployment-status-posthog Bot commented Jun 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrewm4894 commented Jun 20, 2026

Problem

Changes

How did you test this code?

Automatic notifications

🤖 Agent context

Uh oh!

assign-reviewers-posthog Bot commented Jun 20, 2026

👀 Auto-assigned reviewers

Uh oh!

greptile-apps Bot commented Jun 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

deployment-status-posthog Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploy status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

deployment-status-posthog Bot commented Jun 20, 2026 •

edited

Loading