Skip to content

feat(signals): emit scout run started + reaped lifecycle events#65034

Merged
andrewm4894 merged 2 commits into
masterfrom
observability/scout-reap-telemetry
Jun 20, 2026
Merged

feat(signals): emit scout run started + reaped lifecycle events#65034
andrewm4894 merged 2 commits into
masterfrom
observability/scout-reap-telemetry

Conversation

@andrewm4894

Copy link
Copy Markdown
Member

Problem

We can't currently detect a stranded scout run from events alone. The dogfood fleet has exactly one outstanding observability gap (issue 09): a worker dying mid-run strands a TaskRun at IN_PROGRESS, which wedges that (team, skill) lane. The reaper that auto-clears these (#65028) just merged — but it ships no telemetry, so the strand is still invisible except a Loki log line, and the masked 06-16 fleet freeze (~half the project-2 lanes dead for 4 days) went undetected because last_run_at kept advancing.

The two surfaces we have for scout-run health both miss the strand:

  • The 4 existing scout-fleet alerts derive from $ai_generation — a stranded run produces no generation, so it's invisible except as diluted aggregate volume.
  • The only place a run shows up as currently stuck is postgres_signals_signalscoutrunsystem.task_runs — but that base syncs on a cadence (lags), and the scout warehouse views are events-first anyway.

We have signals_scout_run_finished but no started event and nothing for the reap, so throughput, stall, and worker-death can't be derived from events.

Changes

Adds two scout-owned analytics events so the full run lifecycle is event-derived with no warehouse-sync lag, alongside the existing signals_scout_run_finished:

Event When Key props
signals_scout_run_started TaskRun + bridge row exist and the run has cleared the reap + single-flight guards (the on_task_run_created hook) skill_name, skill_version, scout_config_id, run_id, task_run_id
signals_scout_run_reaped _self_heal_stale_runs reaps a stranded orphan skill_name, run_id, task_run_id, status_before, age_seconds, stale_cutoff_seconds

What this unlocks, all event-derived:

  • Throughput / stall: started minus finished (joined on run_id) is the in-flight + stalled set; a started with no finished is a run that died before finalize.
  • Trustworthy "did this lane run": started fires only for runs that actually start (a skipped dispatch emits nothing), so it's the signal last_run_at fails to be — last_run_at advances on skipped dispatches too.
  • Worker-death / mass-stall: a reaped run never reaches the finalize path, so it emits no finished; signals_scout_run_reaped is the strand's only event. A rising count is the 06-16 shape, caught within a tick of the cutoff rather than days late.

Both captures are best-effort (a failure never blocks the run or the reap) and keyed on the team to match signals_scout_run_finished.

How did you test this code?

I'm an agent (Claude Code). Automated tests only — no manual testing claimed.

Extended test_scout_harness.py and ran the full file (26 passed):

  • test_successful_run_captures_run_started_event — a successful run emits signals_scout_run_started with the right team/skill/config/run/task_run identity.
  • test_stale_run_reap_captures_run_reaped_event — reaping a stranded orphan emits signals_scout_run_reaped with status_before, age_seconds, and stale_cutoff_seconds.
  • Updated test_successful_run_captures_run_finished_event to expect both lifecycle captures in order (started then finished).

ruff check and ruff format clean.

Note: test_scout_harness.py can't be collected in isolation due to a pre-existing circular import (reproduces on clean master); ran via the temporal pre-import that CI's full-suite collection uses.

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

🤖 Agent context

Autonomy: Human-driven (agent-assisted)

Andy asked whether the scout fleet had enough observability to detect and alert on stale runs after the issue-09 reaper (#65028) merged, and steered toward event-derived metrics (warehouse tables lag) — specifically asking for a run-started event so throughput is derivable.

Investigation used the dogfooding skills (/phs scouts-dogfooding, signals-alerts, signals-dwh) plus reading the merged reaper. Confirmed live: 20 runs still stranded in_progress (the 06-16 freeze), the reaper merged but not yet deployed to the worker, and that signals_scout_run_finished (15k/14d) is already flowing but unused by any alert. Chose to emit started at the on_task_run_created hook (only point where both run_id and task_run_id exist and the guards have passed) and to add the reaped event in the reaper rather than reuse the generic task_run_failed it already fires, so the strand is a first-class, low-cardinality signal.

Deliberately out of scope: the alerts themselves (created via MCP on the dogfood project, documented in signals-alerts), and changing the last_run_at stamp-on-dispatch behavior — the reaper already neutralizes most of its masking, and these events key on real run rows instead.

Adds two scout-owned analytics events so the full run lifecycle is observable
from events alone (no warehouse-sync lag):

- signals_scout_run_started: fired once the TaskRun + bridge row exist and the
  run has cleared the reap + single-flight guards, so it counts only runs that
  actually start. Pairs with signals_scout_run_finished (joined on run_id) for
  throughput and stall detection: a started with no finished is a run that died
  before finalize.
- signals_scout_run_reaped: fired when _self_heal_stale_runs reaps a stranded
  orphan. A reaped run never reaches the finalize path, so it emits no
  run_finished event and was previously visible only in the logs. Carries
  status_before + age_seconds so a routine one-off is distinguishable from the
  worker-death / mass-stall shape (e.g. the 06-16 fleet freeze).

Both best-effort, keyed on the team to match the existing run_finished event.
3 tests added; full scout-harness suite green (26).
@andrewm4894 andrewm4894 self-assigned this Jun 20, 2026
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team June 20, 2026 17:29
@assign-reviewers-posthog

Copy link
Copy Markdown

👀 Auto-assigned reviewers

These soft owners were skipped because they only have minor changes here. Nothing blocks merge, so self-assign if you'd like a look:

  • @PostHog/team-devex (AGENTS.md)

Soft owners come from CODEOWNERS-soft and each product's product.yaml. Generated files and lockfiles are ignored when deciding ownership.

@andrewm4894 andrewm4894 removed the request for review from a team June 20, 2026 17:30
@greptile-apps

greptile-apps Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Reviews (1): Last reviewed commit: "feat(signals): emit scout run started + ..." | Re-trigger Greptile

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d8eaabb79

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread products/signals/backend/scout_harness/runner.py Outdated
Comment thread products/signals/backend/scout_harness/runner.py
Make the stale-run reap a compare-and-set: a conditional UPDATE off
QUEUED/IN_PROGRESS lets exactly one concurrent trigger win the
transition, so a single stranded run can't double-count in the
signals_scout_run_reaped worker-death/mass-stall signal. Also drop a
date-specific operational reference from a code comment.
@andrewm4894 andrewm4894 added the stamphog Request AI review from stamphog label Jun 20, 2026
@andrewm4894 andrewm4894 enabled auto-merge (squash) June 20, 2026 17:53
github-actions[bot]
github-actions Bot previously approved these changes Jun 20, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analytics lifecycle events are best-effort (all try/except-wrapped), the CAS race fix is strictly safer than before, and both bot-flagged issues (private incident context, double-emit race) were resolved in the follow-up commit visible in this diff. No data model, API, or dependency changes.

@github-actions github-actions Bot dismissed their stale review June 20, 2026 17:55

New commits pushed (delta classified label_absent) — stamphog approval dismissed; re-review running automatically.

@andrewm4894 andrewm4894 added stamphog Request AI review from stamphog and removed stamphog Request AI review from stamphog labels Jun 20, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely additive analytics instrumentation — no data model, API, or dependency changes. The CAS fix for the concurrent-reap double-emit is correct, all capture calls are best-effort (try/except), and both bot-flagged issues were resolved in the follow-up commit included in this diff.

@andrewm4894 andrewm4894 merged commit e5d0cb8 into master Jun 20, 2026
401 of 456 checks passed
@andrewm4894 andrewm4894 deleted the observability/scout-reap-telemetry branch June 20, 2026 19:56
@deployment-status-posthog

deployment-status-posthog Bot commented Jun 20, 2026

Copy link
Copy Markdown

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-06-20 20:41 UTC Run
prod-us ✅ Deployed 2026-06-20 20:52 UTC Run
prod-eu ✅ Deployed 2026-06-20 21:47 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stamphog Request AI review from stamphog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant