Optimize agent learnings: summarizer LLM, tool-call compression, orchestrator-only reflections

## Summary

Optimize the agent learning / reflection pipeline so it can run frequently
without burning orchestrator-tier inference. Introduce a dedicated cheap
**summarizer LLM** for reflection + transcript-ingest synthesis, compress
tool-call history before reflection sees it, and scope reflections to the
orchestrator agent only.

Follow-up to the heuristic transcript-ingest pipeline shipped in
#1406 — that PR keeps the seam clean for an LLM-driven extractor to plug
in without touching callers; this issue covers the "real" LLM path.

## Problem

The current learning pipeline has three pressure points:

1. **Reflections need a model, but not the orchestrator's.** Reflection /
   ingest is going to fire often (per session-memory threshold crossing,
   on transcript close, on segment close). Running it on the same
   high-tier model the orchestrator uses is wasteful — most of what
   reflection produces is short, structured summaries that a cheap model
   handles fine. A separate summarizer tier keeps the hot path expensive
   and the cold path cheap.

2. **Tool-call history is the dominant token cost.** Reflection over a
   raw transcript drags every tool call's full output through the
   summarizer, which (a) inflates cost and (b) blows past the
   summarizer's context window. We need a compression / concatenation
   pass that collapses tool calls into per-tool digests (count, success
   rate, key outputs) before reflection sees them.

3. **Reflection should be orchestrator-only.** Today the hooks fire on
   any agent that crosses the threshold, including sub-agents and
   specialists. Sub-agent transcripts are short-lived, scoped to a
   single delegation, and almost never carry durable user context worth
   surfacing in future chats. Restricting reflection to the orchestrator
   removes a class of low-signal extractions and matches the user's
   mental model of \"what the assistant remembers across chats.\"

Constraints worth calling out:

- **Summarizer context window is significantly smaller than the
  orchestrator's**, so any pre-summarizer compression has to be aggressive
  enough to fit a multi-turn transcript into the smaller window.
- Reflection / ingest must remain **background-first** — the orchestrator
  turn's user-visible latency must not regress.
- The summarizer must be **swappable** (local Ollama vs cloud) and
  **opt-out-able** so users without a configured summarizer fall back to
  the existing heuristic path from #1406.

## Solution (optional)

Sketch — happy to iterate:

- **New `SummarizerProvider` trait + config knob.** Pluggable model,
  separate from the orchestrator provider. Cloud default + Ollama
  fallback. Carries its own context-window cap so callers can budget.
- **Tool-call digest layer.** Before any reflection / ingest LLM call,
  collapse tool messages into a `ToolCallDigest { name, count,
  success_rate, p95_duration_ms, sample_inputs, sample_outputs }`
  shape. Drop raw outputs past a small per-tool cap.
- **Orchestrator-only gating.** Add an `is_orchestrator()` (or
  `agent_role`) check on the reflection / ingest hooks; skip silently
  for sub-agents. Keep the existing turn-level `ReflectionHook` for the
  orchestrator path.
- **Two-stage extract for transcript-ingest.** Stage 1 (heuristic, what
  #1406 shipped) generates candidates. Stage 2 (summarizer) merges
  near-duplicate candidates, scores importance, and writes the merged
  output back to `conversation_memory` / `conversation_reflections`.
- **Telemetry**: surface summarizer cost / latency / context-fill alongside
  the existing reflection metrics so we can tune the trigger thresholds.

## Acceptance criteria

- [ ] **Summarizer provider integration** — a dedicated cheap summarizer
  is configurable per workspace, distinct from the orchestrator provider,
  and used by the reflection + transcript-ingest paths.
- [ ] **Tool-call compression** — reflection / ingest never feeds raw
  multi-call tool history to the summarizer; calls are collapsed into
  per-tool digests that fit the summarizer's smaller context window.
- [ ] **Orchestrator-only reflections** — sub-agents and specialists no
  longer trigger reflection / transcript ingest; only the
  user-facing orchestrator does.
- [ ] **Background-first preserved** — orchestrator turn latency does not
  regress; summarizer calls run on the same fire-and-forget surface used
  by `spawn_session_memory_extraction` and `spawn_transcript_ingestion`.
- [ ] **Heuristic fallback intact** — when no summarizer is configured,
  the heuristic path from #1406 remains the source of truth so reflection
  doesn't silently break for offline users.
- [ ] **Observability** — debug logs surface summarizer cost, latency,
  context-fill, and how many tool calls were compressed in each pass.
- [ ] **Diff coverage ≥ 80%** — the implementing PR meets the changed-lines
  coverage gate (Vitest + cargo-llvm-cov, enforced by [\`.github/workflows/coverage.yml\`](../../.github/workflows/coverage.yml)).

## Related

- Builds on #1406 (transcript-to-memory ingestion pipeline) — extends
  `src/openhuman/learning/transcript_ingest/` with an LLM-driven extractor.
- Related modules: \`src/openhuman/learning/reflection.rs\`,
  \`src/openhuman/agent/harness/session/turn.rs::spawn_session_memory_extraction\`,
  \`src/openhuman/context/session_memory.rs\`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize agent learnings: summarizer LLM, tool-call compression, orchestrator-only reflections #1419

Summary

Problem

Solution (optional)

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize agent learnings: summarizer LLM, tool-call compression, orchestrator-only reflections #1419

Description

Summary

Problem

Solution (optional)

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions