You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optimize the agent learning / reflection pipeline so it can run frequently
without burning orchestrator-tier inference. Introduce a dedicated cheap summarizer LLM for reflection + transcript-ingest synthesis, compress
tool-call history before reflection sees it, and scope reflections to the
orchestrator agent only.
Follow-up to the heuristic transcript-ingest pipeline shipped in #1406 — that PR keeps the seam clean for an LLM-driven extractor to plug
in without touching callers; this issue covers the "real" LLM path.
Problem
The current learning pipeline has three pressure points:
Reflections need a model, but not the orchestrator's. Reflection /
ingest is going to fire often (per session-memory threshold crossing,
on transcript close, on segment close). Running it on the same
high-tier model the orchestrator uses is wasteful — most of what
reflection produces is short, structured summaries that a cheap model
handles fine. A separate summarizer tier keeps the hot path expensive
and the cold path cheap.
Tool-call history is the dominant token cost. Reflection over a
raw transcript drags every tool call's full output through the
summarizer, which (a) inflates cost and (b) blows past the
summarizer's context window. We need a compression / concatenation
pass that collapses tool calls into per-tool digests (count, success
rate, key outputs) before reflection sees them.
Reflection should be orchestrator-only. Today the hooks fire on
any agent that crosses the threshold, including sub-agents and
specialists. Sub-agent transcripts are short-lived, scoped to a
single delegation, and almost never carry durable user context worth
surfacing in future chats. Restricting reflection to the orchestrator
removes a class of low-signal extractions and matches the user's
mental model of "what the assistant remembers across chats."
Constraints worth calling out:
Summarizer context window is significantly smaller than the
orchestrator's, so any pre-summarizer compression has to be aggressive
enough to fit a multi-turn transcript into the smaller window.
Reflection / ingest must remain background-first — the orchestrator
turn's user-visible latency must not regress.
New SummarizerProvider trait + config knob. Pluggable model,
separate from the orchestrator provider. Cloud default + Ollama
fallback. Carries its own context-window cap so callers can budget.
Tool-call digest layer. Before any reflection / ingest LLM call,
collapse tool messages into a ToolCallDigest { name, count, success_rate, p95_duration_ms, sample_inputs, sample_outputs }
shape. Drop raw outputs past a small per-tool cap.
Orchestrator-only gating. Add an is_orchestrator() (or agent_role) check on the reflection / ingest hooks; skip silently
for sub-agents. Keep the existing turn-level ReflectionHook for the
orchestrator path.
Two-stage extract for transcript-ingest. Stage 1 (heuristic, what feat(memory): transcript-to-memory ingestion pipeline (#1399) #1406 shipped) generates candidates. Stage 2 (summarizer) merges
near-duplicate candidates, scores importance, and writes the merged
output back to conversation_memory / conversation_reflections.
Telemetry: surface summarizer cost / latency / context-fill alongside
the existing reflection metrics so we can tune the trigger thresholds.
Acceptance criteria
Summarizer provider integration — a dedicated cheap summarizer
is configurable per workspace, distinct from the orchestrator provider,
and used by the reflection + transcript-ingest paths.
Tool-call compression — reflection / ingest never feeds raw
multi-call tool history to the summarizer; calls are collapsed into
per-tool digests that fit the summarizer's smaller context window.
Orchestrator-only reflections — sub-agents and specialists no
longer trigger reflection / transcript ingest; only the
user-facing orchestrator does.
Background-first preserved — orchestrator turn latency does not
regress; summarizer calls run on the same fire-and-forget surface used
by spawn_session_memory_extraction and spawn_transcript_ingestion.
Related modules: `src/openhuman/learning/reflection.rs`,
`src/openhuman/agent/harness/session/turn.rs::spawn_session_memory_extraction`,
`src/openhuman/context/session_memory.rs`.
Summary
Optimize the agent learning / reflection pipeline so it can run frequently
without burning orchestrator-tier inference. Introduce a dedicated cheap
summarizer LLM for reflection + transcript-ingest synthesis, compress
tool-call history before reflection sees it, and scope reflections to the
orchestrator agent only.
Follow-up to the heuristic transcript-ingest pipeline shipped in
#1406 — that PR keeps the seam clean for an LLM-driven extractor to plug
in without touching callers; this issue covers the "real" LLM path.
Problem
The current learning pipeline has three pressure points:
Reflections need a model, but not the orchestrator's. Reflection /
ingest is going to fire often (per session-memory threshold crossing,
on transcript close, on segment close). Running it on the same
high-tier model the orchestrator uses is wasteful — most of what
reflection produces is short, structured summaries that a cheap model
handles fine. A separate summarizer tier keeps the hot path expensive
and the cold path cheap.
Tool-call history is the dominant token cost. Reflection over a
raw transcript drags every tool call's full output through the
summarizer, which (a) inflates cost and (b) blows past the
summarizer's context window. We need a compression / concatenation
pass that collapses tool calls into per-tool digests (count, success
rate, key outputs) before reflection sees them.
Reflection should be orchestrator-only. Today the hooks fire on
any agent that crosses the threshold, including sub-agents and
specialists. Sub-agent transcripts are short-lived, scoped to a
single delegation, and almost never carry durable user context worth
surfacing in future chats. Restricting reflection to the orchestrator
removes a class of low-signal extractions and matches the user's
mental model of "what the assistant remembers across chats."
Constraints worth calling out:
orchestrator's, so any pre-summarizer compression has to be aggressive
enough to fit a multi-turn transcript into the smaller window.
turn's user-visible latency must not regress.
opt-out-able so users without a configured summarizer fall back to
the existing heuristic path from feat(memory): transcript-to-memory ingestion pipeline (#1399) #1406.
Solution (optional)
Sketch — happy to iterate:
SummarizerProvidertrait + config knob. Pluggable model,separate from the orchestrator provider. Cloud default + Ollama
fallback. Carries its own context-window cap so callers can budget.
collapse tool messages into a
ToolCallDigest { name, count, success_rate, p95_duration_ms, sample_inputs, sample_outputs }shape. Drop raw outputs past a small per-tool cap.
is_orchestrator()(oragent_role) check on the reflection / ingest hooks; skip silentlyfor sub-agents. Keep the existing turn-level
ReflectionHookfor theorchestrator path.
feat(memory): transcript-to-memory ingestion pipeline (#1399) #1406 shipped) generates candidates. Stage 2 (summarizer) merges
near-duplicate candidates, scores importance, and writes the merged
output back to
conversation_memory/conversation_reflections.the existing reflection metrics so we can tune the trigger thresholds.
Acceptance criteria
is configurable per workspace, distinct from the orchestrator provider,
and used by the reflection + transcript-ingest paths.
multi-call tool history to the summarizer; calls are collapsed into
per-tool digests that fit the summarizer's smaller context window.
longer trigger reflection / transcript ingest; only the
user-facing orchestrator does.
regress; summarizer calls run on the same fire-and-forget surface used
by
spawn_session_memory_extractionandspawn_transcript_ingestion.the heuristic path from feat(memory): transcript-to-memory ingestion pipeline (#1399) #1406 remains the source of truth so reflection
doesn't silently break for offline users.
context-fill, and how many tool calls were compressed in each pass.
coverage gate (Vitest + cargo-llvm-cov, enforced by `.github/workflows/coverage.yml`).
Related
src/openhuman/learning/transcript_ingest/with an LLM-driven extractor.`src/openhuman/agent/harness/session/turn.rs::spawn_session_memory_extraction`,
`src/openhuman/context/session_memory.rs`.