Description
Implement agentic reflection / continuous learning — ingest agent traces, derive improvement signal, and feed it back into the agent’s inputs (prompts, eval sets, eventually skills and ontology).
Use case. A B2B campaign agent — generates personalised outreach emails for telco B2B sales — that improves itself by reflecting on its own outputs.
Setup
Input data (synthetic, telco B2B context).
- ~10 customer profiles (industry, size, current products, account tenure).
- ~5 products to promote.
- ~20 eval cases = (customer, product) pairs with ground-truth annotations.
The agent.
- Input: customer profile + product to promote.
- Output: personalised outreach email (subject + body).
- Runtime config:
- System prompt versioned in Langfuse.
- One skill file on disk (e.g.
b2b-email-style.md) — brand voice / personalization / CTA guidance.
- Seeded mediocre on purpose (minimal prompt, vague skill file) so the reflection loop has a clear target.
Scorers.
- LLM-as-judge: writing quality, personalization, groundedness.
- Heuristic: subject present, length range, no hallucinated SKUs, CTA present.
Pipelines
agent_run — run the agent on the eval set, emit traces to Langfuse.
evaluation — fetch traces, run scorers, write per-case + aggregate scores.
reflection — meta-agent reads traces + scores + current prompt + current skill file. Produces:
- a narrative summary (what failed, what's being changed, why),
- a new system prompt,
- an updated skill file,
- new eval cases derived from failures.
apply — applies the three on confirmation (prompt → Langfuse, skill → disk, eval cases → Langfuse evaluation dataset).
Re-run agent_run + evaluation with the new prompt/skill to show measurable improvement.
UI flow
Single-page Streamlit dashboard. Three clearly-marked steps, gated by state (idle → run_1_done → reflected → applied → run_2_done) so the demo can't be clicked out of order.
┌──────────────────────────────────────────────┐
│ B2B Campaign Agent — Reflection Demo │
├──────────────────────────────────────────────┤
│ Step 1: Generate emails │
│ [ ▶ Run Agent ] Score: — │
├──────────────────────────────────────────────┤
│ [ Pipeline | Emails | Scoreboard | Langfuse]│
├──────────────────────────────────────────────┤
│ Step 2: Reflect │
│ [ ▶ Run Reflection ] │
│ Summary: identified | fixed | reasons │
│ Prompt diff | Skill diff | New eval cases │
│ [ ✓ Approve & Apply ] │
├──────────────────────────────────────────────┤
│ Step 3: Re-run & compare │
│ [ ▶ Re-run Agent ] │
│ Before: 5.2/10 → After: 8.4/10 │
│ (side-by-side sample emails) │
└──────────────────────────────────────────────┘
Per-step behaviour:
-
Step 1. Triggers agent_run + evaluation via subprocess. Tabs populate as the pipeline runs:
- Pipeline — Kedro-Viz iframed (live node highlighting if the runs view cooperates).
- Emails — sample agent outputs.
- Scoreboard — per-case + aggregate metrics.
- Langfuse — Streamlit panels from the Langfuse SDK (recent traces, scores) with "Open in Langfuse →" links. Tab-switch to the Langfuse UI mid-demo for the credibility moment. Note: if we don't find the way to do it nicely we can use Langfuse UI.
-
Step 2. Triggers reflection. First shows a plain-language summary generated by the meta-agent:
- Identified — failure patterns it found (e.g. "ignored company size in 8/20 cases").
- Fixed — what it changed in the prompt, skill file, and eval set.
- Reasons — why each change addresses the identified issue.
Below the summary, three diffs side-by-side: prompt (Langfuse), skill markdown (disk), proposed new eval cases (table). Single "Approve & Apply" button runs apply.
-
Step 3. Triggers agent_run + evaluation again. Side-by-side before/after metrics and sample emails, with the Step 2 summary kept visible above so the audience can map each identified issue to the visible improvement. Failures from run 1 are now permanent regression cases.
End result
- Aggregate score visibly improves between run 1 and run 2 (e.g. ~5/10 → ~8/10).
- Three human-readable diffs (prompt, skill file, new eval cases) shown in the dashboard.
- Same 20 cases produce visibly better emails after one reflection cycle.
- New failures captured as regression cases for the next iteration.
Closing line for the demo: the same loop trivially extends to cross-sell, pricing, or any other agent — different agent, same pipelines.
Out of scope for this demo
- Real telco data (use synthetic).
- Ontology / graph updates (can be added at the next iteration).
- Production-grade error handling, multi-tenancy, deployment.
Description
Implement agentic reflection / continuous learning — ingest agent traces, derive improvement signal, and feed it back into the agent’s inputs (prompts, eval sets, eventually skills and ontology).
Use case. A B2B campaign agent — generates personalised outreach emails for telco B2B sales — that improves itself by reflecting on its own outputs.
Setup
Input data (synthetic, telco B2B context).
The agent.
b2b-email-style.md) — brand voice / personalization / CTA guidance.Scorers.
Pipelines
agent_run— run the agent on the eval set, emit traces to Langfuse.evaluation— fetch traces, run scorers, write per-case + aggregate scores.reflection— meta-agent reads traces + scores + current prompt + current skill file. Produces:apply— applies the three on confirmation (prompt → Langfuse, skill → disk, eval cases → Langfuse evaluation dataset).Re-run
agent_run+evaluationwith the new prompt/skill to show measurable improvement.UI flow
Single-page Streamlit dashboard. Three clearly-marked steps, gated by state (
idle → run_1_done → reflected → applied → run_2_done) so the demo can't be clicked out of order.Per-step behaviour:
Step 1. Triggers
agent_run+evaluationvia subprocess. Tabs populate as the pipeline runs:Step 2. Triggers
reflection. First shows a plain-language summary generated by the meta-agent:Below the summary, three diffs side-by-side: prompt (Langfuse), skill markdown (disk), proposed new eval cases (table). Single "Approve & Apply" button runs
apply.Step 3. Triggers
agent_run+evaluationagain. Side-by-side before/after metrics and sample emails, with the Step 2 summary kept visible above so the audience can map each identified issue to the visible improvement. Failures from run 1 are now permanent regression cases.End result
Closing line for the demo: the same loop trivially extends to cross-sell, pricing, or any other agent — different agent, same pipelines.
Out of scope for this demo