Skip to content

Implement agentic reflection use case #161

@ElenaKhaustova

Description

@ElenaKhaustova

Description

Implement agentic reflection / continuous learning — ingest agent traces, derive improvement signal, and feed it back into the agent’s inputs (prompts, eval sets, eventually skills and ontology).

Use case. A B2B campaign agent — generates personalised outreach emails for telco B2B sales — that improves itself by reflecting on its own outputs.

Setup

Input data (synthetic, telco B2B context).

  • ~10 customer profiles (industry, size, current products, account tenure).
  • ~5 products to promote.
  • ~20 eval cases = (customer, product) pairs with ground-truth annotations.

The agent.

  • Input: customer profile + product to promote.
  • Output: personalised outreach email (subject + body).
  • Runtime config:
    • System prompt versioned in Langfuse.
    • One skill file on disk (e.g. b2b-email-style.md) — brand voice / personalization / CTA guidance.
  • Seeded mediocre on purpose (minimal prompt, vague skill file) so the reflection loop has a clear target.

Scorers.

  • LLM-as-judge: writing quality, personalization, groundedness.
  • Heuristic: subject present, length range, no hallucinated SKUs, CTA present.

Pipelines

  1. agent_run — run the agent on the eval set, emit traces to Langfuse.
  2. evaluation — fetch traces, run scorers, write per-case + aggregate scores.
  3. reflection — meta-agent reads traces + scores + current prompt + current skill file. Produces:
    • a narrative summary (what failed, what's being changed, why),
    • a new system prompt,
    • an updated skill file,
    • new eval cases derived from failures.
  4. apply — applies the three on confirmation (prompt → Langfuse, skill → disk, eval cases → Langfuse evaluation dataset).

Re-run agent_run + evaluation with the new prompt/skill to show measurable improvement.

UI flow

Single-page Streamlit dashboard. Three clearly-marked steps, gated by state (idle → run_1_done → reflected → applied → run_2_done) so the demo can't be clicked out of order.

┌──────────────────────────────────────────────┐
│  B2B Campaign Agent — Reflection Demo        │
├──────────────────────────────────────────────┤
│  Step 1: Generate emails                     │
│  [ ▶ Run Agent ]   Score: —                  │
├──────────────────────────────────────────────┤
│  [ Pipeline | Emails | Scoreboard | Langfuse]│
├──────────────────────────────────────────────┤
│  Step 2: Reflect                             │
│  [ ▶ Run Reflection ]                        │
│  Summary: identified | fixed | reasons       │
│  Prompt diff | Skill diff | New eval cases   │
│  [ ✓ Approve & Apply ]                       │
├──────────────────────────────────────────────┤
│  Step 3: Re-run & compare                    │
│  [ ▶ Re-run Agent ]                          │
│  Before: 5.2/10  →  After: 8.4/10            │
│  (side-by-side sample emails)                │
└──────────────────────────────────────────────┘

Per-step behaviour:

  • Step 1. Triggers agent_run + evaluation via subprocess. Tabs populate as the pipeline runs:

    • Pipeline — Kedro-Viz iframed (live node highlighting if the runs view cooperates).
    • Emails — sample agent outputs.
    • Scoreboard — per-case + aggregate metrics.
    • Langfuse — Streamlit panels from the Langfuse SDK (recent traces, scores) with "Open in Langfuse →" links. Tab-switch to the Langfuse UI mid-demo for the credibility moment. Note: if we don't find the way to do it nicely we can use Langfuse UI.
  • Step 2. Triggers reflection. First shows a plain-language summary generated by the meta-agent:

    • Identified — failure patterns it found (e.g. "ignored company size in 8/20 cases").
    • Fixed — what it changed in the prompt, skill file, and eval set.
    • Reasons — why each change addresses the identified issue.

    Below the summary, three diffs side-by-side: prompt (Langfuse), skill markdown (disk), proposed new eval cases (table). Single "Approve & Apply" button runs apply.

  • Step 3. Triggers agent_run + evaluation again. Side-by-side before/after metrics and sample emails, with the Step 2 summary kept visible above so the audience can map each identified issue to the visible improvement. Failures from run 1 are now permanent regression cases.

End result

  • Aggregate score visibly improves between run 1 and run 2 (e.g. ~5/10 → ~8/10).
  • Three human-readable diffs (prompt, skill file, new eval cases) shown in the dashboard.
  • Same 20 cases produce visibly better emails after one reflection cycle.
  • New failures captured as regression cases for the next iteration.

Closing line for the demo: the same loop trivially extends to cross-sell, pricing, or any other agent — different agent, same pipelines.

Out of scope for this demo

  • Real telco data (use synthetic).
  • Ontology / graph updates (can be added at the next iteration).
  • Production-grade error handling, multi-tenancy, deployment.

Metadata

Metadata

Labels

No labels
No labels

Type

No fields configured for Parent.

Projects

Status
In Review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions