Implement agentic reflection use case

## Description

Implement agentic reflection / continuous learning — ingest agent traces, derive improvement signal, and feed it back into the agent’s inputs (prompts, eval sets, eventually skills and ontology).

**Use case.** A B2B campaign agent — generates personalised outreach emails for telco B2B sales — that improves itself by reflecting on its own outputs.

## Setup

**Input data (synthetic, telco B2B context).**

- ~10 customer profiles (industry, size, current products, account tenure).
- ~5 products to promote.
- ~20 eval cases = (customer, product) pairs with ground-truth annotations.

**The agent.**

- Input: customer profile + product to promote.
- Output: personalised outreach email (subject + body).
- Runtime config:
  - System prompt versioned in Langfuse.
  - One skill file on disk (e.g. `b2b-email-style.md`) — brand voice / personalization / CTA guidance.
- Seeded mediocre on purpose (minimal prompt, vague skill file) so the reflection loop has a clear target.

**Scorers.**

- LLM-as-judge: writing quality, personalization, groundedness.
- Heuristic: subject present, length range, no hallucinated SKUs, CTA present.

## Pipelines

1. `agent_run` — run the agent on the eval set, emit traces to Langfuse.
2. `evaluation` — fetch traces, run scorers, write per-case + aggregate scores.
3. `reflection` — meta-agent reads traces + scores + current prompt + current skill file. Produces:
   - a narrative summary (what failed, what's being changed, why),
   - a new system prompt,
   - an updated skill file,
   - new eval cases derived from failures.
4. `apply` — applies the three on confirmation (prompt → Langfuse, skill → disk, eval cases → Langfuse evaluation dataset).

Re-run `agent_run` + `evaluation` with the new prompt/skill to show measurable improvement.

## UI flow

Single-page Streamlit dashboard. Three clearly-marked steps, gated by state (`idle → run_1_done → reflected → applied → run_2_done`) so the demo can't be clicked out of order.

```
┌──────────────────────────────────────────────┐
│  B2B Campaign Agent — Reflection Demo        │
├──────────────────────────────────────────────┤
│  Step 1: Generate emails                     │
│  [ ▶ Run Agent ]   Score: —                  │
├──────────────────────────────────────────────┤
│  [ Pipeline | Emails | Scoreboard | Langfuse]│
├──────────────────────────────────────────────┤
│  Step 2: Reflect                             │
│  [ ▶ Run Reflection ]                        │
│  Summary: identified | fixed | reasons       │
│  Prompt diff | Skill diff | New eval cases   │
│  [ ✓ Approve & Apply ]                       │
├──────────────────────────────────────────────┤
│  Step 3: Re-run & compare                    │
│  [ ▶ Re-run Agent ]                          │
│  Before: 5.2/10  →  After: 8.4/10            │
│  (side-by-side sample emails)                │
└──────────────────────────────────────────────┘
```

**Per-step behaviour:**

- **Step 1.** Triggers `agent_run` + `evaluation` via subprocess. Tabs populate as the pipeline runs:
  - *Pipeline* — Kedro-Viz iframed (live node highlighting if the runs view cooperates).
  - *Emails* — sample agent outputs.
  - *Scoreboard* — per-case + aggregate metrics.
  - *Langfuse* — Streamlit panels from the Langfuse SDK (recent traces, scores) with "Open in Langfuse →" links. Tab-switch to the Langfuse UI mid-demo for the credibility moment. **Note:** if we don't find the way to do it nicely we can use Langfuse UI.
- **Step 2.** Triggers `reflection`. First shows a **plain-language summary** generated by the meta-agent:
  - *Identified* — failure patterns it found (e.g. "ignored company size in 8/20 cases").
  - *Fixed* — what it changed in the prompt, skill file, and eval set.
  - *Reasons* — why each change addresses the identified issue.

  Below the summary, three diffs side-by-side: prompt (Langfuse), skill markdown (disk), proposed new eval cases (table). Single "Approve & Apply" button runs `apply`.
- **Step 3.** Triggers `agent_run` + `evaluation` again. Side-by-side before/after metrics and sample emails, with the Step 2 summary kept visible above so the audience can map each identified issue to the visible improvement. Failures from run 1 are now permanent regression cases.

## End result

- Aggregate score visibly improves between run 1 and run 2 (e.g. ~5/10 → ~8/10).
- Three human-readable diffs (prompt, skill file, new eval cases) shown in the dashboard.
- Same 20 cases produce visibly better emails after one reflection cycle.
- New failures captured as regression cases for the next iteration.

Closing line for the demo: *the same loop trivially extends to cross-sell, pricing, or any other agent — different agent, same pipelines.*

## Out of scope for this demo

- Real telco data (use synthetic).
- Ontology / graph updates (can be added at the next iteration).
- Production-grade error handling, multi-tenancy, deployment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement agentic reflection use case #161

Description

Setup

Pipelines

UI flow

End result

Out of scope for this demo

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Implement agentic reflection use case #161

Description

Description

Setup

Pipelines

UI flow

End result

Out of scope for this demo

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions