Skip to content

feat(managed_agents): add multiagent and outcomes cookbooks#599

Merged
markn-ant merged 1 commit into
mainfrom
mnowicki/managed-agents-multiagent-outcomes
May 6, 2026
Merged

feat(managed_agents): add multiagent and outcomes cookbooks#599
markn-ant merged 1 commit into
mainfrom
mnowicki/managed-agents-multiagent-outcomes

Conversation

@markn-ant

Copy link
Copy Markdown
Contributor

Adds two Claude Managed Agents cookbooks:

  • CMA_coordinate_specialist_team.ipynb — Heterogeneous team via the multiagent coordinator config: a coordinator runs three specialists (web-search researcher, file-reading librarian, rules-based pricer) with scoped toolsets to assemble a sales proposal. Covers the multiagent field, the thread_created / thread_message_received event types, and why per-role tool scoping matters.
  • CMA_verify_with_outcome_grader.ipynb — Build a grade-and-revise loop with Outcomes: a writer drafts a cited research brief, a stateless grader fetches every URL and checks every quote against a rubric, and feedback drives revisions until the brief passes. Covers user.define_outcome, the span.outcome_evaluation_* events, and how to write a rubric the grader can act on.

Plus managed_agents/README.md table rows and registry.yaml entries.

- CMA_coordinate_specialist_team.ipynb: heterogeneous team via the
  multiagent coordinator config — coordinator runs three specialists
  with scoped toolsets to assemble a sales proposal
- CMA_verify_with_outcome_grader.ipynb: grade-and-revise loop with
  Outcomes — writer drafts a cited brief, grader checks it against a
  rubric, feedback drives revisions until it passes; includes a
  rubric-writing section with failure modes and a six-principle table
- Update managed_agents/README.md and registry.yaml
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

Notebook Changes

This PR modifies the following notebooks:

📓 managed_agents/CMA_coordinate_specialist_team.ipynb

View diff
nbdiff /dev/null managed_agents/CMA_coordinate_specialist_team.ipynb (0c0645a7439d3d426dc51a5e47a70258aa8afe04)
--- /dev/null  2026-05-06 15:12:20.725363
+++ managed_agents/CMA_coordinate_specialist_team.ipynb (0c0645a7439d3d426dc51a5e47a70258aa8afe04)  (no timestamp)
## added /cells:
+  markdown cell:
+    source:
+      # Building a Sales Proposal with a Heterogeneous Agent Team
+      
+      We'll use Claude Managed Agents and the multi-agent coordinator pattern to automate sales-proposal writing for a fictional company called Northstar, which sells a workflow-automation platform to mid-market operations teams.
+      
+      Right now, their reps build a tailored proposal for each prospect: research what companies in the prospect's segment typically prioritize, pull two relevant case studies from a library of a few hundred, model pricing from an internal rules sheet, and assemble it into a two-page document. Each step draws on a different source and a different kind of judgment.
+      
+      We'll have a coordinator agent run three specialists to do this. A researcher uses web search to find what's typical for the prospect's segment. A librarian reads the case-study library and picks the two best matches. A pricing modeler sees only the rules file and the seat count. The coordinator sequences them and writes the proposal.
+  markdown cell:
+    source:
+      ## 1. Set up the client
+      
+      First, let's install the SDK and set up the Anthropic client. The multiagent config and event types are part of the Managed Agents beta.
+  code cell:
+    source:
+      %%capture
+      %pip install -q anthropic python-dotenv
+  code cell:
+    source:
+      import os
+      
+      import anthropic
+      from dotenv import load_dotenv
+      
+      load_dotenv()
+      
+      BETAS = ["managed-agents-2026-04-01"]
+      MODEL = os.environ.get("COOKBOOK_MODEL", "claude-opus-4-6")
+      client = anthropic.Anthropic()
+  markdown cell:
+    source:
+      ## 2. Define three specialist subagents
+      
+      Next, we'll create the three teammates. Each one gets its own system prompt, its own output shape, and only the tools it needs for its job. The researcher gets web search, the case-study picker can only read the local library, and the pricing modeler just sees `pricing_rules.md` and the seat count. Scoping tools per role keeps the pricer from pulling a competitor's number off the web and keeps the full case-study library out of the coordinator's context.
+  code cell:
+    source:
+      def make_agent(name, description, system, tools):
+          a = client.beta.agents.create(
+              name=name,
+              description=description,
+              model=MODEL,
+              system=system,
+              tools=tools,
+              betas=BETAS,
+          )
+          print(f"{name}: {a.id}")
+          return a.id
+      
+      
+      prospect_researcher = make_agent(
+          "prospect_researcher",
+          "Researches what companies in a given industry segment and size tier typically prioritize.",
+          """Given a prospect's industry and size, use web search to find:
+      - What companies in that segment typically list as strategic priorities
+      - Recent trends or pressures in that industry
+      - Common operational pain points at that scale
+      Return via send_to_parent: {"priorities": [...], "recent_moves": [...], "pain_points": [...], "sources": [...]}""",
+          [
+              {
+                  "type": "agent_toolset_20260401",
+                  "configs": [{"name": "web_search"}, {"name": "web_fetch"}],
+              }
+          ],
+      )
+      
+      case_study_picker = make_agent(
+          "case_study_picker",
+          "Selects the two most relevant case studies from the library for a given prospect profile.",
+          """The case study library is in /mnt/user-data/case_studies/. Each file is one customer story.
+      You will be given a prospect's industry, size, and top priorities. Read the library, score each study on relevance, and pick the two best matches.
+      Return via send_to_parent: {"picks": [{"file": ..., "customer": ..., "why_relevant": ...}, ...]}""",
+          [{"type": "agent_toolset_20260401"}],
+      )
+      
+      pricing_modeler = make_agent(
+          "pricing_modeler",
+          "Builds two or three pricing options for a prospect based on seat count and expected usage.",
+          """Pricing rules are in /mnt/user-data/pricing_rules.md. Given a prospect's estimated seat count and usage tier, build:
+      - a conservative option (annual commit, lower per-seat)
+      - a flexible option (monthly, higher per-seat)
+      - if seat count > 500, an enterprise option with a platform fee
+      Show the first-year total for each. Return via send_to_parent: {"options": [{"name": ..., "structure": ..., "year_one_total": ...}, ...]}""",
+          [{"type": "agent_toolset_20260401"}],
+      )
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          prospect_researcher: agent_011Cahy5YQwAN99heAeSpPob
+          case_study_picker: agent_011Cahy5ZKWFqEAgPNZZxPvM
+      output 1:
+        output_type: stream
+        name: stdout
+        text:
+          pricing_modeler: agent_011Cahy5a5eTdwuzZabFgu6K
+  markdown cell:
+    source:
+      ## 3. Give the team something to work with
+      
+      The librarian needs a library to choose from. We'll give it seven short case studies across healthcare, manufacturing, logistics, retail, fintech, and public sector, so you can see it actually pick the two that fit our prospect.
+  code cell:
+    source:
+      CASE_STUDIES = [
+          {
+              "slug": "stclair_health",
+              "title": "St. Clair Health",
+              "industry": "regional hospital network",
+              "employees": 6200,
+              "summary": """Challenge: credentialing and prior-auth workflows spread across 11 systems.
+      Result with Northstar: consolidated to 3 automated workflows; prior-auth turnaround down 58%; $1.9M annual labor savings.""",
+          },
+          {
+              "slug": "blueridge_health_plan",
+              "title": "BlueRidge Health Plan",
+              "industry": "regional payer",
+              "employees": 2800,
+              "summary": """Challenge: claims-adjudication exceptions queued in email; 19% required manual rework.
+      Result with Northstar: exception routing automated end-to-end; rework rate down to 6%; 11-day faster average claim resolution.""",
+          },
+          {
+              "slug": "calder_mfg",
+              "title": "Calder Manufacturing",
+              "industry": "industrial",
+              "employees": 3100,
+              "summary": """Challenge: purchase-order approvals averaging 9 days.
+      Result with Northstar: PO cycle time cut to 2.1 days; 14% reduction in maverick spend.""",
+          },
+          {
+              "slug": "northwind",
+              "title": "Northwind Logistics",
+              "industry": "3PL",
+              "employees": 4400,
+              "summary": """Challenge: carrier-onboarding paperwork took 3 weeks per carrier.
+      Result with Northstar: onboarding down to 4 days; 22% more carriers activated in Q1.""",
+          },
+          {
+              "slug": "harborview_retail",
+              "title": "Harborview Retail Group",
+              "industry": "specialty retail",
+              "employees": 5600,
+              "summary": """Challenge: store-level inventory exceptions handled by regional managers over Slack and spreadsheets.
+      Result with Northstar: exception triage automated across 140 stores; stockout incidents down 31%.""",
+          },
+          {
+              "slug": "aperture_fintech",
+              "title": "Aperture Payments",
+              "industry": "fintech",
+              "employees": 1900,
+              "summary": """Challenge: KYC and merchant-onboarding reviews averaging 6 business days.
+      Result with Northstar: review SLA cut to 36 hours; onboarding throughput up 2.4x with the same team.""",
+          },
+          {
+              "slug": "summit_county",
+              "title": "Summit County Government",
+              "industry": "public sector",
+              "employees": 3700,
+              "summary": """Challenge: building-permit applications routed through five departments by paper packet.
+      Result with Northstar: single digital intake with parallel department review; median permit time 41 to 17 days.""",
+          },
+      ]
+  markdown cell:
+    source:
+      ### Product and pricing collateral
+      
+      We'll also provide the product one-pager that the coordinator reads when writing the "How we help" section, and the pricing rules file that the modeler uses to build options.
+  code cell:
+    source:
+      PRODUCT = """# Northstar Platform — One-Pager
+      Northstar is a workflow automation platform for mid-market operations teams.
+      Core capabilities: visual process builder, 200+ SaaS connectors, role-based approvals, SOC 2 Type II.
+      Typical results: 40-60% reduction in manual ticket handling, 3-week time-to-first-workflow."""
+      
+      PRICING = """# Pricing Rules (internal)
+      - Per-seat list: $65/mo (monthly) or $52/mo (annual commit).
+      - Usage tiers: light = 1.0x, standard = 1.15x, heavy = 1.30x multiplier on per-seat.
+      - Enterprise (>500 seats): add $48,000/yr platform fee, per-seat drops to $44/mo annual.
+      - All options include onboarding; enterprise includes a named CSM."""
+  markdown cell:
+    source:
+      ### Wire up the coordinator and start a session
+      
+      Now let's create an environment, upload the nine files, and create the coordinator with its `multiagent` roster of three specialists. Each entry is a full agent with its own model, prompt, and toolset, so you could mix model tiers per role.
+  code cell:
+    source:
+      env = client.beta.environments.create(
+          name="proposal-meridian",
+          config={"type": "anthropic_cloud", "networking": {"type": "unrestricted"}},
+      )
+      
+      resources = []
+      
+      
+      def mount(path, content):
+          f = client.beta.files.upload(
+              file=(os.path.basename(path), content.encode(), "text/plain")
+          )
+          resources.append({"type": "file", "file_id": f.id, "mount_path": path})
+      
+      
+      for cs in CASE_STUDIES:
+          body = f"# {cs['title']} ({cs['industry']}, {cs['employees']:,} employees)\n{cs['summary']}"
+          mount(f"/mnt/user-data/case_studies/{cs['slug']}.md", body)
+      mount("/mnt/user-data/product_one_pager.md", PRODUCT)
+      mount("/mnt/user-data/pricing_rules.md", PRICING)
+      
+      coordinator = client.beta.agents.create(
+          name="Proposal Writer",
+          model=MODEL,
+          system="""You assemble tailored sales proposals.
+      Given a prospect name and basic profile:
+      1. Send the prospect's industry and size to prospect_researcher.
+      2. Send the prospect's industry, size, and (once the researcher reports back) their priorities to case_study_picker.
+      3. Send the seat count and usage tier to pricing_modeler.
+      4. Read /mnt/user-data/product_one_pager.md, then write /mnt/session/outputs/proposal.md with sections:
+         Executive summary (tied to their priorities), How we help (from the one-pager),
+         Proof (the two case studies), Investment (the pricing options), Next steps.
+      Keep it to two pages.""",
+          tools=[{"type": "agent_toolset_20260401"}],
+          multiagent={
+              "type": "coordinator",
+              "agents": [prospect_researcher, case_study_picker, pricing_modeler],
+          },
+          betas=BETAS,
+      )
+      
+      session = client.beta.sessions.create(
+          agent={"type": "agent", "id": coordinator.id, "version": coordinator.version},
+          environment_id=env.id,
+          resources=resources,
+          title="Proposal: Meridian Health",
+          betas=BETAS,
+      )
+      print(f"Session {session.id} ready with {len(resources)} files mounted")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Session sesn_011Cahy5wq6Jyk32n2NwgR88 ready with 9 files mounted
+  markdown cell:
+    source:
+      ## 4. Kick off the proposal
+      
+      Let's send the prospect profile and watch the coordinator work. It will start the researcher and the pricing modeler in parallel, then run the case-study picker once the researcher's findings come back, since the picker needs those priorities to score relevance.
+  code cell:
+    source:
+      PROSPECT = {
+          "name": "Meridian Health",
+          "industry": "regional healthcare system",
+          "employees": 8500,
+          "estimated_seats": 600,
+          "usage_tier": "heavy",
+      }
+      
+      client.beta.sessions.events.send(
+          session.id,
+          betas=BETAS,
+          events=[
+              {
+                  "type": "user.message",
+                  "content": [
+                      {
+                          "type": "text",
+                          "text": f"Build a proposal for {PROSPECT['name']}, a {PROSPECT['industry']} with "
+                          f"~{PROSPECT['employees']} employees. Estimate {PROSPECT['estimated_seats']} seats "
+                          f"at {PROSPECT['usage_tier']} usage. Write to /mnt/session/outputs/proposal.md.",
+                      }
+                  ],
+              }
+          ],
+      )
+      
+      with client.beta.sessions.events.stream(session.id, betas=BETAS) as stream:
+          for ev in stream:
+              if ev.type == "session.thread_created":
+                  print(f"[spawn] {ev.agent_name}")
+              elif ev.type == "agent.thread_message_received":
+                  print(f"[report] {ev.from_agent_name} returned")
+              elif ev.type == "session.status_idle":
+                  print("[done]")
+                  break
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          [spawn] prospect_researcher
+      output 1:
+        output_type: stream
+        name: stdout
+        text:
+          [spawn] pricing_modeler
+      output 2:
+        output_type: stream
+        name: stdout
+        text:
+          [report] prospect_researcher returned
+      output 3:
+        output_type: stream
+        name: stdout
+        text:
+          [spawn] case_study_picker
+      output 4:
+        output_type: stream
+        name: stdout
+        text:
+          [report] pricing_modeler returned
+      output 5:
+        output_type: stream
+        name: stdout
+        text:
+          [report] case_study_picker returned
+      output 6:
+        output_type: stream
+        name: stdout
+        text:
+          [done]
+  markdown cell:
+    source:
+      ### What each teammate sent back
+      
+      Before we look at the assembled proposal, let's print the three raw `send_to_parent` payloads. Each subagent ran in its own context with only its own tools, so the three reports look quite different from one another.
+  code cell:
+    source:
+      def text_of(content):
+          return "".join(b.text for b in content if b.type == "text")
+      
+      
+      for ev in client.beta.sessions.events.list(session.id, limit=1000, betas=BETAS):
+          if ev.type == "agent.thread_message_received":
+              body = text_of(ev.content)
+              print(f"━━━ send_to_parent from {ev.from_agent_name} ({len(body)} chars) ━━━")
+              print(body[:1200] + (f"\n…[{len(body) - 1200} more chars]" if len(body) > 1200 else ""))
+              print()
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          ━━━ send_to_parent from prospect_researcher (9580 chars) ━━━
+          Here is the structured research for tailoring a sales proposal to Meridian Health (regional healthcare system, ~8,500 employees):
+          
+          ---
+          
+          ## 1. TOP 3–5 STRATEGIC PRIORITIES FOR REGIONAL HEALTHCARE SYSTEMS IN 2025–2026
+          
+          **Priority 1: Financial Resilience & Cost Containment**
+          - Regional health systems are under intense financial pressure from reimbursement shortfalls, labor cost inflation, and policy changes (e.g., Medicaid cuts, 340B reforms under the One Big Beautiful Bill Act). Per Plante Moran: "Medicaid changes could create tens of millions in revenue shortfalls for some systems." Cash flow acceleration, revenue cycle optimization, and cost control without compromising care quality are top-of-mind.
+          - EY notes: "Reimbursement, labor shortage and cost inflation headwinds are compounded by the need to invest in leading clinical programs." Regional health systems specifically are "signaling the strategic importance of increasing scale in the face of accelerating costs."
+          
+          **Priority 2: Workforce Retention, Burnout Reduction & Staffing Optimization**
+          - The AHA's 2026 Workforce Scan highlights: "Burnout, vacancies and administrative burden remain top concerns for health care leaders acro
+          …[8380 more chars]
+          
+          ━━━ send_to_parent from pricing_modeler (2440 chars) ━━━
+          {"options": [
+            {
+              "name": "Flexible — Monthly",
+              "structure": "Month-to-month billing, no annual commitment. 600 seats at $84.50/seat/month (list $65.00 × 1.30 heavy-usage multiplier). Includes: onboarding program, visual process builder, 200+ SaaS connectors, role-based approvals, SOC 2 Type II compliance, standard email/chat support (business-hours SLA). No volume discount applied; full flexibility to scale seats up or down monthly.",
+              "year_one_total": "$608,400"
+            },
+            {
+              "name": "Standard — Annual Commit",
+              "structure": "12-month annual commitment, billed annually or quarterly. 600 seats at $67.60/seat/month ($52.00 annual-commit base × 1.30 heavy-usage multiplier). Includes everything in Flexible plus: 20% per-seat savings vs. monthly ($121,680 first-year savings), priority support queue, onboarding program. Annual commitment locks in rate for the term.",
+              "year_one_total": "$486,720"
+            },
+            {
+              "name": "Enterprise — Annual Commit + Platform",
+              "structure": "12-month annual commitment with Enterprise platform tier. 600 seats at $57.20/seat/month ($44.00 enterprise base × 1.30 heavy-usage multiplier) plus $48,000/year platform fee. Includes everything i
+          …[1240 more chars]
+          
+          ━━━ send_to_parent from case_study_picker (3470 chars) ━━━
+          {"picks": [
+            {
+              "file": "/mnt/session/uploads/mnt/user-data/case_studies/stclair_health.md",
+              "customer": "St. Clair Health — regional hospital network, 6,200 employees",
+              "key_results": "Consolidated credentialing and prior-auth workflows from 11 disconnected systems into 3 automated workflows; prior-auth turnaround reduced 58%; $1.9M annual labor savings.",
+              "why_relevant": "St. Clair is the strongest match across nearly every Meridian priority: (1) Same industry and org type — a regional hospital network of comparable size (6,200 vs. 8,500 employees), making it a highly credible peer reference. (2) System consolidation — collapsed 11 systems into 3 automated workflows, directly mirroring Meridian's need to consolidate ~1,500 disconnected applications and improve interoperability. (3) Cost containment — $1.9M annual labor savings speaks directly to Meridian's financial resilience priority and makes a concrete ROI case. (4) Admin burden reduction — automating credentialing and prior-auth are classic high-burden healthcare admin tasks, addressing the workforce burnout and retention priority. (5) Prior-auth turnaround improvement (58%) ties to revenue cycle acceleration
+          …[2270 more chars]
+          
+  markdown cell:
+    source:
+      ## 5. Read the proposal
+      
+      Finally, let's render the assembled proposal. The coordinator wrote it to `proposal.md` with the `write` tool, so we'll find that event in the log and display it.
+  code cell:
+    source:
+      from IPython.display import Markdown, display
+      
+      for ev in client.beta.sessions.events.list(session.id, limit=1000, betas=BETAS):
+          if (
+              ev.type == "agent.tool_use"
+              and ev.name == "write"
+              and ev.input["file_path"].endswith("proposal.md")
+          ):
+              display(Markdown(ev.input["content"]))
+              break
+    outputs:
+      output 0:
+        output_type: display_data
+        data:
+          text/markdown:
+            # Northstar Platform — Proposal for Meridian Health
+            
+            **Prepared for:** Meridian Health Leadership | **Date:** May 2026 | **Prepared by:** Northstar Sales Team
+            
+            ---
+            
+            ## Executive Summary
+            
+            Meridian Health is navigating the same pressures facing every regional health system today: tightening reimbursements, a workforce stretched thin by administrative burden, and a compliance landscape that just got harder with the 2026 HIPAA Security Rule overhaul. Meanwhile, hundreds of disconnected applications across your sites create manual hand-offs, slow down revenue cycle processes, and pull clinical and operations staff away from higher-value work.
+            
+            Northstar is a workflow automation platform built for exactly this moment. We help operations teams consolidate fragmented systems into streamlined, auditable workflows — reducing costs, accelerating cycle times, and giving staff hours back in their day. For a system of Meridian's size and complexity, the impact is measurable within weeks, not quarters.
+            
+            ---
+            
+            ## How We Help
+            
+            **Northstar Platform** is a workflow automation platform purpose-built for mid-market operations teams.
+            
+            | Capability | What It Means for Meridian |
+            |---|---|
+            | **Visual Process Builder** | Operations leads design and modify workflows without writing code — no IT bottleneck for every change. |
+            | **200+ SaaS Connectors** | Connect your EHR, billing, scheduling, credentialing, and communication tools into unified, automated flows. |
+            | **Role-Based Approvals** | Route decisions to the right person at the right level — critical for clinical and financial sign-offs. |
+            | **SOC 2 Type II Certified** | Enterprise-grade security posture aligned with HIPAA requirements and the incoming 2026 rule changes. |
+            
+            **Typical results across our customer base:** 40–60% reduction in manual ticket handling, with teams launching their first production workflow within 3 weeks of kickoff.
+            
+            ---
+            
+            ## Proof — Results from Healthcare Organizations Like Yours
+            
+            ### St. Clair Health — Regional Hospital Network, 6,200 Employees
+            
+            St. Clair Health used Northstar to consolidate credentialing and prior-authorization workflows from **11 fragmented systems into 3 automated end-to-end processes**.
+            
+            - **58% faster** prior-auth turnaround
+            - **$1.9M annual labor savings**
+            
+            St. Clair's profile — a regional hospital network managing complex, compliance-sensitive operations across multiple sites — closely mirrors Meridian's environment. Their results demonstrate what's possible when disconnected manual processes are replaced with governed, automated workflows.
+            
+            ### BlueRidge Health Plan — Regional Health Plan, 2,800 Employees
+            
+            BlueRidge deployed Northstar to automate claims-adjudication exception routing that had relied on email queues and heavy manual rework.
+            
+            - Manual rework rate dropped from **19% → 6%**
+            - Average claim resolution accelerated by **11 days**
+            
+            Faster claim resolution on the payer side translates directly to faster reimbursement for provider organizations like Meridian — and the admin-burden reduction story resonates across both sides of the healthcare ecosystem.
+            
+            ---
+            
+            ## Investment — Three Options for 600 Seats, Heavy Usage
+            
+            | | **Flexible (Monthly)** | **Standard (Annual)** | **Enterprise (Annual)** ★ |
+            |---|---|---|---|
+            | **Per seat/month** | $84.50 | $67.60 | $57.20 |
+            | **Platform fee** | — | — | $48,000/yr |
+            | **Year-one total** | **$608,400** | **$486,720** | **$459,840** |
+            | **Savings vs. monthly** | — | $121,680 (20%) | $148,560 (24%) |
+            | **Support** | Email/chat (business hrs) | Priority queue | Dedicated + named CSM |
+            | **Onboarding** | Standard | Standard | Enterprise workshops + workflow design |
+            | **Commitment** | None | 12-month | 12-month |
+            
+            > **★ Recommended:** The Enterprise tier is the best fit for a deployment of Meridian's scale. Despite the platform fee, the per-seat savings make it the lowest total cost — while adding a named Customer Success Manager to guide a complex, multi-site rollout across your 8,500-employee organization. For context, St. Clair Health runs on this tier.
+            
+            ---
+            
+            ## Next Steps
+            
+            1. **Discovery workshop (60 min)** — Map Meridian's top 2–3 workflow pain points with our solutions team.
+            2. **Pilot scope agreement** — Select one high-impact workflow (e.g., prior-auth, credentialing, or claims exception routing) for a 3-week proof-of-value.
+            3. **Business case review** — We'll model projected savings specific to Meridian's volume and current cost structure.
+            4. **Security & compliance review** — Our team will walk through SOC 2 documentation, BAA execution, and HIPAA alignment with your IT/security stakeholders.
+            
+            **Ready to start?** Let's schedule the discovery workshop. We'll come prepared with a draft workflow map based on what we've learned from St. Clair and other regional health systems.
+            
+            ---
+            
+            *Northstar Platform — Automate the work around the work.*
+          text/plain: <IPython.core.display.Markdown object>
+  markdown cell:
+    source:
+      ## Why three subagents instead of one
+      
+      A single agent with all three tools could write this proposal, so why split it up? Scoping each role to its own tools means the pricing modeler can't pull a competitor's list price off the web, because it only has the rules file. The case-study picker reads seven files here, but in production it would read hundreds, and that volume stays in the subagent's context instead of the coordinator's. And the coordinator gets to decide the order and the hand-offs without doing any of the specialist work itself.
+      
+      For more on multi-agent coordination, see the [Managed Agents documentation](https://platform.claude.com/docs/en/managed-agents/multi-agent).

📓 managed_agents/CMA_verify_with_outcome_grader.ipynb

View diff
nbdiff /dev/null managed_agents/CMA_verify_with_outcome_grader.ipynb (0c0645a7439d3d426dc51a5e47a70258aa8afe04)
--- /dev/null  2026-05-06 15:12:20.725363
+++ managed_agents/CMA_verify_with_outcome_grader.ipynb (0c0645a7439d3d426dc51a5e47a70258aa8afe04)  (no timestamp)
## added /cells:
+  markdown cell:
+    source:
+      # Outcomes: Agents that verify their own work
+      
+      Agents are good at producing things that *look* done. Ask one for a cited research brief and you'll get a tidy document with footnotes. Look closer and there's usually room to improve: a topic gets thin coverage, a quote drifts from the source, a citation leans on a press release instead of the original filing. Catching those takes a manual review loop: you read the output, spot what's off, and prompt again. And most of what you say in those rounds is feedback you could have written down before the agent started.
+      
+      [Outcomes](https://platform.claude.com/docs/en/managed-agents/define-outcomes) in Claude Managed Agents gives the session a second agent whose only job is to check. You write a **rubric** that spells out what "done" looks like and how to verify it, and the platform provisions a **grader** in its own context window. The grader can't see the writer's reasoning and has no idea what shortcuts were taken. After each writer turn it rereads the artifact against the rubric and either passes it or hands back a per-criterion list of gaps. The writer revises and the loop runs again, up to a cap you set.
+      
+      In this guide you'll watch that loop run end to end. The writer drafts a one-page brief on EV fast-charging economics with verbatim quotes from its sources. The grader independently fetches every cited URL, searches each page for the quoted string, checks the quote actually supports the claim, and scores coverage against a seven-item checklist. You'll see it catch a real problem (a press-release exhibit cited where a 10-K was required) and you'll see the writer go find the right document.
+      
+      ## What you'll learn
+      
+      - Write a rubric the grader can act on
+      - Start a session with an Outcome and let the agent work toward it
+      - Follow the grade-and-revise loop and read the grader's feedback
+      - Recognize when Outcomes is the right tool
+  markdown cell:
+    source:
+      ## 1. Set up the environment
+      
+      First, let's install the SDK and set up the Anthropic client. The `define_outcome` event and the outcome-evaluation span types are part of the Managed Agents beta.
+  code cell:
+    source:
+      %%capture
+      %pip install -q anthropic python-dotenv
+  code cell:
+    source:
+      import os
+      import re
+      import time
+      
+      import anthropic
+      from dotenv import load_dotenv
+      
+      load_dotenv()
+      
+      BETAS = ["managed-agents-2026-04-01"]
+      MODEL = os.environ.get("COOKBOOK_MODEL", "claude-sonnet-4-6")
+      client = anthropic.Anthropic()
+  markdown cell:
+    source:
+      ## 2. Create the writer and start a session
+      
+      Next, we'll create the writer agent and open a session. The writer's system prompt asks it to cite no more than six sources, each with a short verbatim quote so the grader has something concrete to check.
+  code cell:
+    source:
+      env = client.beta.environments.create(
+          name="research-brief",
+          config={"type": "anthropic_cloud", "networking": {"type": "unrestricted"}},
+      )
+      
+      writer = client.beta.agents.create(
+          name="Research Analyst",
+          model=MODEL,
+          system="""You are a research analyst. You write one-page business briefs.
+      
+      Cite every factual claim with an inline footnote [n]. End the brief with a Sources section in this exact format, one entry per line:
+      
+      [n] "verbatim quote from the page, 25 words or fewer" - Title - URL
+      
+      Only cite pages you actually fetched and read. The quote must be copied character-for-character from the page. Cite no more than 6 sources total. Pick the strongest; do not pad. Save the brief to /mnt/session/outputs/brief.md.""",
+          tools=[
+              {
+                  "type": "agent_toolset_20260401",
+                  "configs": [
+                      {"name": "web_search"},
+                      {"name": "web_fetch"},
+                      {"name": "read"},
+                      {"name": "write"},
+                  ],
+              }
+          ],
+          betas=BETAS,
+      )
+      
+      session = client.beta.sessions.create(
+          agent={"type": "agent", "id": writer.id, "version": writer.version},
+          environment_id=env.id,
+          title="Brief: EV fast-charging unit economics",
+          betas=BETAS,
+      )
+      print(f"Session {session.id}")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Session sesn_011CakRYQjc4NMXdRy4qKZy5
+  markdown cell:
+    source:
+      ## 3. Write a rubric the grader can act on
+      
+      The `define_outcome` event hands the session two things: a **description** of the deliverable, which the writer reads, and a **rubric**, which the grader reads. After each writer turn the platform spins up a fresh grader with the same model and tools as the writer, gives it the rubric, and lets it inspect the artifact. The grader returns a per-criterion verdict, and if anything fails, the explanation goes straight back to the writer for the next revision.
+      
+      The rubric is the only lever you have on the grader, and how it's worded determines whether the grader actually checks anything. The default failure mode is a grader that approves everything. A rubric that says *"check that the brief covers demand charges"* lets the grader skim the brief, see a paragraph about demand charges, and write a sentence confirming it's there. It can do all of that without opening a single source. Most first drafts pass on that basis and the loop never runs. A rubric that says *"open the brief, find the demand-charges section, and confirm it states a $/kW figure or a % of operating cost"* forces the grader to produce evidence. That grader is the one that catches a press release masquerading as a 10-K.
+      
+      A few things worth doing in every rubric:
+      
+      | Principle | In practice |
+      |---|---|
+      | **Make each criterion checkable** | The task says "named-operator economics." The rubric pins it down: GAAP net loss from a 10-K or 10-Q, cited to the filing on `sec.gov`. The rubric should always be more specific than the task. |
+      | **Make the grader earn `satisfied`** | Require concrete evidence (a fetched page, a traced formula, a `file:line` reference) before the grader passes anything. A grader that's too strict costs you an extra loop. One that's too lenient ends the loop with the bad version still in place. |
+      | **Describe the goal, not the steps** | A rubric that prescribes a specific command fails silently when it isn't available, and the check you cared about never runs. Define what counts as proof; the grader has the writer's full toolset and can find it. |
+      | **Anticipate the writer's shortcuts** | "Do NOT corroborate via mirrors, reposts, or search snippets." Without that line, a dead source gets swapped for a scraper page and the grader passes it. |
+      | **Mandate the feedback format** | The grader's explanation is the writer's only signal. Ask for a one-line scoreboard, then one bullet per failure with what's wrong and what to do. |
+      | **Tell the grader what to ignore** | Without a no-fire list, the grader thrashes on style nits, pre-existing issues, and scope creep. Spell out what's out of bounds and have it self-check each finding before raising it. |
+      
+      > **Don't have a rubric yet?** Hand Claude a known-good example of the artifact and ask it to analyze what makes it good, then turn that analysis into criteria. That middle ground reliably beats writing criteria from a blank page.
+      
+      Here's the task and rubric for our brief. Read them side by side and notice how much more specific the rubric is. Each place it adds detail the task didn't have is a check the grader can actually run.
+  code cell:
+    source:
+      TASK = """Write a brief on the unit economics of public DC fast charging in the United States.
+      The brief should cover:
+        1. Capex range
+        2. Demand charges
+        3. Utilization breakeven
+        4. Subsidy programs
+        5. Named-operator economics
+        6. A contrarian or skeptical source
+        7. Hardware vs install cost split
+      """
+      
+      
+      RUBRIC = """
+      You are reviewing a research brief at /mnt/session/outputs/brief.md against a coverage checklist and verifying its citations. The writer was told the seven topics to cover; this rubric defines what counts as sufficient coverage for each topic, and how to verify citations.
+      
+      COVERAGE CHECKLIST. Each item has a specific area:
+        1. Capex range: a dollar range for installed cost per DC fast-charging stall or station.
+        2. Demand charges: quantified impact on opex (a $/kW figure or a % of operating cost).
+        3. Utilization breakeven: a breakeven or target utilization threshold (% or kWh/day).
+        4. Subsidy programs: NEVI or another public funding program, named.
+        5. Named operator: the GAAP net income or net loss from a specific public charging operator's most recent 10-K or 10-Q, and the citation for it must be the SEC filing itself (sec.gov), not a press release, earnings-call recap, or news article.
+        6. Contrarian source: at least one cited source whose thesis is that the economics are unfavorable or structurally challenged.
+        7. Cost split: a hardware vs soft-cost (install, permitting, grid) breakdown or ratio.
+      
+      CITATION CHECK. For every [n] entry in the Sources section:
+        a. LIVE: Fetch the URL with web_fetch. Mark LIVE only if web_fetch returns the readable page directly. Mark DEAD if 404, parked, login-walled, paywalled, returns a bot-block/403, or renders only via JavaScript. Do NOT corroborate via mirrors, reposts, or search snippets; the cited URL itself must fetch.
+        b. VERBATIM: Search the fetched page for the quoted string. Mark QUOTE_MATCH if the exact string appears (treat curly vs straight quotes as equivalent); NOT_FOUND otherwise.
+        c. SUPPORTS CLAIM: Mark SUPPORTS_CLAIM if the quoted passage actually backs the claim it's cited on in the brief; UNSUPPORTED if it's tangential, contradicts the claim, or is just a general statement of fact.
+      
+      OUTPUT FORMAT:
+      
+      Line 1: Coverage N/7. Citations M/K verified.
+      
+      Then, for each failed item in the coverage checklist, create a new bullet, name the item and say what specific bar it failed in one sentence max per bullet. For example: "Item 3 Utilization breakeven - MISSING. <what's missing>".
+      
+      Then, for each failed citation, create a new bullet with the format: "[n] domain - REASON. <what's wrong and what to do>". One sentence max per bullet. For example: "[3] evgo.com - DEAD. The URL returns a 403 error and appears to be behind a bot block. No mirrors or reposts; the cited URL itself must fetch."
+      """
+      
+      client.beta.sessions.events.send(
+          session.id,
+          betas=BETAS,
+          events=[
+              {
+                  "type": "user.define_outcome",
+                  "description": TASK,
+                  "rubric": {"type": "text", "content": RUBRIC},
+                  "max_iterations": 5,
+              },
+          ],
+      )
+    outputs:
+      output 0:
+        output_type: execute_result
+        execution_count: 4
+        data:
+          text/plain: BetaManagedAgentsSendSessionEvents(data=[BetaManagedAgentsUserDefineOutcomeEvent(id='sevt_011CakRYTKNkXJN7eh4kRs1r', description='Write a brief on the unit economics of public DC fast charging in the United States.\nThe brief should cover:\n  1. Capex range\n  2. Demand charges\n  3. Utilization breakeven\n  4. Subsidy programs\n  5. Named-operator economics\n  6. A contrarian or skeptical source\n  7. Hardware vs install cost split\n', max_iterations=5, outcome_id='outc_011CakRYTKNkJ83CBexcz13e', processed_at=datetime.datetime(2026, 5, 6, 1, 10, 30, 512206, tzinfo=TzInfo(0)), rubric=BetaManagedAgentsTextRubric(content='\nYou are reviewing a research brief at /mnt/session/outputs/brief.md against a coverage checklist and verifying its citations. The writer was told the seven topics to cover; this rubric defines what counts as sufficient coverage for each topic, and how to verify citations.\n\nCOVERAGE CHECKLIST. Each item has a specific area:\n  1. Capex range: a dollar range for installed cost per DC fast-charging stall or station.\n  2. Demand charges: quantified impact on opex (a $/kW figure or a % of operating cost).\n  3. Utilization breakeven: a breakeven or target utilization threshold (% or kWh/day).\n  4. Subsidy programs: NEVI or another public funding program, named.\n  5. Named operator: the GAAP net income or net loss from a specific public charging operator\'s most recent 10-K or 10-Q, and the citation for it must be the SEC filing itself (sec.gov), not a press release, earnings-call recap, or news article.\n  6. Contrarian source: at least one cited source whose thesis is that the economics are unfavorable or structurally challenged.\n  7. Cost split: a hardware vs soft-cost (install, permitting, grid) breakdown or ratio.\n\nCITATION CHECK. For every [n] entry in the Sources section:\n  a. LIVE: Fetch the URL with web_fetch. Mark LIVE only if web_fetch returns the readable page directly. Mark DEAD if 404, parked, login-walled, paywalled, returns a bot-block/403, or renders only via JavaScript. Do NOT corroborate via mirrors, reposts, or search snippets; the cited URL itself must fetch.\n  b. VERBATIM: Search the fetched page for the quoted string. Mark QUOTE_MATCH if the exact string appears (treat curly vs straight quotes as equivalent); NOT_FOUND otherwise.\n  c. SUPPORTS CLAIM: Mark SUPPORTS_CLAIM if the quoted passage actually backs the claim it\'s cited on in the brief; UNSUPPORTED if it\'s tangential, contradicts the claim, or is just a general statement of fact.\n\nOUTPUT FORMAT: \n\nLine 1: Coverage N/7. Citations M/K verified.\n\nThen, for each failed item in the coverage checklist, create a new bullet, name the item and say what specific bar it failed in one sentence max per bullet. For example: "Item 3 Utilization breakeven - MISSING. <what\'s missing>".\n\nThen, for each failed citation, create a new bullet with the format: "[n] domain - REASON. <what\'s wrong and what to do>". One sentence max per bullet. For example: "[3] evgo.com - DEAD. The URL returns a 403 error and appears to be behind a bot block. No mirrors or reposts; the cited URL itself must fetch."\n', type='text'), type='user.define_outcome')])
+  markdown cell:
+    source:
+      A few notes on the call above:
+      
+      - **The description and rubric have separate jobs.** The description tells the writer what to make. The rubric tells the grader how to check it. They should agree on the artifact's location and format; if they contradict each other (say the description asks for inline output and the rubric grades a file at `/mnt/session/outputs/`), the loop returns `failed` instead of thrashing.
+      - **`max_iterations` defaults to 3 (max 20).** We set it to 5 since this rubric is strict and the writer needs room. If every run hits the cap with the grader finding the same kind of issue each time, the writer can't act on the feedback and you're paying for iterations that don't converge.
+      - **Inline rubric vs. file rubric.** Inline text is fine for a notebook. In production you'd upload the rubric once via the Files API and pass `rubric: {"type": "file", "file_id": ...}` so it's reusable across sessions and reviewable like code.
+      
+      **Why not just put the rubric in the system prompt?** You can, and it helps the writer aim better. But a writer that knows the criteria is still grading its own work. It will say it passed whenever it believes it did, and it will not go back and refetch a URL it already cited or notice that the quote it remembers is slightly different from the quote on the page. The grader has no choice but to do those checks. It opens with a fresh context window and nothing but the rubric and the artifact, and the platform does not let the loop continue until it produces a verdict on every criterion. You cannot get that kind of separation from a single prompt, however well you write it.
+  markdown cell:
+    source:
+      ## 4. Watch the review loop
+      
+      Let's stream events and render each phase as it happens. We'll print a banner when the writer finishes a draft and show the grader's feedback after each evaluation.
+  markdown cell:
+    source:
+      Two display helpers for the loop below: `banner` draws a labeled divider, and `render_feedback` strips the boilerplate the server wraps around each grader explanation.
+  code cell:
+    source:
+      from IPython.display import Markdown, display
+      
+      HR = "━" * 46
+      
+      
+      def banner(label, tag=""):
+          display(Markdown(f"**{HR}**  \n**{label}** &nbsp; {tag}"))
+      
+      
+      def render_feedback(fb: str):
+          # Strip the server's per-criterion wrapper and trailer.
+          s = re.sub(
+              r"^An independent grader found.*?:\n\n- .*?\((?:partially |not )?met\): ",
+              "",
+              fb,
+              count=1,
+              flags=re.S,
+          )
+          s = re.sub(r"\n\nPlease revise your work.*$", "", s, flags=re.S)
+          display(Markdown(s))
+  code cell:
+    source:
+      TERMINAL = {"satisfied", "max_iterations_reached", "failed", "interrupted"}
+      t0, res, iters = time.time(), None, 0
+      n_search, last_len = 0, 0
+      
+      with client.beta.sessions.events.stream(session.id, betas=BETAS) as stream:
+          for ev in stream:
+              if ev.type == "agent.tool_use":
+                  if ev.name in ("web_search", "web_fetch"):
+                      n_search += 1
+                  if ev.name == "write" and ev.input["file_path"].endswith("brief.md"):
+                      last_len = len(ev.input["content"])
+              elif ev.type == "span.outcome_evaluation_start":
+                  banner("writer · " + ("draft" if iters == 0 else f"revision {iters}"))
+                  display(
+                      Markdown(f"searched/fetched {n_search}× · wrote `brief.md` ({last_len:,} chars)")
+                  )
+                  n_search = 0
+              elif ev.type == "span.outcome_evaluation_end":
+                  res = ev.result
+                  banner(
+                      f"grader · pass {iters}",
+                      "✓ satisfied" if res == "satisfied" else "⟳ needs_revision",
+                  )
+                  render_feedback(ev.explanation)
+                  iters += 1
+                  if res in TERMINAL:
+                      break
+      
+      m, s = divmod(int(time.time() - t0), 60)
+      display(Markdown(f"**done:** {res} after {iters} pass{'es' if iters != 1 else ''} · {m}m {s:02d}s"))
+    outputs:
+      output 0:
+        output_type: display_data
+        data:
+          text/markdown:
+            **━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━**  
+            **writer · draft** &nbsp; 
+          text/plain: <IPython.core.display.Markdown object>
+      output 1:
+        output_type: display_data
+        data:
+          text/markdown: searched/fetched 18× · wrote `brief.md` (7,607 chars)
+          text/plain: <IPython.core.display.Markdown object>
+      output 2:
+        output_type: display_data
+        data:
+          text/markdown:
+            **━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━**  
+            **grader · pass 0** &nbsp; ⟳ needs_revision
+          text/plain: <IPython.core.display.Markdown object>
+      output 3:
+        output_type: display_data
+        data:
+          text/markdown: The brief covers 5 of 7 required topics adequately. Item 2 (Demand charges) fails the quantification bar: the brief describes demand charges qualitatively as the 'single largest operational wildcard' but never states a $/kW figure (McKinsey's footnote of $20/kW is never quoted in the text) or a % of operating cost. Item 5 (Named operator) fails the citation requirement: EVgo's Q1 2024 net loss of $28.2M is cited to evchargingstations.com [6], a third-party news article, not an SEC filing from sec.gov as the rubric explicitly requires. All 6 citations are LIVE, and all 6 quoted strings match the fetched pages and support the claims they are attached to — but [6] is the wrong source type for the Named Operator criterion.
+          text/plain: <IPython.core.display.Markdown object>
+      output 4:
+        output_type: display_data
+        data:
+          text/markdown:
+            **━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━**  
+            **writer · revision 1** &nbsp; 
+          text/plain: <IPython.core.display.Markdown object>
+      output 5:
+        output_type: display_data
+        data:
+          text/markdown: searched/fetched 6× · wrote `brief.md` (7,989 chars)
+          text/plain: <IPython.core.display.Markdown object>
+      output 6:
+        output_type: display_data
+        data:
+          text/markdown:
+            **━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━**  
+            **grader · pass 1** &nbsp; ⟳ needs_revision
+          text/plain: <IPython.core.display.Markdown object>
+      output 7:
+        output_type: display_data
+        data:
+          text/markdown: Six of seven coverage items are fully met: (1) capex range with hardware $45K–$100K and install $40K–$150K per charger; (2) demand charges quantified at $20/kW and $9,600/month for a 480kW station; (3) utilization breakeven at 20% vs. 7.5% national average; (4) NEVI Formula Program named with $5B allocation; (6) contrarian source—Great Plains Institute 'nearly all DCFC scenarios lose money' fully cited and live; (7) hardware vs. install cost split clearly broken out. All six citations fetch live, every quoted string matches exactly, and all citations support the claims they are attached to (6/6 verified). Item 5 (Named operator) fails the rubric's specific requirement: the rubric demands GAAP net income/loss from a 10-K or 10-Q, and the citation must be the SEC filing itself—not a press release. The brief cites EVgo's FY2024 GAAP net loss of $126.7M, but the citation [6] is an 8-K EX-99.1 (earnings press release exhibit), explicitly labelled '(8-K EX-99.1)' in the Sources section. The rubric specifically excludes press releases, and an 8-K EX-99.1 is precisely that—a press release filed with the SEC—not a 10-K or 10-Q annual/quarterly report.
+          text/plain: <IPython.core.display.Markdown object>
+      output 8:
+        output_type: display_data
+        data:
+          text/markdown:
+            **━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━**  
+            **writer · revision 2** &nbsp; 
+          text/plain: <IPython.core.display.Markdown object>
+      output 9:
+        output_type: display_data
+        data:
+          text/markdown: searched/fetched 7× · wrote `brief.md` (7,989 chars)
+          text/plain: <IPython.core.display.Markdown object>
+      output 10:
+        output_type: display_data
+        data:
+          text/markdown:
+            **━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━**  
+            **grader · pass 2** &nbsp; ✓ satisfied
+          text/plain: <IPython.core.display.Markdown object>
+      output 11:
+        output_type: display_data
+        data:
+          text/markdown: All 7 coverage items are fully met: (1) Capex range gives hardware $45K-$100K+ and installation $40K-$150K+ per charger, plus NEVI average $915,420/station. (2) Demand charges quantified at $20/kW (explicit $/kW figure) with $9,600/month example. (3) Utilization breakeven at 20% or $0.53/kWh threshold explicitly modeled. (4) NEVI and IRA Section 30C named as subsidy programs. (5) EVgo GAAP net loss of $126.7M for FY2024 cited from the SEC filing itself (sec.gov XBRL R4.htm), not a press release. (6) Great Plains Institute contrarian source cited with thesis that 'nearly all DCFC scenarios lose money.' (7) Hardware vs. install cost split explicitly broken out. All 6 citations are LIVE (URLs return readable pages), all quoted strings appear verbatim in the fetched pages, and all support the claims they are cited on.
+          text/plain: <IPython.core.display.Markdown object>
+      output 12:
+        output_type: display_data
+        data:
+          text/markdown: **done:** satisfied after 3 passes · 12m 56s
+          text/plain: <IPython.core.display.Markdown object>
+  markdown cell:
+    source:
+      ### What just happened
+      
+      The loop ran for three grading passes.
+      
+      The first draft covered five of seven items. The grader sent it back for two misses: demand charges were described qualitatively with no dollars-per-kW figure, and the named-operator section cited a third-party news article rather than an SEC filing. The writer added the $20/kW number and went to sec.gov for EVgo's FY2024 net loss.
+      
+      The second pass cleared six of seven. The sec.gov document the writer cited was an 8-K Exhibit 99.1, the earnings press release as filed, not the 10-K or 10-Q the rubric asks for. The grader read the URL, identified it as a press-release exhibit, and rejected it on exactly that distinction. On the third pass the writer found EVgo's actual 10-K on EDGAR and the grader cleared it: 7/7 coverage, 6/6 citations live, quote-matching, and supportive.
+      
+      Both rejections happened because the rubric drew a line the grader could check. "A $/kW figure or a % of operating cost" and "the SEC filing itself, not a press release" are decisions the rubric author made; the task itself ("cover demand charges," "named-operator economics") would have waved both through. The second rejection is the one to study. The writer found a `sec.gov` URL, which on its face satisfies the task, and the grader still bounced it because the rubric distinguished an 8-K press-release exhibit from a 10-K. Without that distinction the loop ends one pass early with a worse brief.
+      
+      **Note** that the trace above is from the original cookbook run. If you run this yourself, your results will likely differ.
+  markdown cell:
+    source:
+      ## 5. Read the final brief
+      
+      Finally, let's pull the last version of `brief.md` the writer produced.
+  code cell:
+    source:
+      # Reconstruct the final brief from the event log: full content on `write`,
+      # then apply each `edit` (old_string -> new_string) in order.
+      content = ""
+      for ev in client.beta.sessions.events.list(session.id, limit=1000, betas=BETAS):
+          if ev.type != "agent.tool_use" or "brief.md" not in str(ev.input.get("file_path", "")):
+              continue
+          if ev.name == "write":
+              content = ev.input["content"]
+          elif ev.name == "edit":
+              content = content.replace(ev.input["old_string"], ev.input["new_string"], 1)
+      
+      display(Markdown(content))
+    outputs:
+      output 0:
+        output_type: display_data
+        data:
+          text/markdown:
+            # Unit Economics of Public DC Fast Charging in the United States
+            
+            *Research Brief — May 2026*
+            
+            ---
+            
+            ## 1. Capex Range
+            
+            Public DCFC is capital-intensive at every tier. A single 150–350 kW dispenser carries hardware costs of $45,000 to over $100,000, while installation alone—trenching, conduit, transformer work, concrete, permitting, and commissioning—adds $40,000 to over $150,000 per charger; grid-interconnection upgrades can push total project costs into the millions [1]. Real-world NEVI program data corroborate these figures: Paren's analysis of 330 winning state award applications shows an average total project cost of $915,420 per station, with a mean per-port figure of $192,614 [3]. Site variance is extreme—from a $132,480 Wisconsin highway stop to a $3.6 million Hawaii battery-storage installation [3].
+            
+            ## 2. Demand Charges
+            
+            Demand charges are the defining structural cost risk in DCFC economics. Utilities bill commercial customers on the peak 15-minute power draw in any billing cycle; a multi-vehicle simultaneous charge at a 150–350 kW station can produce a punishing spike that prices the entire month. NEVI corridor sites must deliver a minimum of 600 kW aggregate, compounding the exposure [2]. McKinsey modeled demand charges at $20 per kilowatt, with a four-charger station's 480 kW peak demand generating roughly $9,600 per month in demand costs alone—a figure that can dwarf energy charges at low utilization [1]. At some low-utilization sites, demand charges can account for over 90% of the total monthly electricity bill because the charge is levied on a single peak occurrence, not on aggregate energy consumed [4]. Some operators turn to battery storage or software-based peak-shaving; PG&E reported that load-management technologies reduced station power draw by more than 50%, saving $30,000–$200,000 per site in avoided infrastructure costs [2].
+            
+            ## 3. Utilization Breakeven
+            
+            McKinsey modeled a prototypical four-charger, 150 kW California station at 15% utilization (≈7 sessions/day): annual revenues of $265,000–$285,000 at $0.45/kWh, but operating costs of $220,000–$250,000 plus roughly $90,000 in annual capex depreciation, yielding an EBIT loss of $40,000–$50,000 [1]. The operator reaches breakeven only by raising utilization to 20% or lifting price to $0.53/kWh [1]. The 2022 national average utilization was approximately 7.5%—roughly half the modeled breakeven threshold [1]. Ancillary retail revenue narrows the gap: the same model finds that $12,000 in in-store spend could alone tip the station to breakeven [1].
+            
+            ## 4. Subsidy Programs
+            
+            Three federal mechanisms dominate:
+            
+            - **NEVI Formula Program**: $5 billion allocated to states, D.C., and Puerto Rico from FY 2022 through 2026, covering up to 80% of eligible project costs for corridor DCFC every 50 miles [5]. An additional $2.5 billion Charging and Fueling Infrastructure discretionary grant program targets communities and corridors [5]. Note: NEVI disbursements were paused by executive order in January 2025 and remain subject to active litigation.
+            - **IRA Section 30C Tax Credit**: 30% of project cost, capped at $100,000 per unit, for equipment installed in census tracts with a poverty rate ≥20% or median family income <80% of state median, through December 2032 [5].
+            - **Utility Make-Ready Programs**: Several large utilities (PG&E, SDG&E, Xcel) have introduced subscription-style tariffs or demand-limiter provisions that reduce or restructure demand charges for DCFC operators [2].
+            
+            McKinsey estimates that subsidy receipt converts a $40,000–$50,000 annual EBIT loss into a $25,000–$30,000 profit at the same 15% utilization, driven largely by reduced capex depreciation [1].
+            
+            ## 5. Named-Operator Economics
+            
+            Electrify America, EVgo, and Tesla collectively hold approximately 80% of the public DCFC market [1]. EVgo, which publishes the most granular public financials, reported full-year 2024 revenue of $256.8 million—a 60% increase year-over-year—but still posted a net loss of $126.7 million on a GAAP basis [6]. The company ended 2024 with approximately 4,080 stalls in operation, and in December 2024 closed a DOE loan guarantee of up to $1.25 billion to fund approximately 7,500 additional stalls over the next five years [6]. Management guided for Adjusted EBITDA breakeven in 2025 [6]. EVgo's prefabrication deployment model is separately projected to cut station construction costs by 15% and installation timelines by 50% [3].
+            
+            ## 6. Contrarian / Skeptical View
+            
+            The Great Plains Institute's foundational research is blunt: "Today's economics and the average electric utility rates mean that nearly all DCFC scenarios lose money" [4]. The authors frame it as a chicken-and-egg problem—stations lose money until EV adoption creates more customers, but adoption lags without stations. A counter-skeptical note from FreeWire's Chip Silverman warns that the proposed remedy of eliminating demand charges simply shifts the problem: "Waiving demand charges for fast chargers will shift costs to all ratepayers, including non-EV drivers," with Massachusetts projecting $84–$131 million in forgone utility revenue [2]. This critique suggests that policy shortcuts may create rate-equity problems while masking the industry's fundamental utilization deficit.
+            
+            ## 7. Hardware vs. Installation Cost Split
+            
+            McKinsey's range—$45,000–$100,000+ hardware versus $40,000–$150,000+ installation—implies that at complex sites, civil and electrical work equals or exceeds the cost of the charger itself [1]. Paren's NEVI data dramatize the spread: the cheapest real-world deployment ran $30,942 per port (Tesla, Wisconsin, simple site) while the costliest reached $895,132 per port (Hawaii, battery storage, remote grid) [3]—a 29× range driven almost entirely by infrastructure conditions. The practical heuristic: a $25,000–$75,000 dispenser routinely becomes a $150,000–$250,000 project once transformer upgrades, trenching, bollards, permitting, and network commissioning are included. Hardware costs are falling as competition among manufacturers grows; installation costs, tied to local labor markets and utility interconnection queues, are proving far stickier.
+            
+            ---
+            
+            ## Sources
+            
+            [1] "A 150 to 350kW DCFC charging unit can cost anywhere from $45,000 to over $100,000, and installation costs can range from $40,000 to over $150,000." - Can public EV fast-charging stations be profitable in the United States? - https://www.mckinsey.com/features/mckinsey-center-for-future-mobility/our-insights/can-public-ev-fast-charging-stations-be-profitable-in-the-united-states
+            
+            [2] "Waiving demand charges for fast chargers will shift costs to all ratepayers, including non-EV drivers." - Fast charging, high costs: Eliminating demand charges won't solve the problem - https://www.utilitydive.com/news/eliminating-demand-charges-wont-solve-EV-station-problems/689395/
+            
+            [3] "the average total NEVI project cost is $915,420. The median total cost is $802,267" - NEVI DC Fast Charging Station Total Project Cost Averages $915,000 - https://www.paren.app/blog/nevi-dc-fast-charging-station-total-project-cost-averages-915-000
+            
+            [4] "Today's economics and the average electric utility rates mean that nearly all DCFC scenarios lose money" - 'Nearly all' high voltage EV charging stations lose money: Report - https://www.utilitydive.com/news/nearly-all-high-voltage-ev-charging-stations-lose-money-report/561026/
+            
+            [5] "apportions a total of $5 billion to States, D.C., and Puerto Rico over five years, from Fiscal Year 2022 through 2026" - Federal Funding Programs - https://www.transportation.gov/rural/ev/toolkit/ev-infrastructure-funding-and-financing/federal-funding-programs
+            
+            [6] "Consolidated Statements of Operations - USD ($)$ in Thousands | 12 Months Ended Dec. 31, 2024 | Dec. 31, 2023" - EVgo Inc. Form 10-K, Consolidated Statements of Operations (XBRL), Period: 2024-12-31 - https://www.sec.gov/Archives/edgar/data/1821159/000155837025002400/R4.htm
+          text/plain: <IPython.core.display.Markdown object>
+  markdown cell:
+    source:
+      ## What you learned
+      
+      - **Outcomes fits when you can write down what good looks like.** Attention-to-detail and exhaustive-coverage tasks are the clearest fit; subjective quality works too if the rubric pins down the bar.
+      - **The rubric has to make the grader produce evidence.** Without that, it approves whatever it's shown.
+      - **The grader is independent and stateless.** It runs in its own context window so the writer can't talk it into anything, and a fresh one re-checks the whole artifact every iteration.
+      - **One outcome at a time, but you can chain them.** Once a loop terminates, the session is conversational again and the next `user.define_outcome` starts a new one.
+      
+      For more, see the [Outcomes documentation](https://platform.claude.com/docs/en/managed-agents/define-outcomes).

Generated by nbdime

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

Model check

Validated model references in the changed files against the current public models list.

Findings

managed_agents/CMA_coordinate_specialist_team.ipynb — uses claude-opus-4-6 as default

MODEL = os.environ.get(\"COOKBOOK_MODEL\", \"claude-opus-4-6\")

claude-opus-4-6 is now in the legacy models section of the docs. The current latest Opus model is claude-opus-4-7 (released as the most capable generally available model, with a step-change improvement in agentic coding over 4.6). Consider updating the default to claude-opus-4-7 for new cookbooks.

Note: the project's `CLAUDE.md` lists `claude-opus-4-6` as the current Opus, but the canonical docs now list `claude-opus-4-7` as the latest. `CLAUDE.md` may itself be stale.

managed_agents/CMA_verify_with_outcome_grader.ipynb — ✓ uses claude-sonnet-4-6, the current latest Sonnet alias.

managed_agents/README.md — ✓ no model references.

Other checks

  • No deprecated model IDs found (no claude-sonnet-4-0 / claude-opus-4-0 references).
  • No internal/non-public model names found.
  • No dated model IDs used (both notebooks use the non-dated alias form, per CLAUDE.md guidance).
  • Both notebooks correctly source the model from `COOKBOOK_MODEL` env var with a sensible default.

Recommendation

Update the default in `CMA_coordinate_specialist_team.ipynb` from `claude-opus-4-6` to `claude-opus-4-7` (and re-run the notebook to refresh outputs if the agent IDs / sample outputs were tied to the older model).

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

Link Review

Reviewed links in:

  • managed_agents/README.md
  • managed_agents/CMA_coordinate_specialist_team.ipynb
  • managed_agents/CMA_verify_with_outcome_grader.ipynb

✅ Valid links

Anthropic documentation (HTTPS, current platform.claude.com domain):

  • CMA_coordinate_specialist_team.ipynbManaged Agents documentation — verified, page exists and matches the cited topic (Multiagent sessions).
  • CMA_verify_with_outcome_grader.ipynbOutcomes documentation (cited twice) — verified, page exists and matches the cited topic (Define outcomes).

Internal relative paths in README.md: All 13 referenced files exist in managed_agents/ (the nine CMA_*.ipynb notebooks, data_analyst_agent.ipynb, slack_data_bot.ipynb, sre_incident_responder.ipynb, and example_data/OVERVIEW.md).

External research citations in CMA_verify_with_outcome_grader.ipynb output cells (these are part of the agent's recorded brief, not authored prose, but worth noting they're all HTTPS and on reputable sources):

  • mckinsey.com — McKinsey Center for Future Mobility insight
  • utilitydive.com — two industry articles
  • paren.app — EV charging analytics blog
  • transportation.gov — US DOT federal funding programs
  • sec.gov — EVgo Form 10-K XBRL filing on EDGAR

⚠️ Notes

None requiring action. The external citations in the outcome-grader notebook are recorded model output from the demo run and were checked by the grader at runtime as part of the cookbook's demonstration; they are not authoring statements that need to be kept evergreen.

❌ Broken or problematic links

None.

All links use HTTPS, internal links use relative paths, external links point to stable reputable sources, and the Anthropic docs links resolve to current pages.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Recommendation: COMMENT

Summary

Adds two well-crafted Managed Agents cookbooks: a multiagent coordinator demo (heterogeneous specialist team for sales-proposal generation) and an Outcomes/grader demo (writer + stateless grader loop with rubric-driven revision). Plus matching managed_agents/README.md and registry.yaml entries.

Pedagogy is excellent — rubric-design tables, "Why not just put the rubric in the system prompt?" sidebar, and post-run analysis sections are model examples for the cookbook. No critical issues; the items below are improvements rather than blockers.

Actionable Feedback (5 items)
  • managed_agents/CMA_coordinate_specialist_team.ipynb (end) and managed_agents/CMA_verify_with_outcome_grader.ipynb (end) — Add cleanup cells that call client.beta.sessions.archive(...) and client.beta.environments.delete(...). Other CMA cookbooks tear down sessions/environments to avoid accumulating live billed resources; a learner running these repeatedly will leave a trail of proposal-meridian / research-brief environments behind.
  • CMA_coordinate_specialist_team.ipynb (in cell with display(Markdown(ev.input["content"]))) — Use ev.input.get("file_path", "") defensively, matching the verify notebook's equivalent. Direct ["file_path"] access raises KeyError if a future SDK ever emits a write tool use without file_path.
  • managed_agents/README.md and registry.yaml — Both new entries paraphrase the cookbook descriptions slightly differently. Keeping the README table cell text and registry.yaml description field identical (or sourced from one canonical string) prevents drift between the website and the README.
  • General — Neither notebook nor the README mentions that the managed-agents-2026-04-01 beta requires allowlisted access. A user outside the beta will get a confusing 4xx. One sentence in the README "Getting started" or in each notebook's setup section would save real debugging.
  • CMA_coordinate_specialist_team.ipynb (in cell with PROSPECT = {...}) — Minor: ~{PROSPECT['employees']} employees formats as ~8500. Use :, to render 8,500 so it reads consistently with the case-study summaries (6,200, 2,800).
Detailed Review

Code Quality

  • Both notebooks correctly use dotenv.load_dotenv() plus os.environ.get("COOKBOOK_MODEL", ...) for model selection.
  • Models use the non-dated aliases (claude-opus-4-6, claude-sonnet-4-6) per CLAUDE.md.
  • The two notebooks pick different default models (Opus for coordinator, Sonnet for writer/grader). That is plausibly intentional — the coordinator orchestrates while the writer/grader are simpler — but a one-line note explaining why would help readers calibrate their own choices.
  • Both cookbooks reimplement an inline streaming loop instead of using utilities.stream_until_end_turn, because the coordinator emits thread_created/thread_message_received and the Outcomes loop emits span.outcome_evaluation_* events. That's the right call, but a one-sentence note (mirroring the gate-notebook commentary in README.md) would clarify why.
  • The make_agent helper in the coordinator cookbook lacks type hints; not load-bearing, just a project-style nit.

Security

  • No hardcoded keys, no os.environ assignment of credentials, no shell-injection-prone strings. Beta header is correctly listed via BETAS.
  • Mounted resources (/mnt/user-data/...) are scoped to the session's environment; nothing leaks to the host.

Suggestions

  • Coordinate notebook prints diagnostic glyphs (━━━); verify notebook uses / . Both render fine in Jupyter; just be aware of Windows console caveats outside Jupyter.
  • import re and import time in the verify notebook are both used (render_feedback, elapsed-time math). No dead imports.
  • Registry categories (Agent Patterns, Tools, Evals) match existing taxonomy. Both authors markn-ant and gaganb-ant are present in authors.yaml.

Positive Notes

  • The verify notebook's "What you'll learn" → rubric-design table → live trace → "What just happened" arc is exactly the structure these explainers should follow. Catching the 8-K Exhibit 99.1 vs. 10-K distinction in pass 2 is a great concrete teaching moment.
  • The coordinator notebook's send_to_parent payload printout (showing each subagent's raw return) is a clean way to make the multi-agent boundary visible before showing the assembled artifact.
  • "Why three subagents instead of one" closing section in the coordinator notebook directly addresses the obvious reader question.

@markn-ant markn-ant requested a review from PedramNavid May 6, 2026 15:18
@markn-ant markn-ant merged commit d28edf1 into main May 6, 2026
8 of 9 checks passed
@markn-ant markn-ant deleted the mnowicki/managed-agents-multiagent-outcomes branch May 6, 2026 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants