Skip to content

feat(managed_agents): add callable_agents and define_outcome cookbooks#592

Closed
markn-ant wants to merge 1 commit into
mainfrom
markn/cma-callable-agents-outcomes
Closed

feat(managed_agents): add callable_agents and define_outcome cookbooks#592
markn-ant wants to merge 1 commit into
mainfrom
markn/cma-callable-agents-outcomes

Conversation

@markn-ant

Copy link
Copy Markdown
Contributor

Adds two guided tutorials for the Managed Agents research-preview features:

  • CMA_coordinate_specialist_team.ipynb — heterogeneous multiagent via callable_agents. A coordinator runs three specialists with scoped toolsets to assemble a sales proposal.
  • CMA_verify_with_outcome_grader.ipynb — iterative grader loop via define_outcome. A writer drafts a cited brief, a grader independently verifies every citation against a rubric, feedback drives revisions.

Both follow the existing CMA tutorial conventions (COOKBOOK_MODEL env var, numbered sections, committed outputs). README table and registry.yaml updated.

@github-actions

github-actions Bot commented May 3, 2026

Copy link
Copy Markdown

Notebook Changes

This PR modifies the following notebooks:

📓 managed_agents/CMA_coordinate_specialist_team.ipynb

View diff
nbdiff /dev/null managed_agents/CMA_coordinate_specialist_team.ipynb (30e12afc4db25f595a545199eaf37422bd221d41)
--- /dev/null  2026-05-03 19:45:53.249968
+++ managed_agents/CMA_coordinate_specialist_team.ipynb (30e12afc4db25f595a545199eaf37422bd221d41)  (no timestamp)
## added /cells:
+  markdown cell:
+    source:
+      # Building a Sales Proposal with a Heterogeneous Agent Team
+      
+      We'll use Claude Managed Agents and the callable_agents feature to build a sales proposal for a fictional product called Northstar, a workflow-automation platform for mid-market operations teams.
+      
+      Right now, their reps build a tailored proposal for each prospect: research what companies in the prospect's segment typically prioritize, pull two relevant case studies from a library of a few hundred, model pricing from an internal rules sheet, and assemble it into a two-page document. Each step draws on a different source and a different kind of judgment.
+      
+      We'll have a coordinator agent run three specialists to do this. A researcher gets web search and finds the prospect's priorities. A librarian reads the case-study library and picks the two best matches. A pricing modeler sees only the rules file and the seat count. The coordinator sequences them and writes the proposal.
+  markdown cell:
+    source:
+      ## 1. Set up the client
+      
+      First, let's install the SDK and set up the Anthropic client. The `callable_agents` field and the multiagent event types are currently behind the research-preview header.
+  code cell:
+    source:
+      %pip install anthropic
+  code cell:
+    source:
+      import os, time, httpx, anthropic
+      
+      PREVIEW = "managed-agents-2026-04-01-research-preview"
+      MODEL = os.environ.get("COOKBOOK_MODEL", "claude-opus-4-7")
+      client = anthropic.Anthropic()
+      
+      
+      # The SDK does not yet type the research-preview event types (thread_created,
+      # thread_message_received, ...), so we read events as raw JSON.
+      def list_events(session_id):
+          r = httpx.get(
+              f"https://api.anthropic.com/v1/sessions/{session_id}/events",
+              params={"limit": 1000},
+              headers={
+                  "x-api-key": client.api_key,
+                  "anthropic-version": "2023-06-01",
+                  "anthropic-beta": f"managed-agents-2026-04-01,{PREVIEW}",
+              },
+              timeout=60,
+          )
+          r.raise_for_status()
+          return r.json()["data"]
+  markdown cell:
+    source:
+      ## 2. Define three specialist subagents
+      
+      Next, we'll create the three teammates. Each one gets its own system prompt, its own output schema, and only the tools it needs for its job. The researcher gets web search, the case-study picker can only read the local library, and the pricing modeler just sees `pricing_rules.md` and the seat count. Scoping tools per role keeps the pricer from pulling a competitor's number off the web and keeps the full case-study library out of the coordinator's context.
+  code cell:
+    source:
+      def make_agent(name, description, system, tools):
+          a = client.beta.agents.create(
+              name=name,
+              description=description,
+              model=MODEL,
+              system=system,
+              tools=tools,
+              betas=[PREVIEW],
+          )
+          print(f"{name}: {a.id}")
+          return {"type": "agent", "id": a.id}
+      
+      
+      prospect_researcher = make_agent(
+          "prospect_researcher",
+          "Researches what companies in a given industry segment and size tier typically prioritize.",
+          """Given a prospect's industry and size, use web search to find:
+      - What companies in that segment typically list as strategic priorities
+      - Recent trends or pressures in that industry
+      - Common operational pain points at that scale
+      Return via send_to_parent: {"priorities": [...], "recent_moves": [...], "pain_points": [...], "sources": [...]}""",
+          [
+              {
+                  "type": "agent_toolset_20260401",
+                  "configs": [{"name": "web_search"}, {"name": "web_fetch"}],
+              }
+          ],
+      )
+      
+      case_study_picker = make_agent(
+          "case_study_picker",
+          "Selects the two most relevant case studies from the library for a given prospect profile.",
+          """The case study library is in /mnt/user-data/case_studies/. Each file is one customer story.
+      You will be given a prospect's industry, size, and top priorities. Read the library, score each study on relevance, and pick the two best matches.
+      Return via send_to_parent: {"picks": [{"file": ..., "customer": ..., "why_relevant": ...}, ...]}""",
+          [{"type": "agent_toolset_20260401"}],
+      )
+      
+      pricing_modeler = make_agent(
+          "pricing_modeler",
+          "Builds two or three pricing options for a prospect based on seat count and expected usage.",
+          """Pricing rules are in /mnt/user-data/pricing_rules.md. Given a prospect's estimated seat count and usage tier, build:
+      - a conservative option (annual commit, lower per-seat)
+      - a flexible option (monthly, higher per-seat)
+      - if seat count > 500, an enterprise option with a platform fee
+      Show the first-year total for each. Return via send_to_parent: {"options": [{"name": ..., "structure": ..., "year_one_total": ...}, ...]}""",
+          [{"type": "agent_toolset_20260401"}],
+      )
+  markdown cell:
+    source:
+      ## 3. Give the team something to work with
+      
+      The librarian needs a library to choose from. We'll give it seven short case studies across healthcare, manufacturing, logistics, retail, fintech, and public sector, so you can see it actually pick the two that fit our prospect.
+  code cell:
+    source:
+      CASE_STUDIES = [
+          {
+              "slug": "stclair_health",
+              "title": "St. Clair Health",
+              "industry": "regional hospital network",
+              "employees": 6200,
+              "summary": """Challenge: credentialing and prior-auth workflows spread across 11 systems.
+      Result with Northstar: consolidated to 3 automated workflows; prior-auth turnaround down 58%; $1.9M annual labor savings.""",
+          },
+          {
+              "slug": "blueridge_health_plan",
+              "title": "BlueRidge Health Plan",
+              "industry": "regional payer",
+              "employees": 2800,
+              "summary": """Challenge: claims-adjudication exceptions queued in email; 19% required manual rework.
+      Result with Northstar: exception routing automated end-to-end; rework rate down to 6%; 11-day faster average claim resolution.""",
+          },
+          {
+              "slug": "calder_mfg",
+              "title": "Calder Manufacturing",
+              "industry": "industrial",
+              "employees": 3100,
+              "summary": """Challenge: purchase-order approvals averaging 9 days.
+      Result with Northstar: PO cycle time cut to 2.1 days; 14% reduction in maverick spend.""",
+          },
+          {
+              "slug": "northwind",
+              "title": "Northwind Logistics",
+              "industry": "3PL",
+              "employees": 4400,
+              "summary": """Challenge: carrier-onboarding paperwork took 3 weeks per carrier.
+      Result with Northstar: onboarding down to 4 days; 22% more carriers activated in Q1.""",
+          },
+          {
+              "slug": "harborview_retail",
+              "title": "Harborview Retail Group",
+              "industry": "specialty retail",
+              "employees": 5600,
+              "summary": """Challenge: store-level inventory exceptions handled by regional managers over Slack and spreadsheets.
+      Result with Northstar: exception triage automated across 140 stores; stockout incidents down 31%.""",
+          },
+          {
+              "slug": "aperture_fintech",
+              "title": "Aperture Payments",
+              "industry": "fintech",
+              "employees": 1900,
+              "summary": """Challenge: KYC and merchant-onboarding reviews averaging 6 business days.
+      Result with Northstar: review SLA cut to 36 hours; onboarding throughput up 2.4x with the same team.""",
+          },
+          {
+              "slug": "summit_county",
+              "title": "Summit County Government",
+              "industry": "public sector",
+              "employees": 3700,
+              "summary": """Challenge: building-permit applications routed through five departments by paper packet.
+      Result with Northstar: single digital intake with parallel department review; median permit time 41 to 17 days.""",
+          },
+      ]
+  markdown cell:
+    source:
+      ### Product and pricing collateral
+      
+      We'll also provide the product one-pager that the coordinator reads when writing the "How we help" section, and the pricing rules file that the modeler uses to build options.
+  code cell:
+    source:
+      PRODUCT = """# Northstar Platform — One-Pager
+      Northstar is a workflow automation platform for mid-market operations teams.
+      Core capabilities: visual process builder, 200+ SaaS connectors, role-based approvals, SOC 2 Type II.
+      Typical results: 40-60% reduction in manual ticket handling, 3-week time-to-first-workflow."""
+      
+      PRICING = """# Pricing Rules (internal)
+      - Per-seat list: $65/mo (monthly) or $52/mo (annual commit).
+      - Usage tiers: light = 1.0x, standard = 1.15x, heavy = 1.30x multiplier on per-seat.
+      - Enterprise (>500 seats): add $48,000/yr platform fee, per-seat drops to $44/mo annual.
+      - All options include onboarding; enterprise includes a named CSM."""
+  markdown cell:
+    source:
+      ### Wire up the coordinator and start a session
+      
+      Now let's create an environment, upload the nine files, and create the coordinator with its three `callable_agents`. Each entry is a full agent with its own model, prompt, and toolset, so you could mix model tiers per role.
+  code cell:
+    source:
+      env = client.beta.environments.create(name="proposal-meridian", betas=[PREVIEW])
+      
+      resources = []
+      
+      
+      def mount(path, content):
+          f = client.beta.files.upload(file=(os.path.basename(path), content.encode(), "text/plain"))
+          resources.append({"type": "file", "file_id": f.id, "mount_path": path})
+      
+      
+      for cs in CASE_STUDIES:
+          body = f"# {cs['title']} ({cs['industry']}, {cs['employees']:,} employees)\n{cs['summary']}"
+          mount(f"/mnt/user-data/case_studies/{cs['slug']}.md", body)
+      mount("/mnt/user-data/product_one_pager.md", PRODUCT)
+      mount("/mnt/user-data/pricing_rules.md", PRICING)
+      
+      coordinator = client.beta.agents.create(
+          name="Proposal Writer",
+          model=MODEL,
+          system="""You assemble tailored sales proposals.
+      Given a prospect name and basic profile:
+      1. Send the prospect's industry and size to prospect_researcher.
+      2. Send the prospect's industry, size, and (once the researcher reports back) their priorities to case_study_picker.
+      3. Send the seat count and usage tier to pricing_modeler.
+      4. Read /mnt/user-data/product_one_pager.md, then write /mnt/session/outputs/proposal.md with sections:
+         Executive summary (tied to their priorities), How we help (from the one-pager),
+         Proof (the two case studies), Investment (the pricing options), Next steps.
+      Keep it to two pages.""",
+          tools=[{"type": "agent_toolset_20260401"}],
+          betas=[PREVIEW],
+          extra_body={"callable_agents": [prospect_researcher, case_study_picker, pricing_modeler]},
+      )
+      
+      session = client.beta.sessions.create(
+          agent=coordinator.id,
+          environment_id=env.id,
+          resources=resources,
+          title="Proposal: Meridian Health",
+          betas=[PREVIEW],
+      )
+      print(f"Session {session.id} ready with {len(resources)} files mounted")
+  markdown cell:
+    source:
+      ## 4. Kick off the proposal
+      
+      Let's send the prospect profile and watch the coordinator work. It will start the researcher and the pricing modeler in parallel, then run the case-study picker once the researcher's findings come back, since the picker needs those priorities to score relevance.
+  code cell:
+    source:
+      PROSPECT = {
+          "name": "Meridian Health",
+          "industry": "regional healthcare system",
+          "employees": 8500,
+          "estimated_seats": 600,
+          "usage_tier": "heavy",
+      }
+      
+      client.beta.sessions.events.send(
+          session.id,
+          betas=[PREVIEW],
+          events=[
+              {
+                  "type": "user.message",
+                  "content": [
+                      {
+                          "type": "text",
+                          "text": f"Build a proposal for {PROSPECT['name']}, a {PROSPECT['industry']} with "
+                          f"~{PROSPECT['employees']} employees. Estimate {PROSPECT['estimated_seats']} seats "
+                          f"at {PROSPECT['usage_tier']} usage. Write to /mnt/session/outputs/proposal.md.",
+                      }
+                  ],
+              }
+          ],
+      )
+      
+      seen, created, idled = set(), 0, 0
+      while True:
+          for ev in list_events(session.id):
+              if ev["id"] in seen:
+                  continue
+              seen.add(ev["id"])
+              t = ev["type"]
+              if t == "session.thread_created":
+                  created += 1
+                  print(f"[spawn] {ev['agent_name']}")
+              elif t == "agent.thread_message_received":
+                  print(f"[report] {ev.get('from_agent_name', 'subagent')} returned")
+              elif t == "session.thread_idle":
+                  idled += 1
+              elif t == "session.status_idle":
+                  # The coordinator idles between dispatch waves; only stop once every
+                  # spawned thread has finished.
+                  if created > 0 and idled >= created:
+                      print(f"[done] {created} subagents finished")
+                      break
+          else:
+              time.sleep(4)
+              continue
+          break
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          [spawn] prospect_researcher
+          [spawn] pricing_modeler
+          [report] prospect_researcher returned
+          [spawn] case_study_picker
+          [report] pricing_modeler returned
+          [report] case_study_picker returned
+          [done] 3 subagents finished
+  markdown cell:
+    source:
+      ### What each teammate sent back
+      
+      Before we look at the assembled proposal, let's print the three raw `send_to_parent` payloads. Each subagent ran in its own context with only its own tools, so the three reports look quite different from one another.
+  code cell:
+    source:
+      def text_of(content):
+          if isinstance(content, str):
+              return content
+          return "".join(b.get("text", "") for b in content if b.get("type") == "text")
+      
+      
+      for ev in list_events(session.id):
+          if ev["type"] == "agent.thread_message_received":
+              who = ev["from_agent_name"]
+              body = text_of(ev["content"])
+              print(f"━━━ send_to_parent from {who} ({len(body)} chars) ━━━")
+              print(body[:1200] + (f"\n…[{len(body) - 1200} more chars]" if len(body) > 1200 else ""))
+              print()
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          ━━━ send_to_parent from prospect_researcher (12917 chars) ━━━
+          Research complete for Meridian Health (regional multi-hospital/multi-clinic system, ~8,500 employees). Concise, sales-ready intelligence below.
+          
+          {
+            "priorities": [
+              "Financial resilience & margin protection — revenue cycle optimization, cash flow acceleration, and disciplined capital allocation amid reimbursement pressure. <cite index=\"7-2,7-3\">Medicaid changes could create tens of millions in revenue shortfalls for some systems; hospitals must optimize revenue cycle performance, reduce denials and improve margins, while pursuing cost control through efficiency and throughput rather than across-the-board cuts</cite>.",
+              "Workforce stabilization & clinician burnout reduction — retention, safer staffing ratios, and reducing administrative burden. <cite index=\"21-3\">Healthcare leaders are focused on five critical priorities to stabilize the nursing profession: staffing and recruitment, competitive pay and benefits, stronger leadership, flexible schedules, and meaningful recognition</cite>. <cite index=\"27-3\">In 2025 RN turnover costs U.S. hospitals an average of $61,110 per nurse, with hospital turnover at 18.3% and RN turnover at 16.4%</cite>.",
+              "Scaling AI with gove
+          …[11717 more chars]
+          
+          ━━━ send_to_parent from pricing_modeler (1602 chars) ━━━
+          Pricing built for Meridian Health (600 seats, heavy tier = 1.30× multiplier). Proposal-ready markdown saved to `/mnt/session/outputs/meridian_health_investment.md`.
+          
+          **Structured result:**
+          
+          ```json
+          {"options": [
+            {"name": "Flex (Monthly)", "structure": "Month-to-month, $65 base × 1.30 heavy = $84.50/seat/mo, 600 seats, no platform fee", "year_one_total": 608400},
+            {"name": "Standard (Annual)", "structure": "12-month commit, $52 base × 1.30 heavy = $67.60/seat/mo, 600 seats, no platform fee", "year_one_total": 486720},
+            {"name": "Enterprise", "structure": "12-month commit (>500 seats), $44 base × 1.30 heavy = $57.20/seat/mo, 600 seats, + $48,000/yr platform fee, named CSM included", "year_one_total": 459840}
+          ]}
+          ```
+          
+          **Recommendation: Enterprise** — it is simultaneously the lowest first-year total ($459,840) *and* the only tier that includes a named CSM, which matters for a regional health system rolling out to 600 seats inside an 8,500-employee org. It saves $26,880 vs. Standard and $148,560 vs. Flex in year one.
+          
+          **Note on interpretation:** I applied the 1.30× heavy multiplier to the enterprise per-seat rate ($44 → $57.20), reading "heavy = 1.30x multiplier on per-seat" as apply
+          …[402 more chars]
+          
+          ━━━ send_to_parent from case_study_picker (2951 chars) ━━━
+          {
+            "picks": [
+              {
+                "file": "stclair_health.md",
+                "customer": "St. Clair Health",
+                "industry_size": "Regional hospital network, 6,200 employees",
+                "challenge": "Credentialing and prior-authorization workflows were spread across 11 disconnected systems, creating manual handoffs, delays, and administrative load on clinical and back-office staff.",
+                "implemented": "Northstar consolidated the 11-system sprawl into 3 automated, role-based workflows spanning credentialing and prior-auth, using the visual process builder and connector library to bridge existing systems rather than rip-and-replace.",
+                "results": [
+                  "Prior-auth turnaround time down 58%",
+                  "$1.9M annual labor savings",
+                  "11 point systems consolidated to 3 automated workflows"
+                ],
+                "why_relevant": "Closest analog to Meridian on every axis that matters: same industry (regional multi-site provider IDN), nearest-size peer in the library (6,200 vs. 8,500 employees), and directly hits four of Meridian's top priorities at once — (1) financial resilience / revenue-cycle via prior-auth denials reduction, (2) clinician burnout via reduced admin burden, (7) technical-debt /
+          …[1751 more chars]
+          
+  markdown cell:
+    source:
+      ## 5. Read the proposal
+      
+      Finally, let's render the assembled proposal. The coordinator wrote it to `proposal.md` with the `write` tool, so we'll find that event in the log and display it.
+  code cell:
+    source:
+      from IPython.display import Markdown, display
+      
+      for ev in list_events(session.id):
+          if (
+              ev["type"] == "agent.tool_use"
+              and ev["name"] == "write"
+              and ev["input"]["file_path"].endswith("proposal.md")
+          ):
+              display(Markdown(ev["input"]["content"]))
+              break
+    outputs:
+      output 0:
+        output_type: display_data
+        data:
+          text/markdown:
+            # Proposal for Meridian Health
+            **Northstar Platform — Workflow Automation for a Regional Health System**
+            
+            ---
+            
+            ## Executive Summary
+            
+            Meridian Health is navigating the same three-front pressure facing every regional IDN in 2026: margin compression from reimbursement and denials, a workforce stretched thin by documentation and administrative burden, and a board-level mandate to scale automation with governance and measurable ROI — all without adding IT headcount or cyber exposure. You are too large to keep running on point solutions, and too lean to build custom integrations for every workflow.
+            
+            Northstar is built for exactly this middle. We help regional health systems:
+            
+            - **Protect margin** by automating revenue-cycle workflows (prior-auth, denials, exception handling) that currently leak labor dollars and cash.
+            - **Give clinician time back** by removing manual handoffs, duplicate data entry, and email-based approvals from credentialing, scheduling, and back-office processes.
+            - **Consolidate technical debt** by replacing sprawl of point tools with governed, auditable workflows that sit on top of your existing Epic/MEDITECH/Oracle Health environment — no rip-and-replace.
+            - **Stay audit-ready** on HIPAA and the updated Security Rule with SOC 2 Type II controls, role-based approvals, and full audit trails out of the box.
+            
+            For a 600-seat heavy-usage deployment, we recommend the **Enterprise tier at $459,840 in year one**, with a named Customer Success Manager to drive adoption across service lines.
+            
+            ---
+            
+            ## How We Help
+            
+            Northstar is a workflow automation platform for mid-market operations teams. Four capabilities do the heavy lifting at Meridian:
+            
+            | Capability | What it does for Meridian |
+            |---|---|
+            | **Visual process builder** | RCM, credentialing, and ops leads design and modify workflows without engineering tickets — reducing dependence on a stretched IT team. |
+            | **200+ SaaS connectors** | Bridges existing systems (EHR, payer portals, HRIS, ITSM, finance) via standards-based integration instead of custom point-to-point work. |
+            | **Role-based approvals** | Enforces governance, segregation of duties, and audit trails required for HIPAA, SOC 2, and internal compliance. |
+            | **SOC 2 Type II** | BAA-ready, with encryption in transit and at rest, MFA, and vendor-oversight documentation aligned to the updated HIPAA Security Rule. |
+            
+            **Typical customer results:** 40–60% reduction in manual ticket handling and first production workflow live in 3 weeks — not 3 quarters.
+            
+            ---
+            
+            ## Proof
+            
+            ### St. Clair Health — Regional hospital network, 6,200 employees
+            **Challenge.** Credentialing and prior-authorization workflows were spread across 11 disconnected systems, creating manual handoffs, delays, and administrative load on clinical and back-office staff.
+            **What we did.** Consolidated 11 systems into 3 automated, role-based workflows using Northstar's visual builder and connector library — bridging existing systems rather than replacing them.
+            **Results.**
+            - **Prior-auth turnaround time down 58%**
+            - **$1.9M annual labor savings**
+            - **11 point systems consolidated to 3 governed workflows**
+            
+            > *Why it matters for Meridian:* the closest analog in our library — same industry, similar size, and a direct hit on your RCM, clinician-burden, and platform-consolidation priorities.
+            
+            ### BlueRidge Health Plan — Regional health payer, 2,800 employees
+            **Challenge.** Claims-adjudication exceptions were queued informally over email; 19% of claims required manual rework, driving cycle time and labor cost.
+            **What we did.** Automated end-to-end exception routing with role-based approvals, replacing the email-and-spreadsheet workflow with a governed, auditable process.
+            **Results.**
+            - **Manual rework rate cut from 19% to 6%**
+            - **Average claim resolution 11 days faster**
+            - **Auditable, HIPAA-aligned exception routing**
+            
+            > *Why it matters for Meridian:* same adjudication mechanics your RCM team fights from the provider side — evidence Northstar handles HIPAA-sensitive, high-volume claims work with measurable denials and cycle-time impact.
+            
+            ---
+            
+            ## Investment
+            
+            Northstar Platform sized for **600 seats at the heavy usage tier**. All options include onboarding and SOC 2 Type II controls.
+            
+            | Option | Commitment | Per-Seat (Heavy) | Platform Fee | **Year-One Total** |
+            |---|---|---|---|---|
+            | Flex (Monthly) | Month-to-month | $84.50 / seat / mo | — | **$608,400** |
+            | Standard (Annual) | 12-month commit | $67.60 / seat / mo | — | **$486,720** |
+            | **Enterprise** ⭐ | 12-month commit | $57.20 / seat / mo | $48,000 / yr | **$459,840** |
+            
+            **Recommended: Enterprise** — the lowest year-one total *and* the only tier with a named Customer Success Manager. Saves $26,880 vs. Standard and $148,560 vs. Flex, and gives Meridian a single point of accountability as you roll out across service lines.
+            
+            At St. Clair Health's results, a comparable deployment would pay for itself roughly **four times over in year one** on labor savings alone — before counting denials recovery, turnover reduction, or cycle-time gains.
+            
+            ---
+            
+            ## Next Steps
+            
+            1. **Discovery workshop (½ day, on-site or virtual)** — Northstar solutions architect with Meridian's RCM, credentialing, and IT leads to map 3–5 candidate workflows and quantify baseline labor and cycle-time metrics.
+            2. **Tailored ROI model** — delivered within one week of discovery, tied to Meridian's denials rate, FTE costs, and target workflows.
+            3. **Security & compliance review** — SOC 2 Type II report, BAA, and HIPAA Security Rule control mapping shared with your CISO's team in parallel.
+            4. **Pilot-to-production plan** — first workflow live in 3 weeks, with a named CSM driving expansion across service lines under the Enterprise agreement.
+            
+            **Primary contact:** [Account Executive, Northstar] • Ready to schedule discovery within the week.
+          text/plain: <IPython.core.display.Markdown object>
+  markdown cell:
+    source:
+      ## Why three subagents instead of one
+      
+      A single agent with all three tools could write this proposal, so why split it up? Scoping each role to its own tools means the pricing modeler can't pull a competitor's list price off the web, because it only has the rules file. The case-study picker reads seven files here, but in production it would read hundreds, and that volume stays in the subagent's context instead of the coordinator's. And the coordinator gets to decide the order and the hand-offs without doing any of the specialist work itself.
+      
+      For more on `callable_agents`, see the [Managed Agents documentation](https://platform.claude.com/docs/en/managed-agents/multi-agent).

📓 managed_agents/CMA_verify_with_outcome_grader.ipynb

View diff
nbdiff /dev/null managed_agents/CMA_verify_with_outcome_grader.ipynb (30e12afc4db25f595a545199eaf37422bd221d41)
--- /dev/null  2026-05-03 19:45:53.249968
+++ managed_agents/CMA_verify_with_outcome_grader.ipynb (30e12afc4db25f595a545199eaf37422bd221d41)  (no timestamp)
## added /cells:
+  markdown cell:
+    source:
+      # Research Brief with Verified Sources
+      
+      We'll use `define_outcome` to attach a grader agent to a research writer. The writer will produce a cited one-page brief on EV fast-charging economics, and the grader will independently fetch every cited URL, check that each quote actually appears on the page, and score the brief against a seven-item coverage checklist. When something doesn't pass, the writer gets specific feedback and revises until it does.
+      
+      We'll watch the loop run end to end and see the grader catch two real problems that the writer then fixes.
+  markdown cell:
+    source:
+      ## 1. Set up the environment
+      
+      First, let's install the SDK and set up the Anthropic client. The `define_outcome` event types are currently behind the research-preview header, so we'll also add a small helper that reads the session's raw event stream.
+  code cell:
+    source:
+      %pip install anthropic
+  code cell:
+    source:
+      import os, time, httpx, anthropic
+      
+      PREVIEW = "managed-agents-2026-04-01-research-preview"
+      MODEL = os.environ.get("COOKBOOK_MODEL", "claude-sonnet-4-6")
+      client = anthropic.Anthropic()
+      
+      
+      # The SDK does not yet type the research-preview event types
+      # (span.outcome_evaluation_*), so we read events as raw JSON.
+      def list_events(session_id):
+          r = httpx.get(
+              f"https://api.anthropic.com/v1/sessions/{session_id}/events",
+              params={"limit": 1000},
+              headers={
+                  "x-api-key": client.api_key,
+                  "anthropic-version": "2023-06-01",
+                  "anthropic-beta": f"managed-agents-2026-04-01,{PREVIEW}",
+              },
+              timeout=60,
+          )
+          r.raise_for_status()
+          return r.json()["data"]
+  markdown cell:
+    source:
+      ## 2. Create the writer and start a session
+      
+      Next, we'll create the writer agent and open a session. The writer's system prompt lists the seven topics it needs to cover and asks it to cite no more than six sources, each with a short verbatim quote so the grader has something concrete to check.
+  code cell:
+    source:
+      env = client.beta.environments.create(name="research-brief", betas=[PREVIEW])
+      
+      writer = client.beta.agents.create(
+          name="Research Analyst",
+          model=MODEL,
+          system="""You are a research analyst. You write one-page business briefs.
+      
+      Cite every factual claim with an inline footnote [n]. End the brief with a Sources section in this exact format, one entry per line:
+      
+      [n] "verbatim quote from the page, 25 words or fewer" - Title - URL
+      
+      Only cite pages you actually fetched and read. The quote must be copied character-for-character from the page. Cite no more than 6 sources total. Pick the strongest; do not pad. Save the brief to /mnt/session/outputs/brief.md.""",
+          tools=[
+              {
+                  "type": "agent_toolset_20260401",
+                  "configs": [
+                      {"name": "web_search"},
+                      {"name": "web_fetch"},
+                      {"name": "read"},
+                      {"name": "write"},
+                  ],
+              }
+          ],
+          betas=[PREVIEW],
+      )
+      
+      session = client.beta.sessions.create(
+          agent=writer.id,
+          environment_id=env.id,
+          title="Brief: EV fast-charging unit economics",
+          betas=[PREVIEW],
+      )
+      print(f"Session {session.id}")
+  markdown cell:
+    source:
+      ## 3. Define the outcome
+      
+      Now we'll send the `define_outcome` event. This gives the session a rubric, and after each writer turn the platform spins up a separate grader agent to evaluate the output against it.
+      
+      Our rubric asks the grader to do two things. For each citation, it fetches the URL directly, looks for the quoted string on the page, and confirms the passage actually supports the claim. For coverage, it defines what counts as covered for each of the seven topics. The bar can be more specific than the original ask; for example, item 5 requires the GAAP net loss from a 10-K on sec.gov, not a press-release summary.
+      
+      One thing worth knowing is that the grader is stateless. A fresh evaluation agent runs on every iteration, so every citation is re-fetched and every coverage item is re-scored each time. This means the final `satisfied` verdict is always a full re-check of the document, but it also means each iteration pays the full verification cost.
+  code cell:
+    source:
+      USER_MESSSAGE = """
+      Write a brief on the unit economics of public DC fast charging in the United States.
+      The brief should cover:
+        1. Capex range
+        2. Demand charges
+        3. Utilization breakeven
+        4. Subsidy programs
+        5. Named-operator economics
+        6. A contrarian or skeptical source
+        7. Hardware vs install cost split
+      """
+      
+      
+      RUBRIC = """
+      You are reviewing a research brief at /mnt/session/outputs/brief.md against a coverage checklist and verifying its citations. The writer was told the seven topics to cover; this rubric defines what counts as sufficient coverage for each topic, and how to verify citations.
+      
+      COVERAGE CHECKLIST. Each item has a specific area 
+        1. Capex range: a dollar range for installed cost per DC fast-charging stall or station.
+        2. Demand charges: quantified impact on opex (a $/kW figure or a % of operating cost).
+        3. Utilization breakeven: a breakeven or target utilization threshold (% or kWh/day).
+        4. Subsidy programs: NEVI or another public funding program, named.
+        5. Named operator: the GAAP net income or net loss from a specific public charging operator's most recent 10-K or 10-Q, and the citation for it must be the SEC filing itself (sec.gov), not a press release, earnings-call recap, or news article.
+        6. Contrarian source: at least one cited source whose thesis is that the economics are unfavorable or structurally challenged.
+        7. Cost split: a hardware vs soft-cost (install, permitting, grid) breakdown or ratio.
+      
+      CITATION CHECK. For every [n] entry in the Sources section:
+        a. LIVE: Fetch the URL with web_fetch. Mark LIVE only if web_fetch returns the readable page directly. Mark DEAD if 404, parked, login-walled, paywalled, returns a bot-block/403, or renders only via JavaScript. Do NOT corroborate via mirrors, reposts, or search snippets; the cited URL itself must fetch.
+        b. VERBATIM: Search the fetched page for the quoted string. Mark QUOTE_MATCH if the exact string appears (treat curly vs straight quotes as equivalent); NOT_FOUND otherwise.
+        c. SUPPORTS CLAIM: Mark SUPPORTS_CLAIM if the quoted passage actually backs the claim it's cited on in the brief; UNSUPPORTED if it's tangential, contradicts the claim, or is just a general statement of fact.
+      
+      OUTPUT FORMAT: 
+      
+      Line 1: Coverage N/7. Citations M/K verified.
+      
+      Then, for each failed item in the coverage checklist, create a new bullet, name the item and say what specific bar it failed in one sentence max per bullet. For example: "Item 3 Utilization breakeven - MISSING. <what's missing>".
+      
+      Then, for each failed citation, create a new bullet with the format: "[n] domain - REASON. <what's wrong and what to do>". One sentence max per bullet. For example: "[3] evgo.com - DEAD. The URL returns a 403 error and appears to be behind a bot block. No mirrors or reposts; the cited URL itself must fetch."
+      """
+      
+      client.beta.sessions.events.send(
+          session.id,
+          betas=[PREVIEW],
+          events=[
+              {
+                  "type": "user.define_outcome",
+                  "description": "One-page brief on DC fast-charging unit economics that clears the coverage checklist with every citation live, quote-matching, and supportive.",
+                  "rubric": {"type": "text", "content": RUBRIC},
+                  "max_iterations": 3,
+              },
+              {
+                  "type": "user.message",
+                  "content": [{"type": "text", "text": USER_MESSSAGE}],
+              },
+          ],
+      )
+    outputs:
+      output 0:
+        output_type: execute_result
+        execution_count: 9
+        data:
+          text/plain: BetaManagedAgentsSendSessionEvents(data=[BetaManagedAgentsUserMessageEvent(id='sevt_011Cag2vpkA2Nms5ytyW3Qhe', content=None, type='user.define_outcome', processed_at=datetime.datetime(2026, 5, 3, 17, 31, 34, 482886, tzinfo=TzInfo(0)), description='One-page brief on DC fast-charging unit economics that clears the coverage checklist with every citation live, quote-matching, and supportive.', max_iterations=3, outcome_id='outc_011Cag2vpkA253yS6SZf8aXY', rubric={'content': '\nYou are reviewing a research brief at /mnt/session/outputs/brief.md against a coverage checklist and verifying its citations. The writer was told the seven topics to cover; this rubric defines what counts as sufficient coverage for each topic, and how to verify citations.\n\nCOVERAGE CHECKLIST. Each item has a specific area \n  1. Capex range: a dollar range for installed cost per DC fast-charging stall or station.\n  2. Demand charges: quantified impact on opex (a $/kW figure or a % of operating cost).\n  3. Utilization breakeven: a breakeven or target utilization threshold (% or kWh/day).\n  4. Subsidy programs: NEVI or another public funding program, named.\n  5. Named operator: the GAAP net income or net loss from a specific public charging operator\'s most recent 10-K or 10-Q, and the citation for it must be the SEC filing itself (sec.gov), not a press release, earnings-call recap, or news article.\n  6. Contrarian source: at least one cited source whose thesis is that the economics are unfavorable or structurally challenged.\n  7. Cost split: a hardware vs soft-cost (install, permitting, grid) breakdown or ratio.\n\nCITATION CHECK. For every [n] entry in the Sources section:\n  a. LIVE: Fetch the URL with web_fetch. Mark LIVE only if web_fetch returns the readable page directly. Mark DEAD if 404, parked, login-walled, paywalled, returns a bot-block/403, or renders only via JavaScript. Do NOT corroborate via mirrors, reposts, or search snippets; the cited URL itself must fetch.\n  b. VERBATIM: Search the fetched page for the quoted string. Mark QUOTE_MATCH if the exact string appears (treat curly vs straight quotes as equivalent); NOT_FOUND otherwise.\n  c. SUPPORTS CLAIM: Mark SUPPORTS_CLAIM if the quoted passage actually backs the claim it\'s cited on in the brief; UNSUPPORTED if it\'s tangential, contradicts the claim, or is just a general statement of fact.\n\nOUTPUT FORMAT: \n\nLine 1: Coverage N/7. Citations M/K verified.\n\nThen, for each failed item in the coverage checklist, create a new bullet, name the item and say what specific bar it failed in one sentence max per bullet. For example: "Item 3 Utilization breakeven - MISSING. <what\'s missing>".\n\nThen, for each failed citation, create a new bullet with the format: "[n] domain - REASON. <what\'s wrong and what to do>". One sentence max per bullet. For example: "[3] evgo.com - DEAD. The URL returns a 403 error and appears to be behind a bot block. No mirrors or reposts; the cited URL itself must fetch."\n', 'type': 'text'}), BetaManagedAgentsUserMessageEvent(id='sevt_012akHfJBv1qZABihi3YwEMU', content=[BetaManagedAgentsTextBlock(text='\nWrite a brief on the unit economics of public DC fast charging in the United States.\nThe brief should cover:\n  1. Capex range\n  2. Demand charges\n  3. Utilization breakeven\n  4. Subsidy programs\n  5. Named-operator economics\n  6. A contrarian or skeptical source\n  7. Hardware vs install cost split\n', type='text')], type='user.message', processed_at=None)])
+  markdown cell:
+    source:
+      ## 4. Watch the review loop
+      
+      Let's poll the event stream and render each phase as it happens. We'll print a banner when the writer finishes a draft and show the grader's feedback after each evaluation.
+  code cell:
+    source:
+      import re, time
+      from IPython.display import Markdown, display
+      
+      HR = "━" * 46
+      
+      
+      def banner(label, tag=""):
+          display(Markdown(f"**{HR}**  \n**{label}** &nbsp; {tag}"))
+      
+      
+      def render_feedback(fb: str):
+          # Strip the server's per-criterion wrapper and trailer.
+          s = re.sub(
+              r"^An independent grader found.*?:\n\n- .*?\((?:partially |not )?met\): ",
+              "",
+              fb,
+              count=1,
+              flags=re.S,
+          )
+          s = re.sub(r"\n\nPlease revise your work.*$", "", s, flags=re.S)
+          display(Markdown(s))
+      
+      
+      t0, seen, done, it, res = time.time(), set(), False, 0, None
+      n_search, last_len, banner_it = 0, 0, -1
+      while not done:
+          for ev in list_events(session.id):
+              if ev["id"] in seen:
+                  continue
+              seen.add(ev["id"])
+              et = ev["type"]
+              if et == "agent.tool_use":
+                  if ev["name"] in ("web_search", "web_fetch"):
+                      n_search += 1
+                  if ev["name"] == "write" and ev["input"]["file_path"].endswith("brief.md"):
+                      last_len = len(ev["input"]["content"])
+              elif et == "span.outcome_evaluation_start":
+                  if banner_it == it:
+                      continue
+                  banner_it = it
+                  banner("writer · " + ("draft" if it == 0 else "revision"))
+                  display(
+                      Markdown(f"searched/fetched {n_search}× · wrote `brief.md` ({last_len:,} chars)")
+                  )
+                  n_search = 0
+              elif et == "span.outcome_evaluation_end":
+                  res, fb = ev["result"], ev["explanation"]
+                  banner(
+                      f"grader · iteration {it}",
+                      "✓ satisfied" if res == "satisfied" else "⟳ needs_revision",
+                  )
+                  render_feedback(fb)
+                  it += 1
+                  if res == "satisfied":
+                      done = True
+              elif et == "session.status_idle":
+                  done = True
+          if not done:
+              time.sleep(5)
+      
+      m, s = divmod(int(time.time() - t0), 60)
+      display(Markdown(f"**done:** {res} after {it} iteration{'s' if it != 1 else ''} · {m}m {s:02d}s"))
+  markdown cell:
+    source:
+      ### What just happened
+      
+      The loop ran for three iterations, and both revisions on the named-operator requirement.
+      
+      The first draft cited revenue figures from a news article, meanwhile the rubric asks for the GAAP net loss and requires the SEC filing as the source, so the grader sent it back as the only miss. The writer went to sec.gov, pulled a net-loss figure, and resubmitted.
+      
+      The second pass still failed. The sec.gov document the writer cited was an 8-K exhibit, which is the earnings press release as filed, not the 10-K or 10-Q the rubric asks for. On the third submission, the writer found the actual 10-K and the third pass cleared.
+      
+      **Note** that this loop was unique to the original cookbook run. If you are running this on your own, your results will likely differ.
+  markdown cell:
+    source:
+      ## 5. Read the accepted brief
+      
+      Finally, let's pull the version of `brief.md` that the grader accepted.
+  code cell:
+    source:
+      from IPython.display import Markdown, display
+      
+      writes = [
+          ev
+          for ev in list_events(session.id)
+          if ev["type"] == "agent.tool_use"
+          and ev["name"] == "write"
+          and ev["input"]["file_path"].endswith("brief.md")
+      ]
+      display(Markdown(writes[-1]["input"]["content"]))
+    outputs:
+      output 0:
+        output_type: display_data
+        data:
+          text/markdown:
+            # DC Fast-Charging Unit Economics: A Business Brief
+            ### U.S. Public DCFC Infrastructure — 2025–2026
+            
+            ---
+            
+            ## 1. CapEx Range
+            
+            Total project costs for a public DCFC station vary sharply by power level, site complexity, and geography. A 150 to 350 kW DCFC charging unit can cost anywhere from $45,000 to over $100,000, and installation costs can range from $40,000 to over $150,000; additionally, grid upgrade and integration costs can amount to millions, depending on the number of fast chargers installed.[1] Hardware alone runs $38,000–$90,000 per connector; when installation, electrical upgrades, and commissioning are included, total station costs are typically $75,000–$150,000 per unit.[3] Real-world NEVI-style four-port (4×150 kW) highway stations have averaged roughly $915,000 in total project cost based on industry benchmarks.[3] Ultra-fast charger hardware prices fell roughly 20% between 2022 and 2024, improving the long-term capital case—though the infrastructure bill remains formidable.[3]
+            
+            ---
+            
+            ## 2. Hardware vs. Installation Cost Split
+            
+            The hardware sticker price is consistently the smaller portion of a DCFC project budget. NREL (2024) project data confirms that electrical infrastructure frequently accounts for 40–60% of total DC fast charger project cost—often exceeding the hardware cost itself at complex sites.[3] DC fast charger installation runs $20,000–$60,000 per connector, driven by transformer upgrades, utility service entrance modifications, and the conduit runs required to deliver high-voltage power to charging positions.[3] In contrast, hardware alone runs $38,000–$90,000 per connector; when installation, electrical upgrades, and commissioning are included, total station costs are typically $75,000–$150,000 per unit depending on site complexity.[3] Unlike Level 2 installations where hardware dominates, DCFC projects routinely see civil and electrical make-ready work equal to or exceeding equipment cost, making grid make-ready the most variable and consistently underestimated budget line.
+            
+            ---
+            
+            ## 3. Demand Charges
+            
+            Demand charges are the dominant, least controllable element of DCFC operating costs. At 50 kW, demand charges account for 24 percent to 39 percent of a DCFC station's annual costs; if the station capacity is increased to 350 kW, the cost share of demand charges grows to 68 percent to 81 percent of total costs.[2] In dollar terms, a 350 kW charger operating under a $20/kW demand rate incurs approximately $7,000 in monthly demand charges.[3] Increasing power capacity beyond 150 kW makes it nearly impossible for a station operator to break even except in cases where the electric utility does NOT have a demand charge.[2] Eliminating the demand charge entirely can decrease operational costs for DCFC stations by as much as 85 percent.[2] Battery storage co-located with DCFC hardware has produced documented demand-charge cost reductions of 30–60% for operators who have deployed this architecture.[3]
+            
+            ---
+            
+            ## 4. Utilization Breakeven
+            
+            Utilization is the primary financial lever, and the industry has not yet cleared the threshold at scale. Without subsidies, a typical four-charger California station loses approximately $40,000–$50,000 per year in EBIT at 15% utilization; the owner-operator would break even if utilization increased from 15 percent to 20 percent, or if the price for charging customers increased from $0.45/kWh to $0.53/kWh.[1] If a DCFC station generated $12,000 in ancillary revenue streams (retail, advertising), it could break even without either improvement.[1] Nationally, the average DCFC utilization rate weakened in Q1 2026 to 15.6% (down from 16.2% a year ago), a possible sign that infrastructure expansion is progressing faster than EV market growth.[6] Average consumer pricing stood at $0.53 per kWh (excluding free chargers, which would bring the average down to $0.49/kWh) in Q1 2026, unchanged from Q4 2025.[6] DC fast charger projects at well-selected highway corridor sites may achieve payback in 5–8 years; low-utilization sites may not break even without utility or federal support.[3]
+            
+            ---
+            
+            ## 5. Subsidy Programs
+            
+            Federal incentives are the decisive swing factor for near-term DCFC viability. The National Electric Vehicle Infrastructure (NEVI) Formula Program will fund up to 80 percent of project costs, provided that the station serves the public and meets criteria such as being located along Federal Highway Administration Alternative Fuel Corridors.[1] The Inflation Reduction Act Section 30C Alternative Fuel Infrastructure Tax Credit provides credits up to $100,000 per charging port for qualifying installations, with a 30% federal tax credit applicable through December 31, 2032 under current policy.[3] State utility rebates vary widely but can offset 20–50% of total project cost in supportive jurisdictions.[3]
+            
+            **Critical risk:** In February 2025, the Department of Transportation rescinded its NEVI Program Guidance and suspended new state plan approvals; no new obligations could occur pending updated federal guidance, and as of mid-2025, the program's disbursement timeline remained uncertain with at least 16 states having filed legal challenges. Operators who require NEVI funding to make a project pencil should not underwrite final investment decisions until the program's status is resolved. The DOE Title 17 loan program has emerged as an alternative capital channel: EVgo received a conditional commitment for a loan guarantee of up to $1.05 billion from the U.S. Department of Energy Loan Programs Office under its Title 17 program, to build approximately 7,500 fast charging stalls across the U.S.[5]
+            
+            ---
+            
+            ## 6. Named-Operator Economics
+            
+            Three companies—Electrify America, EVgo, and Tesla—hold approximately 80 percent of the U.S. public DCFC market.[1] Their financial trajectories differ meaningfully.
+            
+            **EVgo** (NASDAQ: EVGO) is the most transparent owner-operator data point. In Q3 2024, EVgo reported revenue of $67.5 million and a GAAP net loss of $33.3 million, with network throughput of 78 GWh—a 92% and 111% year-over-year increase respectively.[5] Average daily throughput per stall for the EVgo network was 254 kilowatt hours per day in the third quarter of 2024, an increase of 64% compared to 155 kilowatt hours per day in the third quarter of 2023.[5] This throughput trajectory illustrates the scaling dynamic central to DCFC unit economics: the path to profitability runs through volume, not price. For full-year 2024, EVgo posted a GAAP net loss of approximately $126.7 million against revenue of roughly $257 million; for full-year 2025, revenue grew approximately 50% to $384 million while the net loss narrowed to approximately $95 million—reflecting improving gross margins (21.0% in 2025 vs. 11.4% in 2024) as throughput scaled.[5] EVgo had 3,680 stalls in operation at the end of Q3 2024, growing toward 5,100 by year-end 2025.
+            
+            **ChargePoint** (NASDAQ: CHPT) operates primarily as a hardware and software solutions provider rather than an owner-operator, selling charging equipment and network services to site-host owners who bear CapEx.[1] This model limits direct charging revenue exposure but also limits gross-margin upside from high-utilization sites. ChargePoint went public via SPAC in early 2021 and, despite operating the largest U.S. charging network by location count, has not generated annual profits; by 2026 the stock traded at a fraction of its peak market capitalization—an illustration of how network scale does not automatically translate into unit-economics health.
+            
+            **Electrify America** is a privately held subsidiary of Volkswagen Group, originally capitalized by the $2 billion Dieselgate environmental settlement obligation, and does not report standalone financials.
+            
+            The key cross-operator insight from peer-reviewed modeling: based on current adoption and utilization rates in the U.S., the business model involving an owner-operator collaborating with a public partner ensures profitability and protects the investment in DCFC stations from financial losses—while sole owner-operator models show negative NPV in most geographies at today's utilization levels.[3]
+            
+            ---
+            
+            ## 7. The Skeptical View
+            
+            The most rigorous bearish case comes from the Great Plains Institute's empirical study of DCFC economics, as reported by Utility Dive: *"Today's economics and the average electric utility rates mean that nearly all DCFC scenarios lose money,"* GPI study authors Dane McFarlane and Matt Prorok wrote.[4] The mechanism is structural: *"DCFC charging stations will currently lose money every year until increased EV adoption results in more charging customers each day."*[4] GPI identified a chicken-and-egg trap where more chargers are needed to accelerate EV adoption, but chargers lose money without the EV volume to drive utilization. In most other cases, it is very difficult for a DCFC station to break even due to demand charges.[4]
+            
+            This finding remains directionally valid even as market conditions have improved. As of Q1 2026, 73,394 public DCFC ports across 13,708 locations are operational,[6] yet the national utilization rate is slipping—not rising—as supply outgrows demand in many markets. The GPI conclusion applies most acutely to high-power (>150 kW) stations in low-EV-density markets: eliminating demand charges is the fastest path to profitability, and without tariff reform or battery storage, those stations face structurally negative unit economics regardless of site selection.
+            
+            ---
+            
+            ## Sources
+            
+            [1] "A 150 to 350kW DCFC charging unit can cost anywhere from $45,000 to over $100,000, and installation costs can range from $40,000 to over $150,000." - Can public EV fast-charging stations be profitable in the United States? - https://www.mckinsey.com/features/mckinsey-center-for-future-mobility/our-insights/can-public-ev-fast-charging-stations-be-profitable-in-the-united-states
+            
+            [2] "At 50 kW, demand charges account for 24 percent to 39 percent of a DCFC station's annual costs." - Analysis: How Demand Charges Impact Electric Vehicle Fast Charging Infrastructure - https://betterenergy.org/blog/demand-charges-and-dcfc/
+            
+            [3] "NREL (2024) project data confirms that electrical infrastructure frequently accounts for 40–60% of total DC fast charger project cost — often exceeding the hardware cost itself at complex sites." - EV Charging Station Cost in 2026: Complete Business Guide - https://trendxinsights.com/blogs/ev-charging-station-cost-usa/
+            
+            [4] "Today's economics and the average electric utility rates mean that nearly all DCFC scenarios lose money," GPI study authors Dane McFarlane and Matt Prorok wrote. - 'Nearly all' high voltage EV charging stations lose money: Report - https://www.utilitydive.com/news/nearly-all-high-voltage-ev-charging-stations-lose-money-report/561026/
+            
+            [5] "Net Loss of $33.3 million" - EVgo Inc. Reports Record Third Quarter 2024 Results (8-K Exhibit 99.1) - https://www.sec.gov/Archives/edgar/data/1821159/000155837024015153/evgo-20241112xex99d1.htm
+            
+            [6] "The average price of DC fast charging in Q1 2026 was $0.53 per kWh (excluding free chargers, which would bring the average down to $0.49/kWh). That's the same level as in Q4 2025." - Paren's Q1 2026 Report: US DCFC Infrastructure Grows, Utilization Weakens - https://evchargingstations.com/chargingnews/parens-q1-2026-report/
+          text/plain: <IPython.core.display.Markdown object>
+  markdown cell:
+    source:
+      ## Conclusion
+      
+      `define_outcome` fits when there's a way to check the output that doesn't depend on trusting the writer. In our case, that was fetching each URL and reading the site's contents. In other domains it might be running a test suite, validating a schema, or more.
+      
+      For more on `define_outcome`, see the [Managed Agents documentation](https://platform.claude.com/docs/en/managed-agents/define-outcomes).

Generated by nbdime

@github-actions

github-actions Bot commented May 3, 2026

Copy link
Copy Markdown

@markn-ant

Copy link
Copy Markdown
Contributor Author

Moving to claude-cookbooks-private for internal review first.

@markn-ant markn-ant closed this May 3, 2026
@markn-ant markn-ant deleted the markn/cma-callable-agents-outcomes branch May 3, 2026 19:55

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Recommendation: REQUEST_CHANGES

Summary

Adds two well-crafted CMA tutorial notebooks: CMA_coordinate_specialist_team.ipynb (heterogeneous multiagent via callable_agents) and CMA_verify_with_outcome_grader.ipynb (iterative grader loop via define_outcome). Code quality is high, model aliases are correct, and the pedagogical approach is strong. A few issues need addressing before merge.

Actionable Feedback (5 items)
  • CMA_verify_with_outcome_grader.ipynb (in cell with USER_MESSSAGE = """) — Typo: variable is named USER_MESSSAGE (three S's). It works at runtime since definition and usage match, but is visible to all readers of the source. Rename to USER_MESSAGE in both the definition and the reference two cells later.

  • CMA_verify_with_outcome_grader.ipynb (inside the RUBRIC string) — Incomplete sentence: "COVERAGE CHECKLIST. Each item has a specific area" ends without a predicate. This is sent verbatim to the grader agent. Should read something like "Each item has a specific bar that must be cleared" to make the instruction unambiguous.

  • CMA_verify_with_outcome_grader.ipynb (polling cell, elif et == "session.status_idle": done = True) — Premature loop exit: the session can go idle between the writer finishing and the grader spinning up. If this fires before span.outcome_evaluation_end, the loop exits with res = None and displays None after 0 iterations. The coordinate notebook guards this with if created > 0 and idled >= created:; add an analogous guard here (e.g. only set done = True on session.status_idle when it > 0 or res is not None).

  • Both notebooks (last cell) — No resource cleanup: the coordinate notebook creates 4 agents, 1 environment, 9 files, and 1 session; the verify notebook creates 1 agent, 1 environment, and 1 session. None are archived. Other CMA notebooks (e.g. CMA_iterate_fix_failing_tests.ipynb) include a final cleanup cell calling client.beta.sessions.archive, client.beta.environments.archive, and client.beta.agents.archive. Add equivalent cleanup cells to both new notebooks to avoid dangling resources (especially important for tutorial readers who may run the notebook multiple times).

  • Both notebooks (polling loops) — No wall-clock timeout: if the session stalls without firing session.status_idle, both loops poll indefinitely. The coordinate notebook has an additional edge case: if created stays 0 (coordinator spawns no subagents), the if created > 0 guard blocks every subsequent idle event from breaking the loop. Add a deadline guard (e.g. if time.time() > time.time() + 600: break) or add a prose note that this is a simplified tutorial loop, and fix the created == 0 dead-lock case with a comment or guard.

Detailed Review

Code Quality

Both notebooks follow the CMA tutorial conventions well: COOKBOOK_MODEL env var, numbered sections, committed outputs, httpx raw polling explained with a comment. The for/else/break pattern in the coordinate notebook's polling loop is correct and idiomatic. The text_of() helper defensively handles both string and typed-block content shapes. The render_feedback regex stripping in the verify notebook is pragmatic and well-commented.

Security

No hardcoded secrets. client.api_key is used in the httpx header — this correctly reads from the SDK client (which sources it from ANTHROPIC_API_KEY), not from any hardcoded value. No injection risks in the notebooks.

Model Usage

Correct non-dated aliases used throughout: claude-opus-4-7 for the coordinate notebook and claude-sonnet-4-6 for the verify notebook. The choice of Opus for the coordinator (which orchestrates three parallel specialists) is appropriate.

Suggestions (non-blocking)

  • Both notebooks use %pip install anthropic without the -q flag. Other CMA notebooks use -q to suppress noisy output. Consider changing to %pip install -q anthropic.
  • The verify notebook re-imports time in the polling cell (import re, time) even though time was already imported at the top. The duplicate is harmless but slightly confusing for readers.
  • Both notebooks could add 2–4 bullet-point "By the end of this notebook you will have…" learning objectives to the intro cell, following the style of CMA_iterate_fix_failing_tests.ipynb. Minor given the README table fills this role, but it would bring these to par with the rest of the series.

Positive Notes

  • The RUBRIC in the verify notebook is a standout artifact: the five-point specificity on the named-operator citation (GAAP, 10-K/10-Q, sec.gov only) is precise enough that a grader agent can actually verify it, and the narrative explaining what the grader caught (8-K exhibit vs. 10-K) gives readers a concrete, non-obvious example of rubric precision in practice.
  • README entries are accurate and slot cleanly into the existing table format. Registry entries have correct paths, author, date, and categories.
  • The markn-ant author is present in authors.yaml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant