Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 9 additions & 37 deletions managed_agents/CMA_explore_unfamiliar_codebase.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,28 +7,14 @@
"source": [
"# Explore: grounding in an unfamiliar codebase\n",
"\n",
"This notebook drops the agent into a repository it's never seen\n",
"before and asks it to figure out the real architecture. The\n",
"filesystem is the agent's only workspace, and only the files it\n",
"chooses to read end up in its context window, so exploration with\n",
"`ls`, `grep`, and `read` is how it builds up a mental model.\n",
"This notebook drops the agent into a repository it's never seen before and asks it to figure out the real architecture. The filesystem is the agent's only workspace, and only the files it chooses to read end up in its context window, so exploration with `ls`, `grep`, and `read` is how it builds up a mental model.\n",
"\n",
"The interesting part is a trap we've planted in the fixture.\n",
"`ARCHITECTURE.md` describes a layout that the code no longer\n",
"follows, so an agent that trusts the docs without checking the\n",
"code will confidently give the wrong answer. Grounding, in this\n",
"context, means verifying what you read against what's actually\n",
"there rather than treating documentation as authoritative.\n",
"The interesting part is a trap we've planted in the fixture. `ARCHITECTURE.md` describes a layout that the code no longer follows, so an agent that trusts the docs without checking the code will confidently give the wrong answer. Grounding, in this context, means verifying what you read against what's actually there rather than treating documentation as authoritative.\n",
"\n",
"What this teaches beyond the iterate notebook:\n",
"\n",
"- **Exploration before action.** A good agent reads enough of the\n",
" tree to understand it, then answers, not the other way around.\n",
"- **Adding resources mid-session.** The sidebar at the end shows\n",
" how to push more files into a running session via\n",
" `sessions.resources.add` rather than re-creating the session.\n",
" Useful when exploration uncovers something the agent should\n",
" look at next."
"- **Exploration before action.** A good agent reads enough of the tree to understand it, then answers, not the other way around.\n",
"- **Adding resources mid-session.** The sidebar at the end shows how to push more files into a running session via `sessions.resources.add` rather than re-creating the session. Useful when exploration uncovers something the agent should look at next."
]
},
{
Expand Down Expand Up @@ -60,10 +46,7 @@
"source": [
"## 1. Generate the repo fixture\n",
"\n",
"The repo is small enough that we build it in memory with a helper\n",
"rather than keeping a disk fixture alongside the notebook. The\n",
"helper plants a `services/` microservices layout and a stale\n",
"`ARCHITECTURE.md` that still describes the old monolithic layout."
"The repo is small enough that we build it in memory with a helper rather than keeping a disk fixture alongside the notebook. The helper plants a `services/` microservices layout and a stale `ARCHITECTURE.md` that still describes the old monolithic layout."
]
},
{
Expand Down Expand Up @@ -133,9 +116,7 @@
"source": [
"## 3. Explore and watch for the stale-doc trap\n",
"\n",
"A grounded answer mentions the real `services/` layout and flags\n",
"`ARCHITECTURE.md` as out of date. An ungrounded answer parrots\n",
"the monolith layout the stale doc describes."
"A grounded answer mentions the real `services/` layout and flags `ARCHITECTURE.md` as out of date. An ungrounded answer parrots the monolith layout the stale doc describes."
]
},
{
Expand Down Expand Up @@ -176,9 +157,7 @@
"source": [
"## 4. Read back the agent's notes\n",
"\n",
"The agent was told to keep notes in `/tmp/NOTES.md` as it worked.\n",
"Printing that file is a useful way to see how its understanding\n",
"of the codebase developed during exploration."
"The agent was told to keep notes in `/tmp/NOTES.md` as it worked. Printing that file is a useful way to see how its understanding of the codebase developed during exploration."
]
},
{
Expand Down Expand Up @@ -207,16 +186,9 @@
"source": [
"## Sidebar: adding more context to a running session\n",
"\n",
"The `resources=` argument on `sessions.create` is the most common\n",
"way to mount files, but the API also exposes a\n",
"`/v1/sessions/<id>/resources` sub-resource for managing mounts on\n",
"an existing session. This is useful here: if exploration uncovers\n",
"a question that needs additional context (a config file, a\n",
"changelog, an external schema), you can drop it in without\n",
"tearing down the session.\n",
"The `resources=` argument on `sessions.create` is the most common way to mount files, but the API also exposes a `/v1/sessions/<id>/resources` sub-resource for managing mounts on an existing session. This is useful here: if exploration uncovers a question that needs additional context (a config file, a changelog, an external schema), you can drop it in without tearing down the session.\n",
"\n",
"The pattern is the same upload-then-attach loop you already know,\n",
"just split across two calls instead of one:"
"The pattern is the same upload-then-attach loop you already know, just split across two calls instead of one:"
]
},
{
Expand Down
81 changes: 13 additions & 68 deletions managed_agents/CMA_gate_human_in_the_loop.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,54 +7,24 @@
"source": [
"# Gate: human-in-the-loop with custom tools\n",
"\n",
"Many workflows sit in the gap between \"fully automate\" and\n",
"\"always ask a human.\" Expense approval is a classic example: the\n",
"agent can handle the clear cases on its own, but it should know\n",
"when to escalate ambiguous ones for human review. Calibration\n",
"matters here, an agent that escalates everything is exhausting\n",
"to work with, and an agent that escalates nothing is dangerous.\n",
"Many workflows sit in the gap between \"fully automate\" and \"always ask a human.\" Expense approval is a classic example: the agent can handle the clear cases on its own, but it should know when to escalate ambiguous ones for human review. Calibration matters here, an agent that escalates everything is exhausting to work with, and an agent that escalates nothing is dangerous.\n",
"\n",
"This notebook builds an expense approver around two **custom\n",
"tools**: `decide()` for clear-cut cases and `escalate()` for\n",
"ambiguous ones. Both round-trip through your application, which\n",
"is where you either log the outcome (decide) or put it in front\n",
"of a reviewer (escalate).\n",
"This notebook builds an expense approver around two **custom tools**: `decide()` for clear-cut cases and `escalate()` for ambiguous ones. Both round-trip through your application, which is where you either log the outcome (decide) or put it in front of a reviewer (escalate).\n",
"\n",
"## What custom tools are\n",
"\n",
"Up until now the cookbook has used the built-in `agent_toolset`\n",
"(bash, read, write, etc.), all of which run inside the sandbox\n",
"container. **Custom tools** are different: when the agent calls\n",
"one, the session pauses and emits an `agent.custom_tool_use`\n",
"event, your application sees the call, runs whatever code you\n",
"want, and POSTs back a `user.custom_tool_result` event. The\n",
"session resumes with that result in the agent's context.\n",
"Up until now the cookbook has used the built-in `agent_toolset` (bash, read, write, etc.), all of which run inside the sandbox container. **Custom tools** are different: when the agent calls one, the session pauses and emits an `agent.custom_tool_use` event, your application sees the call, runs whatever code you want, and POSTs back a `user.custom_tool_result` event. The session resumes with that result in the agent's context.\n",
"\n",
"This is the right shape for two situations:\n",
"\n",
"1. **The data lives somewhere the sandbox can't reach.** Anything\n",
" behind your own network boundary. The agent calls back into\n",
" your application via the round-trip.\n",
"2. **You want a human in the loop, or your own audit and approval\n",
" layer in front of every call.** That's what this notebook does:\n",
" `decide` and `escalate` aren't just \"tools\" in the abstract,\n",
" they're the seam where your business logic and human reviewers\n",
" take over from the agent.\n",
"1. **The data lives somewhere the sandbox can't reach.** Anything behind your own network boundary. The agent calls back into your application via the round-trip.\n",
"2. **You want a human in the loop, or your own audit and approval layer in front of every call.** That's what this notebook does: `decide` and `escalate` aren't just \"tools\" in the abstract, they're the seam where your business logic and human reviewers take over from the agent.\n",
"\n",
"(The other extension patterns, MCP toolsets and `resources=`\n",
"repo mounts, are covered in the operate notebook and the orchestrate notebook\n",
"respectively.)\n",
"(The other extension patterns, MCP toolsets and `resources=` repo mounts, are covered in the operate notebook and the orchestrate notebook respectively.)\n",
"\n",
"The notebook has two parts. Part A drives the session by\n",
"streaming events locally and responding to each custom tool call\n",
"as it arrives, convenient during development because everything\n",
"happens in one process and you can see the behavior live. Part B\n",
"is a short pointer to the production webhook pattern, which is\n",
"walked through end-to-end in the operate notebook.\n",
"The notebook has two parts. Part A drives the session by streaming events locally and responding to each custom tool call as it arrives, convenient during development because everything happens in one process and you can see the behavior live. Part B is a short pointer to the production webhook pattern, which is walked through end-to-end in the operate notebook.\n",
"\n",
"The fixture lives in `example_data/gate/` and contains a\n",
"`policy.yaml` plus twelve receipts that exercise every branch of\n",
"the policy."
"The fixture lives in `example_data/gate/` and contains a `policy.yaml` plus twelve receipts that exercise every branch of the policy."
]
},
{
Expand Down Expand Up @@ -112,17 +82,9 @@
"source": [
"## 2. Define the agent with two custom tools\n",
"\n",
"Custom tools are declared in the same `tools=` array as the\n",
"built-in toolset, with `\"type\": \"custom\"` and a JSON schema for\n",
"the input. Each declaration tells the model what the tool is for\n",
"(`description`), what to call it with (`input_schema`), and what\n",
"its name is. The agent decides when to call them; your code\n",
"decides what they do when called.\n",
"Custom tools are declared in the same `tools=` array as the built-in toolset, with `\"type\": \"custom\"` and a JSON schema for the input. Each declaration tells the model what the tool is for (`description`), what to call it with (`input_schema`), and what its name is. The agent decides when to call them; your code decides what they do when called.\n",
"\n",
"Here we keep the built-in `agent_toolset_20260401` enabled too,\n",
"so the agent can read the policy file and the receipts inline.\n",
"`decide` and `escalate` are the two custom tools that make every\n",
"decision a round-trip."
"Here we keep the built-in `agent_toolset_20260401` enabled too, so the agent can read the policy file and the receipts inline. `decide` and `escalate` are the two custom tools that make every decision a round-trip."
]
},
{
Expand Down Expand Up @@ -210,12 +172,7 @@
"source": [
"## Part A: streaming locally during development\n",
"\n",
"The simplest way to drive a custom-tool agent is to stream the\n",
"session's events and react to each tool call as it arrives.\n",
"`decide` calls get logged and `escalate` calls get a simulated\n",
"human decision inline. In production you would queue the\n",
"escalation and have a real reviewer come back to it later, which\n",
"is what the operate notebook covers."
"The simplest way to drive a custom-tool agent is to stream the session's events and react to each tool call as it arrives. `decide` calls get logged and `escalate` calls get a simulated human decision inline. In production you would queue the escalation and have a real reviewer come back to it later, which is what the operate notebook covers."
]
},
{
Expand Down Expand Up @@ -335,21 +292,9 @@
"source": [
"## Part B: webhooks for production\n",
"\n",
"The local streaming pattern works fine during development, but\n",
"it holds an HTTP connection open while humans think, which\n",
"doesn't scale well. The production pattern instead registers a\n",
"webhook in the Console that fires on `session.status_idled`,\n",
"which is the signal that the agent is either done or waiting on\n",
"a tool result. Your server inspects the events, puts any pending\n",
"escalation in front of a reviewer, and POSTs the\n",
"`user.custom_tool_result` back whenever the human finishes, no\n",
"long-lived connection on your side.\n",
"The local streaming pattern works fine during development, but it holds an HTTP connection open while humans think, which doesn't scale well. The production pattern instead registers a webhook in the Console that fires on `session.status_idled`, which is the signal that the agent is either done or waiting on a tool result. Your server inspects the events, puts any pending escalation in front of a reviewer, and POSTs the `user.custom_tool_result` back whenever the human finishes, no long-lived connection on your side.\n",
"\n",
"The operate notebook walks through the full webhook setup end-to-end:\n",
"Console registration, HMAC signature verification, the FastAPI\n",
"handler, and the round-trip back to `events.send`. The code that\n",
"responds to the agent is identical to Part A above; only the\n",
"trigger changes (webhook push instead of streaming pull)."
"The operate notebook walks through the full webhook setup end-to-end: Console registration, HMAC signature verification, the FastAPI handler, and the round-trip back to `events.send`. The code that responds to the agent is identical to Part A above; only the trigger changes (webhook push instead of streaming pull)."
]
}
],
Expand Down
Loading
Loading