Merge origin/main into grounded spatial reasoning branch

kathylau-oai · kathylau-oai · commit 64d308856e0e · 2026-05-12T21:21:49.000-04:00
diff --git a/.gitignore b/.gitignore
@@ -151,3 +151,6 @@ examples/coding-agent-workspace
 # VS Code files
 .vscode/
 .cursorignore
+
+# Generated by examples/agents_sdk/agent_improvement_loop.ipynb
+/examples/agents_sdk/agent_improvement_loop_artifacts/
diff --git a/authors.yaml b/authors.yaml
@@ -598,6 +598,7 @@ PASFIELD-OAI:
   website: "https://github.com/PASFIELD-OAI"
   avatar: "https://avatars.githubusercontent.com/u/276116382?v=4"
 
+
 shreekant-openai:
   name: "Shreekant Agrawal"
   website: "https://www.linkedin.com/in/shreeagrawal/"
diff --git a/examples/agents_sdk/agent_improvement_loop.ipynb b/examples/agents_sdk/agent_improvement_loop.ipynb
diff --git a/examples/multimodal/grounded_spatial_reasoning_layouts.ipynb b/examples/multimodal/grounded_spatial_reasoning_layouts.ipynb
@@ -17,11 +17,11 @@
     "\n",
     "Frontier multimodal models make a different workflow possible. Instead of training a task-specific model on many paired examples for one layout domain, we can test whether a reasoning-capable model like GPT-5.5 can solve more of the task directly when given enough scaffolding: clear visual context, an operating procedure, constrained objects, a structured output format, and evals that identify where the result breaks.\n",
     "\n",
-    "The use case is office layout generation from an empty floorplan. Given an empty floorplan image, an office-layout SOP, and a furniture catalog, the model has to produce a structured plan for where furniture should go and why. In this project, grounding does not only mean pointing to the right region of an image. It means the model’s planning decisions have to remain tied to the geometry and constraints visible in the source image. If the source drawing contains a wall, doorway, partition, room boundary, or circulation path, the generated plan has to respect that structure rather than produce something that merely looks plausible.\n",
+    "The use case is office layout generation from an empty floorplan. Given an empty floorplan image, an office-layout SOP, and a furniture catalog, GPT-5.5 produces a structured layout spec that identifies spaces, infers room roles, applies numeric rules, places furniture, and records assumptions about the source drawing. The idea also came from a real office problem. Headcount in our New York office was growing fast enough that “could we fit a few more desks somewhere?” started to sound like a reasonable thing to ask a model.\n",
     "\n",
     "We use an office floorplan as the test case because the failures are easy to inspect. GPT-5.5 receives an empty floorplan, a reusable layout SOP, and a small furniture catalog, then produces a structured layout spec. The spec identifies spaces, infers room roles, applies numeric rules, places furniture, and records assumptions about the source drawing.\n",
     "\n",
-    "The experiment is built around evaluating that spec, not judging a rendered image alone. Each run can be checked for mechanical validity, semantic grounding, and similarity to a human-authored reference when one exists. This lets us separate failure modes that a visual mockup would collapse together: invalid geometry, wrong room interpretation, incorrect furniture program, weak source grounding, and render drift."
+    "The experiment is built around evaluating that spec, not judging a rendered image alone. Each run can be checked for mechanical validity, semantic grounding, and similarity to a human-authored reference when one exists. This lets us separate failure modes that a visual mockup would collapse together: invalid geometry, wrong room interpretation, incorrect furniture program, and weak source grounding."
    ]
   },
   {
@@ -31,7 +31,7 @@
    "source": [
     "## Task setup: from floorplan to layout spec\n",
     "\n",
-    "The generation step separates the problem into three inputs: 1) visual evidence from the floorplan, 2) planning policy from the SOP, and 3) physical scale from the furniture catalog.\n",
+    "The generation step separates the problem into three inputs: 1) visual evidence from the floorplan, 2) planning policy from the SOP, and 3) object dimensions from the furniture catalog.\n",
     "\n",
     "The visible evidence is the empty floorplan image, `empty.png`. The model infers walls, doors, stairs, restrooms, partitions, openings, and usable room shapes from the image itself. It does not see the filled reference image or the gold JSON during generation; those artifacts are held back for evaluation.\n",
     "\n",
@@ -77,7 +77,7 @@
     "  ]\n",
     "}\n",
     "```\n",
-    "The furniture catalog supplies physical scale. It turns labels like `workstation`, `reception_desk`, or `storage_cabinet` into objects with dimensions and simple placement priors, so the model has to reason about whether objects actually fit.\n",
+    "The furniture catalog supplies object dimensions. It turns labels like `workstation`, `reception_desk`, or `storage_cabinet` into objects with dimensions and simple placement priors, so the model has to reason about whether objects actually fit.\n",
     "\n",
     "```json\n",
     "{\n",
@@ -129,7 +129,7 @@
     "This is the only pass/fail gate. It checks whether the generated layout is mechanically valid against constraints.json and the source floorplan. The check covers hard spatial constraints such as source-wall collisions, object overlap, placements staying inside declared zones, and other geometry errors that would make the plan unusable regardless of how plausible the render looks.\n",
     "\n",
     "2. **grounded_program — semantic grounding score.**\n",
-    "This check uses GPT-5.5 to judge whether the layout is faithful to the visible floorplan and the brief. It looks at whether the right kinds of rooms were recognized, whether major spaces were furnished appropriately, and whether the zoning interpretation is grounded in the drawing rather than invented. This is a score, not a hard reject rule, because semantic interpretation is harder to reduce to a single deterministic check.\n",
+    "This score combines deterministic room-coverage checks with a GPT-5.5 judgment about whether the layout is faithful to the visible floorplan and the brief. It looks at whether the right kinds of rooms were recognized, whether major spaces were furnished appropriately, and whether the zoning interpretation is grounded in the drawing rather than invented. This is a score, not a hard reject rule, because semantic interpretation is harder to reduce to a single deterministic check.\n",
     "\n",
     "3. **reference_similarity — deterministic gold-spec score.**\n",
     "When a golden JSON spec exists, this check compares the generated layout to the reference using zone agreement, furniture-program agreement, and tolerant placement proximity. This gives us a benchmark-style signal without making the human reference the only acceptable layout. It is useful for comparing candidates, but it is not the hard gate.\n",
@@ -237,7 +237,7 @@
    "source": [
     "### Run 15 - A valid end-to-end case clarified the division of labor\n",
     "\n",
-    "In the strongest later case, the remaining issues were mostly local geometry problems rather than broad semantic confusion. The candidate reached `completed_valid` after one model repair for a zone-bounds issue, while deterministic geometry cleared residual desk and storage collisions through repair passes, orthogonal-rotation search, local repair, and a small joint-repack fallback.\n",
+    "In the strongest later case, the run reached physical validity, though some semantic disagreement with the reference room program still remained. The candidate reached `completed_valid` after one model repair for a zone-bounds issue, while deterministic geometry cleared residual desk and storage collisions through repair passes, orthogonal-rotation search, local repair, and a small joint-repack fallback.\n",
     "\n",
     "<p align=\"center\">\n",
     "  <img src=\"images/grounded_spatial_run_15_trace_viewer.png\" alt=\"Run 15: valid rendered floorplan and trace viewer\" width=\"90%\">\n",
diff --git a/images/agent-improvement-loop-flywheel.svg b/images/agent-improvement-loop-flywheel.svg
@@ -0,0 +1,83 @@
+<svg width="1400" height="760" viewBox="0 0 1400 760" fill="none" xmlns="http://www.w3.org/2000/svg">
+  <defs>
+    <linearGradient id="bg" x1="0" y1="0" x2="1400" y2="760" gradientUnits="userSpaceOnUse">
+      <stop stop-color="#F8FBFF"/>
+      <stop offset="1" stop-color="#F5F7FB"/>
+    </linearGradient>
+    <filter id="shadow" x="-20%" y="-20%" width="140%" height="140%">
+      <feDropShadow dx="0" dy="10" stdDeviation="18" flood-color="#16324F" flood-opacity="0.12"/>
+    </filter>
+    <marker id="arrow-blue" markerWidth="12" markerHeight="12" refX="10" refY="6" orient="auto">
+      <path d="M0 0L12 6L0 12V0Z" fill="#2563EB"/>
+    </marker>
+    <marker id="arrow-green" markerWidth="12" markerHeight="12" refX="10" refY="6" orient="auto">
+      <path d="M0 0L12 6L0 12V0Z" fill="#16A34A"/>
+    </marker>
+    <marker id="arrow-slate" markerWidth="12" markerHeight="12" refX="10" refY="6" orient="auto">
+      <path d="M0 0L12 6L0 12V0Z" fill="#475569"/>
+    </marker>
+  </defs>
+
+  <rect width="1400" height="760" rx="32" fill="url(#bg)"/>
+
+  <text x="700" y="62" text-anchor="middle" fill="#0F172A" font-family="Inter, Arial, sans-serif" font-size="30" font-weight="700">
+    An agent improvement loop can connect feedback to code
+  </text>
+
+  <circle cx="700" cy="388" r="226" stroke="#CBD5E1" stroke-width="28"/>
+  <path d="M700 162A226 226 0 0 1 895.7 275" stroke="#2563EB" stroke-width="28" stroke-linecap="round"/>
+  <path d="M895.7 275A226 226 0 0 1 895.7 501" stroke="#2563EB" stroke-width="28" stroke-linecap="round"/>
+  <path d="M895.7 501A226 226 0 0 1 700 614" stroke="#16A34A" stroke-width="28" stroke-linecap="round"/>
+  <path d="M700 614A226 226 0 0 1 504.3 501" stroke="#16A34A" stroke-width="28" stroke-linecap="round"/>
+  <path d="M504.3 501A226 226 0 0 1 504.3 275" stroke="#16A34A" stroke-width="28" stroke-linecap="round"/>
+  <path d="M504.3 275A226 226 0 0 1 700 162" stroke="#2563EB" stroke-width="28" stroke-linecap="round"/>
+
+  <g filter="url(#shadow)" font-family="Inter, Arial, sans-serif">
+    <rect x="594" y="118" width="212" height="88" rx="22" fill="white"/>
+    <text x="700" y="153" text-anchor="middle" fill="#0F172A" font-size="19" font-weight="700">SDK traces</text>
+    <text x="700" y="179" text-anchor="middle" fill="#475569" font-size="14">Observed behavior</text>
+
+    <rect x="847" y="231" width="228" height="88" rx="22" fill="white"/>
+    <text x="961" y="266" text-anchor="middle" fill="#0F172A" font-size="19" font-weight="700">Human + LLM feedback</text>
+    <text x="961" y="292" text-anchor="middle" fill="#475569" font-size="14">Interpret traces</text>
+
+    <rect x="847" y="457" width="228" height="88" rx="22" fill="white"/>
+    <text x="961" y="492" text-anchor="middle" fill="#0F172A" font-size="19" font-weight="700">Promptfoo evals</text>
+    <text x="961" y="518" text-anchor="middle" fill="#475569" font-size="14">Generate + validate</text>
+
+    <rect x="594" y="570" width="212" height="88" rx="22" fill="white"/>
+    <text x="700" y="605" text-anchor="middle" fill="#0F172A" font-size="19" font-weight="700">HALO diagnosis</text>
+    <text x="700" y="631" text-anchor="middle" fill="#475569" font-size="14">Rank changes</text>
+
+    <rect x="325" y="457" width="228" height="88" rx="22" fill="white"/>
+    <text x="439" y="492" text-anchor="middle" fill="#0F172A" font-size="19" font-weight="700">Codex handoff</text>
+    <text x="439" y="518" text-anchor="middle" fill="#475569" font-size="14">Evidence to code</text>
+
+    <rect x="325" y="231" width="228" height="88" rx="22" fill="white"/>
+    <text x="439" y="266" text-anchor="middle" fill="#0F172A" font-size="19" font-weight="700">Harness update</text>
+    <text x="439" y="292" text-anchor="middle" fill="#475569" font-size="14">PR + validation</text>
+  </g>
+
+  <path d="M804 178C840 190 867 209 892 236" stroke="#2563EB" stroke-width="4" marker-end="url(#arrow-blue)"/>
+  <path d="M1006 319C1020 358 1020 418 1006 454" stroke="#2563EB" stroke-width="4" marker-end="url(#arrow-blue)"/>
+  <path d="M894 545C866 572 838 588 804 600" stroke="#16A34A" stroke-width="4" marker-end="url(#arrow-green)"/>
+  <path d="M596 600C560 588 532 572 504 545" stroke="#16A34A" stroke-width="4" marker-end="url(#arrow-green)"/>
+  <path d="M394 454C380 418 380 358 394 319" stroke="#16A34A" stroke-width="4" marker-end="url(#arrow-green)"/>
+  <path d="M508 236C533 209 560 190 596 178" stroke="#2563EB" stroke-width="4" marker-end="url(#arrow-blue)"/>
+
+  <rect x="66" y="580" width="266" height="108" rx="22" fill="#EEF2FF" stroke="#A5B4FC" stroke-width="2"/>
+  <text x="199" y="618" text-anchor="middle" fill="#3730A3" font-family="Inter, Arial, sans-serif" font-size="19" font-weight="700">
+    Automation + heartbeat
+  </text>
+  <text x="199" y="646" text-anchor="middle" fill="#3730A3" font-family="Inter, Arial, sans-serif" font-size="15">
+    Detect new handoffs
+  </text>
+  <text x="199" y="670" text-anchor="middle" fill="#3730A3" font-family="Inter, Arial, sans-serif" font-size="15">
+    and trigger Codex
+  </text>
+  <path d="M332 634C362 620 388 589 418 548" stroke="#6366F1" stroke-width="4" stroke-dasharray="8 8" marker-end="url(#arrow-slate)"/>
+
+  <text x="700" y="714" text-anchor="middle" fill="#334155" font-family="Inter, Arial, sans-serif" font-size="18">
+    Human judgment enters through feedback, then keeps compounding through every later optimization cycle.
+  </text>
+</svg>
diff --git a/images/agent-improvement-loop-human-gates.svg b/images/agent-improvement-loop-human-gates.svg
@@ -0,0 +1,50 @@
+<svg width="1400" height="430" viewBox="0 0 1400 430" fill="none" xmlns="http://www.w3.org/2000/svg">
+  <defs>
+    <linearGradient id="bg2" x1="0" y1="0" x2="1400" y2="430" gradientUnits="userSpaceOnUse">
+      <stop stop-color="#FCFCFD"/>
+      <stop offset="1" stop-color="#F8FAFC"/>
+    </linearGradient>
+    <marker id="arrow-main" markerWidth="12" markerHeight="12" refX="10" refY="6" orient="auto">
+      <path d="M0 0L12 6L0 12V0Z" fill="#0F172A"/>
+    </marker>
+  </defs>
+
+  <rect width="1400" height="430" rx="28" fill="url(#bg2)"/>
+  <text x="700" y="58" text-anchor="middle" fill="#0F172A" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">
+    Human gates can be placed anywhere in the delivery path
+  </text>
+
+  <path d="M120 226H1275" stroke="#0F172A" stroke-width="5" stroke-linecap="round" marker-end="url(#arrow-main)"/>
+
+  <g font-family="Inter, Arial, sans-serif">
+    <circle cx="210" cy="226" r="28" fill="#DBEAFE" stroke="#2563EB" stroke-width="3"/>
+    <text x="210" y="232" text-anchor="middle" fill="#1D4ED8" font-size="18" font-weight="700">1</text>
+    <text x="210" y="162" text-anchor="middle" fill="#0F172A" font-size="18" font-weight="700">Trace review</text>
+    <text x="210" y="188" text-anchor="middle" fill="#475569" font-size="15">Add domain feedback</text>
+
+    <circle cx="470" cy="226" r="28" fill="#DBEAFE" stroke="#2563EB" stroke-width="3"/>
+    <text x="470" y="232" text-anchor="middle" fill="#1D4ED8" font-size="18" font-weight="700">2</text>
+    <text x="470" y="162" text-anchor="middle" fill="#0F172A" font-size="18" font-weight="700">Eval review</text>
+    <text x="470" y="188" text-anchor="middle" fill="#475569" font-size="15">Tighten the gate</text>
+
+    <circle cx="730" cy="226" r="28" fill="#DCFCE7" stroke="#16A34A" stroke-width="3"/>
+    <text x="730" y="232" text-anchor="middle" fill="#15803D" font-size="18" font-weight="700">3</text>
+    <text x="730" y="162" text-anchor="middle" fill="#0F172A" font-size="18" font-weight="700">PR review</text>
+    <text x="730" y="188" text-anchor="middle" fill="#475569" font-size="15">Inspect the diff</text>
+
+    <circle cx="990" cy="226" r="28" fill="#DCFCE7" stroke="#16A34A" stroke-width="3"/>
+    <text x="990" y="232" text-anchor="middle" fill="#15803D" font-size="18" font-weight="700">4</text>
+    <text x="990" y="162" text-anchor="middle" fill="#0F172A" font-size="18" font-weight="700">Merge approval</text>
+    <text x="990" y="188" text-anchor="middle" fill="#475569" font-size="15">Deploy by policy</text>
+
+    <rect x="1128" y="146" width="180" height="160" rx="22" fill="#F8FAFC" stroke="#CBD5E1" stroke-width="2"/>
+    <text x="1218" y="180" text-anchor="middle" fill="#0F172A" font-size="18" font-weight="700">Autonomous path</text>
+    <text x="1218" y="214" text-anchor="middle" fill="#475569" font-size="15">All gates can be</text>
+    <text x="1218" y="238" text-anchor="middle" fill="#475569" font-size="15">policy-driven when</text>
+    <text x="1218" y="262" text-anchor="middle" fill="#475569" font-size="15">confidence is high</text>
+  </g>
+
+  <text x="700" y="368" text-anchor="middle" fill="#334155" font-family="Inter, Arial, sans-serif" font-size="18">
+    The same loop supports full autonomy, selective review, or review at every stage.
+  </text>
+</svg>
diff --git a/registry.yaml b/registry.yaml
@@ -3258,3 +3258,19 @@
     - codex
     - agents
     - evals
+
+- title: Build an Agent Improvement Loop with Traces, Evals, and Codex
+  path: examples/agents_sdk/agent_improvement_loop.ipynb
+  slug: agent-improvement-loop
+  description: Build a trace-driven improvement loop that turns human and model feedback into Promptfoo evals, HALO-ranked harness changes, and a Codex handoff.
+  date: 2026-05-12
+  authors:
+    - PASFIELD-OAI
+  tags:
+    - agents-sdk
+    - agents
+    - evals
+    - promptfoo
+    - tracing
+    - codex
+    - halo