|
17 | 17 | "\n", |
18 | 18 | "Frontier multimodal models make a different workflow possible. Instead of training a task-specific model on many paired examples for one layout domain, we can test whether a reasoning-capable model like GPT-5.5 can solve more of the task directly when given enough scaffolding: clear visual context, an operating procedure, constrained objects, a structured output format, and evals that identify where the result breaks.\n", |
19 | 19 | "\n", |
20 | | - "The use case is office layout generation from an empty floorplan. Given an empty floorplan image, an office-layout SOP, and a furniture catalog, the model has to produce a structured plan for where furniture should go and why. In this project, grounding does not only mean pointing to the right region of an image. It means the model’s planning decisions have to remain tied to the geometry and constraints visible in the source image. If the source drawing contains a wall, doorway, partition, room boundary, or circulation path, the generated plan has to respect that structure rather than produce something that merely looks plausible.\n", |
| 20 | + "The use case is office layout generation from an empty floorplan. Given an empty floorplan image, an office-layout SOP, and a furniture catalog, GPT-5.5 produces a structured layout spec that identifies spaces, infers room roles, applies numeric rules, places furniture, and records assumptions about the source drawing. The idea also came from a real office problem. Headcount in our New York office was growing fast enough that “could we fit a few more desks somewhere?” started to sound like a reasonable thing to ask a model.\n", |
21 | 21 | "\n", |
22 | 22 | "We use an office floorplan as the test case because the failures are easy to inspect. GPT-5.5 receives an empty floorplan, a reusable layout SOP, and a small furniture catalog, then produces a structured layout spec. The spec identifies spaces, infers room roles, applies numeric rules, places furniture, and records assumptions about the source drawing.\n", |
23 | 23 | "\n", |
24 | | - "The experiment is built around evaluating that spec, not judging a rendered image alone. Each run can be checked for mechanical validity, semantic grounding, and similarity to a human-authored reference when one exists. This lets us separate failure modes that a visual mockup would collapse together: invalid geometry, wrong room interpretation, incorrect furniture program, weak source grounding, and render drift." |
| 24 | + "The experiment is built around evaluating that spec, not judging a rendered image alone. Each run can be checked for mechanical validity, semantic grounding, and similarity to a human-authored reference when one exists. This lets us separate failure modes that a visual mockup would collapse together: invalid geometry, wrong room interpretation, incorrect furniture program, and weak source grounding." |
25 | 25 | ] |
26 | 26 | }, |
27 | 27 | { |
|
31 | 31 | "source": [ |
32 | 32 | "## Task setup: from floorplan to layout spec\n", |
33 | 33 | "\n", |
34 | | - "The generation step separates the problem into three inputs: 1) visual evidence from the floorplan, 2) planning policy from the SOP, and 3) physical scale from the furniture catalog.\n", |
| 34 | + "The generation step separates the problem into three inputs: 1) visual evidence from the floorplan, 2) planning policy from the SOP, and 3) object dimensions from the furniture catalog.\n", |
35 | 35 | "\n", |
36 | 36 | "The visible evidence is the empty floorplan image, `empty.png`. The model infers walls, doors, stairs, restrooms, partitions, openings, and usable room shapes from the image itself. It does not see the filled reference image or the gold JSON during generation; those artifacts are held back for evaluation.\n", |
37 | 37 | "\n", |
|
77 | 77 | " ]\n", |
78 | 78 | "}\n", |
79 | 79 | "```\n", |
80 | | - "The furniture catalog supplies physical scale. It turns labels like `workstation`, `reception_desk`, or `storage_cabinet` into objects with dimensions and simple placement priors, so the model has to reason about whether objects actually fit.\n", |
| 80 | + "The furniture catalog supplies object dimensions. It turns labels like `workstation`, `reception_desk`, or `storage_cabinet` into objects with dimensions and simple placement priors, so the model has to reason about whether objects actually fit.\n", |
81 | 81 | "\n", |
82 | 82 | "```json\n", |
83 | 83 | "{\n", |
|
129 | 129 | "This is the only pass/fail gate. It checks whether the generated layout is mechanically valid against constraints.json and the source floorplan. The check covers hard spatial constraints such as source-wall collisions, object overlap, placements staying inside declared zones, and other geometry errors that would make the plan unusable regardless of how plausible the render looks.\n", |
130 | 130 | "\n", |
131 | 131 | "2. **grounded_program — semantic grounding score.**\n", |
132 | | - "This check uses GPT-5.5 to judge whether the layout is faithful to the visible floorplan and the brief. It looks at whether the right kinds of rooms were recognized, whether major spaces were furnished appropriately, and whether the zoning interpretation is grounded in the drawing rather than invented. This is a score, not a hard reject rule, because semantic interpretation is harder to reduce to a single deterministic check.\n", |
| 132 | + "This score combines deterministic room-coverage checks with a GPT-5.5 judgment about whether the layout is faithful to the visible floorplan and the brief. It looks at whether the right kinds of rooms were recognized, whether major spaces were furnished appropriately, and whether the zoning interpretation is grounded in the drawing rather than invented. This is a score, not a hard reject rule, because semantic interpretation is harder to reduce to a single deterministic check.\n", |
133 | 133 | "\n", |
134 | 134 | "3. **reference_similarity — deterministic gold-spec score.**\n", |
135 | 135 | "When a golden JSON spec exists, this check compares the generated layout to the reference using zone agreement, furniture-program agreement, and tolerant placement proximity. This gives us a benchmark-style signal without making the human reference the only acceptable layout. It is useful for comparing candidates, but it is not the hard gate.\n", |
|
237 | 237 | "source": [ |
238 | 238 | "### Run 15 - A valid end-to-end case clarified the division of labor\n", |
239 | 239 | "\n", |
240 | | - "In the strongest later case, the remaining issues were mostly local geometry problems rather than broad semantic confusion. The candidate reached `completed_valid` after one model repair for a zone-bounds issue, while deterministic geometry cleared residual desk and storage collisions through repair passes, orthogonal-rotation search, local repair, and a small joint-repack fallback.\n", |
| 240 | + "In the strongest later case, the run reached physical validity, though some semantic disagreement with the reference room program still remained. The candidate reached `completed_valid` after one model repair for a zone-bounds issue, while deterministic geometry cleared residual desk and storage collisions through repair passes, orthogonal-rotation search, local repair, and a small joint-repack fallback.\n", |
241 | 241 | "\n", |
242 | 242 | "<p align=\"center\">\n", |
243 | 243 | " <img src=\"images/grounded_spatial_run_15_trace_viewer.png\" alt=\"Run 15: valid rendered floorplan and trace viewer\" width=\"90%\">\n", |
|
0 commit comments