|
1 | 1 | # [Proposal] Consider adding a low-cost scaffold review workflow based on run logs |
2 | 2 |
|
| 3 | +> Note: this write-up is based on a prior Chinese discussion and is being shared here, per the original framing, as a "GPT-5.4 Pro proposal" for discussion. |
| 4 | +
|
3 | 5 | ## Background |
4 | 6 |
|
5 | 7 | As projects like `humanize` grow into real-world agent scaffolds, more contributors will naturally add new features, roles, and workflows. A related challenge is that it becomes harder to tell whether those changes are actually improving the system. |
@@ -115,6 +117,277 @@ For example: |
115 | 117 |
|
116 | 118 | If a recommendation cannot yet be written in this format, it may be better treated as an observation rather than an immediate action item. |
117 | 119 |
|
| 120 | +## Workflow details |
| 121 | + |
| 122 | +### 0. Prepare the inputs |
| 123 | + |
| 124 | +Each run should keep at least these fields: |
| 125 | + |
| 126 | +- `session_id` |
| 127 | +- `scaffold_version` |
| 128 | +- `model_version` |
| 129 | +- `task_id` |
| 130 | +- `task_slice` |
| 131 | +- `repo / language / task_size` |
| 132 | +- `events[]`: plan, search, read, edit, test, review, refine, stop, handoff |
| 133 | +- `artifacts`: diff, test results, review comments |
| 134 | +- `outcome`: success / failure / false finish / human takeover |
| 135 | +- `cost`: tokens, latency, number of turns |
| 136 | +- `budget`: the token / time budget given to the agent for that run |
| 137 | + |
| 138 | +There are also two useful static inputs: |
| 139 | + |
| 140 | +- **current scaffold spec**: planner, search, builder, reviewer, refiner, stop policy, escalation rule, memory policy |
| 141 | +- **task taxonomy**: for example `small-fix / multifile / debug / review-only / resume / refactor` |
| 142 | + |
| 143 | +The most important point is: **logs need to be connected to scaffold versions**. Otherwise reviewers can describe symptoms, but they cannot attribute them. |
| 144 | + |
| 145 | +### 1. Clean, redact, and normalize |
| 146 | + |
| 147 | +It probably makes sense not to feed raw logs directly into a strong model. |
| 148 | + |
| 149 | +First do three things: |
| 150 | + |
| 151 | +- redact sensitive data such as secrets, paths, customer data, and internal URLs |
| 152 | +- normalize different agent / event formats into a single schema |
| 153 | +- segment the data by session, task, and review loop |
| 154 | + |
| 155 | +A natural output artifact could be `normalized_sessions.jsonl`. |
| 156 | + |
| 157 | +### 2. Run a cheap metric pre-screen |
| 158 | + |
| 159 | +This layer does not need a strong model. Programmatic analysis is likely enough. |
| 160 | + |
| 161 | +Some useful metrics might be: |
| 162 | + |
| 163 | +**Fit** |
| 164 | + |
| 165 | +- success rate by task slice |
| 166 | +- `tokens_per_success` by slice |
| 167 | +- human takeover rate by slice |
| 168 | + |
| 169 | +**Flow** |
| 170 | + |
| 171 | +- `time_to_first_read` |
| 172 | +- `time_to_first_useful_edit` |
| 173 | +- `search_steps_before_first_edit` |
| 174 | +- `review_loop_count` |
| 175 | + |
| 176 | +**Friction** |
| 177 | + |
| 178 | +- repeated reads of the same file |
| 179 | +- repeated execution of the same failing command |
| 180 | +- similar unproductive searches |
| 181 | +- diff rollback / rewrite count |
| 182 | + |
| 183 | +**Feedback** |
| 184 | + |
| 185 | +- `false_finish_rate` |
| 186 | +- rate of “claimed done, but tests still failed” |
| 187 | +- rate of critical issues only found in the second review round |
| 188 | +- frequency of repeated failure patterns |
| 189 | + |
| 190 | +At this stage the main question is simple: |
| 191 | + |
| 192 | +**what actually got worse this week, and what was just task-mix drift?** |
| 193 | + |
| 194 | +### 3. Use stratified sampling instead of reading all logs |
| 195 | + |
| 196 | +This seems like the key cost-control step. |
| 197 | + |
| 198 | +Rather than letting a reviewer consume all logs, sample only **representative cases**. One useful first pass might be six buckets: |
| 199 | + |
| 200 | +- cheap success |
| 201 | +- expensive success |
| 202 | +- cheap failure |
| 203 | +- expensive failure |
| 204 | +- false finish |
| 205 | +- human takeover |
| 206 | + |
| 207 | +Then do a second layer of sampling by `task_slice`, for example: |
| 208 | + |
| 209 | +- single-file fixes |
| 210 | +- multi-file changes |
| 211 | +- debugging / test repair |
| 212 | +- resume-from-partial |
| 213 | +- tasks with heavy reviewer / refiner involvement |
| 214 | + |
| 215 | +A weekly sample of roughly 24–40 sessions may already be enough. Coverage across categories likely matters more than raw volume. |
| 216 | + |
| 217 | +### 4. Generate a Trace Card for each sampled session |
| 218 | + |
| 219 | +A cheap model, ideally a local one, could first generate a short card for each session. |
| 220 | + |
| 221 | +Each `Trace Card` could include: |
| 222 | + |
| 223 | +- what the task was |
| 224 | +- which scaffold stages were used |
| 225 | +- where the run started to drift |
| 226 | +- which actions created value |
| 227 | +- which actions were pure waste |
| 228 | +- whether verification was sufficient |
| 229 | +- the most likely failure tags |
| 230 | +- short evidence references |
| 231 | + |
| 232 | +For example: |
| 233 | + |
| 234 | +```yaml |
| 235 | +session_id: xxx |
| 236 | +task_slice: multifile-debug |
| 237 | +outcome: false_finish |
| 238 | +summary: > |
| 239 | + The agent located module A quickly, but repeated search on module B six times; |
| 240 | + the reviewer pointed out missing test coverage, but the refiner did not add new verification; |
| 241 | + the stop rule triggered too early. |
| 242 | +failure_tags: |
| 243 | + - SEARCH_THRASH |
| 244 | + - VERIFICATION_GAP |
| 245 | + - EARLY_STOP |
| 246 | +evidence: |
| 247 | + - turn_14: repeated grep on same path |
| 248 | + - turn_27: reviewer requests missing edge-case test |
| 249 | + - turn_31: stop without rerun of failing suite |
| 250 | +``` |
| 251 | +
|
| 252 | +This step feels especially important because the stronger reviewer should ideally read **Trace Cards + metrics + scaffold spec**, not raw long logs. |
| 253 | +
|
| 254 | +### 5. Session Reviewer: score one session at a time |
| 255 | +
|
| 256 | +At this layer, the reviewer only looks at a single session and does not try to make global conclusions. |
| 257 | +
|
| 258 | +It does two things: |
| 259 | +
|
| 260 | +**A. Score the session with a rubric** |
| 261 | +
|
| 262 | +- Fit: was this scaffold too heavy or too light for the task? |
| 263 | +- Flow: were the handoffs smooth? |
| 264 | +- Friction: was there obvious mechanical waste? |
| 265 | +- Feedback: were errors detected and corrected? |
| 266 | +- Governance: were stop / escalate / review permissions placed appropriately? |
| 267 | +
|
| 268 | +**B. Apply failure taxonomy tags** |
| 269 | +
|
| 270 | +A fixed taxonomy might include labels like: |
| 271 | +
|
| 272 | +- `OVER_PLANNING` |
| 273 | +- `UNDER_PLANNING` |
| 274 | +- `SEARCH_THRASH` |
| 275 | +- `CONTEXT_AMNESIA` |
| 276 | +- `VERIFICATION_GAP` |
| 277 | +- `REVIEW_THEATER` |
| 278 | +- `EARLY_STOP` |
| 279 | +- `LATE_ESCALATION` |
| 280 | +- `TOOL_MISMATCH` |
| 281 | +- `BAD_ROLE_BOUNDARY` |
| 282 | + |
| 283 | +One useful guardrail here is that the Session Reviewer should not jump straight to “change the prompt.” It should stay focused on symptoms and evidence. |
| 284 | + |
| 285 | +### 6. Scaffold Reviewer: diagnose mechanisms across sessions |
| 286 | + |
| 287 | +This is the core layer. |
| 288 | + |
| 289 | +Instead of looking at one session, it looks across: |
| 290 | + |
| 291 | +- weekly metrics |
| 292 | +- distribution by slice |
| 293 | +- 24–40 Trace Cards |
| 294 | +- the current scaffold spec |
| 295 | +- the previous review report |
| 296 | + |
| 297 | +Then it produces three kinds of output. |
| 298 | + |
| 299 | +**1. Repeated patterns** |
| 300 | + |
| 301 | +For example: |
| 302 | + |
| 303 | +- small tasks still go through the full build-review-refine path and waste cycles |
| 304 | +- the reviewer often flags insufficient verification, but the refiner cannot actually add verification |
| 305 | +- search works reasonably well on multi-file tasks, but the stop policy is too aggressive |
| 306 | + |
| 307 | +**2. Attribution to scaffold components** |
| 308 | + |
| 309 | +For example: |
| 310 | + |
| 311 | +- the issue is not necessarily weak model capability, but loss of actionable requirements in the `reviewer -> refiner` handoff |
| 312 | +- the issue is not necessarily poor search, but over-fragmented planning that breaks context apart |
| 313 | +- the issue is not necessarily that reviewer adds no value, but that trivial tasks should not always invoke reviewer |
| 314 | + |
| 315 | +**3. No more than 3 change proposals** |
| 316 | + |
| 317 | +Each proposal should include five fields: |
| 318 | + |
| 319 | +- which module to change |
| 320 | +- which failure mode it addresses |
| 321 | +- which metric it aims to improve |
| 322 | +- possible side effects |
| 323 | +- how to falsify it cheaply |
| 324 | + |
| 325 | +For example: |
| 326 | + |
| 327 | +```yaml |
| 328 | +proposal: |
| 329 | + title: "Skip reviewer for single-file fixes" |
| 330 | + target_module: review_trigger_policy |
| 331 | + expected_gain: |
| 332 | + - lower tokens_per_success on small-fix |
| 333 | + - lower latency |
| 334 | + risk: |
| 335 | + - miss subtle regression on edge cases |
| 336 | + falsification_test: |
| 337 | + - A/B on small-fix slice for 1 week |
| 338 | + - guardrail: false_finish_rate must not increase > 1pp |
| 339 | +``` |
| 340 | + |
| 341 | +### 7. Counter Reviewer: a structured dissent pass |
| 342 | + |
| 343 | +This layer feels important because AI reviewers can otherwise sound persuasive while still drifting toward weak recommendations. |
| 344 | + |
| 345 | +The Counter Reviewer does one job: |
| 346 | + |
| 347 | +**try to refute the previous reviewer’s conclusion.** |
| 348 | + |
| 349 | +In particular, it checks for confounders such as: |
| 350 | + |
| 351 | +- task distribution changed, not scaffold quality |
| 352 | +- model version changed, not scaffold quality |
| 353 | +- infra / sandbox / timeout noise |
| 354 | +- weak labels causing model-capability problems to be mistaken for scaffold problems |
| 355 | + |
| 356 | +Only recommendations that still stand after this pass should move into the action list. |
| 357 | + |
| 358 | +### 8. Human Triage: maintainers choose a small number of experiments |
| 359 | + |
| 360 | +This layer probably does not need a large review meeting. |
| 361 | + |
| 362 | +The goal is simple: **pick only 1–2 experiments at a time.** |
| 363 | + |
| 364 | +The meeting output could be limited to: |
| 365 | + |
| 366 | +- what not to change this week |
| 367 | +- what to change this week |
| 368 | +- how success will be judged |
| 369 | + |
| 370 | +It may be better not to let AI propose six changes at once, because then it becomes hard to know which one actually worked. |
| 371 | + |
| 372 | +### 9. Run experiments and close the loop |
| 373 | + |
| 374 | +Each accepted proposal can be turned into a small experiment with: |
| 375 | + |
| 376 | +- the target task slice |
| 377 | +- the affected scaffold module |
| 378 | +- primary metrics |
| 379 | +- guardrail metrics |
| 380 | +- experiment duration |
| 381 | +- whether gradual rollout / shadow run is needed |
| 382 | + |
| 383 | +After the experiment, write the result back into the review archive: |
| 384 | + |
| 385 | +- was the proposal confirmed or falsified? |
| 386 | +- did a new failure mode appear? |
| 387 | +- does the taxonomy need to expand? |
| 388 | + |
| 389 | +That is what would allow the reviewer workflow to gradually learn the system, instead of starting from zero every week. |
| 390 | + |
118 | 391 | ## Why this might be worth discussing |
119 | 392 |
|
120 | 393 | I think this workflow could potentially help `humanize` in a few ways: |
|
0 commit comments