Skip to content

Commit 099d409

Browse files
author
zenus
committed
Expand English issue draft with workflow details
1 parent 822f696 commit 099d409

1 file changed

Lines changed: 273 additions & 0 deletions

File tree

.humanize/drafts/humanize-org-issue-en.md

Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# [Proposal] Consider adding a low-cost scaffold review workflow based on run logs
22

3+
> Note: this write-up is based on a prior Chinese discussion and is being shared here, per the original framing, as a "GPT-5.4 Pro proposal" for discussion.
4+
35
## Background
46

57
As projects like `humanize` grow into real-world agent scaffolds, more contributors will naturally add new features, roles, and workflows. A related challenge is that it becomes harder to tell whether those changes are actually improving the system.
@@ -115,6 +117,277 @@ For example:
115117

116118
If a recommendation cannot yet be written in this format, it may be better treated as an observation rather than an immediate action item.
117119

120+
## Workflow details
121+
122+
### 0. Prepare the inputs
123+
124+
Each run should keep at least these fields:
125+
126+
- `session_id`
127+
- `scaffold_version`
128+
- `model_version`
129+
- `task_id`
130+
- `task_slice`
131+
- `repo / language / task_size`
132+
- `events[]`: plan, search, read, edit, test, review, refine, stop, handoff
133+
- `artifacts`: diff, test results, review comments
134+
- `outcome`: success / failure / false finish / human takeover
135+
- `cost`: tokens, latency, number of turns
136+
- `budget`: the token / time budget given to the agent for that run
137+
138+
There are also two useful static inputs:
139+
140+
- **current scaffold spec**: planner, search, builder, reviewer, refiner, stop policy, escalation rule, memory policy
141+
- **task taxonomy**: for example `small-fix / multifile / debug / review-only / resume / refactor`
142+
143+
The most important point is: **logs need to be connected to scaffold versions**. Otherwise reviewers can describe symptoms, but they cannot attribute them.
144+
145+
### 1. Clean, redact, and normalize
146+
147+
It probably makes sense not to feed raw logs directly into a strong model.
148+
149+
First do three things:
150+
151+
- redact sensitive data such as secrets, paths, customer data, and internal URLs
152+
- normalize different agent / event formats into a single schema
153+
- segment the data by session, task, and review loop
154+
155+
A natural output artifact could be `normalized_sessions.jsonl`.
156+
157+
### 2. Run a cheap metric pre-screen
158+
159+
This layer does not need a strong model. Programmatic analysis is likely enough.
160+
161+
Some useful metrics might be:
162+
163+
**Fit**
164+
165+
- success rate by task slice
166+
- `tokens_per_success` by slice
167+
- human takeover rate by slice
168+
169+
**Flow**
170+
171+
- `time_to_first_read`
172+
- `time_to_first_useful_edit`
173+
- `search_steps_before_first_edit`
174+
- `review_loop_count`
175+
176+
**Friction**
177+
178+
- repeated reads of the same file
179+
- repeated execution of the same failing command
180+
- similar unproductive searches
181+
- diff rollback / rewrite count
182+
183+
**Feedback**
184+
185+
- `false_finish_rate`
186+
- rate of “claimed done, but tests still failed”
187+
- rate of critical issues only found in the second review round
188+
- frequency of repeated failure patterns
189+
190+
At this stage the main question is simple:
191+
192+
**what actually got worse this week, and what was just task-mix drift?**
193+
194+
### 3. Use stratified sampling instead of reading all logs
195+
196+
This seems like the key cost-control step.
197+
198+
Rather than letting a reviewer consume all logs, sample only **representative cases**. One useful first pass might be six buckets:
199+
200+
- cheap success
201+
- expensive success
202+
- cheap failure
203+
- expensive failure
204+
- false finish
205+
- human takeover
206+
207+
Then do a second layer of sampling by `task_slice`, for example:
208+
209+
- single-file fixes
210+
- multi-file changes
211+
- debugging / test repair
212+
- resume-from-partial
213+
- tasks with heavy reviewer / refiner involvement
214+
215+
A weekly sample of roughly 24–40 sessions may already be enough. Coverage across categories likely matters more than raw volume.
216+
217+
### 4. Generate a Trace Card for each sampled session
218+
219+
A cheap model, ideally a local one, could first generate a short card for each session.
220+
221+
Each `Trace Card` could include:
222+
223+
- what the task was
224+
- which scaffold stages were used
225+
- where the run started to drift
226+
- which actions created value
227+
- which actions were pure waste
228+
- whether verification was sufficient
229+
- the most likely failure tags
230+
- short evidence references
231+
232+
For example:
233+
234+
```yaml
235+
session_id: xxx
236+
task_slice: multifile-debug
237+
outcome: false_finish
238+
summary: >
239+
The agent located module A quickly, but repeated search on module B six times;
240+
the reviewer pointed out missing test coverage, but the refiner did not add new verification;
241+
the stop rule triggered too early.
242+
failure_tags:
243+
- SEARCH_THRASH
244+
- VERIFICATION_GAP
245+
- EARLY_STOP
246+
evidence:
247+
- turn_14: repeated grep on same path
248+
- turn_27: reviewer requests missing edge-case test
249+
- turn_31: stop without rerun of failing suite
250+
```
251+
252+
This step feels especially important because the stronger reviewer should ideally read **Trace Cards + metrics + scaffold spec**, not raw long logs.
253+
254+
### 5. Session Reviewer: score one session at a time
255+
256+
At this layer, the reviewer only looks at a single session and does not try to make global conclusions.
257+
258+
It does two things:
259+
260+
**A. Score the session with a rubric**
261+
262+
- Fit: was this scaffold too heavy or too light for the task?
263+
- Flow: were the handoffs smooth?
264+
- Friction: was there obvious mechanical waste?
265+
- Feedback: were errors detected and corrected?
266+
- Governance: were stop / escalate / review permissions placed appropriately?
267+
268+
**B. Apply failure taxonomy tags**
269+
270+
A fixed taxonomy might include labels like:
271+
272+
- `OVER_PLANNING`
273+
- `UNDER_PLANNING`
274+
- `SEARCH_THRASH`
275+
- `CONTEXT_AMNESIA`
276+
- `VERIFICATION_GAP`
277+
- `REVIEW_THEATER`
278+
- `EARLY_STOP`
279+
- `LATE_ESCALATION`
280+
- `TOOL_MISMATCH`
281+
- `BAD_ROLE_BOUNDARY`
282+
283+
One useful guardrail here is that the Session Reviewer should not jump straight to “change the prompt.” It should stay focused on symptoms and evidence.
284+
285+
### 6. Scaffold Reviewer: diagnose mechanisms across sessions
286+
287+
This is the core layer.
288+
289+
Instead of looking at one session, it looks across:
290+
291+
- weekly metrics
292+
- distribution by slice
293+
- 24–40 Trace Cards
294+
- the current scaffold spec
295+
- the previous review report
296+
297+
Then it produces three kinds of output.
298+
299+
**1. Repeated patterns**
300+
301+
For example:
302+
303+
- small tasks still go through the full build-review-refine path and waste cycles
304+
- the reviewer often flags insufficient verification, but the refiner cannot actually add verification
305+
- search works reasonably well on multi-file tasks, but the stop policy is too aggressive
306+
307+
**2. Attribution to scaffold components**
308+
309+
For example:
310+
311+
- the issue is not necessarily weak model capability, but loss of actionable requirements in the `reviewer -> refiner` handoff
312+
- the issue is not necessarily poor search, but over-fragmented planning that breaks context apart
313+
- the issue is not necessarily that reviewer adds no value, but that trivial tasks should not always invoke reviewer
314+
315+
**3. No more than 3 change proposals**
316+
317+
Each proposal should include five fields:
318+
319+
- which module to change
320+
- which failure mode it addresses
321+
- which metric it aims to improve
322+
- possible side effects
323+
- how to falsify it cheaply
324+
325+
For example:
326+
327+
```yaml
328+
proposal:
329+
title: "Skip reviewer for single-file fixes"
330+
target_module: review_trigger_policy
331+
expected_gain:
332+
- lower tokens_per_success on small-fix
333+
- lower latency
334+
risk:
335+
- miss subtle regression on edge cases
336+
falsification_test:
337+
- A/B on small-fix slice for 1 week
338+
- guardrail: false_finish_rate must not increase > 1pp
339+
```
340+
341+
### 7. Counter Reviewer: a structured dissent pass
342+
343+
This layer feels important because AI reviewers can otherwise sound persuasive while still drifting toward weak recommendations.
344+
345+
The Counter Reviewer does one job:
346+
347+
**try to refute the previous reviewer’s conclusion.**
348+
349+
In particular, it checks for confounders such as:
350+
351+
- task distribution changed, not scaffold quality
352+
- model version changed, not scaffold quality
353+
- infra / sandbox / timeout noise
354+
- weak labels causing model-capability problems to be mistaken for scaffold problems
355+
356+
Only recommendations that still stand after this pass should move into the action list.
357+
358+
### 8. Human Triage: maintainers choose a small number of experiments
359+
360+
This layer probably does not need a large review meeting.
361+
362+
The goal is simple: **pick only 1–2 experiments at a time.**
363+
364+
The meeting output could be limited to:
365+
366+
- what not to change this week
367+
- what to change this week
368+
- how success will be judged
369+
370+
It may be better not to let AI propose six changes at once, because then it becomes hard to know which one actually worked.
371+
372+
### 9. Run experiments and close the loop
373+
374+
Each accepted proposal can be turned into a small experiment with:
375+
376+
- the target task slice
377+
- the affected scaffold module
378+
- primary metrics
379+
- guardrail metrics
380+
- experiment duration
381+
- whether gradual rollout / shadow run is needed
382+
383+
After the experiment, write the result back into the review archive:
384+
385+
- was the proposal confirmed or falsified?
386+
- did a new failure mode appear?
387+
- does the taxonomy need to expand?
388+
389+
That is what would allow the reviewer workflow to gradually learn the system, instead of starting from zero every week.
390+
118391
## Why this might be worth discussing
119392

120393
I think this workflow could potentially help `humanize` in a few ways:

0 commit comments

Comments
 (0)