Skip to content

Commit 7cbf80b

Browse files
author
zenus
committed
Add bilingual issue drafts for scaffold review proposal
1 parent e1e00df commit 7cbf80b

2 files changed

Lines changed: 284 additions & 0 deletions

File tree

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# [Proposal] Add a low-cost scaffold review workflow based on run logs
2+
3+
## Background
4+
5+
As projects like `humanize` grow into real-world agent scaffolds, more contributors will naturally add new features, roles, and workflows. The hard part is not making changes — it is evaluating whether those changes actually improve the system.
6+
7+
One natural idea is to build a CI-like check that runs scaffold changes against “real” development workloads. But there are two practical issues:
8+
9+
1. It is hard to choose workloads that are genuinely representative.
10+
2. If the workloads are large and realistic, the token cost becomes too high for frequent evaluation.
11+
12+
I think there is a useful reframing here: instead of treating scaffold changes purely as “prompt/agent capability tweaks,” we can evaluate them as an **organizational design** problem.
13+
14+
In other words, the key question is not just “is this scaffold more sophisticated?” but:
15+
16+
- Does it fit the actual task distribution?
17+
- Does it improve information flow and decision flow?
18+
- Does it reduce coordination friction such as repeated search, repeated review, and repeated trial-and-error?
19+
- Does it help the system surface failures earlier and reuse successful patterns more reliably?
20+
21+
I would summarize this evaluation lens into four dimensions:
22+
23+
- **Fit**: does the scaffold match the real task mix?
24+
- **Flow**: are information flow, decision flow, and handoffs working well?
25+
- **Friction**: where are we wasting effort through loops, queues, or duplicate work?
26+
- **Feedback**: are failures caught early, and are wins made reusable?
27+
28+
The benefit of this framing is that it does not require a giant “real benchmark” every time. It lets us use existing run logs as evidence to continuously diagnose whether the scaffold design is improving or regressing.
29+
30+
## Proposed change
31+
32+
I would suggest adding a **low-cost periodic scaffold review workflow** on top of the existing logging / trace system, starting with a lightweight v1.
33+
34+
### 1. Make runs traceable to scaffold versions
35+
36+
Each run should retain at least:
37+
38+
- `session_id`
39+
- `scaffold_version`
40+
- `model_version`
41+
- `task_id`
42+
- `task_slice`
43+
- `budget`
44+
- `events[]` (for example plan / search / read / edit / test / review / stop / handoff)
45+
- `outcome` (success / failure / false finish / human takeover)
46+
- `artifacts` (diff / test result / review comments)
47+
48+
The most important point is: **logs must be attributable to a specific scaffold version**. Otherwise the analysis can describe symptoms, but not attribute them to a concrete change.
49+
50+
### 2. Run cheap metric screening daily
51+
52+
Do not send full logs to a strong model by default. First run programmatic metrics over all runs, for example:
53+
54+
- `success@budget`
55+
- `tokens_per_success`
56+
- `false_finish_rate`
57+
- `human_takeover_rate`
58+
- `search_steps_before_first_edit`
59+
- `review_loop_count`
60+
- repeated reads of the same file / repeated execution of the same failing command
61+
62+
The goal here is not to generate recommendations yet. It is to answer: **did the scaffold actually get worse, or did the task mix change?**
63+
64+
### 3. Sample weekly instead of reviewing all raw logs
65+
66+
To control cost, do stratified sampling over outcomes and task types — for example 20–40 sessions covering:
67+
68+
- cheap successes
69+
- expensive successes
70+
- cheap failures
71+
- expensive failures
72+
- false finishes
73+
- human takeovers
74+
75+
This is much cheaper and usually more stable than feeding an entire week of raw logs into a model.
76+
77+
### 4. Generate Trace Cards before higher-level review
78+
79+
Use a cheap or local model to compress each sampled session into a structured `Trace Card`, keeping only:
80+
81+
- what the task was
82+
- which scaffold phases were used
83+
- where the run started to drift
84+
- which actions added value
85+
- which actions were pure waste
86+
- whether verification was sufficient
87+
- the most likely failure tag
88+
- short evidence references
89+
90+
Then let a stronger model review only:
91+
92+
- metric summaries
93+
- Trace Cards
94+
- the current scaffold spec
95+
- the previous review report
96+
97+
instead of full raw logs.
98+
99+
### 5. Constrain review output into falsifiable experiment proposals
100+
101+
Each weekly review should produce at most 1–3 proposed changes, and every proposal should map explicitly to:
102+
103+
- one failure mode
104+
- one scaffold module
105+
- one expected improvement metric
106+
- one low-cost falsification test
107+
108+
For example:
109+
110+
- skip reviewer for `small-fix`
111+
- target module: `review_trigger_policy`
112+
- expected gain: lower `tokens_per_success` and latency
113+
- risk: missing subtle regressions
114+
- validation: one-week A/B test with `false_finish_rate` as guardrail
115+
116+
If a recommendation cannot be written in this format, it is probably still an observation rather than an actionable change.
117+
118+
## Why this seems useful
119+
120+
I think this workflow would help `humanize` in four ways:
121+
122+
1. **It evaluates the whole scaffold, not just model capability.**
123+
2. **It scales better as more contributors propose changes.**
124+
3. **It controls token cost by reviewing compressed evidence instead of raw logs.**
125+
4. **It creates a tighter learning loop by turning suggestions into small experiments.**
126+
127+
## A minimal first version
128+
129+
If this should start small, I would begin with just three things:
130+
131+
1. add `scaffold_version`, `task_slice`, `outcome`, `budget`, and `events` to the log schema;
132+
2. add a script or workflow that generates `weekly_scaffold_review.md`;
133+
3. define a minimal `failure taxonomy` and `Trace Card` schema.
134+
135+
That alone would already move the discussion from subjective impressions toward low-cost, evidence-based scaffold diagnosis.
136+
137+
If this direction sounds useful, I would be happy to help sketch a more concrete v1, such as:
138+
139+
- a `Trace Card` schema
140+
- a first-pass `failure taxonomy`
141+
- a `weekly_scaffold_review.md` template
142+
- a constrained reviewer prompt structure
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# [Proposal] 基于运行日志增加一个低成本的 scaffold review workflow
2+
3+
## 背景
4+
5+
随着 `humanize` 这类真实项目型 agent scaffold 逐渐变大,会有越来越多的贡献者尝试增加新 feature、角色分工或 workflow。问题是:这些改动到底有没有真正提升系统效果,往往很难评估。
6+
7+
一个很自然的方向是把这类改动放进类似 CI 的评测流程里,在一些“真实开发场景”上跑对比。但这里有两个现实问题:
8+
9+
1. 很难挑选真正有代表性的“真实 workload”;
10+
2. 如果直接跑较大的真实任务,token 成本会很高,难以高频执行。
11+
12+
我最近有一个想法:也许不应该只把 scaffold 改动看成“prompt/agent 能力优化”,而应该把它看成一种**组织设计(organizational design)**问题。
13+
14+
换句话说,我们真正想知道的不是“这个 scaffold 看起来更复杂了吗”,而是:
15+
16+
- 它是否更匹配当前任务分布;
17+
- 它是否让信息流和决策流更顺畅;
18+
- 它是否降低了重复搜索、重复 review、重复试错等协调摩擦;
19+
- 它是否让失败更早暴露、成功经验更容易复用。
20+
21+
我觉得可以把这个评估框架压缩成四个维度:
22+
23+
- **Fit**:scaffold 是否匹配真实任务分布;
24+
- **Flow**:信息流 / 决策流 / handoff 是否顺畅;
25+
- **Friction**:哪里在空转、排队、重复劳动;
26+
- **Feedback**:失败是否被及时发现,经验是否能沉淀。
27+
28+
这个视角的好处是,它不要求我们每次都用一个超大的“真实 benchmark”来判断 scaffold,而是允许我们从已有运行日志里提取证据,持续诊断系统设计是否合理。
29+
30+
## 修改建议
31+
32+
我建议在 `humanize` 现有日志/trace 基础上,加一个**低成本的周期性 scaffold review workflow**,优先做一个很轻量的 v1:
33+
34+
### 1. 先把运行日志和 scaffold 版本关联起来
35+
36+
每次 run 至少保留这些字段:
37+
38+
- `session_id`
39+
- `scaffold_version`
40+
- `model_version`
41+
- `task_id`
42+
- `task_slice`
43+
- `budget`
44+
- `events[]`(例如 plan / search / read / edit / test / review / stop / handoff)
45+
- `outcome`(success / failure / false finish / human takeover)
46+
- `artifacts`(diff / test result / review comments)
47+
48+
最关键的是:**日志必须能映射到具体 scaffold 版本**。否则后续分析只能“吐槽现象”,无法归因到具体改动。
49+
50+
### 2. 每天只跑便宜的指标预审
51+
52+
全量日志不直接喂给大模型,而是先做程序统计,例如:
53+
54+
- `success@budget`
55+
- `tokens_per_success`
56+
- `false_finish_rate`
57+
- `human_takeover_rate`
58+
- `search_steps_before_first_edit`
59+
- `review_loop_count`
60+
- 重复读同一文件 / 重复跑同一失败命令的次数
61+
62+
这一步的目的不是给建议,而是先定位:**到底是 scaffold 变坏了,还是任务分布变了。**
63+
64+
### 3. 每周只抽样,不看全量原始日志
65+
66+
为了控制 token 消耗,可以按结果和任务类型做分层采样,例如抽 20–40 个 session,覆盖:
67+
68+
- 便宜成功
69+
- 昂贵成功
70+
- 便宜失败
71+
- 昂贵失败
72+
- false finish
73+
- human takeover
74+
75+
这样比“把一周日志全塞进模型”更便宜,也通常更稳。
76+
77+
### 4. 先生成 Trace Card,再做高级 review
78+
79+
先用便宜模型或本地模型,把每个 session 压成一张结构化 `Trace Card`,只保留:
80+
81+
- 任务是什么
82+
- scaffold 经过了哪些阶段
83+
- 哪一步开始偏航
84+
- 哪些动作高价值
85+
- 哪些动作纯浪费
86+
- 验证是否充分
87+
- 最可能的 failure tag
88+
- 证据片段引用
89+
90+
然后再让强模型只看:
91+
92+
- 指标摘要
93+
- Trace Cards
94+
- 当前 scaffold spec
95+
- 上一轮 review 结果
96+
97+
而不是直接看原始长日志。
98+
99+
### 5. review 的输出必须约束成“可证伪的实验建议”
100+
101+
建议每周最多产出 1–3 条修改建议,而且每条建议都必须显式映射到:
102+
103+
- 一个 failure mode
104+
- 一个 scaffold 模块
105+
- 一个预期改善指标
106+
- 一个低成本证伪实验
107+
108+
例如:
109+
110+
-`small-fix` 跳过 reviewer
111+
- 目标模块:`review_trigger_policy`
112+
- 预期收益:降低 `tokens_per_success` 和 latency
113+
- 风险:漏掉边界回归
114+
- 验证方式:A/B 一周,guardrail 为 `false_finish_rate` 不显著上升
115+
116+
如果一条建议不能写成这种格式,那它更像“观察”,还不应该进入 action list。
117+
118+
## 为什么这件事值得做
119+
120+
我觉得这个 workflow 对 `humanize` 的价值在于:
121+
122+
1. **更贴近真实系统演化**:不是只测模型能力,而是测整个 scaffold 的组织效果;
123+
2. **更可扩展**:随着贡献者变多,能够更系统地评估 feature 改动是否真的带来收益;
124+
3. **更省 token**:全量日志只做程序统计,强模型只看压缩后的证据包;
125+
4. **更容易形成闭环**:每周只认领少量实验,逐步验证哪些 scaffold 设计真的有效。
126+
127+
## 一个可落地的最小版本
128+
129+
如果要从很小的改动开始,我建议优先落 3 个东西:
130+
131+
1. 给日志补齐 `scaffold_version``task_slice``outcome``budget``events`
132+
2. 增加一个每周生成 `weekly_scaffold_review.md` 的脚本或 workflow;
133+
3. 统一一版最小 `failure taxonomy``Trace Card` schema。
134+
135+
这样就已经能把“是否值得改 scaffold”从主观讨论,往“基于证据的低成本组织诊断”推进一步。
136+
137+
如果 maintainer 觉得这个方向有价值,我很愿意继续补一个更具体的 v1 草案,比如:
138+
139+
- `Trace Card` schema
140+
- `failure taxonomy` 初版
141+
- `weekly_scaffold_review.md` 模板
142+
- reviewer prompt 的结构约束

0 commit comments

Comments
 (0)