Skip to content

Commit 822f696

Browse files
author
zenus
committed
Soften tone of bilingual issue drafts
1 parent 7cbf80b commit 822f696

2 files changed

Lines changed: 48 additions & 48 deletions

File tree

.humanize/drafts/humanize-org-issue-en.md

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,35 @@
1-
# [Proposal] Add a low-cost scaffold review workflow based on run logs
1+
# [Proposal] Consider adding a low-cost scaffold review workflow based on run logs
22

33
## Background
44

5-
As projects like `humanize` grow into real-world agent scaffolds, more contributors will naturally add new features, roles, and workflows. The hard part is not making changes — it is evaluating whether those changes actually improve the system.
5+
As projects like `humanize` grow into real-world agent scaffolds, more contributors will naturally add new features, roles, and workflows. A related challenge is that it becomes harder to tell whether those changes are actually improving the system.
66

7-
One natural idea is to build a CI-like check that runs scaffold changes against “real” development workloads. But there are two practical issues:
7+
One natural idea is to add a CI-like check that compares scaffold changes on “real” development workloads. But there are two practical issues:
88

99
1. It is hard to choose workloads that are genuinely representative.
10-
2. If the workloads are large and realistic, the token cost becomes too high for frequent evaluation.
10+
2. If the workloads are large and realistic, the token cost can become too high for frequent evaluation.
1111

12-
I think there is a useful reframing here: instead of treating scaffold changes purely as “prompt/agent capability tweaks,” we can evaluate them as an **organizational design** problem.
12+
I have been wondering whether scaffold changes could be framed not only as “prompt/agent capability tweaks,” but also as an **organizational design** problem.
1313

14-
In other words, the key question is not just “is this scaffold more sophisticated?” but:
14+
In other words, the question may not just be “is this scaffold more sophisticated?”, but also:
1515

1616
- Does it fit the actual task distribution?
1717
- Does it improve information flow and decision flow?
1818
- Does it reduce coordination friction such as repeated search, repeated review, and repeated trial-and-error?
1919
- Does it help the system surface failures earlier and reuse successful patterns more reliably?
2020

21-
I would summarize this evaluation lens into four dimensions:
21+
If I compress that evaluation lens a bit, it seems to fall into four dimensions:
2222

2323
- **Fit**: does the scaffold match the real task mix?
2424
- **Flow**: are information flow, decision flow, and handoffs working well?
2525
- **Friction**: where are we wasting effort through loops, queues, or duplicate work?
2626
- **Feedback**: are failures caught early, and are wins made reusable?
2727

28-
The benefit of this framing is that it does not require a giant “real benchmark” every time. It lets us use existing run logs as evidence to continuously diagnose whether the scaffold design is improving or regressing.
28+
One benefit of this framing is that it does not require a giant “real benchmark” every time. It allows us to use existing run logs as evidence and continuously observe whether the scaffold design is moving in a good direction.
2929

30-
## Proposed change
30+
## A possible direction
3131

32-
I would suggest adding a **low-cost periodic scaffold review workflow** on top of the existing logging / trace system, starting with a lightweight v1.
32+
If this seems useful, I would like to suggest adding a **low-cost periodic scaffold review workflow** on top of the existing logging / trace system, starting with a lightweight v1.
3333

3434
### 1. Make runs traceable to scaffold versions
3535

@@ -45,11 +45,11 @@ Each run should retain at least:
4545
- `outcome` (success / failure / false finish / human takeover)
4646
- `artifacts` (diff / test result / review comments)
4747

48-
The most important point is: **logs must be attributable to a specific scaffold version**. Otherwise the analysis can describe symptoms, but not attribute them to a concrete change.
48+
The most important point, in my view, is that **logs should ideally be attributable to a specific scaffold version**. Otherwise the analysis may describe symptoms, but it becomes much harder to attribute them to a concrete change.
4949

5050
### 2. Run cheap metric screening daily
5151

52-
Do not send full logs to a strong model by default. First run programmatic metrics over all runs, for example:
52+
Instead of sending full logs to a strong model by default, first run programmatic metrics over all runs, for example:
5353

5454
- `success@budget`
5555
- `tokens_per_success`
@@ -59,11 +59,11 @@ Do not send full logs to a strong model by default. First run programmatic metri
5959
- `review_loop_count`
6060
- repeated reads of the same file / repeated execution of the same failing command
6161

62-
The goal here is not to generate recommendations yet. It is to answer: **did the scaffold actually get worse, or did the task mix change?**
62+
The goal here is not necessarily to generate recommendations immediately. It is first to help answer: **did the scaffold actually get worse, or did the task mix change?**
6363

6464
### 3. Sample weekly instead of reviewing all raw logs
6565

66-
To control cost, do stratified sampling over outcomes and task types — for example 20–40 sessions covering:
66+
To control cost, it may be enough to do stratified sampling over outcomes and task types — for example 20–40 sessions covering:
6767

6868
- cheap successes
6969
- expensive successes
@@ -72,11 +72,11 @@ To control cost, do stratified sampling over outcomes and task types — for exa
7272
- false finishes
7373
- human takeovers
7474

75-
This is much cheaper and usually more stable than feeding an entire week of raw logs into a model.
75+
This is usually much cheaper than feeding an entire week of raw logs into a model, and it may also lead to a more stable review rhythm.
7676

7777
### 4. Generate Trace Cards before higher-level review
7878

79-
Use a cheap or local model to compress each sampled session into a structured `Trace Card`, keeping only:
79+
A cheap or local model could first compress each sampled session into a structured `Trace Card`, keeping only:
8080

8181
- what the task was
8282
- which scaffold phases were used
@@ -87,7 +87,7 @@ Use a cheap or local model to compress each sampled session into a structured `T
8787
- the most likely failure tag
8888
- short evidence references
8989

90-
Then let a stronger model review only:
90+
Then a stronger model would review only:
9191

9292
- metric summaries
9393
- Trace Cards
@@ -96,9 +96,9 @@ Then let a stronger model review only:
9696

9797
instead of full raw logs.
9898

99-
### 5. Constrain review output into falsifiable experiment proposals
99+
### 5. Keep review output close to falsifiable experiment proposals
100100

101-
Each weekly review should produce at most 1–3 proposed changes, and every proposal should map explicitly to:
101+
If this workflow were adopted, I think it could be helpful for each weekly review to produce at most 1–3 proposed changes, and for each proposal to map as explicitly as possible to:
102102

103103
- one failure mode
104104
- one scaffold module
@@ -113,11 +113,11 @@ For example:
113113
- risk: missing subtle regressions
114114
- validation: one-week A/B test with `false_finish_rate` as guardrail
115115

116-
If a recommendation cannot be written in this format, it is probably still an observation rather than an actionable change.
116+
If a recommendation cannot yet be written in this format, it may be better treated as an observation rather than an immediate action item.
117117

118-
## Why this seems useful
118+
## Why this might be worth discussing
119119

120-
I think this workflow would help `humanize` in four ways:
120+
I think this workflow could potentially help `humanize` in a few ways:
121121

122122
1. **It evaluates the whole scaffold, not just model capability.**
123123
2. **It scales better as more contributors propose changes.**
@@ -126,15 +126,15 @@ I think this workflow would help `humanize` in four ways:
126126

127127
## A minimal first version
128128

129-
If this should start small, I would begin with just three things:
129+
If this should start small, I would suggest beginning with just three things:
130130

131131
1. add `scaffold_version`, `task_slice`, `outcome`, `budget`, and `events` to the log schema;
132132
2. add a script or workflow that generates `weekly_scaffold_review.md`;
133133
3. define a minimal `failure taxonomy` and `Trace Card` schema.
134134

135-
That alone would already move the discussion from subjective impressions toward low-cost, evidence-based scaffold diagnosis.
135+
Even that alone could already move the discussion from subjective impressions toward low-cost, evidence-based scaffold diagnosis.
136136

137-
If this direction sounds useful, I would be happy to help sketch a more concrete v1, such as:
137+
If the maintainers think this direction is worthwhile, I would also be happy to help sketch a more concrete v1, such as:
138138

139139
- a `Trace Card` schema
140140
- a first-pass `failure taxonomy`

.humanize/drafts/humanize-org-issue-zh.md

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,35 @@
1-
# [Proposal] 基于运行日志增加一个低成本的 scaffold review workflow
1+
# [Proposal] 考虑基于运行日志增加一个低成本的 scaffold review workflow
22

33
## 背景
44

5-
随着 `humanize` 这类真实项目型 agent scaffold 逐渐变大,会有越来越多的贡献者尝试增加新 feature、角色分工或 workflow。问题是:这些改动到底有没有真正提升系统效果,往往很难评估
5+
随着 `humanize` 这类真实项目型 agent scaffold 逐渐变大,会有越来越多的贡献者尝试增加新 feature、角色分工或 workflow。一个随之而来的问题是:这些改动是否真的提升了系统效果,往往并不容易判断
66

7-
一个很自然的方向是把这类改动放进类似 CI 的评测流程里,在一些“真实开发场景”上跑对比。但这里有两个现实问题
7+
一个很自然的方向,是把这类改动放进类似 CI 的评测流程里,在一些“真实开发场景”上做对比。不过这里有两个现实问题
88

99
1. 很难挑选真正有代表性的“真实 workload”;
10-
2. 如果直接跑较大的真实任务,token 成本会很高,难以高频执行
10+
2. 如果直接跑较大的真实任务,token 成本会比较高,不太适合高频执行
1111

12-
我最近有一个想法:也许不应该只把 scaffold 改动看成“prompt/agent 能力优化”,而应该把它看成一种**组织设计(organizational design)**问题
12+
最近我在想,也许不一定只把 scaffold 改动看成“prompt/agent 能力优化”,也可以把它视作一种**组织设计(organizational design)**问题来评估
1313

14-
换句话说,我们真正想知道的不是“这个 scaffold 看起来更复杂了吗”,而是:
14+
换句话说,我们想知道的也许不只是“这个 scaffold 看起来是不是更复杂了”,而是:
1515

1616
- 它是否更匹配当前任务分布;
1717
- 它是否让信息流和决策流更顺畅;
1818
- 它是否降低了重复搜索、重复 review、重复试错等协调摩擦;
1919
- 它是否让失败更早暴露、成功经验更容易复用。
2020

21-
我觉得可以把这个评估框架压缩成四个维度
21+
如果把这个评估视角再压缩一下,我觉得可以落在四个维度上
2222

2323
- **Fit**:scaffold 是否匹配真实任务分布;
2424
- **Flow**:信息流 / 决策流 / handoff 是否顺畅;
2525
- **Friction**:哪里在空转、排队、重复劳动;
2626
- **Feedback**:失败是否被及时发现,经验是否能沉淀。
2727

28-
这个视角的好处是,它不要求我们每次都用一个超大的“真实 benchmark”来判断 scaffold,而是允许我们从已有运行日志里提取证据,持续诊断系统设计是否合理
28+
这个视角的一个好处是,它不要求我们每次都依赖一个超大的“真实 benchmark”来判断 scaffold,而是允许我们从已有运行日志里提取证据,持续观察系统设计是否合理
2929

30-
## 修改建议
30+
## 一个可能的修改方向
3131

32-
我建议在 `humanize` 现有日志/trace 基础上,加一个**低成本的周期性 scaffold review workflow**优先做一个很轻量的 v1
32+
如果这个方向有参考价值,我想建议在 `humanize` 现有日志/trace 基础上,尝试增加一个**低成本的周期性 scaffold review workflow**先从一个轻量 v1 开始。
3333

3434
### 1. 先把运行日志和 scaffold 版本关联起来
3535

@@ -45,7 +45,7 @@
4545
- `outcome`(success / failure / false finish / human takeover)
4646
- `artifacts`(diff / test result / review comments)
4747

48-
最关键的是**日志必须能映射到具体 scaffold 版本**否则后续分析只能“吐槽现象”,无法归因到具体改动
48+
这里我觉得最关键的是**日志最好能映射到具体 scaffold 版本**否则后续分析可能只能描述现象,比较难归因到具体改动
4949

5050
### 2. 每天只跑便宜的指标预审
5151

@@ -59,7 +59,7 @@
5959
- `review_loop_count`
6060
- 重复读同一文件 / 重复跑同一失败命令的次数
6161

62-
这一步的目的不是给建议,而是先定位**到底是 scaffold 变坏了,还是任务分布变了**
62+
这一步的目的不一定是立刻给建议,而是先帮助回答**到底是 scaffold 变坏了,还是只是任务分布变了**
6363

6464
### 3. 每周只抽样,不看全量原始日志
6565

@@ -72,11 +72,11 @@
7272
- false finish
7373
- human takeover
7474

75-
这样比“把一周日志全塞进模型”更便宜,也通常更稳
75+
这样通常会比“把一周日志全塞进模型”更便宜,也更容易维持稳定的 review 节奏
7676

7777
### 4. 先生成 Trace Card,再做高级 review
7878

79-
先用便宜模型或本地模型,把每个 session 压成一张结构化 `Trace Card`,只保留:
79+
可以先用便宜模型或本地模型,把每个 session 压成一张结构化 `Trace Card`,只保留:
8080

8181
- 任务是什么
8282
- scaffold 经过了哪些阶段
@@ -96,9 +96,9 @@
9696

9797
而不是直接看原始长日志。
9898

99-
### 5. review 的输出必须约束成“可证伪的实验建议”
99+
### 5. review 的输出尽量约束成“可证伪的实验建议”
100100

101-
建议每周最多产出 1–3 条修改建议,而且每条建议都必须显式映射到
101+
如果后面真的采用这种 workflow,我会比较倾向于让每周 review 最多产出 1–3 条修改建议,而且每条建议都尽量显式映射到
102102

103103
- 一个 failure mode
104104
- 一个 scaffold 模块
@@ -113,28 +113,28 @@
113113
- 风险:漏掉边界回归
114114
- 验证方式:A/B 一周,guardrail 为 `false_finish_rate` 不显著上升
115115

116-
如果一条建议不能写成这种格式,那它更像“观察”,还不应该进入 action list。
116+
如果一条建议还不能写成这种格式,也许更适合作为“观察”,而不是立即进入 action list。
117117

118-
## 为什么这件事值得做
118+
## 为什么我觉得这件事值得讨论
119119

120-
我觉得这个 workflow 对 `humanize` 的价值在于
120+
我觉得这个 workflow 对 `humanize` 可能有几个潜在价值
121121

122122
1. **更贴近真实系统演化**:不是只测模型能力,而是测整个 scaffold 的组织效果;
123123
2. **更可扩展**:随着贡献者变多,能够更系统地评估 feature 改动是否真的带来收益;
124124
3. **更省 token**:全量日志只做程序统计,强模型只看压缩后的证据包;
125-
4. **更容易形成闭环**:每周只认领少量实验,逐步验证哪些 scaffold 设计真的有效
125+
4. **更容易形成闭环**:每周只认领少量实验,逐步验证哪些 scaffold 设计真正有效
126126

127127
## 一个可落地的最小版本
128128

129-
如果要从很小的改动开始,我建议优先落 3 个东西:
129+
如果要从很小的改动开始,我会建议优先补 3 个东西:
130130

131131
1. 给日志补齐 `scaffold_version``task_slice``outcome``budget``events`
132132
2. 增加一个每周生成 `weekly_scaffold_review.md` 的脚本或 workflow;
133133
3. 统一一版最小 `failure taxonomy``Trace Card` schema。
134134

135-
这样就已经能把“是否值得改 scaffold”从主观讨论,往“基于证据的低成本组织诊断”推进一步。
135+
这样也许就已经能把“是否值得改 scaffold”从主观讨论,往“基于证据的低成本组织诊断”推进一步。
136136

137-
如果 maintainer 觉得这个方向有价值,我很愿意继续补一个更具体的 v1 草案,比如:
137+
如果 maintainer 觉得这个方向有意义,我也很愿意继续补一个更具体的 v1 草案,比如:
138138

139139
- `Trace Card` schema
140140
- `failure taxonomy` 初版

0 commit comments

Comments
 (0)