Soften tone of bilingual issue drafts

zenus · zenus · commit 822f6969aa29 · 2026-03-07T04:03:49.000Z
diff --git a/.humanize/drafts/humanize-org-issue-en.md b/.humanize/drafts/humanize-org-issue-en.md
@@ -1,35 +1,35 @@
-# [Proposal] Add a low-cost scaffold review workflow based on run logs
+# [Proposal] Consider adding a low-cost scaffold review workflow based on run logs
 
 ## Background
 
-As projects like `humanize` grow into real-world agent scaffolds, more contributors will naturally add new features, roles, and workflows. The hard part is not making changes — it is evaluating whether those changes actually improve the system.
+As projects like `humanize` grow into real-world agent scaffolds, more contributors will naturally add new features, roles, and workflows. A related challenge is that it becomes harder to tell whether those changes are actually improving the system.
 
-One natural idea is to build a CI-like check that runs scaffold changes against “real” development workloads. But there are two practical issues:
+One natural idea is to add a CI-like check that compares scaffold changes on “real” development workloads. But there are two practical issues:
 
 1. It is hard to choose workloads that are genuinely representative.
-2. If the workloads are large and realistic, the token cost becomes too high for frequent evaluation.
+2. If the workloads are large and realistic, the token cost can become too high for frequent evaluation.
 
-I think there is a useful reframing here: instead of treating scaffold changes purely as “prompt/agent capability tweaks,” we can evaluate them as an **organizational design** problem.
+I have been wondering whether scaffold changes could be framed not only as “prompt/agent capability tweaks,” but also as an **organizational design** problem.
 
-In other words, the key question is not just “is this scaffold more sophisticated?” but:
+In other words, the question may not just be “is this scaffold more sophisticated?”, but also:
 
 - Does it fit the actual task distribution?
 - Does it improve information flow and decision flow?
 - Does it reduce coordination friction such as repeated search, repeated review, and repeated trial-and-error?
 - Does it help the system surface failures earlier and reuse successful patterns more reliably?
 
-I would summarize this evaluation lens into four dimensions:
+If I compress that evaluation lens a bit, it seems to fall into four dimensions:
 
 - **Fit**: does the scaffold match the real task mix?
 - **Flow**: are information flow, decision flow, and handoffs working well?
 - **Friction**: where are we wasting effort through loops, queues, or duplicate work?
 - **Feedback**: are failures caught early, and are wins made reusable?
 
-The benefit of this framing is that it does not require a giant “real benchmark” every time. It lets us use existing run logs as evidence to continuously diagnose whether the scaffold design is improving or regressing.
+One benefit of this framing is that it does not require a giant “real benchmark” every time. It allows us to use existing run logs as evidence and continuously observe whether the scaffold design is moving in a good direction.
 
-## Proposed change
+## A possible direction
 
-I would suggest adding a **low-cost periodic scaffold review workflow** on top of the existing logging / trace system, starting with a lightweight v1.
+If this seems useful, I would like to suggest adding a **low-cost periodic scaffold review workflow** on top of the existing logging / trace system, starting with a lightweight v1.
 
 ### 1. Make runs traceable to scaffold versions
 
@@ -45,11 +45,11 @@ Each run should retain at least:
 - `outcome` (success / failure / false finish / human takeover)
 - `artifacts` (diff / test result / review comments)
 
-The most important point is: **logs must be attributable to a specific scaffold version**. Otherwise the analysis can describe symptoms, but not attribute them to a concrete change.
+The most important point, in my view, is that **logs should ideally be attributable to a specific scaffold version**. Otherwise the analysis may describe symptoms, but it becomes much harder to attribute them to a concrete change.
 
 ### 2. Run cheap metric screening daily
 
-Do not send full logs to a strong model by default. First run programmatic metrics over all runs, for example:
+Instead of sending full logs to a strong model by default, first run programmatic metrics over all runs, for example:
 
 - `success@budget`
 - `tokens_per_success`
@@ -59,11 +59,11 @@ Do not send full logs to a strong model by default. First run programmatic metri
 - `review_loop_count`
 - repeated reads of the same file / repeated execution of the same failing command
 
-The goal here is not to generate recommendations yet. It is to answer: **did the scaffold actually get worse, or did the task mix change?**
+The goal here is not necessarily to generate recommendations immediately. It is first to help answer: **did the scaffold actually get worse, or did the task mix change?**
 
 ### 3. Sample weekly instead of reviewing all raw logs
 
-To control cost, do stratified sampling over outcomes and task types — for example 20–40 sessions covering:
+To control cost, it may be enough to do stratified sampling over outcomes and task types — for example 20–40 sessions covering:
 
 - cheap successes
 - expensive successes
@@ -72,11 +72,11 @@ To control cost, do stratified sampling over outcomes and task types — for exa
 - false finishes
 - human takeovers
 
-This is much cheaper and usually more stable than feeding an entire week of raw logs into a model.
+This is usually much cheaper than feeding an entire week of raw logs into a model, and it may also lead to a more stable review rhythm.
 
 ### 4. Generate Trace Cards before higher-level review
 
-Use a cheap or local model to compress each sampled session into a structured `Trace Card`, keeping only:
+A cheap or local model could first compress each sampled session into a structured `Trace Card`, keeping only:
 
 - what the task was
 - which scaffold phases were used
@@ -87,7 +87,7 @@ Use a cheap or local model to compress each sampled session into a structured `T
 - the most likely failure tag
 - short evidence references
 
-Then let a stronger model review only:
+Then a stronger model would review only:
 
 - metric summaries
 - Trace Cards
@@ -96,9 +96,9 @@ Then let a stronger model review only:
 
 instead of full raw logs.
 
-### 5. Constrain review output into falsifiable experiment proposals
+### 5. Keep review output close to falsifiable experiment proposals
 
-Each weekly review should produce at most 1–3 proposed changes, and every proposal should map explicitly to:
+If this workflow were adopted, I think it could be helpful for each weekly review to produce at most 1–3 proposed changes, and for each proposal to map as explicitly as possible to:
 
 - one failure mode
 - one scaffold module
@@ -113,11 +113,11 @@ For example:
 - risk: missing subtle regressions
 - validation: one-week A/B test with `false_finish_rate` as guardrail
 
-If a recommendation cannot be written in this format, it is probably still an observation rather than an actionable change.
+If a recommendation cannot yet be written in this format, it may be better treated as an observation rather than an immediate action item.
 
-## Why this seems useful
+## Why this might be worth discussing
 
-I think this workflow would help `humanize` in four ways:
+I think this workflow could potentially help `humanize` in a few ways:
 
 1. **It evaluates the whole scaffold, not just model capability.**
 2. **It scales better as more contributors propose changes.**
@@ -126,15 +126,15 @@ I think this workflow would help `humanize` in four ways:
 
 ## A minimal first version
 
-If this should start small, I would begin with just three things:
+If this should start small, I would suggest beginning with just three things:
 
 1. add `scaffold_version`, `task_slice`, `outcome`, `budget`, and `events` to the log schema;
 2. add a script or workflow that generates `weekly_scaffold_review.md`;
 3. define a minimal `failure taxonomy` and `Trace Card` schema.
 
-That alone would already move the discussion from subjective impressions toward low-cost, evidence-based scaffold diagnosis.
+Even that alone could already move the discussion from subjective impressions toward low-cost, evidence-based scaffold diagnosis.
 
-If this direction sounds useful, I would be happy to help sketch a more concrete v1, such as:
+If the maintainers think this direction is worthwhile, I would also be happy to help sketch a more concrete v1, such as:
 
 - a `Trace Card` schema
 - a first-pass `failure taxonomy`
diff --git a/.humanize/drafts/humanize-org-issue-zh.md b/.humanize/drafts/humanize-org-issue-zh.md
@@ -1,35 +1,35 @@
-# [Proposal] 基于运行日志增加一个低成本的 scaffold review workflow
+# [Proposal] 考虑基于运行日志增加一个低成本的 scaffold review workflow
 
 ## 背景
 
-随着 `humanize` 这类真实项目型 agent scaffold 逐渐变大，会有越来越多的贡献者尝试增加新 feature、角色分工或 workflow。问题是：这些改动到底有没有真正提升系统效果，往往很难评估。
+随着 `humanize` 这类真实项目型 agent scaffold 逐渐变大，会有越来越多的贡献者尝试增加新 feature、角色分工或 workflow。一个随之而来的问题是：这些改动是否真的提升了系统效果，往往并不容易判断。
 
-一个很自然的方向是把这类改动放进类似 CI 的评测流程里，在一些“真实开发场景”上跑对比。但这里有两个现实问题：
+一个很自然的方向，是把这类改动放进类似 CI 的评测流程里，在一些“真实开发场景”上做对比。不过这里有两个现实问题：
 
 1. 很难挑选真正有代表性的“真实 workload”；
-2. 如果直接跑较大的真实任务，token 成本会很高，难以高频执行。
+2. 如果直接跑较大的真实任务，token 成本会比较高，不太适合高频执行。
 
-我最近有一个想法：也许不应该只把 scaffold 改动看成“prompt/agent 能力优化”，而应该把它看成一种**组织设计（organizational design）**问题。
+最近我在想，也许不一定只把 scaffold 改动看成“prompt/agent 能力优化”，也可以把它视作一种**组织设计（organizational design）**问题来评估。
 
-换句话说，我们真正想知道的不是“这个 scaffold 看起来更复杂了吗”，而是：
+换句话说，我们想知道的也许不只是“这个 scaffold 看起来是不是更复杂了”，而是：
 
 - 它是否更匹配当前任务分布；
 - 它是否让信息流和决策流更顺畅；
 - 它是否降低了重复搜索、重复 review、重复试错等协调摩擦；
 - 它是否让失败更早暴露、成功经验更容易复用。
 
-我觉得可以把这个评估框架压缩成四个维度：
+如果把这个评估视角再压缩一下，我觉得可以落在四个维度上：
 
 - **Fit**：scaffold 是否匹配真实任务分布；
 - **Flow**：信息流 / 决策流 / handoff 是否顺畅；
 - **Friction**：哪里在空转、排队、重复劳动；
 - **Feedback**：失败是否被及时发现，经验是否能沉淀。
 
-这个视角的好处是，它不要求我们每次都用一个超大的“真实 benchmark”来判断 scaffold，而是允许我们从已有运行日志里提取证据，持续诊断系统设计是否合理。
+这个视角的一个好处是，它不要求我们每次都依赖一个超大的“真实 benchmark”来判断 scaffold，而是允许我们从已有运行日志里提取证据，持续观察系统设计是否合理。
 
-## 修改建议
+## 一个可能的修改方向
 
-我建议在 `humanize` 现有日志/trace 基础上，加一个**低成本的周期性 scaffold review workflow**，优先做一个很轻量的 v1：
+如果这个方向有参考价值，我想建议在 `humanize` 现有日志/trace 基础上，尝试增加一个**低成本的周期性 scaffold review workflow**，先从一个轻量 v1 开始。
 
 ### 1. 先把运行日志和 scaffold 版本关联起来
 
@@ -45,7 +45,7 @@
 - `outcome`（success / failure / false finish / human takeover）
 - `artifacts`（diff / test result / review comments）
 
-最关键的是：**日志必须能映射到具体 scaffold 版本**。否则后续分析只能“吐槽现象”，无法归因到具体改动。
+这里我觉得最关键的是：**日志最好能映射到具体 scaffold 版本**。否则后续分析可能只能描述现象，比较难归因到具体改动。
 
 ### 2. 每天只跑便宜的指标预审
 
@@ -59,7 +59,7 @@
 - `review_loop_count`
 - 重复读同一文件 / 重复跑同一失败命令的次数
 
-这一步的目的不是给建议，而是先定位：**到底是 scaffold 变坏了，还是任务分布变了。**
+这一步的目的不一定是立刻给建议，而是先帮助回答：**到底是 scaffold 变坏了，还是只是任务分布变了。**
 
 ### 3. 每周只抽样，不看全量原始日志
 
@@ -72,11 +72,11 @@
 - false finish
 - human takeover
 
-这样比“把一周日志全塞进模型”更便宜，也通常更稳。
+这样通常会比“把一周日志全塞进模型”更便宜，也更容易维持稳定的 review 节奏。
 
 ### 4. 先生成 Trace Card，再做高级 review
 
-先用便宜模型或本地模型，把每个 session 压成一张结构化 `Trace Card`，只保留：
+可以先用便宜模型或本地模型，把每个 session 压成一张结构化 `Trace Card`，只保留：
 
 - 任务是什么
 - scaffold 经过了哪些阶段
@@ -96,9 +96,9 @@
 
 而不是直接看原始长日志。
 
-### 5. review 的输出必须约束成“可证伪的实验建议”
+### 5. review 的输出尽量约束成“可证伪的实验建议”
 
-建议每周最多产出 1–3 条修改建议，而且每条建议都必须显式映射到：
+如果后面真的采用这种 workflow，我会比较倾向于让每周 review 最多产出 1–3 条修改建议，而且每条建议都尽量显式映射到：
 
 - 一个 failure mode
 - 一个 scaffold 模块
@@ -113,28 +113,28 @@
 - 风险：漏掉边界回归
 - 验证方式：A/B 一周，guardrail 为 `false_finish_rate` 不显著上升
 
-如果一条建议不能写成这种格式，那它更像“观察”，还不应该进入 action list。
+如果一条建议还不能写成这种格式，也许更适合作为“观察”，而不是立即进入 action list。
 
-## 为什么这件事值得做
+## 为什么我觉得这件事值得讨论
 
-我觉得这个 workflow 对 `humanize` 的价值在于：
+我觉得这个 workflow 对 `humanize` 可能有几个潜在价值：
 
 1. **更贴近真实系统演化**：不是只测模型能力，而是测整个 scaffold 的组织效果；
 2. **更可扩展**：随着贡献者变多，能够更系统地评估 feature 改动是否真的带来收益；
 3. **更省 token**：全量日志只做程序统计，强模型只看压缩后的证据包；
-4. **更容易形成闭环**：每周只认领少量实验，逐步验证哪些 scaffold 设计真的有效。
+4. **更容易形成闭环**：每周只认领少量实验，逐步验证哪些 scaffold 设计真正有效。
 
 ## 一个可落地的最小版本
 
-如果要从很小的改动开始，我建议优先落 3 个东西：
+如果要从很小的改动开始，我会建议优先补 3 个东西：
 
 1. 给日志补齐 `scaffold_version`、`task_slice`、`outcome`、`budget`、`events`；
 2. 增加一个每周生成 `weekly_scaffold_review.md` 的脚本或 workflow；
 3. 统一一版最小 `failure taxonomy` 和 `Trace Card` schema。
 
-这样就已经能把“是否值得改 scaffold”从主观讨论，往“基于证据的低成本组织诊断”推进一步。
+这样也许就已经能把“是否值得改 scaffold”从主观讨论，往“基于证据的低成本组织诊断”推进一步。
 
-如果 maintainer 觉得这个方向有价值，我很愿意继续补一个更具体的 v1 草案，比如：
+如果 maintainer 觉得这个方向有意义，我也很愿意继续补一个更具体的 v1 草案，比如：
 
 - `Trace Card` schema
 - `failure taxonomy` 初版