You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .humanize/drafts/humanize-org-issue-en.md
+25-25Lines changed: 25 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,35 +1,35 @@
1
-
# [Proposal]Add a low-cost scaffold review workflow based on run logs
1
+
# [Proposal]Consider adding a low-cost scaffold review workflow based on run logs
2
2
3
3
## Background
4
4
5
-
As projects like `humanize` grow into real-world agent scaffolds, more contributors will naturally add new features, roles, and workflows. The hard part is not making changes — it is evaluating whether those changes actually improve the system.
5
+
As projects like `humanize` grow into real-world agent scaffolds, more contributors will naturally add new features, roles, and workflows. A related challenge is that it becomes harder to tell whether those changes are actually improving the system.
6
6
7
-
One natural idea is to build a CI-like check that runs scaffold changes against “real” development workloads. But there are two practical issues:
7
+
One natural idea is to add a CI-like check that compares scaffold changes on “real” development workloads. But there are two practical issues:
8
8
9
9
1. It is hard to choose workloads that are genuinely representative.
10
-
2. If the workloads are large and realistic, the token cost becomes too high for frequent evaluation.
10
+
2. If the workloads are large and realistic, the token cost can become too high for frequent evaluation.
11
11
12
-
I think there is a useful reframing here: instead of treating scaffold changes purely as “prompt/agent capability tweaks,” we can evaluate them as an **organizational design** problem.
12
+
I have been wondering whether scaffold changes could be framed not only as “prompt/agent capability tweaks,” but also as an **organizational design** problem.
13
13
14
-
In other words, the key question is not just “is this scaffold more sophisticated?” but:
14
+
In other words, the question may not just be “is this scaffold more sophisticated?”, but also:
15
15
16
16
- Does it fit the actual task distribution?
17
17
- Does it improve information flow and decision flow?
18
18
- Does it reduce coordination friction such as repeated search, repeated review, and repeated trial-and-error?
19
19
- Does it help the system surface failures earlier and reuse successful patterns more reliably?
20
20
21
-
I would summarize this evaluation lens into four dimensions:
21
+
If I compress that evaluation lens a bit, it seems to fall into four dimensions:
22
22
23
23
-**Fit**: does the scaffold match the real task mix?
24
24
-**Flow**: are information flow, decision flow, and handoffs working well?
25
25
-**Friction**: where are we wasting effort through loops, queues, or duplicate work?
26
26
-**Feedback**: are failures caught early, and are wins made reusable?
27
27
28
-
The benefit of this framing is that it does not require a giant “real benchmark” every time. It lets us use existing run logs as evidence to continuously diagnose whether the scaffold design is improving or regressing.
28
+
One benefit of this framing is that it does not require a giant “real benchmark” every time. It allows us to use existing run logs as evidence and continuously observe whether the scaffold design is moving in a good direction.
29
29
30
-
## Proposed change
30
+
## A possible direction
31
31
32
-
I would suggest adding a **low-cost periodic scaffold review workflow** on top of the existing logging / trace system, starting with a lightweight v1.
32
+
If this seems useful, I would like to suggest adding a **low-cost periodic scaffold review workflow** on top of the existing logging / trace system, starting with a lightweight v1.
33
33
34
34
### 1. Make runs traceable to scaffold versions
35
35
@@ -45,11 +45,11 @@ Each run should retain at least:
45
45
-`outcome` (success / failure / false finish / human takeover)
46
46
-`artifacts` (diff / test result / review comments)
47
47
48
-
The most important point is: **logs must be attributable to a specific scaffold version**. Otherwise the analysis can describe symptoms, but not attribute them to a concrete change.
48
+
The most important point, in my view, is that **logs should ideally be attributable to a specific scaffold version**. Otherwise the analysis may describe symptoms, but it becomes much harder to attribute them to a concrete change.
49
49
50
50
### 2. Run cheap metric screening daily
51
51
52
-
Do not send full logs to a strong model by default. First run programmatic metrics over all runs, for example:
52
+
Instead of sending full logs to a strong model by default, first run programmatic metrics over all runs, for example:
53
53
54
54
-`success@budget`
55
55
-`tokens_per_success`
@@ -59,11 +59,11 @@ Do not send full logs to a strong model by default. First run programmatic metri
59
59
-`review_loop_count`
60
60
- repeated reads of the same file / repeated execution of the same failing command
61
61
62
-
The goal here is not to generate recommendations yet. It is to answer: **did the scaffold actually get worse, or did the task mix change?**
62
+
The goal here is not necessarily to generate recommendations immediately. It is first to help answer: **did the scaffold actually get worse, or did the task mix change?**
63
63
64
64
### 3. Sample weekly instead of reviewing all raw logs
65
65
66
-
To control cost, do stratified sampling over outcomes and task types — for example 20–40 sessions covering:
66
+
To control cost, it may be enough to do stratified sampling over outcomes and task types — for example 20–40 sessions covering:
67
67
68
68
- cheap successes
69
69
- expensive successes
@@ -72,11 +72,11 @@ To control cost, do stratified sampling over outcomes and task types — for exa
72
72
- false finishes
73
73
- human takeovers
74
74
75
-
This is much cheaper and usually more stable than feeding an entire week of raw logs into a model.
75
+
This is usually much cheaper than feeding an entire week of raw logs into a model, and it may also lead to a more stable review rhythm.
76
76
77
77
### 4. Generate Trace Cards before higher-level review
78
78
79
-
Use a cheap or local model to compress each sampled session into a structured `Trace Card`, keeping only:
79
+
A cheap or local model could first compress each sampled session into a structured `Trace Card`, keeping only:
80
80
81
81
- what the task was
82
82
- which scaffold phases were used
@@ -87,7 +87,7 @@ Use a cheap or local model to compress each sampled session into a structured `T
87
87
- the most likely failure tag
88
88
- short evidence references
89
89
90
-
Then let a stronger model review only:
90
+
Then a stronger model would review only:
91
91
92
92
- metric summaries
93
93
- Trace Cards
@@ -96,9 +96,9 @@ Then let a stronger model review only:
96
96
97
97
instead of full raw logs.
98
98
99
-
### 5. Constrain review output into falsifiable experiment proposals
99
+
### 5. Keep review output close to falsifiable experiment proposals
100
100
101
-
Each weekly review should produce at most 1–3 proposed changes, and every proposal should map explicitly to:
101
+
If this workflow were adopted, I think it could be helpful for each weekly review to produce at most 1–3 proposed changes, and for each proposal to map as explicitly as possible to:
102
102
103
103
- one failure mode
104
104
- one scaffold module
@@ -113,11 +113,11 @@ For example:
113
113
- risk: missing subtle regressions
114
114
- validation: one-week A/B test with `false_finish_rate` as guardrail
115
115
116
-
If a recommendation cannot be written in this format, it is probably still an observation rather than an actionable change.
116
+
If a recommendation cannot yet be written in this format, it may be better treated as an observation rather than an immediate action item.
117
117
118
-
## Why this seems useful
118
+
## Why this might be worth discussing
119
119
120
-
I think this workflow would help `humanize` in four ways:
120
+
I think this workflow could potentially help `humanize` in a few ways:
121
121
122
122
1.**It evaluates the whole scaffold, not just model capability.**
123
123
2.**It scales better as more contributors propose changes.**
@@ -126,15 +126,15 @@ I think this workflow would help `humanize` in four ways:
126
126
127
127
## A minimal first version
128
128
129
-
If this should start small, I would begin with just three things:
129
+
If this should start small, I would suggest beginning with just three things:
130
130
131
131
1. add `scaffold_version`, `task_slice`, `outcome`, `budget`, and `events` to the log schema;
132
132
2. add a script or workflow that generates `weekly_scaffold_review.md`;
133
133
3. define a minimal `failure taxonomy` and `Trace Card` schema.
134
134
135
-
That alone would already move the discussion from subjective impressions toward low-cost, evidence-based scaffold diagnosis.
135
+
Even that alone could already move the discussion from subjective impressions toward low-cost, evidence-based scaffold diagnosis.
136
136
137
-
If this direction sounds useful, I would be happy to help sketch a more concrete v1, such as:
137
+
If the maintainers think this direction is worthwhile, I would also be happy to help sketch a more concrete v1, such as:
0 commit comments