Skip to content

Commit 0605997

Browse files
张靖恒claude
andcommitted
Calibrate cost estimates from experiment 2 measurements
n=5 real Polymarket descriptions, Gemini 2.0 Flash V2 strict prompt: - 5/5 schema_ok, 5/5 grounding_ok - avg input 552 tok, avg output 397 tok - actual: $0.000214/call (vs $0.000090 prior estimate, 2.4× higher) - output is bigger than predicted because verbatim_text transcript inflates it ~2.6× over the "short JSON" assumption Total run revised: $0.52 → $0.87. Still trivial. Also drops the head-to-head budget line — V2 strict prompt with verbatim grounding nailed 5/5 on first try, so no prompt-tuning phase needed before scale-up. Prior $0.02 line removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 8d598a5 commit 0605997

1 file changed

Lines changed: 21 additions & 17 deletions

File tree

docs/references/dash-ocr-production-patterns.md

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -168,32 +168,36 @@ Three design choices that are load-bearing — flag before changing:
168168

169169
### 3.1 模型单价(OpenRouter 公开价)
170170

171-
| 模型 | 价格 ($/1M input + output token) | T2 单 call 估算 |
171+
| 模型 | 价格 ($/1M input + output token) | T2 单 call 校准值 |
172172
|---|---|---|
173-
| Gemini 2.0 Flash | $0.10 + $0.40 | ~$0.00009 |
174-
| Qwen 2.5-72B | $0.15 + $0.40 | ~$0.00015 |
175-
| Claude Haiku 4.5 | $0.80 + $4 | ~$0.0008 |
176-
| Claude Sonnet 4.6 | $3 + $15 | ~$0.003 |
177-
| DeepSeek V3 | $0.27 + $1.10 | ~$0.0003 |
173+
| **Gemini 2.0 Flash** | **$0.10 + $0.40** | **$0.000214** ✅ 实测 |
174+
| Qwen 2.5-72B | $0.15 + $0.40 | ~$0.0003 (估算) |
175+
| Claude Haiku 4.5 | $0.80 + $4 | ~$0.0015 (估算) |
176+
| Claude Sonnet 4.6 | $3 + $15 | ~$0.005 (估算) |
177+
| DeepSeek V3 | $0.27 + $1.10 | ~$0.0006 (估算) |
178178

179-
**T2 单 call** 估算口径:input ~500 token(resolution criteria + prompt),output ~150 token(verbatim + clauses JSON)。
179+
**T2 单 call 实测口径**[`reports/experiment-openrouter-calibration-2026-05-12.md`](../../reports/experiment-openrouter-calibration-2026-05-12.md),n=5):
180+
- input 平均 552 token(system prompt + description)
181+
- output 平均 397 token(verbatim_text + clauses JSON,verbatim 占大头)
182+
- 实测成本:$0.000214/call —— 比早先估算 $0.00009 高 2.4×,因 verbatim grounding 的 output 比"短结构 JSON"大 ~2.6×
183+
- 校准结论:schema_ok = 5/5, grounding_ok = 5/5,prompt 不需要再调
180184

181-
### 3.2 T2 + T4 总预算(OpenRouter routing)
185+
### 3.2 T2 + T4 总预算(OpenRouter routing,校准后
182186

183187
| 步骤 | 调用次数 | 单价 | 小计 |
184188
|---|---|---|---|
185-
| T2 V2 主提取(Gemini Flash) | 2000 | $0.00009 | **$0.18** |
186-
| T2 V1 fallback(10% silent-empty,**同 Gemini Flash + V1 permissive prompt**| 200 | $0.00009 | **$0.02** |
187-
| T2 prompt tuning head-to-head(3 模型 × 20 markets,仅调优阶段| 60 | 平均 $0.0003 | **$0.02** |
189+
| T2 V2 主提取(Gemini Flash) | 2000 | $0.000214 ✅实测 | **$0.43** |
190+
| T2 V1 fallback(10% silent-empty,**同 Gemini Flash + V1 permissive prompt**| 200 | $0.000214 | **$0.04** |
191+
| T2 prompt tuning head-to-head(已通过 5/5,**不需要 head-to-head**| 0 | | **$0** |
188192
| T3 embedding(OpenAI text-embedding-3-small) | ~10M token | $0.00002/1k | **$0.20** |
189-
| T3 LLM 验证 candidates(Gemini 2.0 Flash) | 500 | $0.00009 | **$0.05** |
190-
| T4 corpus(结构化派生,无 LLM 调用) | 0 || **$0** |
191-
| T4 judge ensemble(3 模型 × 100 cases) | 300 | 平均 $0.00015 | **$0.05** |
192-
| **合计单次完整跑** | | | **~$0.52** |
193+
| T3 LLM 验证 candidates(Gemini 2.0 Flash) | 500 | $0.000214 | **$0.11** |
194+
| T4 corpus(结构化派生,无 LLM 调用,PR #7 已派生 10,122 mutex pairs| 0 || **$0** |
195+
| T4 judge ensemble(3 模型 × 100 cases,按 Gemini 单价估| 300 | 平均 $0.0003 | **$0.09** |
196+
| **合计单次完整跑** | | | **~$0.87** |
193197

194-
**vs 原 $17 估算**33× 便宜。
198+
**vs 原 $17 估算**仍然 20× 便宜。
195199

196-
每月跑 2 次 ≈ $1.20。零成本敏感度
200+
每月跑 2 次 ≈ $1.74。仍然零成本敏感度
197201

198202
---
199203

0 commit comments

Comments
 (0)