Add OpenRouter calibration experiment results

张靖恒 · claude · 张靖恒 · commit 83b8a5a91603 · 2026-05-12T14:51:47.000+08:00
n=5 random Polymarket markets, Gemini 2.0 Flash via OpenRouter, V2 strict prompt with verbatim grounding. Results: - schema_ok: 5/5 (100%) - grounding_ok: 5/5 (100%) — verbatim_text substring check passes on every clause across all 5 markets - Avg 2-5 clauses extracted per market, types diverse (deadline / tiebreaker / source) - Avg ambiguity_score 0.2-0.3 - Avg latency 3.2s/call Cost calibration: - Actual: $0.000214/call (5×$0.001070) - PR #6 estimate: $0.000090/call - Off by 2.4× — output tokens are ~397 avg (estimate was 150), because verbatim_text transcription inflates output - 2000-market T2 projected: $0.43 (vs $0.18 in PR #6) Conclusion: V2 strict prompt + Gemini Flash combination works out-of-the-box on real Polymarket descriptions. No prompt tweaks needed before T2 scale-up. PR #6 budget needs an update but total cost still trivial. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/reports/experiment-openrouter-calibration-2026-05-12.md b/reports/experiment-openrouter-calibration-2026-05-12.md
@@ -0,0 +1,54 @@
+# OpenRouter Gemini Flash 校准实验报告（2026-05-12T06:50:07.649967+00:00）
+
+**Model**: `google/gemini-2.0-flash-001`
+**Sample**: n=5 markets, random.seed=42
+**Prompt**: V2 strict (verbatim grounding + injection defense)
+
+## 1. 总体指标
+
+- schema_ok: 5/5 (100%)
+- grounding_ok: 5/5 (100%)
+- 总输入 tokens: 2760
+- 总输出 tokens: 1985
+- 平均输入/call: 552
+- 平均输出/call: 397
+- 平均延迟: 3.2s
+
+## 2. 实际成本 vs 估算
+
+- 实际：$0.001070 / 5 calls = **$0.000214/call**
+- PR #6 估算：$0.000090/call
+- 差距：需修正备忘录
+- 推算 2000 markets 完整 T2 跑：$0.43
+
+## 3. 每市场细节
+
+### Market 1: `630772` — "Will the Democrats win the Maine Senate race in 2026?"
+- description_len: 852
+- schema_ok: True, grounding_ok: True, clauses: 2, ambiguity: 0.2
+- tokens: 565 in + 355 out, elapsed: 3.6s
+- sample clause: type=`deadline`, source_substring=`2026 midterm Maine U.S. Senate election...`
+
+### Market 2: `679652` — "Will Christina Loren Clement be the Republican nominee for Senate in Georgia?"
+- description_len: 399
+- schema_ok: True, grounding_ok: True, clauses: 2, ambiguity: 0.3
+- tokens: 459 in + 278 out, elapsed: 2.4s
+- sample clause: type=`tiebreaker`, source_substring=`If no 2026 Georgia Republican Senate Primary takes place, this market will resol...`
+
+### Market 3: `582320` — "Will Rennes finish in the top 4 of the Ligue 1 2025–26 standings?"
+- description_len: 661
+- schema_ok: True, grounding_ok: True, clauses: 3, ambiguity: 0.2
+- tokens: 553 in + 384 out, elapsed: 2.9s
+- sample clause: type=`deadline`, source_substring=`October 1, 2026...`
+
+### Market 4: `630845` — "Will the Republicans win the New Hampshire Senate race in 2026?"
+- description_len: 860
+- schema_ok: True, grounding_ok: True, clauses: 2, ambiguity: 0.3
+- tokens: 566 in + 357 out, elapsed: 2.9s
+- sample clause: type=`deadline`, source_substring=`2026 midterm New Hampshire U.S. Senate election...`
+
+### Market 5: `681026` — "Will DNM (DEK) win the most seats at the Cyprus House of Representatives electio"
+- description_len: 1107
+- schema_ok: True, grounding_ok: True, clauses: 5, ambiguity: 0.2
+- tokens: 617 in + 611 out, elapsed: 4.0s
+- sample clause: type=`deadline`, source_substring=`May 24, 2026...`