Skip to content

Commit 83b8a5a

Browse files
张靖恒claude
andcommitted
Add OpenRouter calibration experiment results
n=5 random Polymarket markets, Gemini 2.0 Flash via OpenRouter, V2 strict prompt with verbatim grounding. Results: - schema_ok: 5/5 (100%) - grounding_ok: 5/5 (100%) — verbatim_text substring check passes on every clause across all 5 markets - Avg 2-5 clauses extracted per market, types diverse (deadline / tiebreaker / source) - Avg ambiguity_score 0.2-0.3 - Avg latency 3.2s/call Cost calibration: - Actual: $0.000214/call (5×$0.001070) - PR #6 estimate: $0.000090/call - Off by 2.4× — output tokens are ~397 avg (estimate was 150), because verbatim_text transcription inflates output - 2000-market T2 projected: $0.43 (vs $0.18 in PR #6) Conclusion: V2 strict prompt + Gemini Flash combination works out-of-the-box on real Polymarket descriptions. No prompt tweaks needed before T2 scale-up. PR #6 budget needs an update but total cost still trivial. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 79077aa commit 83b8a5a

1 file changed

Lines changed: 54 additions & 0 deletions

File tree

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# OpenRouter Gemini Flash 校准实验报告(2026-05-12T06:50:07.649967+00:00)
2+
3+
**Model**: `google/gemini-2.0-flash-001`
4+
**Sample**: n=5 markets, random.seed=42
5+
**Prompt**: V2 strict (verbatim grounding + injection defense)
6+
7+
## 1. 总体指标
8+
9+
- schema_ok: 5/5 (100%)
10+
- grounding_ok: 5/5 (100%)
11+
- 总输入 tokens: 2760
12+
- 总输出 tokens: 1985
13+
- 平均输入/call: 552
14+
- 平均输出/call: 397
15+
- 平均延迟: 3.2s
16+
17+
## 2. 实际成本 vs 估算
18+
19+
- 实际:$0.001070 / 5 calls = **$0.000214/call**
20+
- PR #6 估算:$0.000090/call
21+
- 差距:需修正备忘录
22+
- 推算 2000 markets 完整 T2 跑:$0.43
23+
24+
## 3. 每市场细节
25+
26+
### Market 1: `630772` — "Will the Democrats win the Maine Senate race in 2026?"
27+
- description_len: 852
28+
- schema_ok: True, grounding_ok: True, clauses: 2, ambiguity: 0.2
29+
- tokens: 565 in + 355 out, elapsed: 3.6s
30+
- sample clause: type=`deadline`, source_substring=`2026 midterm Maine U.S. Senate election...`
31+
32+
### Market 2: `679652` — "Will Christina Loren Clement be the Republican nominee for Senate in Georgia?"
33+
- description_len: 399
34+
- schema_ok: True, grounding_ok: True, clauses: 2, ambiguity: 0.3
35+
- tokens: 459 in + 278 out, elapsed: 2.4s
36+
- sample clause: type=`tiebreaker`, source_substring=`If no 2026 Georgia Republican Senate Primary takes place, this market will resol...`
37+
38+
### Market 3: `582320` — "Will Rennes finish in the top 4 of the Ligue 1 2025–26 standings?"
39+
- description_len: 661
40+
- schema_ok: True, grounding_ok: True, clauses: 3, ambiguity: 0.2
41+
- tokens: 553 in + 384 out, elapsed: 2.9s
42+
- sample clause: type=`deadline`, source_substring=`October 1, 2026...`
43+
44+
### Market 4: `630845` — "Will the Republicans win the New Hampshire Senate race in 2026?"
45+
- description_len: 860
46+
- schema_ok: True, grounding_ok: True, clauses: 2, ambiguity: 0.3
47+
- tokens: 566 in + 357 out, elapsed: 2.9s
48+
- sample clause: type=`deadline`, source_substring=`2026 midterm New Hampshire U.S. Senate election...`
49+
50+
### Market 5: `681026` — "Will DNM (DEK) win the most seats at the Cyprus House of Representatives electio"
51+
- description_len: 1107
52+
- schema_ok: True, grounding_ok: True, clauses: 5, ambiguity: 0.2
53+
- tokens: 617 in + 611 out, elapsed: 4.0s
54+
- sample clause: type=`deadline`, source_substring=`May 24, 2026...`

0 commit comments

Comments
 (0)