Skip to content

Commit 266b388

Browse files
committed
Add strict LLM provider benchmarking
1 parent 9778cb1 commit 266b388

17 files changed

Lines changed: 1909 additions & 12 deletions

.env.example

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,13 @@ OPENAI_CHAT_STREAM=true
1111
OPENAI_CHAT_RESPONSE_FORMAT=json_object
1212
OPENAI_TEMPERATURE=0
1313

14+
# Secondary provider, tried after primary and before backup.
15+
OPENAI_SECONDARY_API_KEY=
16+
OPENAI_SECONDARY_BASE_URL=
17+
OPENAI_SECONDARY_API_MODE=chat
18+
OPENAI_SECONDARY_MODEL=
19+
OPENAI_SECONDARY_TIMEOUT=60
20+
1421
# Backup provider.
1522
OPENAI_BACKUP_API_KEY=
1623
OPENAI_BACKUP_BASE_URL=
@@ -50,4 +57,3 @@ POLYMARKET_CHAIN_ID=137
5057
POLYMARKET_CLOB_API_KEY=
5158
POLYMARKET_CLOB_API_SECRET=
5259
POLYMARKET_CLOB_PASSPHRASE=
53-

docs/security.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
## Variables that deserve attention
1010

1111
- `OPENAI_API_KEY`
12+
- `OPENAI_SECONDARY_API_KEY`
1213
- `OPENAI_BACKUP_API_KEY`
1314
- `OPENAI_FALLBACK_API_KEY`
1415
- `ODDPOOL_API_KEY`
@@ -26,4 +27,3 @@ rg -n "sk-[A-Za-z0-9_-]{12,}|PRIVATE_KEY|API_KEY|PASS_PHRASE|passphrase|secret"
2627
```
2728

2829
If you add a new secret-bearing setting later, update `.gitignore`, `.env.example`, and the docs together.
29-

poly_strategy/cli.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1761,7 +1761,7 @@ def _build_parser() -> argparse.ArgumentParser:
17611761
discover.add_argument("--model", help="OpenAI model name; defaults to OPENAI_MODEL")
17621762
discover.add_argument("--fallback-model", help="OpenAI model name for retrying remaining failed batches")
17631763
discover.add_argument("--base-url", help="OpenAI-compatible base URL; defaults to OPENAI_BASE_URL or OpenAI")
1764-
discover.add_argument("--api-mode", choices=["responses", "chat"], help="OpenAI-compatible API mode; defaults to OPENAI_API_MODE or responses")
1764+
discover.add_argument("--api-mode", choices=["responses", "chat", "messages"], help="OpenAI-compatible API mode; defaults to OPENAI_API_MODE or responses")
17651765
discover.add_argument("--batch-size", type=int, default=10, help="markets per LLM discovery batch")
17661766
discover.add_argument("--min-confidence", type=float, default=0.95, help="minimum candidate confidence")
17671767
discover.add_argument("--max-markets", type=int, help="limit input markets for a small run")
@@ -1824,7 +1824,7 @@ def _build_parser() -> argparse.ArgumentParser:
18241824
verify_groups.add_argument("--report-out", help="optional JSON verification report path")
18251825
verify_groups.add_argument("--model", help="OpenAI model name; defaults to OPENAI_MODEL")
18261826
verify_groups.add_argument("--base-url", help="OpenAI-compatible base URL; defaults to OPENAI_BASE_URL or OpenAI")
1827-
verify_groups.add_argument("--api-mode", choices=["responses", "chat"], help="OpenAI-compatible API mode; defaults to OPENAI_API_MODE or responses")
1827+
verify_groups.add_argument("--api-mode", choices=["responses", "chat", "messages"], help="OpenAI-compatible API mode; defaults to OPENAI_API_MODE or responses")
18281828
verify_groups.add_argument("--min-net-edge", type=float, default=0.002, help="minimum diagnostic net edge to verify")
18291829
verify_groups.add_argument("--top", type=int, default=10, help="maximum diagnostic groups to verify")
18301830
verify_groups.add_argument("--min-confidence", type=float, default=0.95, help="minimum verification confidence")
@@ -1921,13 +1921,13 @@ def _build_parser() -> argparse.ArgumentParser:
19211921
verify_cross.add_argument("--verified-only", action="store_true", help="write only verified same-binary signals")
19221922
verify_cross.add_argument("--model", help="OpenAI model name; defaults to OPENAI_MODEL")
19231923
verify_cross.add_argument("--base-url", help="OpenAI-compatible base URL; defaults to OPENAI_BASE_URL or OpenAI")
1924-
verify_cross.add_argument("--api-mode", choices=["responses", "chat"], help="OpenAI-compatible API mode; defaults to OPENAI_API_MODE or responses")
1924+
verify_cross.add_argument("--api-mode", choices=["responses", "chat", "messages"], help="OpenAI-compatible API mode; defaults to OPENAI_API_MODE or responses")
19251925
verify_cross.add_argument("--backup-model", help="backup model; defaults to OPENAI_BACKUP_MODEL")
19261926
verify_cross.add_argument("--backup-base-url", help="backup OpenAI-compatible base URL; defaults to OPENAI_BACKUP_BASE_URL")
1927-
verify_cross.add_argument("--backup-api-mode", choices=["responses", "chat"], help="backup API mode; defaults to OPENAI_BACKUP_API_MODE")
1927+
verify_cross.add_argument("--backup-api-mode", choices=["responses", "chat", "messages"], help="backup API mode; defaults to OPENAI_BACKUP_API_MODE")
19281928
verify_cross.add_argument("--fallback-model", help="fallback model; defaults to OPENAI_FALLBACK_MODEL")
19291929
verify_cross.add_argument("--fallback-base-url", help="fallback OpenAI-compatible base URL; defaults to OPENAI_FALLBACK_BASE_URL")
1930-
verify_cross.add_argument("--fallback-api-mode", choices=["responses", "chat"], help="fallback API mode; defaults to OPENAI_FALLBACK_API_MODE")
1930+
verify_cross.add_argument("--fallback-api-mode", choices=["responses", "chat", "messages"], help="fallback API mode; defaults to OPENAI_FALLBACK_API_MODE")
19311931
verify_cross.add_argument("--timeout", type=float, default=60.0, help="HTTP timeout in seconds")
19321932
verify_cross.add_argument("--backup-timeout", type=float, help="backup provider HTTP timeout; defaults to OPENAI_BACKUP_TIMEOUT or --timeout")
19331933
verify_cross.add_argument("--fallback-timeout", type=float, help="fallback provider HTTP timeout; defaults to OPENAI_FALLBACK_TIMEOUT or --timeout")

poly_strategy/openai_rules.py

Lines changed: 64 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ def _normalize_api_mode(api_mode: Optional[str]) -> str:
2424
return "responses"
2525
if value in {"chat", "chat_completions", "chat-completions", "chatcompletions"}:
2626
return "chat"
27+
if value in {"messages", "message", "anthropic", "anthropic_messages", "anthropic-messages"}:
28+
return "messages"
2729
raise OpenAIConfigError(f"unsupported OPENAI_API_MODE: {api_mode!r}")
2830

2931

@@ -204,7 +206,14 @@ def __init__(
204206
if self.proxy
205207
else None
206208
)
207-
self._transport = transport or (self._post_chat_completions if self.api_mode == "chat" else self._post_responses)
209+
if transport is not None:
210+
self._transport = transport
211+
elif self.api_mode == "chat":
212+
self._transport = self._post_chat_completions
213+
elif self.api_mode == "messages":
214+
self._transport = self._post_messages
215+
else:
216+
self._transport = self._post_responses
208217

209218
def build_payload(self, markets: Iterable[MarketText]) -> dict:
210219
return self._build_payload(
@@ -217,7 +226,7 @@ def build_payload(self, markets: Iterable[MarketText]) -> dict:
217226
def _build_payload(self, markets: Iterable[MarketText], system_prompt: str, schema_name: str, schema: dict) -> dict:
218227
market_rows = market_texts_to_prompt_rows(list(markets))
219228
prompt_text = json.dumps({"markets": market_rows}, ensure_ascii=True, sort_keys=True)
220-
if self.api_mode == "chat":
229+
if self.api_mode in {"chat", "messages"}:
221230
if schema_name == "polymarket_relation_discovery":
222231
chat_system_prompt, chat_user_prompt = _relation_chat_prompts(prompt_text)
223232
else:
@@ -244,6 +253,17 @@ def _build_payload(self, markets: Iterable[MarketText], system_prompt: str, sche
244253
"Never return verification sources, market summaries, safe_items, or a top-level markets key.\n\n"
245254
f"Input markets JSON:\n{prompt_text}"
246255
)
256+
if self.api_mode == "messages":
257+
payload = {
258+
"model": self.model,
259+
"system": chat_system_prompt,
260+
"messages": [{"role": "user", "content": chat_user_prompt}],
261+
}
262+
if self.max_output_tokens is not None:
263+
payload["max_tokens"] = self.max_output_tokens
264+
if self.temperature is not None:
265+
payload["temperature"] = self.temperature
266+
return payload
247267
payload = {
248268
"model": self.model,
249269
"messages": [
@@ -336,6 +356,23 @@ def _post_chat_completions(self, payload: dict, timeout: float) -> dict:
336356
return _parse_chat_stream_response(response)
337357
return json.loads(response.read().decode("utf-8"))
338358

359+
def _post_messages(self, payload: dict, timeout: float) -> dict:
360+
request = Request(
361+
_messages_url(self.base_url),
362+
data=json.dumps(payload).encode("utf-8"),
363+
headers={
364+
"authorization": f"Bearer {self.api_key}",
365+
"x-api-key": self.api_key,
366+
"anthropic-version": "2023-06-01",
367+
"content-type": "application/json",
368+
"accept": "application/json",
369+
"user-agent": "poly-strategy/0.1",
370+
},
371+
method="POST",
372+
)
373+
with self._open_request(request, timeout) as response:
374+
return json.loads(response.read().decode("utf-8"))
375+
339376
def _open_request(self, request: Request, timeout: float):
340377
if self._opener is not None:
341378
return self._opener.open(request, timeout=timeout)
@@ -389,7 +426,7 @@ class OpenAICrossPlatformVerifierClient(OpenAIRuleDiscoveryClient):
389426
def build_payload(self, matches: Iterable[dict]) -> dict:
390427
rows = [_cross_platform_prompt_row(match) for match in matches]
391428
prompt_text = json.dumps({"matches": rows}, ensure_ascii=True, sort_keys=True)
392-
if self.api_mode == "chat":
429+
if self.api_mode in {"chat", "messages"}:
393430
required_keys = ", ".join(_CROSS_PLATFORM_RESPONSE_SCHEMA.get("required", []))
394431
output_contract = _chat_output_contract("polymarket_kalshi_cross_platform_verification")
395432
output_instruction = _chat_output_instruction("polymarket_kalshi_cross_platform_verification")
@@ -414,6 +451,17 @@ def build_payload(self, matches: Iterable[dict]) -> dict:
414451
"Return only one JSON object matching the schema; no markdown, no prose.\n\n"
415452
f"Input matches JSON:\n{prompt_text}"
416453
)
454+
if self.api_mode == "messages":
455+
payload = {
456+
"model": self.model,
457+
"system": chat_system_prompt,
458+
"messages": [{"role": "user", "content": chat_user_prompt}],
459+
}
460+
if self.max_output_tokens is not None:
461+
payload["max_tokens"] = self.max_output_tokens
462+
if self.temperature is not None:
463+
payload["temperature"] = self.temperature
464+
return payload
417465
payload = {
418466
"model": self.model,
419467
"messages": [
@@ -644,6 +692,15 @@ def _chat_completions_url(base_url: str) -> str:
644692
return f"{normalized}/v1/chat/completions"
645693

646694

695+
def _messages_url(base_url: str) -> str:
696+
normalized = base_url.rstrip("/")
697+
if normalized.endswith("/messages"):
698+
return normalized
699+
if normalized.endswith("/v1"):
700+
return f"{normalized}/messages"
701+
return f"{normalized}/v1/messages"
702+
703+
647704
def _parse_chat_stream_response(response) -> dict:
648705
content_parts = []
649706
for data in _iter_sse_data(response):
@@ -730,6 +787,10 @@ def _extract_output_text(response: dict) -> str:
730787
if isinstance(output_text, str) and output_text:
731788
return output_text
732789

790+
content_text = _content_value_to_text(response.get("content"))
791+
if content_text:
792+
return content_text
793+
733794
output = response.get("output")
734795
if isinstance(output, list):
735796
for item in output:
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# LLM 复杂场景识别汇总结论(2026-05-13)
2+
3+
## 结论
4+
5+
按严格复杂场景测试结果看,三家 provider 的可用性差异很大。
6+
7+
## 最强候选
8+
9+
1. `windhub / doubao-seed-1-8-251228 / messages`
10+
- pass recall: `7/8`
11+
- perfect: `6/8`
12+
- avg recall: `0.95`
13+
- median latency: `76.47s`
14+
- 语义最强,但太慢,不适合高频主路径。
15+
16+
2. `windhub / deepseek-v3-2-251201 / messages`
17+
- pass recall: `8/8`
18+
- perfect: `3/8`
19+
- avg recall: `0.91`
20+
- median latency: `44.28s`
21+
- 这是更均衡的主力候选。
22+
23+
3. `secondary / gemini-2.5-flash-nothinking / messages`
24+
- pass recall: `7/8`
25+
- perfect: `4/8`
26+
- avg recall: `0.91`
27+
- median latency: `9.20s`
28+
- 速度最好,但正式链路 smoke 曾出现 `HTTP 554`,不适合作为当前默认自动备份。
29+
30+
4. `elysiver / longcat-flash-chat / chat`
31+
- pass recall: `8/8`
32+
- perfect: `4/8`
33+
- avg recall: `0.94`
34+
- median latency: `20.08s`
35+
- 在 elysiver 里最稳,兼顾速度和语义。
36+
37+
5. `elysiver / qwen3-max / messages`
38+
- pass recall: `7/8`
39+
- perfect: `3/8`
40+
- avg recall: `0.87`
41+
- median latency: `41.96s`
42+
- 语义强,但慢。
43+
44+
6. `secondary / gemini-3.1-pro-preview / chat_stream`
45+
- pass recall: `6/8`
46+
- perfect: `1/8`
47+
- avg recall: `0.81`
48+
- median latency: `9.71s`
49+
- 语义弱于 `gemini-2.5-flash-nothinking/messages`,但正式 CLI smoke 通过,适合作为当前 secondary 默认备份。
50+
51+
## 不推荐路径
52+
53+
- `gpt-5.5-web-auto/messages` 在 elysiver 上被 moderation 直接拦截。
54+
- `gemini-2.5-pro``gemini-3-flash-preview``glm-5` 在 secondary 上大量 554/不稳定。
55+
- `42-mini``42-pro` 在 elysiver 上复杂语义召回偏低。
56+
- `deepseek-v4-flash*` 在 elysiver 上基本 504,不适合继续投入。
57+
58+
## 实际建议
59+
60+
- 主路径:`windhub/deepseek-v3-2-251201/messages`
61+
- 高语义模式:`windhub/doubao-seed-1-8-251228/messages`
62+
- 低延迟语义候选:`secondary/gemini-2.5-flash-nothinking/messages`
63+
- 当前 secondary 默认备份:`secondary/gemini-3.1-pro-preview/chat`
64+
- 第三备份:`elysiver/longcat-flash-chat/chat`
65+
66+
## 正式链路 smoke
67+
68+
- `windhub/deepseek-v3-2-251201/messages`: 通过,2-market threshold 样本发现 `1` 个 implication。
69+
- `secondary/gemini-2.5-flash-nothinking/messages`: 未通过,返回 `HTTP 554`
70+
- `secondary/gemini-3.1-pro-preview/chat`: 通过,2-market threshold 样本发现 `1` 个 implication。
71+
- `elysiver/longcat-flash-chat/chat`: 通过,2-market threshold 样本发现 `1` 个 implication。
72+
73+
## 说明
74+
75+
- 这里的排序优先看 `perfect``pass recall`,其次看 `avg recall`,最后才看延迟。
76+
- 若实际部署要偏高频,应优先用主路径 + 备份路径组合,而不是单一追求最高 recall。

0 commit comments

Comments
 (0)