Skip to content

Commit e1f2e17

Browse files
committed
Add external fact signal pipeline spec
1 parent 6a2ebff commit e1f2e17

1 file changed

Lines changed: 398 additions & 0 deletions

File tree

Lines changed: 398 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,398 @@
1+
# External Fact Signal Pipeline Design
2+
3+
Date: 2026-05-15
4+
Audience: PR discussion with Soli22de
5+
Status: research/design spec, not an implementation plan yet
6+
7+
## Discussion Prompt for Soli22de
8+
9+
这次想讨论的不是继续做 maker/cross-platform arb,而是换成一个外部事实信号方向:
10+
11+
- 先从 Polymarket 的 active markets 和 resolution rules 出发,判断每个市场到底需要等什么外部事实。
12+
- 再只盯少数可信来源,比如 SEC EDGAR、Federal Register、CourtListener 这类官方/API/RSS 来源。
13+
- 新文档发布后,用 LLM 把非结构化内容抽成结构化事实。
14+
- 再把事实映射回 market_id,产出现在项目已经能消费的 `external_signal`
15+
- 最后做 shadow report,看这个事实是否真的早于价格变化、扣掉 spread/depth 后还有没有可交易边际。
16+
17+
重点是不要做全网爬虫,也不要直接上交易。第一阶段要证明的是:这个管线能不能稳定找到“别人不容易结构化、但市场会关心”的新事实。
18+
19+
## Goal
20+
21+
Build a research pipeline that finds newly published, messy external facts and converts them into structured Polymarket signals.
22+
23+
The thesis is not "crawl more websites". The thesis is:
24+
25+
1. Start from live Polymarket markets and their resolution conditions.
26+
2. Infer what external fact would actually resolve or materially move each market.
27+
3. Watch trusted source families where those facts can appear first.
28+
4. Use an LLM to turn raw/unstructured documents into structured facts.
29+
5. Map facts back to market IDs and measure whether price moved after the fact appeared.
30+
31+
If this works, the edge comes from organizing unstructured public information faster and more consistently than other participants, not from generic arbitrage.
32+
33+
## Why This Exists
34+
35+
Recent research has weakened the pure maker/cross-platform arbitrage direction. Existing infrastructure can collect Polymarket markets, rules, order books, watchlists, external signal rows, realtime logs, and paper reports, but the current `external_signal` input is mostly a generic ingestion hook.
36+
37+
The next promising direction is an information pipeline:
38+
39+
- Market asks a question.
40+
- The market's rules imply a trusted resolution source or fact type.
41+
- A new external document is published.
42+
- The LLM extracts the relevant fact into a strict schema.
43+
- The system tests whether that fact arrived before a tradable price adjustment.
44+
45+
This should be implemented as shadow research first. No live trading is in scope.
46+
47+
## Current Repo Context
48+
49+
Useful existing pieces:
50+
51+
- `poly_strategy/external_signals.py` normalizes external scanner payloads into `external_signal` NDJSON.
52+
- `poly_strategy/watchlist.py` boosts markets that appear in `external_signal` rows and keeps related neg-risk groups together.
53+
- `scripts/refresh_external_signals.sh` can refresh `data/external-signals.ndjson` and rebuild the discovery watchlist.
54+
- `scripts/run_realtime_monitor.sh` already accepts `EXTERNAL_SIGNALS=data/external-signals.ndjson` when building the realtime watchlist.
55+
- `scripts/background_manager.sh` can periodically run external-signal refreshes under tmux-style orchestration.
56+
- Existing LLM/rule-discovery work can be reused to understand market rules and resolution text.
57+
58+
The gap: there is no source-specific pipeline that says, for a given market, which external fact to monitor, where to monitor it, how to extract it, and how to validate whether it created tradable edge.
59+
60+
## Recommended Approach
61+
62+
Use a targeted source-connector approach, not broad crawling.
63+
64+
Broad crawling is expensive and noisy. It creates too much content for the LLM, too many false positives, and too many site-specific maintenance problems. It also makes it hard to prove that a signal arrived before the market moved.
65+
66+
Recommended first version:
67+
68+
1. Classify markets into external fact needs.
69+
2. Enable only a few trusted source families with APIs/RSS/search endpoints.
70+
3. Store raw documents and extracted facts separately.
71+
4. Map facts to markets with deterministic filters plus LLM verification.
72+
5. Run a forward-return report before considering any trading logic.
73+
74+
## Source Families for MVP
75+
76+
These source families are good first candidates because they are public, structured enough to collect, and often contain market-moving facts.
77+
78+
| Source family | Use case | Why it is worth testing | Official docs |
79+
|---|---|---|---|
80+
| SEC EDGAR | company filings, 8-K, 10-K, 10-Q, ownership, company facts | official corporate disclosures can resolve or move business/company markets | https://www.sec.gov/search-filings/edgar-application-programming-interfaces |
81+
| Federal Register | US agency rules, notices, executive/regulatory actions | official government publication stream; useful for regulation/policy markets | https://www.federalregister.gov/reader-aids/developer-resources/rest-api |
82+
| CourtListener | US legal opinions/dockets when available | court outcomes can resolve legal/case markets | https://www.courtlistener.com/help/api/rest/ |
83+
| GDELT | discovery only, not final truth source | can help discover global news/doc mentions, but should not be treated as authoritative resolution evidence | https://www.gdeltproject.org/data.html |
84+
85+
Important: GDELT/news-like sources should be discovery inputs only unless the market rules explicitly allow them as resolution sources. For resolution-sensitive markets, prefer official documents.
86+
87+
## Non-Goals
88+
89+
- Do not build a crawler that scrapes every website.
90+
- Do not scrape social media firehoses in the first PR.
91+
- Do not turn signals into live orders.
92+
- Do not add secrets, API keys, or paid data dependencies to tracked files.
93+
- Do not revive the maker/cross-platform arbitrage thesis in this task.
94+
- Do not assume every LLM match is a tradable edge; validation must prove it.
95+
96+
## Data Flow
97+
98+
```text
99+
Polymarket Gamma / existing market cache
100+
-> market fact-needs classifier
101+
-> source-specific queries
102+
-> raw external documents NDJSON
103+
-> LLM fact extractor
104+
-> market/fact mapper
105+
-> external_signal NDJSON
106+
-> watchlist priority boost + realtime monitor
107+
-> forward-return signal report
108+
```
109+
110+
The `external_signal` file remains the integration boundary with current watchlist/realtime code. New work should happen before that boundary and after it in reporting.
111+
112+
## Component 1: Market Fact-Needs Classifier
113+
114+
Purpose: convert each active market into a small set of factual monitoring needs.
115+
116+
Input:
117+
118+
- market ID / condition ID / tokens
119+
- title/question
120+
- description and rules text
121+
- end date
122+
- category/tags
123+
- any known resolution source URL/text from existing rule extraction
124+
125+
Output row schema:
126+
127+
```json
128+
{
129+
"type": "market_fact_need",
130+
"schema_version": 1,
131+
"market_id": "...",
132+
"condition_id": "...",
133+
"question": "...",
134+
"event_type": "regulatory_notice|sec_filing|court_decision|election|economic_release|sports_result|other",
135+
"entities": [
136+
{"name": "...", "kind": "company|agency|court|person|country|team|other", "aliases": ["..."]}
137+
],
138+
"needed_fact": "Plain-English fact that would resolve or materially update the market.",
139+
"deadline_utc": "2026-...",
140+
"preferred_source_family": "sec_edgar|federal_register|courtlistener|gdelt_discovery|manual",
141+
"source_query": {"query": "...", "filters": {"...": "..."}},
142+
"resolution_source_hint": "...",
143+
"automation_eligible": true,
144+
"confidence": 0.0,
145+
"notes": "..."
146+
}
147+
```
148+
149+
Rules:
150+
151+
- If the resolution rules name a source, prefer that source.
152+
- If no reliable source can be inferred, mark `automation_eligible=false`.
153+
- Do not send all markets to every connector. The classifier must narrow the source family first.
154+
- Keep a reason/evidence field so manual review can audit why a market was routed to a source.
155+
156+
## Component 2: Source Connectors
157+
158+
Purpose: collect raw documents from a small set of source families using official APIs where possible.
159+
160+
Suggested CLI shape:
161+
162+
```bash
163+
python3 -m poly_strategy.cli collect-external-docs \
164+
--fact-needs data/market-fact-needs.ndjson \
165+
--source sec_edgar \
166+
--out data/external-raw-docs.ndjson
167+
```
168+
169+
Initial connectors:
170+
171+
- `sec_edgar`: query company submissions/company facts and recent filings for relevant company entities.
172+
- `federal_register`: query recent documents by agency/search term/date window.
173+
- `courtlistener`: query opinions/dockets by court/case/entity where the market rules make this useful.
174+
- `gdelt_discovery`: optional discovery-only connector for mentions, not final evidence.
175+
176+
Raw document schema:
177+
178+
```json
179+
{
180+
"type": "raw_external_doc",
181+
"schema_version": 1,
182+
"source_family": "sec_edgar|federal_register|courtlistener|gdelt_discovery",
183+
"source_id": "stable-source-document-id",
184+
"url": "https://...",
185+
"published_at": "2026-...Z",
186+
"retrieved_at": "2026-...Z",
187+
"title": "...",
188+
"entities_hint": ["..."],
189+
"market_ids_hint": ["..."],
190+
"body_text": "short normalized text or extracted relevant sections",
191+
"raw": {"source_specific": "payload"}
192+
}
193+
```
194+
195+
Connector requirements:
196+
197+
- Deduplicate by `source_family + source_id`.
198+
- Preserve `published_at` and `retrieved_at`; validation depends on time ordering.
199+
- Store enough evidence text for LLM extraction, but avoid dumping huge documents when a relevant section can be selected.
200+
- Fail soft per source; one broken connector should not stop other connectors.
201+
- Respect public API rate limits and identify the client where required.
202+
203+
## Component 3: LLM Fact Extractor
204+
205+
Purpose: convert raw documents into strict, auditable structured facts.
206+
207+
Suggested CLI shape:
208+
209+
```bash
210+
python3 -m poly_strategy.cli extract-structured-facts \
211+
--raw-docs data/external-raw-docs.ndjson \
212+
--fact-needs data/market-fact-needs.ndjson \
213+
--out data/structured-facts.ndjson
214+
```
215+
216+
Structured fact schema:
217+
218+
```json
219+
{
220+
"type": "structured_fact",
221+
"schema_version": 1,
222+
"source_family": "sec_edgar",
223+
"source_id": "...",
224+
"url": "https://...",
225+
"published_at": "2026-...Z",
226+
"extracted_at": "2026-...Z",
227+
"canonical_fact": "One sentence factual claim.",
228+
"entities": [{"name": "...", "kind": "..."}],
229+
"event_type": "...",
230+
"trigger_status": "direct_trigger|material_update|weak_signal|unrelated",
231+
"evidence_text": "short quote or excerpt supporting the fact",
232+
"candidate_market_ids": ["..."],
233+
"confidence": 0.0,
234+
"risk_flags": ["ambiguous_entity", "stale_document", "not_resolution_source"]
235+
}
236+
```
237+
238+
LLM requirements:
239+
240+
- Use structured JSON output only.
241+
- Include a short evidence excerpt and URL for every non-`unrelated` fact.
242+
- Prefer recall in extraction, then precision in mapping.
243+
- Mark ambiguity instead of forcing a match.
244+
- The extractor must not create a trade recommendation.
245+
246+
## Component 4: Market/Fact Mapper
247+
248+
Purpose: decide whether a structured fact applies to one or more markets and convert high-confidence matches into existing `external_signal` rows.
249+
250+
Mapping should use two layers:
251+
252+
1. Deterministic filters: entity aliases, source family, date window, market status, resolution source hints.
253+
2. LLM verification: compare `canonical_fact + evidence_text` against the market question/rules.
254+
255+
External signal output should keep compatibility with `poly_strategy/external_signals.py`, but add useful fields under `raw`:
256+
257+
```json
258+
{
259+
"type": "external_signal",
260+
"schema_version": 1,
261+
"source": "external_fact_pipeline",
262+
"source_id": "sec_edgar:...:market_id",
263+
"ts": "2026-...Z",
264+
"kind": "external_fact",
265+
"event_title": "...",
266+
"quoted_edge": null,
267+
"quoted_roi": null,
268+
"quoted_depth": null,
269+
"legs": [
270+
{"venue": "polymarket", "market_id": "...", "token": null, "side": "watch"}
271+
],
272+
"raw": {
273+
"source_family": "sec_edgar",
274+
"url": "https://...",
275+
"published_at": "2026-...Z",
276+
"canonical_fact": "...",
277+
"trigger_status": "direct_trigger",
278+
"evidence_text": "...",
279+
"mapping_confidence": 0.0,
280+
"expected_direction": "yes|no|unknown",
281+
"risk_flags": []
282+
}
283+
}
284+
```
285+
286+
`expected_direction` may be useful for evaluation, but it must be optional. If direction is uncertain, emit `unknown` and let reporting separate directional from non-directional signals.
287+
288+
## Component 5: Forward-Return Signal Report
289+
290+
Purpose: prove or disprove edge before any trading work.
291+
292+
Suggested CLI shape:
293+
294+
```bash
295+
python3 -m poly_strategy.cli external-fact-signal-report \
296+
--signals data/external-signals.ndjson \
297+
--books data/realtime-books.ndjson \
298+
--out reports/external-fact-signal-report-YYYY-MM-DD.md
299+
```
300+
301+
Metrics:
302+
303+
- Number of active markets classified.
304+
- Number of fact needs by source family.
305+
- Number of raw documents collected by source family.
306+
- Number of structured facts by trigger status.
307+
- Number of mapped signals by source family and market type.
308+
- Manual audit precision on a sample of mapped signals.
309+
- Time from `published_at` to `retrieved_at` to `external_signal.ts`.
310+
- Forward mid-price movement at 5m, 30m, 2h, 24h.
311+
- Directional hit rate where `expected_direction` is known.
312+
- Simulated tradability after spread/depth, not just mid-price movement.
313+
314+
Validation rules:
315+
316+
- Count a signal only if the raw document's `published_at` is before the measured market move.
317+
- Separate "fact arrived after price moved" from "fact arrived before price moved".
318+
- Do not claim edge from stale documents.
319+
- Report spread-adjusted and depth-limited results separately from raw mid-price movement.
320+
321+
## Acceptance Criteria for First PR Series
322+
323+
A good first implementation should satisfy all of these before we call the direction promising:
324+
325+
- Produces `market_fact_need` rows for a representative active-market sample.
326+
- Routes markets to at least three source families: SEC EDGAR, Federal Register, and CourtListener.
327+
- Writes deduped `raw_external_doc` rows with source URL, `published_at`, and `retrieved_at`.
328+
- Writes `structured_fact` rows with evidence text and confidence.
329+
- Emits compatible `external_signal` rows for mapped Polymarket markets.
330+
- Existing watchlist/realtime flow can consume the emitted `external_signal` file without code rewrites.
331+
- Produces a Markdown/JSON report that separates coverage, precision, timeliness, price movement, and tradability.
332+
- Includes tests for schema validation, deduplication, source routing, and mapper behavior.
333+
334+
## Kill Criteria
335+
336+
Stop or redesign the direction if the shadow run shows any of these:
337+
338+
- Fewer than 10 high-confidence mapped signals per week after source routing is working.
339+
- Manual audit precision below 70% for mapped `direct_trigger` or `material_update` rows.
340+
- Most signals are retrieved after the relevant market has already moved.
341+
- Forward movement exists only in mid-price but disappears after spread/depth assumptions.
342+
- Source maintenance cost is high because too many custom scrapers are required.
343+
344+
## Suggested PR Split
345+
346+
PR 1: market fact-needs classifier and report
347+
348+
- Add `market_fact_need` schema and CLI.
349+
- Use existing market/rule caches.
350+
- No external collection yet.
351+
- Output a report showing which markets are automatable and which source family they need.
352+
353+
PR 2: official source connectors
354+
355+
- Add SEC EDGAR, Federal Register, and CourtListener collectors.
356+
- Write `raw_external_doc` NDJSON.
357+
- Include dedupe, timestamps, and rate-limit handling.
358+
- Keep GDELT optional/discovery-only if added.
359+
360+
PR 3: LLM fact extractor and mapper
361+
362+
- Add structured fact extraction.
363+
- Add deterministic + LLM market mapping.
364+
- Emit compatible `external_signal` rows.
365+
366+
PR 4: outcome evaluation
367+
368+
- Join emitted signals to realtime/order-book snapshots.
369+
- Produce forward-return and tradability reports.
370+
- Include a manual audit sample export.
371+
372+
This split is preferred because each PR can be reviewed independently and can fail without blocking the others.
373+
374+
## Open Questions for Soli22de
375+
376+
- Is PR 1 enough as the first contribution, or should PR 1 include one source connector for end-to-end proof?
377+
- Which market categories in current Gamma data look most suitable for SEC/Federal Register/CourtListener routing?
378+
- Should the extractor use the current LLM profile stack, or should it define a separate cheaper/high-recall profile?
379+
- How large should the manual audit sample be for the first shadow run?
380+
- Should `expected_direction` be required only for `direct_trigger`, or optional for all signal rows?
381+
382+
## Implementation Notes
383+
384+
- Keep all new runtime outputs under `data/` or `reports/`; do not commit generated data.
385+
- Prefer append-only NDJSON for raw docs, facts, and signals so tmux/background loops can run safely.
386+
- Add compaction for new large NDJSON files only after the first connector proves useful.
387+
- Use existing CLI style in `poly_strategy/cli.py` and existing test style under `tests/`.
388+
- Keep source adapters isolated from LLM extraction so we can test them without provider keys.
389+
- Add deterministic fixtures for unit tests; do not depend on live network in tests.
390+
391+
## Definition of Done
392+
393+
The direction is ready for deeper implementation when we have:
394+
395+
1. A working shadow pipeline from market fact need to external signal.
396+
2. At least one week of reports showing coverage, precision, timeliness, and tradability.
397+
3. Clear evidence that the pipeline finds facts the current monitor would not otherwise prioritize.
398+
4. A decision on whether to expand sources, improve mapping, or stop the direction.

0 commit comments

Comments
 (0)