|
| 1 | +# External Fact Signal Pipeline Design |
| 2 | + |
| 3 | +Date: 2026-05-15 |
| 4 | +Audience: PR discussion with Soli22de |
| 5 | +Status: research/design spec, not an implementation plan yet |
| 6 | + |
| 7 | +## Discussion Prompt for Soli22de |
| 8 | + |
| 9 | +这次想讨论的不是继续做 maker/cross-platform arb,而是换成一个外部事实信号方向: |
| 10 | + |
| 11 | +- 先从 Polymarket 的 active markets 和 resolution rules 出发,判断每个市场到底需要等什么外部事实。 |
| 12 | +- 再只盯少数可信来源,比如 SEC EDGAR、Federal Register、CourtListener 这类官方/API/RSS 来源。 |
| 13 | +- 新文档发布后,用 LLM 把非结构化内容抽成结构化事实。 |
| 14 | +- 再把事实映射回 market_id,产出现在项目已经能消费的 `external_signal`。 |
| 15 | +- 最后做 shadow report,看这个事实是否真的早于价格变化、扣掉 spread/depth 后还有没有可交易边际。 |
| 16 | + |
| 17 | +重点是不要做全网爬虫,也不要直接上交易。第一阶段要证明的是:这个管线能不能稳定找到“别人不容易结构化、但市场会关心”的新事实。 |
| 18 | + |
| 19 | +## Goal |
| 20 | + |
| 21 | +Build a research pipeline that finds newly published, messy external facts and converts them into structured Polymarket signals. |
| 22 | + |
| 23 | +The thesis is not "crawl more websites". The thesis is: |
| 24 | + |
| 25 | +1. Start from live Polymarket markets and their resolution conditions. |
| 26 | +2. Infer what external fact would actually resolve or materially move each market. |
| 27 | +3. Watch trusted source families where those facts can appear first. |
| 28 | +4. Use an LLM to turn raw/unstructured documents into structured facts. |
| 29 | +5. Map facts back to market IDs and measure whether price moved after the fact appeared. |
| 30 | + |
| 31 | +If this works, the edge comes from organizing unstructured public information faster and more consistently than other participants, not from generic arbitrage. |
| 32 | + |
| 33 | +## Why This Exists |
| 34 | + |
| 35 | +Recent research has weakened the pure maker/cross-platform arbitrage direction. Existing infrastructure can collect Polymarket markets, rules, order books, watchlists, external signal rows, realtime logs, and paper reports, but the current `external_signal` input is mostly a generic ingestion hook. |
| 36 | + |
| 37 | +The next promising direction is an information pipeline: |
| 38 | + |
| 39 | +- Market asks a question. |
| 40 | +- The market's rules imply a trusted resolution source or fact type. |
| 41 | +- A new external document is published. |
| 42 | +- The LLM extracts the relevant fact into a strict schema. |
| 43 | +- The system tests whether that fact arrived before a tradable price adjustment. |
| 44 | + |
| 45 | +This should be implemented as shadow research first. No live trading is in scope. |
| 46 | + |
| 47 | +## Current Repo Context |
| 48 | + |
| 49 | +Useful existing pieces: |
| 50 | + |
| 51 | +- `poly_strategy/external_signals.py` normalizes external scanner payloads into `external_signal` NDJSON. |
| 52 | +- `poly_strategy/watchlist.py` boosts markets that appear in `external_signal` rows and keeps related neg-risk groups together. |
| 53 | +- `scripts/refresh_external_signals.sh` can refresh `data/external-signals.ndjson` and rebuild the discovery watchlist. |
| 54 | +- `scripts/run_realtime_monitor.sh` already accepts `EXTERNAL_SIGNALS=data/external-signals.ndjson` when building the realtime watchlist. |
| 55 | +- `scripts/background_manager.sh` can periodically run external-signal refreshes under tmux-style orchestration. |
| 56 | +- Existing LLM/rule-discovery work can be reused to understand market rules and resolution text. |
| 57 | + |
| 58 | +The gap: there is no source-specific pipeline that says, for a given market, which external fact to monitor, where to monitor it, how to extract it, and how to validate whether it created tradable edge. |
| 59 | + |
| 60 | +## Recommended Approach |
| 61 | + |
| 62 | +Use a targeted source-connector approach, not broad crawling. |
| 63 | + |
| 64 | +Broad crawling is expensive and noisy. It creates too much content for the LLM, too many false positives, and too many site-specific maintenance problems. It also makes it hard to prove that a signal arrived before the market moved. |
| 65 | + |
| 66 | +Recommended first version: |
| 67 | + |
| 68 | +1. Classify markets into external fact needs. |
| 69 | +2. Enable only a few trusted source families with APIs/RSS/search endpoints. |
| 70 | +3. Store raw documents and extracted facts separately. |
| 71 | +4. Map facts to markets with deterministic filters plus LLM verification. |
| 72 | +5. Run a forward-return report before considering any trading logic. |
| 73 | + |
| 74 | +## Source Families for MVP |
| 75 | + |
| 76 | +These source families are good first candidates because they are public, structured enough to collect, and often contain market-moving facts. |
| 77 | + |
| 78 | +| Source family | Use case | Why it is worth testing | Official docs | |
| 79 | +|---|---|---|---| |
| 80 | +| SEC EDGAR | company filings, 8-K, 10-K, 10-Q, ownership, company facts | official corporate disclosures can resolve or move business/company markets | https://www.sec.gov/search-filings/edgar-application-programming-interfaces | |
| 81 | +| Federal Register | US agency rules, notices, executive/regulatory actions | official government publication stream; useful for regulation/policy markets | https://www.federalregister.gov/reader-aids/developer-resources/rest-api | |
| 82 | +| CourtListener | US legal opinions/dockets when available | court outcomes can resolve legal/case markets | https://www.courtlistener.com/help/api/rest/ | |
| 83 | +| GDELT | discovery only, not final truth source | can help discover global news/doc mentions, but should not be treated as authoritative resolution evidence | https://www.gdeltproject.org/data.html | |
| 84 | + |
| 85 | +Important: GDELT/news-like sources should be discovery inputs only unless the market rules explicitly allow them as resolution sources. For resolution-sensitive markets, prefer official documents. |
| 86 | + |
| 87 | +## Non-Goals |
| 88 | + |
| 89 | +- Do not build a crawler that scrapes every website. |
| 90 | +- Do not scrape social media firehoses in the first PR. |
| 91 | +- Do not turn signals into live orders. |
| 92 | +- Do not add secrets, API keys, or paid data dependencies to tracked files. |
| 93 | +- Do not revive the maker/cross-platform arbitrage thesis in this task. |
| 94 | +- Do not assume every LLM match is a tradable edge; validation must prove it. |
| 95 | + |
| 96 | +## Data Flow |
| 97 | + |
| 98 | +```text |
| 99 | +Polymarket Gamma / existing market cache |
| 100 | + -> market fact-needs classifier |
| 101 | + -> source-specific queries |
| 102 | + -> raw external documents NDJSON |
| 103 | + -> LLM fact extractor |
| 104 | + -> market/fact mapper |
| 105 | + -> external_signal NDJSON |
| 106 | + -> watchlist priority boost + realtime monitor |
| 107 | + -> forward-return signal report |
| 108 | +``` |
| 109 | + |
| 110 | +The `external_signal` file remains the integration boundary with current watchlist/realtime code. New work should happen before that boundary and after it in reporting. |
| 111 | + |
| 112 | +## Component 1: Market Fact-Needs Classifier |
| 113 | + |
| 114 | +Purpose: convert each active market into a small set of factual monitoring needs. |
| 115 | + |
| 116 | +Input: |
| 117 | + |
| 118 | +- market ID / condition ID / tokens |
| 119 | +- title/question |
| 120 | +- description and rules text |
| 121 | +- end date |
| 122 | +- category/tags |
| 123 | +- any known resolution source URL/text from existing rule extraction |
| 124 | + |
| 125 | +Output row schema: |
| 126 | + |
| 127 | +```json |
| 128 | +{ |
| 129 | + "type": "market_fact_need", |
| 130 | + "schema_version": 1, |
| 131 | + "market_id": "...", |
| 132 | + "condition_id": "...", |
| 133 | + "question": "...", |
| 134 | + "event_type": "regulatory_notice|sec_filing|court_decision|election|economic_release|sports_result|other", |
| 135 | + "entities": [ |
| 136 | + {"name": "...", "kind": "company|agency|court|person|country|team|other", "aliases": ["..."]} |
| 137 | + ], |
| 138 | + "needed_fact": "Plain-English fact that would resolve or materially update the market.", |
| 139 | + "deadline_utc": "2026-...", |
| 140 | + "preferred_source_family": "sec_edgar|federal_register|courtlistener|gdelt_discovery|manual", |
| 141 | + "source_query": {"query": "...", "filters": {"...": "..."}}, |
| 142 | + "resolution_source_hint": "...", |
| 143 | + "automation_eligible": true, |
| 144 | + "confidence": 0.0, |
| 145 | + "notes": "..." |
| 146 | +} |
| 147 | +``` |
| 148 | + |
| 149 | +Rules: |
| 150 | + |
| 151 | +- If the resolution rules name a source, prefer that source. |
| 152 | +- If no reliable source can be inferred, mark `automation_eligible=false`. |
| 153 | +- Do not send all markets to every connector. The classifier must narrow the source family first. |
| 154 | +- Keep a reason/evidence field so manual review can audit why a market was routed to a source. |
| 155 | + |
| 156 | +## Component 2: Source Connectors |
| 157 | + |
| 158 | +Purpose: collect raw documents from a small set of source families using official APIs where possible. |
| 159 | + |
| 160 | +Suggested CLI shape: |
| 161 | + |
| 162 | +```bash |
| 163 | +python3 -m poly_strategy.cli collect-external-docs \ |
| 164 | + --fact-needs data/market-fact-needs.ndjson \ |
| 165 | + --source sec_edgar \ |
| 166 | + --out data/external-raw-docs.ndjson |
| 167 | +``` |
| 168 | + |
| 169 | +Initial connectors: |
| 170 | + |
| 171 | +- `sec_edgar`: query company submissions/company facts and recent filings for relevant company entities. |
| 172 | +- `federal_register`: query recent documents by agency/search term/date window. |
| 173 | +- `courtlistener`: query opinions/dockets by court/case/entity where the market rules make this useful. |
| 174 | +- `gdelt_discovery`: optional discovery-only connector for mentions, not final evidence. |
| 175 | + |
| 176 | +Raw document schema: |
| 177 | + |
| 178 | +```json |
| 179 | +{ |
| 180 | + "type": "raw_external_doc", |
| 181 | + "schema_version": 1, |
| 182 | + "source_family": "sec_edgar|federal_register|courtlistener|gdelt_discovery", |
| 183 | + "source_id": "stable-source-document-id", |
| 184 | + "url": "https://...", |
| 185 | + "published_at": "2026-...Z", |
| 186 | + "retrieved_at": "2026-...Z", |
| 187 | + "title": "...", |
| 188 | + "entities_hint": ["..."], |
| 189 | + "market_ids_hint": ["..."], |
| 190 | + "body_text": "short normalized text or extracted relevant sections", |
| 191 | + "raw": {"source_specific": "payload"} |
| 192 | +} |
| 193 | +``` |
| 194 | + |
| 195 | +Connector requirements: |
| 196 | + |
| 197 | +- Deduplicate by `source_family + source_id`. |
| 198 | +- Preserve `published_at` and `retrieved_at`; validation depends on time ordering. |
| 199 | +- Store enough evidence text for LLM extraction, but avoid dumping huge documents when a relevant section can be selected. |
| 200 | +- Fail soft per source; one broken connector should not stop other connectors. |
| 201 | +- Respect public API rate limits and identify the client where required. |
| 202 | + |
| 203 | +## Component 3: LLM Fact Extractor |
| 204 | + |
| 205 | +Purpose: convert raw documents into strict, auditable structured facts. |
| 206 | + |
| 207 | +Suggested CLI shape: |
| 208 | + |
| 209 | +```bash |
| 210 | +python3 -m poly_strategy.cli extract-structured-facts \ |
| 211 | + --raw-docs data/external-raw-docs.ndjson \ |
| 212 | + --fact-needs data/market-fact-needs.ndjson \ |
| 213 | + --out data/structured-facts.ndjson |
| 214 | +``` |
| 215 | + |
| 216 | +Structured fact schema: |
| 217 | + |
| 218 | +```json |
| 219 | +{ |
| 220 | + "type": "structured_fact", |
| 221 | + "schema_version": 1, |
| 222 | + "source_family": "sec_edgar", |
| 223 | + "source_id": "...", |
| 224 | + "url": "https://...", |
| 225 | + "published_at": "2026-...Z", |
| 226 | + "extracted_at": "2026-...Z", |
| 227 | + "canonical_fact": "One sentence factual claim.", |
| 228 | + "entities": [{"name": "...", "kind": "..."}], |
| 229 | + "event_type": "...", |
| 230 | + "trigger_status": "direct_trigger|material_update|weak_signal|unrelated", |
| 231 | + "evidence_text": "short quote or excerpt supporting the fact", |
| 232 | + "candidate_market_ids": ["..."], |
| 233 | + "confidence": 0.0, |
| 234 | + "risk_flags": ["ambiguous_entity", "stale_document", "not_resolution_source"] |
| 235 | +} |
| 236 | +``` |
| 237 | + |
| 238 | +LLM requirements: |
| 239 | + |
| 240 | +- Use structured JSON output only. |
| 241 | +- Include a short evidence excerpt and URL for every non-`unrelated` fact. |
| 242 | +- Prefer recall in extraction, then precision in mapping. |
| 243 | +- Mark ambiguity instead of forcing a match. |
| 244 | +- The extractor must not create a trade recommendation. |
| 245 | + |
| 246 | +## Component 4: Market/Fact Mapper |
| 247 | + |
| 248 | +Purpose: decide whether a structured fact applies to one or more markets and convert high-confidence matches into existing `external_signal` rows. |
| 249 | + |
| 250 | +Mapping should use two layers: |
| 251 | + |
| 252 | +1. Deterministic filters: entity aliases, source family, date window, market status, resolution source hints. |
| 253 | +2. LLM verification: compare `canonical_fact + evidence_text` against the market question/rules. |
| 254 | + |
| 255 | +External signal output should keep compatibility with `poly_strategy/external_signals.py`, but add useful fields under `raw`: |
| 256 | + |
| 257 | +```json |
| 258 | +{ |
| 259 | + "type": "external_signal", |
| 260 | + "schema_version": 1, |
| 261 | + "source": "external_fact_pipeline", |
| 262 | + "source_id": "sec_edgar:...:market_id", |
| 263 | + "ts": "2026-...Z", |
| 264 | + "kind": "external_fact", |
| 265 | + "event_title": "...", |
| 266 | + "quoted_edge": null, |
| 267 | + "quoted_roi": null, |
| 268 | + "quoted_depth": null, |
| 269 | + "legs": [ |
| 270 | + {"venue": "polymarket", "market_id": "...", "token": null, "side": "watch"} |
| 271 | + ], |
| 272 | + "raw": { |
| 273 | + "source_family": "sec_edgar", |
| 274 | + "url": "https://...", |
| 275 | + "published_at": "2026-...Z", |
| 276 | + "canonical_fact": "...", |
| 277 | + "trigger_status": "direct_trigger", |
| 278 | + "evidence_text": "...", |
| 279 | + "mapping_confidence": 0.0, |
| 280 | + "expected_direction": "yes|no|unknown", |
| 281 | + "risk_flags": [] |
| 282 | + } |
| 283 | +} |
| 284 | +``` |
| 285 | + |
| 286 | +`expected_direction` may be useful for evaluation, but it must be optional. If direction is uncertain, emit `unknown` and let reporting separate directional from non-directional signals. |
| 287 | + |
| 288 | +## Component 5: Forward-Return Signal Report |
| 289 | + |
| 290 | +Purpose: prove or disprove edge before any trading work. |
| 291 | + |
| 292 | +Suggested CLI shape: |
| 293 | + |
| 294 | +```bash |
| 295 | +python3 -m poly_strategy.cli external-fact-signal-report \ |
| 296 | + --signals data/external-signals.ndjson \ |
| 297 | + --books data/realtime-books.ndjson \ |
| 298 | + --out reports/external-fact-signal-report-YYYY-MM-DD.md |
| 299 | +``` |
| 300 | + |
| 301 | +Metrics: |
| 302 | + |
| 303 | +- Number of active markets classified. |
| 304 | +- Number of fact needs by source family. |
| 305 | +- Number of raw documents collected by source family. |
| 306 | +- Number of structured facts by trigger status. |
| 307 | +- Number of mapped signals by source family and market type. |
| 308 | +- Manual audit precision on a sample of mapped signals. |
| 309 | +- Time from `published_at` to `retrieved_at` to `external_signal.ts`. |
| 310 | +- Forward mid-price movement at 5m, 30m, 2h, 24h. |
| 311 | +- Directional hit rate where `expected_direction` is known. |
| 312 | +- Simulated tradability after spread/depth, not just mid-price movement. |
| 313 | + |
| 314 | +Validation rules: |
| 315 | + |
| 316 | +- Count a signal only if the raw document's `published_at` is before the measured market move. |
| 317 | +- Separate "fact arrived after price moved" from "fact arrived before price moved". |
| 318 | +- Do not claim edge from stale documents. |
| 319 | +- Report spread-adjusted and depth-limited results separately from raw mid-price movement. |
| 320 | + |
| 321 | +## Acceptance Criteria for First PR Series |
| 322 | + |
| 323 | +A good first implementation should satisfy all of these before we call the direction promising: |
| 324 | + |
| 325 | +- Produces `market_fact_need` rows for a representative active-market sample. |
| 326 | +- Routes markets to at least three source families: SEC EDGAR, Federal Register, and CourtListener. |
| 327 | +- Writes deduped `raw_external_doc` rows with source URL, `published_at`, and `retrieved_at`. |
| 328 | +- Writes `structured_fact` rows with evidence text and confidence. |
| 329 | +- Emits compatible `external_signal` rows for mapped Polymarket markets. |
| 330 | +- Existing watchlist/realtime flow can consume the emitted `external_signal` file without code rewrites. |
| 331 | +- Produces a Markdown/JSON report that separates coverage, precision, timeliness, price movement, and tradability. |
| 332 | +- Includes tests for schema validation, deduplication, source routing, and mapper behavior. |
| 333 | + |
| 334 | +## Kill Criteria |
| 335 | + |
| 336 | +Stop or redesign the direction if the shadow run shows any of these: |
| 337 | + |
| 338 | +- Fewer than 10 high-confidence mapped signals per week after source routing is working. |
| 339 | +- Manual audit precision below 70% for mapped `direct_trigger` or `material_update` rows. |
| 340 | +- Most signals are retrieved after the relevant market has already moved. |
| 341 | +- Forward movement exists only in mid-price but disappears after spread/depth assumptions. |
| 342 | +- Source maintenance cost is high because too many custom scrapers are required. |
| 343 | + |
| 344 | +## Suggested PR Split |
| 345 | + |
| 346 | +PR 1: market fact-needs classifier and report |
| 347 | + |
| 348 | +- Add `market_fact_need` schema and CLI. |
| 349 | +- Use existing market/rule caches. |
| 350 | +- No external collection yet. |
| 351 | +- Output a report showing which markets are automatable and which source family they need. |
| 352 | + |
| 353 | +PR 2: official source connectors |
| 354 | + |
| 355 | +- Add SEC EDGAR, Federal Register, and CourtListener collectors. |
| 356 | +- Write `raw_external_doc` NDJSON. |
| 357 | +- Include dedupe, timestamps, and rate-limit handling. |
| 358 | +- Keep GDELT optional/discovery-only if added. |
| 359 | + |
| 360 | +PR 3: LLM fact extractor and mapper |
| 361 | + |
| 362 | +- Add structured fact extraction. |
| 363 | +- Add deterministic + LLM market mapping. |
| 364 | +- Emit compatible `external_signal` rows. |
| 365 | + |
| 366 | +PR 4: outcome evaluation |
| 367 | + |
| 368 | +- Join emitted signals to realtime/order-book snapshots. |
| 369 | +- Produce forward-return and tradability reports. |
| 370 | +- Include a manual audit sample export. |
| 371 | + |
| 372 | +This split is preferred because each PR can be reviewed independently and can fail without blocking the others. |
| 373 | + |
| 374 | +## Open Questions for Soli22de |
| 375 | + |
| 376 | +- Is PR 1 enough as the first contribution, or should PR 1 include one source connector for end-to-end proof? |
| 377 | +- Which market categories in current Gamma data look most suitable for SEC/Federal Register/CourtListener routing? |
| 378 | +- Should the extractor use the current LLM profile stack, or should it define a separate cheaper/high-recall profile? |
| 379 | +- How large should the manual audit sample be for the first shadow run? |
| 380 | +- Should `expected_direction` be required only for `direct_trigger`, or optional for all signal rows? |
| 381 | + |
| 382 | +## Implementation Notes |
| 383 | + |
| 384 | +- Keep all new runtime outputs under `data/` or `reports/`; do not commit generated data. |
| 385 | +- Prefer append-only NDJSON for raw docs, facts, and signals so tmux/background loops can run safely. |
| 386 | +- Add compaction for new large NDJSON files only after the first connector proves useful. |
| 387 | +- Use existing CLI style in `poly_strategy/cli.py` and existing test style under `tests/`. |
| 388 | +- Keep source adapters isolated from LLM extraction so we can test them without provider keys. |
| 389 | +- Add deterministic fixtures for unit tests; do not depend on live network in tests. |
| 390 | + |
| 391 | +## Definition of Done |
| 392 | + |
| 393 | +The direction is ready for deeper implementation when we have: |
| 394 | + |
| 395 | +1. A working shadow pipeline from market fact need to external signal. |
| 396 | +2. At least one week of reports showing coverage, precision, timeliness, and tradability. |
| 397 | +3. Clear evidence that the pipeline finds facts the current monitor would not otherwise prioritize. |
| 398 | +4. A decision on whether to expand sources, improve mapping, or stop the direction. |
0 commit comments