Skip to content

Arb-persistence study + James Bond depth verdict (94/yr ceiling)#9

Merged
WW-shan merged 13 commits into
WW-shan:mainfrom
Soli22de:experiment/2026-05-12-gamma-baseline
May 13, 2026
Merged

Arb-persistence study + James Bond depth verdict (94/yr ceiling)#9
WW-shan merged 13 commits into
WW-shan:mainfrom
Soli22de:experiment/2026-05-12-gamma-baseline

Conversation

@Soli22de
Copy link
Copy Markdown
Collaborator

TL;DR

14-day backfill + real CLOB depth check on the long-tail explicit_other arb candidate (James Bond). Verdict: technically real, commercially dead — max ~$3.78/event × ~4 events/14d ≈ $394/yr theoretical ceiling, less than gas after monitoring/withdrawal costs.

Binary D-vs-R sub-classifier built; backfill shows 98 "events" but most are forward-fill artifacts on illiquid markets. Live snapshot loop is now running on Soli22de's machine to accumulate honest bestAsk data — final binary verdict in 1-2 weeks.

Two read-first files

What's in the diff

Infrastructure (~700 lines):

  • scripts/snapshot_gamma.py — every-15-min Gamma collector, writes per-snapshot markets.ndjson + classified groups.ndjson with edge_after_fee precomputed
  • run_snapshot_loop.ps1 — persistent loop runner (Windows PowerShell)
  • scripts/backfill_prices_history.py — reconstructs 14 days of synthetic snapshots via CLOB /prices-history mid prices (with caveats baked into the docstring)
  • scripts/analyze_arb_events.py — detects contiguous edge events, computes persistence_minutes, applies pre-locked GO/KILL thresholds
  • scripts/verify_james_bond_book.py — real CLOB /book depth check + fill simulation at sizes [10, 30, 50, 80, 100, 150]
  • scripts/analyze_binary_refined.py — sub-classifies 2-member groups into dvr/yes_no/pseudo (catches the Aston Villa vs Freiburg false-binary trap)

Reports (~1200 lines, 4 files):

  • reports/arb-persistence-2026-05-12.md
  • reports/james-bond-book-validation-2026-05-12.md
  • reports/binary-refined-2026-05-12.md
  • reports/research-summary-2026-05-13.md

Plus yesterday's leftover commit (separate, 0df9dee): experiment 7 v2 refined + 4-model GLM extraction results.

Key findings

Question Answer
Does explicit_other long-tail arb exist? Yes, only on James Bond. 30–60hr persistence per event.
What's the realistic profit ceiling? ~$394/yr at theoretical peak, $0–200/yr after gas. Dead at scale.
Does binary D-vs-R arb exist? Unclear. Backfill says 98 events but forward-fill on illiquid markets produces fake persistence. Need live bestAsk over 7-14 days to verify.
Is /prices-history backfill useful in general? For liquid markets yes, for long-tail no. Worth knowing.

Open questions for review

  1. Pivot direction: research-summary §5.3 lists 4 candidate next theses (high-liquidity short-half-life / cross-platform / market-making / quit). Which do you want to talk through?
  2. Live loop length: I set it to 15-min cadence indefinitely. Stop after 14 days, or keep going?
  3. Kill list items (research-summary §5.4) — do you agree we should stop polishing T2/T3 LLM pipelines until we have a thesis whose bottleneck is description-reading rather than orderbook depth?

Test plan

This is mostly a research PR, not a code-shipping one. To verify the infra:

  • python -u scripts/snapshot_gamma.py --pages 6 produces data/snapshots/<date>/<HH-MM>/{markets,groups,meta}.ndjson
  • python -u scripts/analyze_arb_events.py runs cleanly on at least one snapshot
  • python -u scripts/verify_james_bond_book.py queries CLOB and produces the depth report (depends on James Bond markets still being open)
  • Read the two research-summary files and the depth-validation report

🤖 Generated with Claude Code

张靖恒 and others added 8 commits May 12, 2026 14:18
Experiment 1+3 (Gamma distribution + structural ground truth):
- Pulled n=2000 active markets from gamma-api.polymarket.com (4 pages
  × 500, ~3.5s total)
- vol24hr P10/P50/P90 = $0 / $40 / $18,333
- Liquidity P10/P50/P90 = $787 / $10,138 / $221,690
- Spread P10/P50/P90 = 0.001 / 0.01 / 0.10 (present in raw Gamma —
  PR WW-shan#4 spec was wrong to exclude this)
- 14-90d-to-resolution band = 693 markets (35%) — target range OK
- Derived 10,122 mutex pairs from 171 neg-risk groups → T4 $0 corpus
  validated

Implications:
- Q1 thresholds in PR WW-shan#3 are way too high; data-driven values in report
- Q4 T4 corpus problem disappears (10k+ pairs from structure alone)
- PR WW-shan#4 spec needs amendment for spread availability + dead-tier
  rephrasing (liquidity, not volume, as P10 boundary)

Raw NDJSON under data/experiments/ is gitignored; only script + report
committed.

Experiment 2 (OpenRouter calibration script, ~$0.001 total):
- One-shot validation that Gemini Flash V2 strict prompt actually
  produces schema-conforming JSON with verbatim grounding
- Requires OPENROUTER_API_KEY at runtime; not run yet
- Will calibrate the $0.00009/call estimate in PR WW-shan#6

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
n=5 random Polymarket markets, Gemini 2.0 Flash via OpenRouter,
V2 strict prompt with verbatim grounding.

Results:
- schema_ok: 5/5 (100%)
- grounding_ok: 5/5 (100%) — verbatim_text substring check passes
  on every clause across all 5 markets
- Avg 2-5 clauses extracted per market, types diverse
  (deadline / tiebreaker / source)
- Avg ambiguity_score 0.2-0.3
- Avg latency 3.2s/call

Cost calibration:
- Actual: $0.000214/call (5×$0.001070)
- PR WW-shan#6 estimate: $0.000090/call
- Off by 2.4× — output tokens are ~397 avg (estimate was 150),
  because verbatim_text transcription inflates output
- 2000-market T2 projected: $0.43 (vs $0.18 in PR WW-shan#6)

Conclusion: V2 strict prompt + Gemini Flash combination works
out-of-the-box on real Polymarket descriptions. No prompt
tweaks needed before T2 scale-up. PR WW-shan#6 budget needs an update
but total cost still trivial.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three improvements over experiment 2:
- 4 models head-to-head via OpenRouter (Gemini Flash, DeepSeek V3,
  GPT-4o-mini, Llama 3.3 70B) instead of just Gemini Flash. Tests
  the assumption that Flash — originally chosen in dash-ocr for
  image OCR — is also best for pure text extraction.
- n=30 stratified by description length (10 short + 10 medium +
  10 long) instead of n=5 random
- Full per-call NDJSON dump (raw response, parsed JSON, tokens,
  cost, schema/grounding checks) so we can read actual clause
  text afterward — experiment 2 lost this data

Hard cost cap: $0.50 (rough est is $0.04 for 120 calls).

Auto-metrics report shows per-model schema/grounding/clause-count.
The actionable/structural/trivial qualitative judgment will be
done in a separate pass that reads the NDJSON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-paid elysiver.h-e.top endpoint hosts GLM models including glm-5
and glm-4.6 (and glm-5.1 but quota-exhausted today). Cost = $0 to us
(user-paid, quota-limited), latency 15-25s/call.

Changes:
- PROVIDERS dict abstracts OpenAI-compatible endpoints (openrouter,
  elysiver); each model entry declares its provider.
- load_env_file() reads .env for OPENAI_BACKUP_API_KEY (gitignored).
- call_model() now routes per-model and looks up keys per-provider.
- Browser-like User-Agent header — Cloudflare on elysiver returns 1010
  to urllib's default UA; verified Mozilla/5.0 UA passes the check.
- New --out-name arg so partial reruns (e.g. only the new GLM models)
  don't overwrite earlier 4-model results.
- Added glm-5 and glm-4.6 model entries (cost=0 placeholder).

Smoke-tested: both GLM models complete full V2 prompt against real
Polymarket descriptions, schema_ok and grounding_ok, extract substantive
clauses (exclusion / numeric_threshold types).

windhub.cc primary endpoint remains unusable: harder Cloudflare JS
challenge that browser UA alone doesn't pass.

Also commits the previously-run 4-model report file
(experiment-multi-model-extraction-2026-05-12.md) which captured
the OpenRouter Gemini Flash / DeepSeek V3 / GPT-4o-mini / Llama 3.3
70B baseline used for judging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Experiment 7 (refined v2): classifies neg-risk groups into
explicit_other / binary / open_set tiers using catch-all keyword markers.
Replaces broken size>=8 heuristic that flagged Nobel-style open candidate
sets as exhaustive. 06:13 UTC snapshot found 1 strict candidate (James
Bond) + 7 binary candidates.

GLM-4.6 / GLM-5 results: 29/29 vs 16/30 success on elysiver-routed free
endpoints. GLM-4.6 emerges as $0 alternative to DeepSeek V3 for T2
resolution-clause extraction.

Research summary (Day 1) captures the day's verdicts for classmate WW
review: thesis live, James Bond needs CLOB depth verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cale

Infrastructure:
- snapshot_gamma.py: 15-min Gamma snapshot collector, writes per-snapshot
  markets.ndjson + classified groups.ndjson with edge_after_fee precomputed
- run_snapshot_loop.ps1: persistent loop runner (PowerShell)
- backfill_prices_history.py: reconstructs 14 days of synthetic snapshots
  via CLOB /prices-history (mid-price approximation, with caveats)
- analyze_arb_events.py: detects contiguous edge events, computes
  persistence_minutes, applies pre-locked GO/KILL thresholds
- verify_james_bond_book.py: real CLOB /book depth check + fill simulation
- analyze_binary_refined.py: sub-classifies 2-member groups into
  dvr/yes_no/pseudo, runs per-subtier event detection

Findings (14-day window, see research-summary-2026-05-13.md for the
full writeup):
- explicit_other tier: 4 events, all on James Bond group, 30-60hr
  persistence, mid-edge +9% to +18%. CLOB depth check shows max
  per-event profit ~$3.78 (at 80-unit basket), breakeven at ~120 units.
  Annualized ceiling ~$394 before gas/withdrawal. Commercially dead.
- binary dvr tier (D vs R races): 98 events at +2% threshold, but most
  are forward-fill artifacts on illiquid markets (9-34 distinct prices
  across 1140 snapshots = ~one trade per 30hr). Live bestAsk required
  to verify.

Conclusion: live snapshot loop continues running to accumulate honest
bestAsk time series; explicit_other thesis killed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live-only retest of the binary tier (--live-only flag on both
analyzers) confirmed the 22-hour persistence seen in backfill was
forward-fill artifact. Real bestAsk-based events flash for 1 snapshot
(15 min) and disappear. A few D/R races (WV/TN/SC Senate/Gov) showed
persistent low-edge floors of 2-5% across 14 hours of live data.

Generalized verify_book script (scripts/verify_group_book.py, takes
--group-id-prefix arg) and ran depth check on the most stable
candidate (SC Governor D/R, +2.55% sustained edge, $4,377 min_liq):

  Marginal edge:  +2.55%
  50u basket:     +$0.52 profit
  200u basket:    +$0.31 (breakeven approaching)
  500u basket:    -$21 (negative)

Republican side bestAsk=0.91 has depth of only 3.9 units ($3.5 of
fillable). Edge collapses on first non-trivial fill.

Both thesis branches now have definitive verdicts:
  - explicit_other (James Bond): $3.78/event max
  - binary D-vs-R (SC Gov sample): $0.52/event max

Both killed by the same structural fact: long-tail Polymarket books
hold $5-80 of depth at best ask. The "persistent edge" is real but
is the unfilled cost of nobody bothering to take $5 of order flow.

Action: live snapshot loop stopped (no point accumulating data when
both branches are dead). Existing 1.3GB local data kept as baseline.
Pivot direction TBD with WW via PR WW-shan#9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Soli22de
Copy link
Copy Markdown
Collaborator Author

Day 3 follow-up — both thesis branches now confirmed dead at scale

Added commit 3cf624c to the PR with Day 3 findings (script + 2 reports + amended research-summary-2026-05-13.md §3.6–3.8 and §5).

TL;DR: ran analyze_binary_refined.py --live-only against ~14 hours of pure live bestAsk data. Backfill's "22hr persistence" was indeed forward-fill artifact (median live persistence: 15 min = 1 snapshot). But a few D/R races showed persistent small-edge floors (WV/TN/SC Senate/Gov, +2-5% for many hours), so I generalized the depth-check script and ran it on the most stable candidate.

SC Governor D/R depth check (was the best-looking live candidate, +2.55% sustained for 14hrs, $4,377 min_liq):

Basket size Edge $ Edge %
1u (marginal) +$0.026 +2.55%
50u +$0.52 +1.04%
200u +$0.31 +0.15% (breakeven)
500u -$21 -4.29%
1000u -$70 -7.01%

Killer: Republican-side bestAsk = 0.91 with only 3.9 units of depth ($3.5 fillable). A single $4 trade closes the edge.

Both thesis branches:

Branch Best per-event profit Verdict
explicit_other (James Bond) $3.78 Dead at scale
binary D-vs-R (SC Gov etc) $0.52 Dead at scale

Same structural cause: Polymarket long-tail books carry $5–80 of depth at best ask. The "persistent edges" are the price of nobody bothering with $5 of order flow.

Actions taken:

  • ✅ Stopped live snapshot loop (no point accumulating when both branches dead). 1.3GB local data kept as baseline.
  • ✅ Generalized depth checker → scripts/verify_group_book.py --group-id <prefix>
  • ✅ Pushed Day 3 commit to this PR

Three questions for you (full version in research-summary §8):

  1. Agree on stopping the loop? (already stopped, can restart cheaply)
  2. Of the 4 candidate next-theses in §5.2, which to scope first? My weak preference is B (market-making) — same data, different lens, asks "if persistent edge is the cost-of-being-the-only-bidder, can we BE the bidder?"
  3. Worth writing a formal docs/thesis-postmortem.md so the next person who looks at long-tail Polymarket arb doesn't repeat the path?

After user pushed back on the "thesis dead" verdict ("我不信邪"), built two
maker-strategy simulators:

  v1 (mid-touch, simulate_maker_basket.py):
    - 3.15M tick points across 157 tokens over 14 days
    - For each (group, day, markup): did mid touch (bestAsk - markup)?
    - Result: $15,546/yr theoretical across 72 groups @ $100 basket
    - Known weakness: mid touching != trade at that price

  v2 (trade tape, simulate_maker_basket_v2.py):
    - Real Polymarket /trades tape, 48,030 raw trades over 14 days
    - Filtered to SELL Yes trades (the type that would hit a maker bid)
    - Only 1,602 / 48,030 (3.3%) qualified
    - Result: $918/yr theoretical = 17x reduction from v1

After realistic adjustments (queue priority, D/R correlation,
partial-fill hedging cost, gas): $200-500/yr @ $100 basket, scaling to
~$2-5k/yr @ $1000 basket with $144k capital tied up.

Updated research-summary §3.9: I overclaimed "thesis dead" from 2
taker-only depth snapshots. Real verdict: TAKER dead, MAKER alive at
hobby scale. User's skepticism was correct; my single-perspective
testing was insufficient methodology.

Verdict matrix:
  - Taker basket arb:                    $0-200/yr     killed
  - Maker basket arb (v1 mid-sim):       $15k optimistic phantom
  - Maker basket arb (v2 trade tape):    $200-500/yr   honest

Live loop remains stopped. Existing data + scripts kept for paper
trading next phase if pursued.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Soli22de
Copy link
Copy Markdown
Collaborator Author

Correction: I was wrong yesterday — MAKER thesis lives at $200-500/yr

Soli22de pushed back on the "thesis dead" verdict, which led to actually building the maker-strategy backtest I should have built before declaring the whole thing dead. Two simulators added in commit 4d1e2a6:

v1 (mid-touch, scripts/simulate_maker_basket.py)

  • 3.15M mid-price tick points across 157 tokens over 14 days
  • Fill proxy: did mid touch (bestAsk - markup) on day d?
  • Result: $15,546/yr theoretical across 72 groups @ $100 basket
  • Caveat written in: mid touching != trade at that price

v2 (trade tape, scripts/simulate_maker_basket_v2.py)

  • Real data-api.polymarket.com/trades tape, 48,030 raw trades
  • Filtered to SELL-Yes trades in window: 1,602 (3.3%)
  • Per (group, day, markup): did any SELL-Yes trade hit at price ≤ target?
  • Result: $918/yr theoretical — 17x reduction from v1
Metric v1 (mid-touch) v2 (trade tape)
Total daily $ $42.59 $2.51
Annualized $15,546 $918
Groups w/ positive expected 49/72 17/72
Best per-group Kansas Gov $4.87/day Iowa Senate $1.00/day

Realistic adjustment

After queue priority (×0.6), D/R correlation (×0.7), partial-fill hedging (-10%), Polygon gas (-20%): $200-500/yr at $100 basket, ~$2-5k/yr at $1000 basket.

Corrected verdict matrix

Strategy Real $/yr Status
Taker basket arb $0-200 Dead at scale (verified)
Maker mid-sim (v1) $15k phantom Methodology was wrong
Maker trade-tape (v2) $200-500 @ $100 Honest, hobby scale

What I learned

I overclaimed "thesis dead" from 2 taker-only depth snapshots. Robust testing needs at least: multiple strategy perspectives (taker/maker/hold), multiple snapshots in time, realistic fill models (trade tape > mid-touch), and multiple capital sizes.

Soli22de's "我不信邪" pushback is the only reason this PR has a defensible final verdict instead of a wrong one.

Three questions, revised

  1. Want to take the next step — paper-trade 2-3 top v2 candidates (Iowa Senate, Georgia Senate, Illinois Senate D/R) for 1-2 weeks to validate the ~$1/day-per-group claim with real fills?
  2. Accept hobby scale ($2-5k/yr at $1k basket size) or pivot to cross-platform / HFT-lane theses?
  3. Should I extend the trade-tape simulator to cover non-D/R structures (initiative referenda, sports specials) — could uncover other tradeable groups we missed?

WW-shan and others added 2 commits May 13, 2026 14:28
…cherry-pick

WW review of PR WW-shan#9 caught 4 bugs in the v1/v2 maker simulations and
verify_group_book.py:

  1. Income not capped by realized trade size — formula
     `fill_rate * avg_edge * intended_basket` assumed every fill captured
     the full $100. With avg trade sizes of 3-9 units, that overstates by
     5-20x.

  2. Maker target could cross bestAsk — `max(t, bestBid+0.001)` for narrow
     spreads could produce target = bestAsk (crossing/taker order, not maker).

  3. verify_group_book.py partial-fill cost wrong —
     `cost = avg_px * size` should be `avg_px * filled`. Inflated negative
     edge numbers when book ran out.

  4. avg_min_leg_sell_size was logged as a caveat but never folded into
     the main income formula.

Fixes:
- scripts/simulate_maker_basket_v2.py: per-day actual fill =
  min(intended, min-over-legs of qualifying-trade-size at price <= target);
  income = sum(edge_per_unit * actual_units) / window_days; target strictly
  clamped below bestAsk; skip markup levels with no valid maker zone.
- scripts/verify_group_book.py: compute basket cost/fee/edge at actual
  fillable units (not intended size); flag CAPPED rows where book runs out.

Reran both:
- v3 (size-capped): naive across 72 groups = -$263/yr;
  cherry-pick 18 positive-edge groups = +$117/yr @ $100 basket
- SC Gov taker @ 200u = +$2.02 (max), capped at 1304u for larger sizes

Research summary §3.10 + §3.11 supersede §3.9. TL;DR updated.

Honest final verdict: long-tail D-vs-R spread is market friction, not
alpha. Retail investor cannot net positive after fee + queue + gas.
The thesis isn't dead — it was never alpha to begin with, only looked
like alpha because of methodological errors at multiple layers.

Thanks to WW for the rigorous review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Soli22de
Copy link
Copy Markdown
Collaborator Author

@WW-shan 你提的 4 条全对。已修 + push commit d222906,先告诉你结果免得你浪费时间重做 v2:

修了什么

你提的 bug 修哪 怎么修
收益没按真实成交量封顶 simulate_maker_basket_v2.py:241-280 每日 fill = min(intended_basket, min-over-legs of 在 price≤target 的 SELL Yes size 合计);总日 $ = sum(edge_per_unit * actual_units) / window_days
Maker target 可能越过 bestAsk simulate_maker_basket_v2.py:236-260 t = min(t, bestAsk - 0.001) 强制 maker zone;spread < 0.002 的 markup 直接 skip
Taker partial-fill cost = avg_px * size verify_group_book.py:167-194 重写:先找 max_fillable,再用 actual_units = min(intended, max_fillable) 重算 cost/fee/edge;CAPPED 行加 ⚠️
avg_min_leg_sell_size 没进主公式 #1 已合并

修完跑出来的新数字(v3)

Version 方法 年化
v1 mid-touch 错(mid 不等于 fill) +$15,546
v2 size-uncapped 错(你抓的那条) +$918
v3 size-capped + no-crossing 修完 −$263 naive / +$117 cherry-pick

72 组 naive 部署 = 平均亏 $263/yr。只 trade 那 18 个正期望组 = +$117/yr @ $100 basket。

更细看 markup 表:

Markup Avg fill rate Avg edge/unit 总日 $
$0.005 5.9% -1.27% -$2.09
$0.010 5.8% -0.38% -$1.04
$0.020 5.7% +0.24% -$0.88
$0.030 5.6% +0.66% -$0.78
$0.050 5.4% +0.90% -$0.75

每个 markup 总日 $ 都负 —— 即使最好的 markup ($0.05),total = -$0.75/day。原因:avg edge/unit 在小 markup 时被 fee 吃掉,大 markup 时 fill rate 太低。

你判断中那两条对得很正

  • 你说 "taker 一次性吃 bestAsk 大概率仍然死" —— 确认,SC Gov fixed-version max profit 是 $2.02 一次性 @ 200u,再大就 capped。
  • 你说 "maker $200-500 不能信,需要按真实成交量封顶后重算" —— 直接命中。修完后 naive 是负的,cherry-pick 也只剩 $117。

关键诚实结论修正

之前 PR 标题/描述说"maker thesis lives at $200-500/yr",那是错的。修完后真实判决:

长尾 D-vs-R 的 spread 不是 alpha,是市场摩擦。零售投资者做不到 net positive after fee + queue + gas。

完整演变写在 reports/research-summary-2026-05-13.md §3.10 + §3.11(覆盖 §3.9 那个中间版本)。

还想问你

  1. 这次 v3 在你模型里跑过没?如果你的中转测试有不同结果欢迎贴。
  2. 我对 cherry-pick 那 18 个正期望组的态度:不应该当成可执行 alpha,因为是 hindsight bias —— 我们没有 oracle 提前知道哪 18 个。你怎么看?
  3. 这个 PR 的结论现在是 "thesis 是 hobby 都不到",你建议怎么处理:(a) merge with this verdict (b) close as "exhausted" (c) keep open for future ref

@WW-shan WW-shan merged commit 416a24a into WW-shan:main May 13, 2026
1 check passed
WW-shan pushed a commit that referenced this pull request May 15, 2026
Post-mortem of the maker-arb thesis after WW archived PR #9.

Two methodology fixes in v4:
  (A) maker fee was wrongly = taker fee in v3. Polymarket docs and
      live feeSchedule.takerOnly=True on 100/100 sampled markets
      confirm makers never pay fees. Corrected: maker_fee = 0.
  (B) v3 was 100% in-sample. Added 10/4 train/test split + multi-window
      orchestration to detect window-luck.

Findings progression:
  v3 (in-sample, taker fee):       -$263/yr naive, +$117 cherry
  v4 single window (today):        +$195/yr naive, +$289 cherry OOS
  v4 multi-window (4 x 14d = 56d): naive mean -$183 (sign flips!),
                                    cherry mean +$251 but UNSTABLE

The decisive result: across 4 non-overlapping 14-day windows covering
2026-03-20 to 2026-05-15:
  - 0 of 64 groups have positive OOS in >=3/4 windows
  - 44/64 groups (69%) had zero positive OOS across all 4 windows
  - Even the 2 groups consistently in top-18 by in-sample (Wisconsin,
    Kansas) had positive OOS in only 2/4 and 1/4 windows respectively
  - Naive deploy sign flips: -$1,117 in 3/20-4/03 window, +$239 in
    4/03-4/17 window

Cherry-pick "wins" within each window because we pick this window's
winners; but the winners rotate, so no actionable alpha.

Files:
  scripts/simulate_maker_basket_v4.py     - corrected fee + IS/OOS split
                                            + --end-date for time-shifting
  scripts/aggregate_v4_multi_window.py    - cross-window stability
  reports/maker-simulation-v4-*-w-*.md    - 4 per-window reports
  reports/maker-simulation-v4-multi-window-2026-05-15.md - the verdict

Note: poly_strategy/maker.py production code already has
fee_rate_assumption=0.0 for maker legs. The fee bug was localized
to my standalone research script, not production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants