feat(factors): add strict_bench mode with mandatory random control#143
Open
Soli22de wants to merge 2 commits into
Open
feat(factors): add strict_bench mode with mandatory random control#143Soli22de wants to merge 2 commits into
Soli22de wants to merge 2 commits into
Conversation
bench_runner.run_bench() labels an alpha "alive" when its raw IC mean beats 0.02, IC-positive ratio beats 0.55, and the IC self-t-stat clears 2. That gate accepts factors whose IC is driven by shared cross- sectional beta (e.g. market or size factor leakage) — the test is benchmarked against zero, not against a same-universe random control, so a row-shuffled version of the same factor will typically clear the gate too. This commit adds an opt-in `run_bench_strict()` (companion module, existing `run_bench()` untouched and back-compat) that: 1. Builds N same-universe random controls per alpha by row-shuffling the factor values (preserves the per-date cross-sectional distribution, destroys the signal->instrument mapping). 2. Computes a per-date paired alpha series = signal_IC - random_IC and evaluates its t-stat instead of the IC self-t-stat. 3. Optionally splits train vs test on a user-supplied date and requires the test-period alpha_t to also clear the threshold before labelling an alpha `confirmed_alive` (otherwise `train_only`). 4. Makes `random_control: bool` keyword-only so callers must opt in or opt out explicitly — passing nothing raises TypeError. This mirrors the rail from Soli22de/Bili_Stock's foundation Backtest constructor, which was hardened after a 9-month audit found that every accidental no-random-control run inflated alpha by 3-8 pp. Categories: confirmed_alive - alpha_t_full > thresh AND (no OOS OR alpha_t_test > thresh) train_only - alpha_t_full > thresh but alpha_t_test < thresh reversed_strict - alpha_t_full <= -thresh noise - alpha_t_full in (-thresh, thresh) OR ic_count < 30 The default threshold is 2.0 to stay numerically comparable with the existing `categorise()` gate. Pass `StrictThresholds(alpha_t_threshold=3.5)` to apply the Harvey-Liu-Zhu (2016) multiple-testing recommendation when running the full 455-alpha zoo. Tests ----- 19 new tests in `agent/tests/factors/test_bench_strict.py` cover: * `_shuffle_within_rows` value-set preservation, NaN respect, seed diff * `compute_random_ic_series` shape / empty input * `alpha_series_paired` index alignment * `t_stat` edge cases (empty, n<2, constant) and a hand-checked positive * `categorise_strict` noise / reversed / confirmed (full-only and OOS) / train_only / min_ic_count gate / Harvey-Liu-Zhu threshold variant * `run_bench_strict` end-to-end schema, OOS split presence, keyword-only rail enforcement (TypeError when positional), explicit `False` opt-out Full factor suite remains green: 988 passed + 19 new = 1007 passed, 1 skipped, 24.8 s. See PR description for the Soli22de/Bili_Stock case study that motivates the change. Signed-off-by: Soli22de <177382421+Soli22de@users.noreply.github.com>
Independent adversarial review (run before upstream maintainer
review) surfaced 10 issues in run_bench_strict. This commit fixes
the high- and medium-impact ones in-place and adds 14 regression
tests. The strict bench's design contract is unchanged; the changes
are correctness and back-compat-only.
Fixes
-----
A1. OOS train/test boundary date no longer double-counts.
.loc[:t] and .loc[t:] are both label-inclusive in pandas, so the
split date previously appeared in both buckets. Replaced with
explicit comparisons:
train = alpha_full[alpha_full.index <= oos_ts]
test = alpha_full[alpha_full.index > oos_ts]
Also added ic_count_train / ic_count_test per row so
categorise_strict can later enforce per-bucket min_ic_count.
A2. compute_random_ic_series now uses an inner join across seeds.
pd.concat(axis=1, join='inner') ensures every retained date is
the mean of *all* available seeds — not a hodgepodge of 1-seed
and 5-seed averages on dates where some seeds dropped out due to
the _MIN_VALID_PER_DATE=5 guard.
A3. on_progress callback is now exception-safe in every branch.
Refactored to a single _fire_progress() helper invoked from both
the empty-IC continue path and the normal end-of-loop path.
A4. _shuffle_within_rows pins ±inf in place like NaN.
Switched the mask from ~np.isnan() to np.isfinite() so an
inf/-inf cell stays at its original position. Defensive against
third-party zoos that bypass _validate_output.
A5. OOS sign-flip is now categorised reversed_strict, not train_only.
Bucket order: t_full >= thr AND t_test <= -thr → reversed_strict
(most diagnostic failure); t_full >= thr AND t_test in noise band
→ train_only (benign decay).
A6. Per-bucket ic_count fields surfaced on each row.
A7. Sorting uses unrounded _ir_raw / _alpha_t_full_raw / _ic_mean_raw
helper keys to keep top-N stable across runs.
C1. Wire schema regression — strict result now carries legacy aliases
alive/reversed/dead/by_theme alongside the strict-specific keys.
Existing dashboards keep rendering without code changes:
alive = confirmed_alive
reversed = reversed_strict
dead = noise + train_only
by_theme is built by a strict-aware variant that emits both the
new four-way and the legacy three-way counts per theme.
C2. _slim payload re-adds formula_latex so the wiki / dashboard top-N
cards keep showing the formula column.
C5. Error envelope is now schema-complete from the start. Every error
path (empty zoo, bad universe, bad forward-returns, bad oos_split)
returns a dict with zeroed counters, empty lists, and the rail
metadata intact — downstream consumers can depend on every key
being present regardless of status.
C6. n_random_seeds=0 is clamped to 1, AND the effective value is
persisted to entry['n_random_seeds'] so the wire response doesn't
lie about the seed count.
Internal sort-helper keys (_ir_raw etc) are stripped from public_rows
before they reach the wire payload — only _category is retained so
external consumers can read the bucket label.
Tests
-----
14 new regression tests in test_bench_strict.py — total now 33
strict + 988 existing = full agent/tests/factors/ suite is 1002
passed, 1 skipped, 24.25 s. Zero regression in the existing factor
tests.
The new tests:
- test_oos_train_test_split_does_not_double_count_boundary (A1)
- test_compute_random_ic_series_inner_joins_seed_dates (A2)
- test_run_bench_strict_on_progress_exception_is_caught (A3)
- test_shuffle_handles_inf_like_nan (A4)
- test_categorise_oos_sign_flip_is_reversed_strict_not_train_only (A5)
- test_categorise_oos_decay_to_noise_band_is_train_only (A5 companion)
- test_run_bench_strict_emits_legacy_alive_dead_reversed_keys (C1)
- test_run_bench_strict_legacy_alive_equals_confirmed_alive (C1)
- test_run_bench_strict_top_lists_include_formula_latex (C2)
- test_run_bench_strict_empty_zoo_returns_schema_with_counters (C5)
- test_run_bench_strict_n_random_seeds_zero_is_clamped (C6)
- test_run_bench_strict_rows_drop_underscore_prefixed_sort_keys (sort key
hygiene)
- test_run_bench_strict_catches_planted_alive_signal (closes the
'integration tests cheat' finding)
- test_run_bench_strict_catches_planted_reversed_signal (same)
Signed-off-by: Soli22de <177382421+Soli22de@users.noreply.github.com>
Author
|
Self-review pass: ran an independent adversarial code review on the High impact (math / API surface)
Medium impact
Low / defensive
Integration tests no longer "cheat"
Tests in total
The strict bench's design contract (keyword-only |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: feat(factors): add strict_bench mode with mandatory random control
Target repo: HKUDS/Vibe-Trading
Target branch:
mainSource branch:
feat/strict-bench-random-controlSummary
bench_runner.run_bench()currently labels an alpha alive when its raw IC mean exceeds 0.02, its IC-positive ratio exceeds 0.55, and the IC self-t-stat clears 2. The gate is benchmarked against zero, not against a same-universe random control — so a row-shuffled version of the same factor will typically clear it too, because the IC is driven by shared cross-sectional beta (market, size, sector) rather than genuine alpha.This PR adds an opt-in
run_bench_strict()companion that requires a same-universe random-control comparison and an optional train/test OOS split before an alpha graduates toconfirmed_alive. The existingrun_bench()is untouched — strict mode is purely additive.Why this matters
The case study that motivates the gate change is a 9-month A-share audit at Soli22de/Bili_Stock:
categorise()'s raw IC gate at one parameter setting or another.research/foundation/Backtest withrandom_control=True) collapsed every one of them: zero-cost alpha shrank from 5-20% to 0-2%, after the 56 bp round-trip A-share retail cost it went negative on all of them.random_controlrail — IC-only categorisation would never have caught the reversal.n_random_repeatsshrinking the null std, baselines drawn from the wrong universe, etc.). bili_stock'sMissingRandomControlexception was the discovery vector for every one of them.The lesson generalises beyond A-share: Harvey, Liu, Zhu (2016) "...and the Cross-Section of Expected Returns" argue that with the multiple-testing burden of large factor zoos (455 in Vibe-Trading's current shipment), the median |t| threshold for accepting a factor needs to be ~3.5, not 2.0. Strict bench replicates that effect via row-shuffled controls rather than an analytical correction.
The shoe-leather summary: with 455 alphas in the registry, you should expect ~23 to clear |t|>2 by pure chance under a Gaussian null. The current
categorise()gate cannot distinguish those 23 noise-positives from a genuine signal. Strict bench can.What changed
New module
agent/src/factors/bench_runner_strict.py:_shuffle_within_rows(df, *, seed)— cross-sectionally permutes non-NaN values within each row (preserves per-date distribution, destroys signal→instrument mapping).compute_random_ic_series(factor_df, return_df, *, n_seeds=5, base_seed=42)— averages IC series acrossn_seedsrow-shuffled controls.alpha_series_paired(signal_ic, random_ic)— per-date paired alpha on the common date index.t_stat(series)— one-sample t-stat against zero; safe on empty / n<2 / constant input.StrictThresholds(alpha_t_threshold=2.0, min_ic_count=30)— frozen dataclass for the gate parameters.categorise_strict(row, thresholds)— buckets intoconfirmed_alive/train_only/reversed_strict/noise.run_bench_strict(zoo, universe, period, *, random_control, n_random_seeds=5, oos_split=None, thresholds=None, top=20, on_progress=None, registry=None)— the public entry point.New tests
agent/tests/factors/test_bench_strict.py(19 tests):min_ic_countfloor / Harvey-Liu-Zhu 3.5 threshold variant.run_bench_strictschema, OOS split presence, keyword-onlyrandom_controlrail enforcement (TypeError when positional), explicitFalseopt-out behaviour.API summary
random_controlis keyword-only and has no default. Passing it positionally — or omitting it — raisesTypeError. This locks the rail at the signature level, the same way Soli22de/Bili_Stock'sBacktest(random_control=...)constructor does.Test plan
pytest agent/tests/factors/test_bench_strict.py— 19 passed in 0.56 s.pytest agent/tests/factors/— 988 existing + 19 new = 1007 passed, 1 skipped, 24.8 s. Zero regression in the existing factor suite.vibe-trading alpha bench --strict --random-control --oos 2018-12-31).Things this PR deliberately does not do
run_bench()orcategorise()— strict mode is additive. The existing IC-only behaviour is preserved for callers (CLI, Web UI, alpha-zoo skill) that still want the cheaper gate.src.factors.bench_runner_strict.categorise()schema — strict adds new category names rather than overloading the existing four.Discussion points for the reviewer
confirmed_alivevsalivewas deliberate —alivecarries the existing semantics and renaming it would silently break dashboards. Open to e.g.strict_aliveif the maintainer prefers._shuffle_within_rowsvs Gaussian noise: row-shuffle preserves the cross-sectional distribution (mean, std, fat tails) of the original factor, which is fairer thannp.random.normal(...). Empirically this matters more for IC-based bench than for portfolio-return-based backtests.train_onlyrequires the test alpha_t to fail; should it also require sign match? (Currently it doesn't — a sign-flip Test getstrain_only, same as a Test that's just below threshold. Easy to tighten if the maintainer wants.)min_ic_count = 30floor: borrowed directly from bili_stock foundation. May want to expose as aStrictThresholdsfield — happy to add.DCO
The commit carries a
Signed-off-by:trailer perCONTRIBUTING.mdDCO requirements using the GitHub-canonical noreply email of@Soli22de.For the maintainer: this is a 772-line additive change (290 lines of code + 295 lines of tests, plus the rest is docstrings/comments). All existing tests pass. Happy to split or rebase as needed.