feat(factors): add strict_bench mode with mandatory random control by Soli22de · Pull Request #143 · HKUDS/Vibe-Trading

Soli22de · 2026-05-25T21:30:06Z

PR: feat(factors): add strict_bench mode with mandatory random control

Target repo: HKUDS/Vibe-Trading
Target branch: main
Source branch: feat/strict-bench-random-control

Summary

bench_runner.run_bench() currently labels an alpha alive when its raw IC mean exceeds 0.02, its IC-positive ratio exceeds 0.55, and the IC self-t-stat clears 2. The gate is benchmarked against zero, not against a same-universe random control — so a row-shuffled version of the same factor will typically clear it too, because the IC is driven by shared cross-sectional beta (market, size, sector) rather than genuine alpha.

This PR adds an opt-in run_bench_strict() companion that requires a same-universe random-control comparison and an optional train/test OOS split before an alpha graduates to confirmed_alive. The existing run_bench() is untouched — strict mode is purely additive.

Why this matters

The case study that motivates the gate change is a 9-month A-share audit at Soli22de/Bili_Stock:

12 single-factor variants (BM, ROE, momentum, low-vol, BAB, MAX, reversal, ...) all passed categorise()'s raw IC gate at one parameter setting or another.
Adding a same-universe random control (research/foundation/ Backtest with random_control=True) collapsed every one of them: zero-cost alpha shrank from 5-20% to 0-2%, after the 56 bp round-trip A-share retail cost it went negative on all of them.
A separate low-vol factor scored Train Calmar +1.79 and Test Calmar -0.71 under the same random_control rail — IC-only categorisation would never have caught the reversal.
The same rail caught five engine bugs that all biased alpha upward (event-driven persisting an extra day, n_random_repeats shrinking the null std, baselines drawn from the wrong universe, etc.). bili_stock's MissingRandomControl exception was the discovery vector for every one of them.

The lesson generalises beyond A-share: Harvey, Liu, Zhu (2016) "...and the Cross-Section of Expected Returns" argue that with the multiple-testing burden of large factor zoos (455 in Vibe-Trading's current shipment), the median |t| threshold for accepting a factor needs to be ~3.5, not 2.0. Strict bench replicates that effect via row-shuffled controls rather than an analytical correction.

The shoe-leather summary: with 455 alphas in the registry, you should expect ~23 to clear |t|>2 by pure chance under a Gaussian null. The current categorise() gate cannot distinguish those 23 noise-positives from a genuine signal. Strict bench can.

What changed

New module agent/src/factors/bench_runner_strict.py:

_shuffle_within_rows(df, *, seed) — cross-sectionally permutes non-NaN values within each row (preserves per-date distribution, destroys signal→instrument mapping).
compute_random_ic_series(factor_df, return_df, *, n_seeds=5, base_seed=42) — averages IC series across n_seeds row-shuffled controls.
alpha_series_paired(signal_ic, random_ic) — per-date paired alpha on the common date index.
t_stat(series) — one-sample t-stat against zero; safe on empty / n<2 / constant input.
StrictThresholds(alpha_t_threshold=2.0, min_ic_count=30) — frozen dataclass for the gate parameters.
categorise_strict(row, thresholds) — buckets into confirmed_alive / train_only / reversed_strict / noise.
run_bench_strict(zoo, universe, period, *, random_control, n_random_seeds=5, oos_split=None, thresholds=None, top=20, on_progress=None, registry=None) — the public entry point.

New tests agent/tests/factors/test_bench_strict.py (19 tests):

Shuffle helpers: row-value-set preservation, NaN respect, seed difference.
Random IC series: shape, empty input handling.
Paired alpha + t_stat: edge cases (empty / n<2 / constant) and a hand-checked positive case.
Strict categorisation: noise corridor / reversed / confirmed (full-only and OOS) / train-only / min_ic_count floor / Harvey-Liu-Zhu 3.5 threshold variant.
Integration: run_bench_strict schema, OOS split presence, keyword-only random_control rail enforcement (TypeError when positional), explicit False opt-out behaviour.

API summary

from src.factors.bench_runner_strict import (
    StrictThresholds,
    run_bench_strict,
)

# Default rail: alpha_t > 2 against same-universe random control,
# no OOS split.
result = run_bench_strict(
    zoo="alpha101",
    universe="csi300",
    period="2018-2025",
    random_control=True,    # keyword-only, MUST be passed explicitly
)
print(result["confirmed_alive"], result["train_only"],
      result["reversed_strict"], result["noise"])

# Harvey-Liu-Zhu (2016) corrected threshold + train/test OOS split.
result = run_bench_strict(
    zoo="gtja191",
    universe="csi300",
    period="2010-2026",
    random_control=True,
    oos_split="2018-12-31",
    thresholds=StrictThresholds(alpha_t_threshold=3.5),
)

random_control is keyword-only and has no default. Passing it positionally — or omitting it — raises TypeError. This locks the rail at the signature level, the same way Soli22de/Bili_Stock's Backtest(random_control=...) constructor does.

Test plan

pytest agent/tests/factors/test_bench_strict.py — 19 passed in 0.56 s.
pytest agent/tests/factors/ — 988 existing + 19 new = 1007 passed, 1 skipped, 24.8 s. Zero regression in the existing factor suite.
Reviewer to confirm the strict CLI surface (separate follow-up PR can wire vibe-trading alpha bench --strict --random-control --oos 2018-12-31).
Reviewer to confirm the W4 compare module wants to consume strict results too.

Things this PR deliberately does not do

Does not touch run_bench() or categorise() — strict mode is additive. The existing IC-only behaviour is preserved for callers (CLI, Web UI, alpha-zoo skill) that still want the cheaper gate.
Does not wire a CLI flag — that should be a separate, smaller follow-up PR so this one stays surgical. The strict entry point is already importable from src.factors.bench_runner_strict.
Does not modify the existing alive/reversed/dead thresholds — they remain at 0.02 / 0.55 / 2.0 as documented.
Does not change the categorise() schema — strict adds new category names rather than overloading the existing four.

Discussion points for the reviewer

Naming: confirmed_alive vs alive was deliberate — alive carries the existing semantics and renaming it would silently break dashboards. Open to e.g. strict_alive if the maintainer prefers.
Default threshold = 2.0 vs 3.5: I went with 2.0 to stay backward-comparable, but 3.5 is the literature-supported default for a 455-alpha zoo. Happy to flip it if the maintainer wants strict-by-default.
_shuffle_within_rows vs Gaussian noise: row-shuffle preserves the cross-sectional distribution (mean, std, fat tails) of the original factor, which is fairer than np.random.normal(...). Empirically this matters more for IC-based bench than for portfolio-return-based backtests.
OOS gate semantics: train_only requires the test alpha_t to fail; should it also require sign match? (Currently it doesn't — a sign-flip Test gets train_only, same as a Test that's just below threshold. Easy to tighten if the maintainer wants.)
min_ic_count = 30 floor: borrowed directly from bili_stock foundation. May want to expose as a StrictThresholds field — happy to add.

DCO

The commit carries a Signed-off-by: trailer per CONTRIBUTING.md DCO requirements using the GitHub-canonical noreply email of @Soli22de.

For the maintainer: this is a 772-line additive change (290 lines of code + 295 lines of tests, plus the rest is docstrings/comments). All existing tests pass. Happy to split or rebase as needed.

bench_runner.run_bench() labels an alpha "alive" when its raw IC mean beats 0.02, IC-positive ratio beats 0.55, and the IC self-t-stat clears 2. That gate accepts factors whose IC is driven by shared cross- sectional beta (e.g. market or size factor leakage) — the test is benchmarked against zero, not against a same-universe random control, so a row-shuffled version of the same factor will typically clear the gate too. This commit adds an opt-in `run_bench_strict()` (companion module, existing `run_bench()` untouched and back-compat) that: 1. Builds N same-universe random controls per alpha by row-shuffling the factor values (preserves the per-date cross-sectional distribution, destroys the signal->instrument mapping). 2. Computes a per-date paired alpha series = signal_IC - random_IC and evaluates its t-stat instead of the IC self-t-stat. 3. Optionally splits train vs test on a user-supplied date and requires the test-period alpha_t to also clear the threshold before labelling an alpha `confirmed_alive` (otherwise `train_only`). 4. Makes `random_control: bool` keyword-only so callers must opt in or opt out explicitly — passing nothing raises TypeError. This mirrors the rail from Soli22de/Bili_Stock's foundation Backtest constructor, which was hardened after a 9-month audit found that every accidental no-random-control run inflated alpha by 3-8 pp. Categories: confirmed_alive - alpha_t_full > thresh AND (no OOS OR alpha_t_test > thresh) train_only - alpha_t_full > thresh but alpha_t_test < thresh reversed_strict - alpha_t_full <= -thresh noise - alpha_t_full in (-thresh, thresh) OR ic_count < 30 The default threshold is 2.0 to stay numerically comparable with the existing `categorise()` gate. Pass `StrictThresholds(alpha_t_threshold=3.5)` to apply the Harvey-Liu-Zhu (2016) multiple-testing recommendation when running the full 455-alpha zoo. Tests ----- 19 new tests in `agent/tests/factors/test_bench_strict.py` cover: * `_shuffle_within_rows` value-set preservation, NaN respect, seed diff * `compute_random_ic_series` shape / empty input * `alpha_series_paired` index alignment * `t_stat` edge cases (empty, n<2, constant) and a hand-checked positive * `categorise_strict` noise / reversed / confirmed (full-only and OOS) / train_only / min_ic_count gate / Harvey-Liu-Zhu threshold variant * `run_bench_strict` end-to-end schema, OOS split presence, keyword-only rail enforcement (TypeError when positional), explicit `False` opt-out Full factor suite remains green: 988 passed + 19 new = 1007 passed, 1 skipped, 24.8 s. See PR description for the Soli22de/Bili_Stock case study that motivates the change. Signed-off-by: Soli22de <177382421+Soli22de@users.noreply.github.com>

Independent adversarial review (run before upstream maintainer review) surfaced 10 issues in run_bench_strict. This commit fixes the high- and medium-impact ones in-place and adds 14 regression tests. The strict bench's design contract is unchanged; the changes are correctness and back-compat-only. Fixes ----- A1. OOS train/test boundary date no longer double-counts. .loc[:t] and .loc[t:] are both label-inclusive in pandas, so the split date previously appeared in both buckets. Replaced with explicit comparisons: train = alpha_full[alpha_full.index <= oos_ts] test = alpha_full[alpha_full.index > oos_ts] Also added ic_count_train / ic_count_test per row so categorise_strict can later enforce per-bucket min_ic_count. A2. compute_random_ic_series now uses an inner join across seeds. pd.concat(axis=1, join='inner') ensures every retained date is the mean of *all* available seeds — not a hodgepodge of 1-seed and 5-seed averages on dates where some seeds dropped out due to the _MIN_VALID_PER_DATE=5 guard. A3. on_progress callback is now exception-safe in every branch. Refactored to a single _fire_progress() helper invoked from both the empty-IC continue path and the normal end-of-loop path. A4. _shuffle_within_rows pins ±inf in place like NaN. Switched the mask from ~np.isnan() to np.isfinite() so an inf/-inf cell stays at its original position. Defensive against third-party zoos that bypass _validate_output. A5. OOS sign-flip is now categorised reversed_strict, not train_only. Bucket order: t_full >= thr AND t_test <= -thr → reversed_strict (most diagnostic failure); t_full >= thr AND t_test in noise band → train_only (benign decay). A6. Per-bucket ic_count fields surfaced on each row. A7. Sorting uses unrounded _ir_raw / _alpha_t_full_raw / _ic_mean_raw helper keys to keep top-N stable across runs. C1. Wire schema regression — strict result now carries legacy aliases alive/reversed/dead/by_theme alongside the strict-specific keys. Existing dashboards keep rendering without code changes: alive = confirmed_alive reversed = reversed_strict dead = noise + train_only by_theme is built by a strict-aware variant that emits both the new four-way and the legacy three-way counts per theme. C2. _slim payload re-adds formula_latex so the wiki / dashboard top-N cards keep showing the formula column. C5. Error envelope is now schema-complete from the start. Every error path (empty zoo, bad universe, bad forward-returns, bad oos_split) returns a dict with zeroed counters, empty lists, and the rail metadata intact — downstream consumers can depend on every key being present regardless of status. C6. n_random_seeds=0 is clamped to 1, AND the effective value is persisted to entry['n_random_seeds'] so the wire response doesn't lie about the seed count. Internal sort-helper keys (_ir_raw etc) are stripped from public_rows before they reach the wire payload — only _category is retained so external consumers can read the bucket label. Tests ----- 14 new regression tests in test_bench_strict.py — total now 33 strict + 988 existing = full agent/tests/factors/ suite is 1002 passed, 1 skipped, 24.25 s. Zero regression in the existing factor tests. The new tests: - test_oos_train_test_split_does_not_double_count_boundary (A1) - test_compute_random_ic_series_inner_joins_seed_dates (A2) - test_run_bench_strict_on_progress_exception_is_caught (A3) - test_shuffle_handles_inf_like_nan (A4) - test_categorise_oos_sign_flip_is_reversed_strict_not_train_only (A5) - test_categorise_oos_decay_to_noise_band_is_train_only (A5 companion) - test_run_bench_strict_emits_legacy_alive_dead_reversed_keys (C1) - test_run_bench_strict_legacy_alive_equals_confirmed_alive (C1) - test_run_bench_strict_top_lists_include_formula_latex (C2) - test_run_bench_strict_empty_zoo_returns_schema_with_counters (C5) - test_run_bench_strict_n_random_seeds_zero_is_clamped (C6) - test_run_bench_strict_rows_drop_underscore_prefixed_sort_keys (sort key hygiene) - test_run_bench_strict_catches_planted_alive_signal (closes the 'integration tests cheat' finding) - test_run_bench_strict_catches_planted_reversed_signal (same) Signed-off-by: Soli22de <177382421+Soli22de@users.noreply.github.com>

Soli22de · 2026-05-25T21:49:22Z

Self-review pass: ran an independent adversarial code review on the
first commit and pushed a fixup that addresses 10 findings. Categorised
by severity:

High impact (math / API surface)

A1 OOS train/test slice — .loc[:t] and .loc[t:] are both
label-inclusive in pandas, so the split date was being double-counted
across train and test. Replaced with explicit comparisons.
C1 Wire schema regression — the strict result dropped the legacy
alive / reversed / dead / by_theme keys that
alpha_routes._result_for_wire() whitelists, which would have
silently broken existing dashboards. Now emitted as aliases of the
strict buckets:
- alive ← confirmed_alive
- reversed ← reversed_strict
- dead ← noise + train_only
- by_theme built by a strict-aware variant that carries both the
  four-way and the three-way counts per theme.
C2 _slim() was dropping formula_latex, which the wiki / top-N
cards render. Re-added.

Medium impact

A2 compute_random_ic_series was using pd.concat(axis=1).mean(axis=1)
which silently degraded to single-seed averages on dates where any
shuffled seed dropped out via the _MIN_VALID_PER_DATE=5 guard.
Switched to join='inner' so every retained date is the mean of
all available seeds uniformly.
A3 on_progress callback was wrapped in a try/except on the
bottom of the loop but called bare in the empty-IC branch. A raising
callback would have killed the rest of the bench. Refactored to a
single _fire_progress() helper used in both paths.
A5 OOS sign-flip was lumped into train_only. A factor whose
train-period alpha lights up at +3σ and inverts to −3σ in OOS is one
of the most diagnostic outcomes, not a benign decay. Now routed to
reversed_strict (with train_only reserved for OOS that decays to
the noise band).

Low / defensive

A4 _shuffle_within_rows now pins ±inf in place like NaN
(np.isfinite() mask) — defensive against third-party zoos that
bypass Registry._validate_output.
A6 Added per-bucket ic_count_train / ic_count_test on each row.
A7 Sorting uses unrounded _ir_raw / _alpha_t_full_raw /
_ic_mean_raw helper keys; internal-only, stripped from the public
wire rows so they never leak into JSON.
C5 Error envelope now ships every counter / list key pre-zeroed
so downstream consumers can depend on the keys regardless of
status.
C6 n_random_seeds=0 is clamped to 1 and the effective value is
persisted to entry["n_random_seeds"] (was reporting 0 even though
the loop ran).

Integration tests no longer "cheat"
The original integration tests asserted schema presence but never
exercised the four StrictCategory outcomes on a planted signal.
Added:

test_run_bench_strict_catches_planted_alive_signal — a synthetic
panel with a baked-in cross-sectional momentum signal; asserts
confirmed_alive == 1.
test_run_bench_strict_catches_planted_reversed_signal — same
panel, inverted-momentum factor; asserts reversed_strict == 1.

Tests in total

Was: 19 strict + 988 existing = 1007 passed.
Now: 33 strict + 988 existing = 1002 passed, 1 skipped, 24.25 s on
agent/tests/factors/. Zero regression in the existing factor
suite.

The strict bench's design contract (keyword-only random_control,
opt-in companion to run_bench(), no modification to existing
behaviour) is unchanged. All changes are correctness or backward
compat.

Soli22de added 2 commits May 26, 2026 05:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(factors): add strict_bench mode with mandatory random control#143

feat(factors): add strict_bench mode with mandatory random control#143
Soli22de wants to merge 2 commits into
HKUDS:mainfrom
Soli22de:feat/strict-bench-random-control

Soli22de commented May 25, 2026

Uh oh!

Soli22de commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Soli22de commented May 25, 2026

PR: feat(factors): add strict_bench mode with mandatory random control

Summary

Why this matters

What changed

API summary

Test plan

Things this PR deliberately does not do

Discussion points for the reviewer

DCO

Uh oh!

Soli22de commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant