Skip to content

feat(factors): add strict_bench mode with mandatory random control#143

Open
Soli22de wants to merge 2 commits into
HKUDS:mainfrom
Soli22de:feat/strict-bench-random-control
Open

feat(factors): add strict_bench mode with mandatory random control#143
Soli22de wants to merge 2 commits into
HKUDS:mainfrom
Soli22de:feat/strict-bench-random-control

Conversation

@Soli22de
Copy link
Copy Markdown

PR: feat(factors): add strict_bench mode with mandatory random control

Target repo: HKUDS/Vibe-Trading
Target branch: main
Source branch: feat/strict-bench-random-control


Summary

bench_runner.run_bench() currently labels an alpha alive when its raw IC mean exceeds 0.02, its IC-positive ratio exceeds 0.55, and the IC self-t-stat clears 2. The gate is benchmarked against zero, not against a same-universe random control — so a row-shuffled version of the same factor will typically clear it too, because the IC is driven by shared cross-sectional beta (market, size, sector) rather than genuine alpha.

This PR adds an opt-in run_bench_strict() companion that requires a same-universe random-control comparison and an optional train/test OOS split before an alpha graduates to confirmed_alive. The existing run_bench() is untouched — strict mode is purely additive.

Why this matters

The case study that motivates the gate change is a 9-month A-share audit at Soli22de/Bili_Stock:

  • 12 single-factor variants (BM, ROE, momentum, low-vol, BAB, MAX, reversal, ...) all passed categorise()'s raw IC gate at one parameter setting or another.
  • Adding a same-universe random control (research/foundation/ Backtest with random_control=True) collapsed every one of them: zero-cost alpha shrank from 5-20% to 0-2%, after the 56 bp round-trip A-share retail cost it went negative on all of them.
  • A separate low-vol factor scored Train Calmar +1.79 and Test Calmar -0.71 under the same random_control rail — IC-only categorisation would never have caught the reversal.
  • The same rail caught five engine bugs that all biased alpha upward (event-driven persisting an extra day, n_random_repeats shrinking the null std, baselines drawn from the wrong universe, etc.). bili_stock's MissingRandomControl exception was the discovery vector for every one of them.

The lesson generalises beyond A-share: Harvey, Liu, Zhu (2016) "...and the Cross-Section of Expected Returns" argue that with the multiple-testing burden of large factor zoos (455 in Vibe-Trading's current shipment), the median |t| threshold for accepting a factor needs to be ~3.5, not 2.0. Strict bench replicates that effect via row-shuffled controls rather than an analytical correction.

The shoe-leather summary: with 455 alphas in the registry, you should expect ~23 to clear |t|>2 by pure chance under a Gaussian null. The current categorise() gate cannot distinguish those 23 noise-positives from a genuine signal. Strict bench can.

What changed

New module agent/src/factors/bench_runner_strict.py:

  • _shuffle_within_rows(df, *, seed) — cross-sectionally permutes non-NaN values within each row (preserves per-date distribution, destroys signal→instrument mapping).
  • compute_random_ic_series(factor_df, return_df, *, n_seeds=5, base_seed=42) — averages IC series across n_seeds row-shuffled controls.
  • alpha_series_paired(signal_ic, random_ic) — per-date paired alpha on the common date index.
  • t_stat(series) — one-sample t-stat against zero; safe on empty / n<2 / constant input.
  • StrictThresholds(alpha_t_threshold=2.0, min_ic_count=30) — frozen dataclass for the gate parameters.
  • categorise_strict(row, thresholds) — buckets into confirmed_alive / train_only / reversed_strict / noise.
  • run_bench_strict(zoo, universe, period, *, random_control, n_random_seeds=5, oos_split=None, thresholds=None, top=20, on_progress=None, registry=None) — the public entry point.

New tests agent/tests/factors/test_bench_strict.py (19 tests):

  • Shuffle helpers: row-value-set preservation, NaN respect, seed difference.
  • Random IC series: shape, empty input handling.
  • Paired alpha + t_stat: edge cases (empty / n<2 / constant) and a hand-checked positive case.
  • Strict categorisation: noise corridor / reversed / confirmed (full-only and OOS) / train-only / min_ic_count floor / Harvey-Liu-Zhu 3.5 threshold variant.
  • Integration: run_bench_strict schema, OOS split presence, keyword-only random_control rail enforcement (TypeError when positional), explicit False opt-out behaviour.

API summary

from src.factors.bench_runner_strict import (
    StrictThresholds,
    run_bench_strict,
)

# Default rail: alpha_t > 2 against same-universe random control,
# no OOS split.
result = run_bench_strict(
    zoo="alpha101",
    universe="csi300",
    period="2018-2025",
    random_control=True,    # keyword-only, MUST be passed explicitly
)
print(result["confirmed_alive"], result["train_only"],
      result["reversed_strict"], result["noise"])

# Harvey-Liu-Zhu (2016) corrected threshold + train/test OOS split.
result = run_bench_strict(
    zoo="gtja191",
    universe="csi300",
    period="2010-2026",
    random_control=True,
    oos_split="2018-12-31",
    thresholds=StrictThresholds(alpha_t_threshold=3.5),
)

random_control is keyword-only and has no default. Passing it positionally — or omitting it — raises TypeError. This locks the rail at the signature level, the same way Soli22de/Bili_Stock's Backtest(random_control=...) constructor does.

Test plan

  • pytest agent/tests/factors/test_bench_strict.py — 19 passed in 0.56 s.
  • pytest agent/tests/factors/ — 988 existing + 19 new = 1007 passed, 1 skipped, 24.8 s. Zero regression in the existing factor suite.
  • Reviewer to confirm the strict CLI surface (separate follow-up PR can wire vibe-trading alpha bench --strict --random-control --oos 2018-12-31).
  • Reviewer to confirm the W4 compare module wants to consume strict results too.

Things this PR deliberately does not do

  • Does not touch run_bench() or categorise() — strict mode is additive. The existing IC-only behaviour is preserved for callers (CLI, Web UI, alpha-zoo skill) that still want the cheaper gate.
  • Does not wire a CLI flag — that should be a separate, smaller follow-up PR so this one stays surgical. The strict entry point is already importable from src.factors.bench_runner_strict.
  • Does not modify the existing alive/reversed/dead thresholds — they remain at 0.02 / 0.55 / 2.0 as documented.
  • Does not change the categorise() schema — strict adds new category names rather than overloading the existing four.

Discussion points for the reviewer

  1. Naming: confirmed_alive vs alive was deliberate — alive carries the existing semantics and renaming it would silently break dashboards. Open to e.g. strict_alive if the maintainer prefers.
  2. Default threshold = 2.0 vs 3.5: I went with 2.0 to stay backward-comparable, but 3.5 is the literature-supported default for a 455-alpha zoo. Happy to flip it if the maintainer wants strict-by-default.
  3. _shuffle_within_rows vs Gaussian noise: row-shuffle preserves the cross-sectional distribution (mean, std, fat tails) of the original factor, which is fairer than np.random.normal(...). Empirically this matters more for IC-based bench than for portfolio-return-based backtests.
  4. OOS gate semantics: train_only requires the test alpha_t to fail; should it also require sign match? (Currently it doesn't — a sign-flip Test gets train_only, same as a Test that's just below threshold. Easy to tighten if the maintainer wants.)
  5. min_ic_count = 30 floor: borrowed directly from bili_stock foundation. May want to expose as a StrictThresholds field — happy to add.

DCO

The commit carries a Signed-off-by: trailer per CONTRIBUTING.md DCO requirements using the GitHub-canonical noreply email of @Soli22de.


For the maintainer: this is a 772-line additive change (290 lines of code + 295 lines of tests, plus the rest is docstrings/comments). All existing tests pass. Happy to split or rebase as needed.

Soli22de added 2 commits May 26, 2026 05:28
bench_runner.run_bench() labels an alpha "alive" when its raw IC mean
beats 0.02, IC-positive ratio beats 0.55, and the IC self-t-stat clears
2.  That gate accepts factors whose IC is driven by shared cross-
sectional beta (e.g. market or size factor leakage) — the test is
benchmarked against zero, not against a same-universe random control,
so a row-shuffled version of the same factor will typically clear the
gate too.

This commit adds an opt-in `run_bench_strict()` (companion module,
existing `run_bench()` untouched and back-compat) that:

1. Builds N same-universe random controls per alpha by row-shuffling the
   factor values (preserves the per-date cross-sectional distribution,
   destroys the signal->instrument mapping).
2. Computes a per-date paired alpha series = signal_IC - random_IC and
   evaluates its t-stat instead of the IC self-t-stat.
3. Optionally splits train vs test on a user-supplied date and requires
   the test-period alpha_t to also clear the threshold before labelling
   an alpha `confirmed_alive` (otherwise `train_only`).
4. Makes `random_control: bool` keyword-only so callers must opt in or
   opt out explicitly — passing nothing raises TypeError. This mirrors
   the rail from Soli22de/Bili_Stock's foundation Backtest constructor,
   which was hardened after a 9-month audit found that every accidental
   no-random-control run inflated alpha by 3-8 pp.

Categories:
  confirmed_alive  - alpha_t_full > thresh AND (no OOS OR alpha_t_test > thresh)
  train_only       - alpha_t_full > thresh but alpha_t_test < thresh
  reversed_strict  - alpha_t_full <= -thresh
  noise            - alpha_t_full in (-thresh, thresh) OR ic_count < 30

The default threshold is 2.0 to stay numerically comparable with the
existing `categorise()` gate. Pass `StrictThresholds(alpha_t_threshold=3.5)`
to apply the Harvey-Liu-Zhu (2016) multiple-testing recommendation when
running the full 455-alpha zoo.

Tests
-----

19 new tests in `agent/tests/factors/test_bench_strict.py` cover:

* `_shuffle_within_rows` value-set preservation, NaN respect, seed diff
* `compute_random_ic_series` shape / empty input
* `alpha_series_paired` index alignment
* `t_stat` edge cases (empty, n<2, constant) and a hand-checked positive
* `categorise_strict` noise / reversed / confirmed (full-only and OOS)
  / train_only / min_ic_count gate / Harvey-Liu-Zhu threshold variant
* `run_bench_strict` end-to-end schema, OOS split presence, keyword-only
  rail enforcement (TypeError when positional), explicit `False` opt-out

Full factor suite remains green: 988 passed + 19 new = 1007 passed,
1 skipped, 24.8 s.

See PR description for the Soli22de/Bili_Stock case study that
motivates the change.

Signed-off-by: Soli22de <177382421+Soli22de@users.noreply.github.com>
Independent adversarial review (run before upstream maintainer
review) surfaced 10 issues in run_bench_strict. This commit fixes
the high- and medium-impact ones in-place and adds 14 regression
tests. The strict bench's design contract is unchanged; the changes
are correctness and back-compat-only.

Fixes
-----

A1. OOS train/test boundary date no longer double-counts.
    .loc[:t] and .loc[t:] are both label-inclusive in pandas, so the
    split date previously appeared in both buckets. Replaced with
    explicit comparisons:
        train = alpha_full[alpha_full.index <= oos_ts]
        test  = alpha_full[alpha_full.index >  oos_ts]
    Also added ic_count_train / ic_count_test per row so
    categorise_strict can later enforce per-bucket min_ic_count.

A2. compute_random_ic_series now uses an inner join across seeds.
    pd.concat(axis=1, join='inner') ensures every retained date is
    the mean of *all* available seeds — not a hodgepodge of 1-seed
    and 5-seed averages on dates where some seeds dropped out due to
    the _MIN_VALID_PER_DATE=5 guard.

A3. on_progress callback is now exception-safe in every branch.
    Refactored to a single _fire_progress() helper invoked from both
    the empty-IC continue path and the normal end-of-loop path.

A4. _shuffle_within_rows pins ±inf in place like NaN.
    Switched the mask from ~np.isnan() to np.isfinite() so an
    inf/-inf cell stays at its original position. Defensive against
    third-party zoos that bypass _validate_output.

A5. OOS sign-flip is now categorised reversed_strict, not train_only.
    Bucket order: t_full >= thr AND t_test <= -thr → reversed_strict
    (most diagnostic failure); t_full >= thr AND t_test in noise band
    → train_only (benign decay).

A6. Per-bucket ic_count fields surfaced on each row.

A7. Sorting uses unrounded _ir_raw / _alpha_t_full_raw / _ic_mean_raw
    helper keys to keep top-N stable across runs.

C1. Wire schema regression — strict result now carries legacy aliases
    alive/reversed/dead/by_theme alongside the strict-specific keys.
    Existing dashboards keep rendering without code changes:
        alive    = confirmed_alive
        reversed = reversed_strict
        dead     = noise + train_only
    by_theme is built by a strict-aware variant that emits both the
    new four-way and the legacy three-way counts per theme.

C2. _slim payload re-adds formula_latex so the wiki / dashboard top-N
    cards keep showing the formula column.

C5. Error envelope is now schema-complete from the start. Every error
    path (empty zoo, bad universe, bad forward-returns, bad oos_split)
    returns a dict with zeroed counters, empty lists, and the rail
    metadata intact — downstream consumers can depend on every key
    being present regardless of status.

C6. n_random_seeds=0 is clamped to 1, AND the effective value is
    persisted to entry['n_random_seeds'] so the wire response doesn't
    lie about the seed count.

Internal sort-helper keys (_ir_raw etc) are stripped from public_rows
before they reach the wire payload — only _category is retained so
external consumers can read the bucket label.

Tests
-----

14 new regression tests in test_bench_strict.py — total now 33
strict + 988 existing = full agent/tests/factors/ suite is 1002
passed, 1 skipped, 24.25 s. Zero regression in the existing factor
tests.

The new tests:
- test_oos_train_test_split_does_not_double_count_boundary (A1)
- test_compute_random_ic_series_inner_joins_seed_dates (A2)
- test_run_bench_strict_on_progress_exception_is_caught (A3)
- test_shuffle_handles_inf_like_nan (A4)
- test_categorise_oos_sign_flip_is_reversed_strict_not_train_only (A5)
- test_categorise_oos_decay_to_noise_band_is_train_only (A5 companion)
- test_run_bench_strict_emits_legacy_alive_dead_reversed_keys (C1)
- test_run_bench_strict_legacy_alive_equals_confirmed_alive (C1)
- test_run_bench_strict_top_lists_include_formula_latex (C2)
- test_run_bench_strict_empty_zoo_returns_schema_with_counters (C5)
- test_run_bench_strict_n_random_seeds_zero_is_clamped (C6)
- test_run_bench_strict_rows_drop_underscore_prefixed_sort_keys (sort key
  hygiene)
- test_run_bench_strict_catches_planted_alive_signal (closes the
  'integration tests cheat' finding)
- test_run_bench_strict_catches_planted_reversed_signal (same)

Signed-off-by: Soli22de <177382421+Soli22de@users.noreply.github.com>
@Soli22de
Copy link
Copy Markdown
Author

Self-review pass: ran an independent adversarial code review on the
first commit and pushed a fixup that addresses 10 findings. Categorised
by severity:

High impact (math / API surface)

  • A1 OOS train/test slice — .loc[:t] and .loc[t:] are both
    label-inclusive in pandas, so the split date was being double-counted
    across train and test. Replaced with explicit comparisons.
  • C1 Wire schema regression — the strict result dropped the legacy
    alive / reversed / dead / by_theme keys that
    alpha_routes._result_for_wire() whitelists, which would have
    silently broken existing dashboards. Now emitted as aliases of the
    strict buckets:
    • aliveconfirmed_alive
    • reversedreversed_strict
    • deadnoise + train_only
    • by_theme built by a strict-aware variant that carries both the
      four-way and the three-way counts per theme.
  • C2 _slim() was dropping formula_latex, which the wiki / top-N
    cards render. Re-added.

Medium impact

  • A2 compute_random_ic_series was using pd.concat(axis=1).mean(axis=1)
    which silently degraded to single-seed averages on dates where any
    shuffled seed dropped out via the _MIN_VALID_PER_DATE=5 guard.
    Switched to join='inner' so every retained date is the mean of
    all available seeds uniformly.
  • A3 on_progress callback was wrapped in a try/except on the
    bottom of the loop but called bare in the empty-IC branch. A raising
    callback would have killed the rest of the bench. Refactored to a
    single _fire_progress() helper used in both paths.
  • A5 OOS sign-flip was lumped into train_only. A factor whose
    train-period alpha lights up at +3σ and inverts to −3σ in OOS is one
    of the most diagnostic outcomes, not a benign decay. Now routed to
    reversed_strict (with train_only reserved for OOS that decays to
    the noise band).

Low / defensive

  • A4 _shuffle_within_rows now pins ±inf in place like NaN
    (np.isfinite() mask) — defensive against third-party zoos that
    bypass Registry._validate_output.
  • A6 Added per-bucket ic_count_train / ic_count_test on each row.
  • A7 Sorting uses unrounded _ir_raw / _alpha_t_full_raw /
    _ic_mean_raw helper keys; internal-only, stripped from the public
    wire rows so they never leak into JSON.
  • C5 Error envelope now ships every counter / list key pre-zeroed
    so downstream consumers can depend on the keys regardless of
    status.
  • C6 n_random_seeds=0 is clamped to 1 and the effective value is
    persisted to entry["n_random_seeds"] (was reporting 0 even though
    the loop ran).

Integration tests no longer "cheat"
The original integration tests asserted schema presence but never
exercised the four StrictCategory outcomes on a planted signal.
Added:

  • test_run_bench_strict_catches_planted_alive_signal — a synthetic
    panel with a baked-in cross-sectional momentum signal; asserts
    confirmed_alive == 1.
  • test_run_bench_strict_catches_planted_reversed_signal — same
    panel, inverted-momentum factor; asserts reversed_strict == 1.

Tests in total

  • Was: 19 strict + 988 existing = 1007 passed.
  • Now: 33 strict + 988 existing = 1002 passed, 1 skipped, 24.25 s on
    agent/tests/factors/. Zero regression in the existing factor
    suite.

The strict bench's design contract (keyword-only random_control,
opt-in companion to run_bench(), no modification to existing
behaviour) is unchanged. All changes are correctness or backward
compat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant