Automated alpha research pipeline for the WorldQuant BRAIN platform. Built for the International Quant Championship 2026 (IQC).
Instead of clicking "Submit" on the BRAIN web GUI one alpha at a time, this pipeline turns alpha research into a brute-force overnight job: define a handful of expression templates, cross them against thousands of fundamental fields, run the lot through BRAIN's backtester three-concurrent, and wake up to a ranked, de-correlated basket of submission candidates.
BRAIN exposes ~85,000 data fields and a 5-year backtester behind an HTTP API. The web GUI surfaces one simulation at a time; the API allows three concurrent. Manual workflow caps you around 10 backtests/day. This pipeline runs ~800 overnight, persists every result, filters by Sharpe / fitness / IS-check status, then prunes correlated alphas down to a basket you can actually submit.
flowchart LR
Fields["wq discover<br/>/data-fields paginator"] --> FieldsCSV[(fields.csv)]
Templates["8 alpha templates<br/>(rank, ts_mean, ts_zscore...)"] --> Grid
FieldsCSV --> Grid
Universes[universes × neutralizations] --> Grid
Grid["grid build<br/>(template × field × universe × neut)"] --> Dispatcher
Dispatcher["wq run<br/>active-futures dispatcher<br/>ThreadPoolExecutor(3)"] -->|3 concurrent<br/>per-thread Session| BRAIN[("WorldQuant BRAIN<br/>HTTP API")]
BRAIN -->|poll Retry-After<br/>IS metrics + checks| Dispatcher
Dispatcher --> DB[("alphas.db<br/>SQLite + WAL<br/>1 row / sim")]
DB --> Survivors["wq survivors<br/>SQL filter<br/>(Sharpe, fitness, IS checks)"]
Survivors --> SurvivorsCSV[(survivors.csv)]
SurvivorsCSV --> Correlate["wq correlate<br/>fetch PnL → diff →<br/>corr matrix → greedy de-dup"]
Correlate --> Basket[(basket.csv)]
Basket --> Submit["manually submit<br/>best 5–10 to BRAIN platform"]
style BRAIN fill:#1a1a2e,color:#fff
style DB fill:#16213e,color:#fff
style Submit fill:#0f3460,color:#fff
git clone <this-repo> && cd wq-alpha-pipeline
python -m venv .venv && source .venv/bin/activate
pip install -e .
cp .env.example .env # fill in WQ_EMAIL + WQ_PASS
# 1. discover available data fields (one-time, ~30s)
wq discover
# 2. run a small grid as smoke test (≈60 sims, ≈25 min at 3-concurrent)
wq run --max-fields 5 --neutralizations MARKET,SUBINDUSTRY
# 3. filter survivors at default thresholds
wq survivors
# 4. correlation-prune to an uncorrelated basket
wq correlate| Command | What it does |
|---|---|
wq discover |
Pulls all available data fields → fields.csv + fields.json. |
wq run |
Grid runner: 8 templates × N fields × universes × neutralizations. |
wq expressions |
Same dispatcher, but reads raw expressions from a file. |
wq survivors |
SQL filter: Sharpe, fitness, margin, structural IS-checks. |
wq correlate |
Fetches PnL series, builds correlation matrix, greedy de-dup. |
Each subcommand takes --help for its full options.
.env (copy from .env.example):
WQ_EMAIL=your_brain_email
WQ_PASS=your_brain_password
These are HTTP Basic creds for POST /authentication. If your account uses biometric/persona 2FA, the client raises WQBiometricRequired with the URL you need to visit before continuing.
Default simulation settings (region USA, delay 1, language FASTEXPR, etc.) live in wq_pipeline.client.SimulationSettings and are overridable per-call.
These are the non-obvious choices, with the reasoning. The intent is that anyone reading the code can see why, not just what.
- SQLite instead of Postgres / a job queue. One process writes, occasional readers query. The dataset is single-digit MB even at 10k rows. WAL +
busy_timeout=10000is enough; anything heavier is overkill. - Idempotent UPSERT on
(expression_hash, region, universe, delay, decay, neutralization, truncation). Re-running the same grid produces zero duplicates and zero wasted API calls. Crash-resume falls out for free. - Active-futures dispatcher, not
executor.map.executor.mapqueues every task at once and swallows exceptions; the dispatcher loop keeps exactlymax_workersin flight, has per-future watchdog timeouts, and interleaves DB writes between dispatch and result phases. - Thread-local
WQClientper worker.requests.Sessionis not thread-safe — its cookie jar and adapter pool break under concurrent access. Each worker thread lazily instantiates its own client (and its own Session) on first use. mark_runningbeforeexecutor.submit. Closes a TOCTOU window where a crash betweensubmit()and a post-submit DB write would leave the rowQUEUEDwhile a sim was already in flight.fcntl.flockonalphas.db.lock. Prevents twowq runprocesses from racing on the same DB. Second process exits cleanly with a clear message.- PnL diff, not raw cumulative. WQ returns cumulative dollar PnL; correlation between cumulative series is meaningless. The pipeline diffs to daily returns before computing correlation.
COALESCE(chk_X_result, 'PASS') != 'FAIL'for IS-check filters. SQL three-valued logic:NULL != 'FAIL'isNULL, which is falsy inWHERE. Without theCOALESCE, rows with aNULLcheck would silently get filtered out.- Drop
LOW_SHARPE/LOW_FITNESSfrom required-checks. Those checks duplicate the explicit--min-sharpe/--min-fitnessthresholds and would re-impose WQ's hardcoded 1.25 / 1.0 limits on top of any custom filter the user picks. - Positive Sharpe by default,
--allow-inverseopt-in. A negative-Sharpe alpha is just an inverted positive-Sharpe alpha, but the inversion has to be made explicit by the user — competition submissions need the right sign on day one.
- No auto-submitter.
basket.csvis the hand-off; final submission to BRAIN is manual to avoid ever burning a submission slot accidentally. - No multi-process advisory lock across machines.
flockis host-local. One runner per host, one host per DB. align_returnsuses max-length, not date-intersection. Fine for single-universe runs (dates align by construction). For multi-universe baskets, refactor to thread date arrays through.- No LLM template generator. The brief explicitly chose brute-force over generative templating; out of scope.
- No correlation pruning vs production set. PnL correlation is computed within the candidate set, not against your already-submitted alphas. Add when prod-set is large enough to matter.
- 8 hand-written templates (5 keepers + 3 winner-variants from data). Bigger template set is the obvious next lever; see upstream references like worldquant-miner for a 200+ template library to draw from.
Python 3.11+, requests, sqlite3 (stdlib), numpy, concurrent.futures.ThreadPoolExecutor, fcntl. No async, no ORM, no Docker.
src/wq_pipeline/ # the package
├── client.py # BRAIN API client (auth, submit, poll, fetch)
├── models.py # Template, GridItem dataclasses
├── db.py # SQLite store, schema, status transitions
├── runner.py # grid runner + active-futures dispatcher
├── survivors.py # SQL filter for IS-passing alphas
├── correlation.py # PnL fetch + greedy de-dup
├── fields.py # /data-fields paginator
├── expressions.py # raw-expression bulk runner
└── cli.py # single entry point with subcommands
examples/ # smoke + sample expressions file
tests/ # pytest unit tests
data/ # gitignored; alphas.db lives here
.github/workflows/ci.yml
MIT — see LICENSE.