Qlib Factor Lab

Qlib Factor Lab is a lightweight research scaffold for A-share factor work. It keeps factor definitions, data-building scripts, single-factor evaluation, event backtests, and model workflow generation in one small Python package.

The current supported product surface is signal-only research. Factor Lab currently produces governed factor signals and theme research candidates; portfolio construction, paper trading, broker adapters, and live execution are kept as historical/experimental modules and are not part of the default workflow.

The current project focuses on:

Formula-style price, volume, turnover, volatility, reversal, and pattern factors.
CSI500 and CSI300 local research datasets built from AkShare.
JoinQuant factor-library migration candidates that can be expressed with local OHLCV and turnover fields.
AI industry-chain theme scans that turn approved signals into research candidates, not investment advice.

Platform Loop

Factor Lab follows one conservative daily operating loop: govern the data first, turn approved factor families into explainable signals, then optionally project those signals into a focused theme universe such as AI chips, semiconductors, and memory. Each run writes auditable signal artifacts under runs/YYYYMMDD/ and theme scan reports under reports/theme_scans/.

Generated market data, Qlib binaries, MLflow records, and backtest reports are intentionally ignored by Git. See docs/data-and-artifacts.md.

For a compact command-by-command example, see docs/factor-research-path.md.

For the unified North-Star blueprint covering data governance, multi-lane autoresearch, stock cards, family-first portfolios, and expert review, see docs/factor-lab-north-star-blueprint.md. The current workbench intentionally exposes a narrower signal-only subset.

Older roadmap documents discuss paper trading and manual-live readiness. Those modules are not the active product scope; the active path is data governance -> factor research -> daily signal -> theme signal.

Project Layout

configs/                 Provider, factor-mining, and model configs
docs/                    Design notes and operating docs
factors/                 Factor registry and candidate-family notes
reports/joinquant_factorlib/
                         Small JoinQuant factor-library snapshots
scripts/                 CLI commands for data, factors, events, models
src/qlib_factor_lab/     Reusable Python package
tests/                   Unit tests that do not require downloaded data

Research Flow

flowchart TD
    A["Build or download local data"] --> B["Check provider config"]
    B --> C["Generate candidate factors"]
    C --> D["Mine IC / Rank IC"]
    D --> E{"Factor type?"}
    E -->|Cross-sectional feature| F["Batch compare and model workflow"]
    E -->|Absolute or pattern trigger| G["Event backtest by percentile bucket"]
    F --> H["Review stability by horizon, year, and market regime"]
    G --> H
    H --> I["Promote robust factors into model or live-trading design"]
    I --> J["Keep generated reports local; commit source and small references"]

The loop is intentionally conservative: use IC/Rank IC for broad triage, event backtests for trigger-like signals, then require horizon, yearly, and market-regime checks before a factor graduates.

Quick Start

Create a local environment:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install -e .

Run the unit tests:

make test

Start the local Streamlit research workbench:

make workbench

The workbench is a read-only local UI for signal research. It opens on 01 AI产业链 by default, reads existing artifacts such as autoresearch ledgers, approved factors, daily signals, theme scans, and evidence files, and keeps the product surface focused on research signals. It does not execute trading commands from the browser, and the default navigation does not expose portfolio or paper-trading workflows.

Check the local Qlib environment after data has been downloaded or built:

make check-env

Data Setup

Default provider configs:

configs/provider.yaml                Official Qlib sample data
configs/provider_current.yaml        Current CSI500 AkShare/Qlib data
configs/provider_csi300_current.yaml Current CSI300 AkShare/Qlib data

Download official Qlib CN sample data:

python scripts/download_qlib_data.py

Build current CSI500 data:

python scripts/build_akshare_qlib_data.py \
  --universe csi500 \
  --start 20150101 \
  --end 20260420 \
  --history-source sina \
  --qlib-dir data/qlib/cn_data_current \
  --source-dir data/akshare/source \
  --provider-config configs/provider_current.yaml

Build current CSI300 data:

python scripts/build_akshare_qlib_data.py \
  --universe csi300 \
  --start 20150101 \
  --end 20260420 \
  --history-source sina \
  --qlib-dir data/qlib/cn_data_csi300_current \
  --source-dir data/akshare/source_csi300 \
  --provider-config configs/provider_csi300_current.yaml

AkShare free sources are good enough for local research prototypes, but production research should use a stable vendor feed. Use --limit for smoke tests and --delay/--retries when a source throttles requests.

Build the daily research context used by event risk gates and expert review packets. The research database is intentionally fixed to the CSI300 and CSI500 universes; generated security and event files are filtered to those two pools by default.

python scripts/build_research_context_data.py \
  --as-of-date 2026-04-24 \
  --notice-start 2026-04-01 \
  --notice-end 2026-04-24 \
  --universes csi300 csi500 \
  --security-master-output data/security_master.csv \
  --company-events-output data/company_events.csv

For offline smoke tests, normalize local raw AkShare-like CSV files instead of calling the network:

python scripts/build_research_context_data.py \
  --security-master-source-csv raw/security.csv \
  --notice-source-csv raw/notices.csv \
  --universe-symbols-csv raw/universes.csv

The generated data/security_master.csv and data/company_events.csv feed configs/event_risk.yaml, event_risk_snapshot.csv, the daily risk gate, and the expert review packet.

Check point-in-time data-domain coverage and lane readiness:

make data-governance RUN_DATE=20260420

The report reads configs/data_governance.yaml and writes reports/data_governance_YYYYMMDD.md plus a sibling CSV. Missing non-price data domains are reported as shadow rather than promoted into the main portfolio.

Factor Evaluation

Evaluate one registry factor:

python scripts/eval_factor.py \
  --provider-config configs/provider_current.yaml \
  --factor ret_20 \
  --output reports/factor_ret_20_current.csv

Run batch evaluation:

python scripts/batch_eval_factors.py \
  --provider-config configs/provider_current.yaml \
  --output reports/factor_batch_current.csv

Optional neutralization:

python scripts/eval_factor.py \
  --provider-config configs/provider_current.yaml \
  --factor ret_20 \
  --purify-step mad \
  --purify-step zscore \
  --neutralize-size-proxy \
  --plot \
  --plot-horizon 5

The public Qlib CN sample data has no industry or market-cap fields. The project therefore supports:

--neutralize-size-proxy: cross-sectional neutralization with log(close * volume) as a size/liquidity proxy.
--industry-map path/to/industry.csv: optional custom industry map with instrument,industry columns.
--purify-step mad|zscore|rank: optional daily cross-sectional purification before IC/quantile evaluation. The flag can be repeated.

Factor Purification and Exposure Attribution

The project includes a lightweight AlphaPurify-inspired layer without adding AlphaPurify as a dependency:

qlib_factor_lab.factor_purification: MAD winsorization, z-score standardization, rank standardization, and OLS residual neutralization by daily cross-section.
qlib_factor_lab.exposure_attribution: factor-family, industry, and style exposure reports for a daily signal. Target-portfolio usage is historical and outside the current signal-only workflow.

Build the daily explainable signal:

make daily-signal RUN_DATE=20260430

Scan the AI chips / semiconductor / storage theme:

make theme-scan THEME_CONFIG=configs/themes/ai_semiconductor.yaml SIGNAL_CSV=runs/20260430/signals.csv

Theme scans produce tiered research candidates with theme, quality, growth, momentum, event, and risk components. They are signals and research prompts, not orders.

Current signal artifacts:

runs/YYYYMMDD/signals.csv: governed daily factor signal.
reports/theme_scans/ai_semiconductor_YYYYMMDD.csv: focused AI industry-chain signal candidates.
reports/theme_scans/ai_semiconductor_YYYYMMDD.md: human-readable theme report.

Portfolio construction, risk gates, paper orders, and execution-performance attribution remain in the repository for prior experiments, but they are not part of the active Factor Lab product loop.

Candidate Mining

Candidate templates live in:

configs/factor_mining.yaml

The current pool includes momentum, reversal, volatility, volume-price, liquidity, divergence, Wangji pattern, and JoinQuant-migrated turnover/emotion/technical factors.

Longer-term autoresearch is organized by configs/autoresearch/lane_space.yaml: expression, pattern/event, emotion/atmosphere, liquidity, risk, shareholder/capital, fundamental, and regime lanes. Missing non-price data lanes must stay shadow or disabled until their point-in-time data governance gates pass.

Generate the candidate table only:

make candidates

Run a 5-day and 20-day CSI500 screen:

make mine-csi500

The result table includes IC, Rank IC, quintile mean returns, long-short return, turnover, and observation counts.

Autoresearch

The first controlled autoresearch loop is expression-factor only. It lets an agent edit one candidate YAML while the provider, horizons, purification, neutralization, ledger, and artifact paths stay locked by contract:

make autoresearch-expression

Default inputs:

configs/autoresearch/contracts/csi500_current_v1.yaml
configs/autoresearch/expression_space.yaml
configs/autoresearch/candidates/example_expression.yaml

The loop prints a compact summary block, applies the contract's factor purification steps, writes raw and size-proxy-neutralized evaluation artifacts, and appends a local ledger under reports/autoresearch/. Generated run outputs are ignored by Git.

Summarize the local expression ledger by status:

make autoresearch-ledger

The ledger report groups review, discard_candidate, and crash rows, then shows the top review candidates and common discard/crash reasons.

Run overnight Codex CLI autoresearch without an OpenAI API key:

git switch -c autoresearch/nightly-$(date +%Y%m%d)
tmux new -s factor-night
make autoresearch-codex-loop AUTORESEARCH_CODEX_UNTIL=08:30 AUTORESEARCH_CODEX_ITERATIONS=30

The Codex loop uses the local codex ChatGPT login. Each iteration asks Codex to update only configs/autoresearch/candidates/example_expression.yaml; the runner then commits the candidate, runs the locked oracle, runs the ledger summary, and writes logs under reports/autoresearch/codex_loop/. The runner refuses to run on main or master unless --allow-protected-branch is passed directly to the script.

Run the multi-lane orchestration layer:

make autoresearch-multilane

The first implementation executes expression_price_volume through the existing expression oracle and records other lanes as shadow_skipped, disabled_skipped, or unsupported until their data domains and oracles are ready.

Event Backtests

Use event backtests when a factor is closer to an absolute trigger or pattern score than a pure IC feature:

make event-csi300 FACTOR=arbr_26

Event backtests apply the factor's configured direction before percentile bucketing. For example, a direction: -1 factor treats lower raw values as higher scores, so p95_p100 means the best configured score bucket.

Optional breakout-volume confirmation:

python scripts/backtest_factor_events.py \
  --factor wangji-factor1 \
  --provider-config configs/provider_current.yaml \
  --horizon 20 \
  --confirm-window 3 \
  --confirm-volume-ratio 1.2

Generate a Markdown summary from an event backtest summary CSV:

make summarize-event \
  FACTOR=arbr_26 \
  SUMMARY=reports/factor_arbr_26_event_backtest_summary_csi300.csv \
  SUMMARY_MD=reports/factor_arbr_26_event_backtest_summary_csi300.md

Model Workflow

Render a Qlib Alpha158 + LightGBM workflow config without training:

python scripts/run_lgb_workflow.py \
  --provider-config configs/provider_current.yaml \
  --output configs/qlib_lgb_workflow_current.yaml \
  --dry-run

Run the workflow:

python scripts/run_lgb_workflow.py \
  --provider-config configs/provider_current.yaml \
  --output configs/qlib_lgb_workflow_current.yaml

For current data, the default split is:

train: 2015-01-01 ~ 2021-12-31
valid: 2022-01-01 ~ 2023-12-31
test:  2024-01-01 ~ latest complete local trading day

If the benchmark index binary is missing, the workflow uses the candidate-pool stocks as an equal-weight benchmark proxy.

CI

GitHub Actions runs:

python -m unittest discover -s tests

CI does not download market data or run long backtests. Those are local research steps because they depend on data availability, rate limits, and machine storage.

Recommended Workflow

Build or download a local Qlib dataset.
Generate the candidate table from configs/factor_mining.yaml.
Use IC/Rank IC mining for broad factor triage.
Use event backtests for absolute or pattern-like factors.
Promote stable factors into model workflows or a live-trading pipeline design.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qlib Factor Lab

Platform Loop

Project Layout

Research Flow

Quick Start

Data Setup

Factor Evaluation

Factor Purification and Exposure Attribution

Candidate Mining

Autoresearch

Event Backtests

Model Workflow

CI

Recommended Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
.github/workflows		.github/workflows
app		app
configs		configs
data		data
docs		docs
factors		factors
reports		reports
scripts		scripts
src/qlib_factor_lab		src/qlib_factor_lab
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Qlib Factor Lab

Platform Loop

Project Layout

Research Flow

Quick Start

Data Setup

Factor Evaluation

Factor Purification and Exposure Attribution

Candidate Mining

Autoresearch

Event Backtests

Model Workflow

CI

Recommended Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages