Add Regime Smoothing & Expert Load Balancing for MoE #7126

kyo219 · 2026-01-12T13:43:03Z

Summary

MoE (Mixture-of-Experts) モデルの安定性と性能を向上させる2つの主要機能を追加：

Regime Smoothing: 時系列データでのレジーム切り替えを滑らかにする4つのモード
Loss-Free Load Balancing: エキスパート崩壊を防ぐ動的バイアス調整

New Features

Regime Smoothing (mixture_r_smoothing)

Mode	Description	Use Case
none	スムージングなし（デフォルト）	i.i.d. データ
ema	指数移動平均	緩やかなレジーム遷移
markov	前時点の確率を考慮	持続性のあるレジーム
momentum	EMA + トレンド追従	トレンドのあるレジーム

params = {
'boosting': 'mixture',
'mixture_r_smoothing': 'ema', # or 'markov', 'momentum'
'mixture_smoothing_lambda': 0.5, # smoothing strength (0-1)
}

Loss-Free Load Balancing (mixture_balance_factor)

エキスパートが均等に使用されるよう動的にバイアスを調整。勾配に干渉せず、学習を阻害しない。

params = {
'mixture_balance_factor': 5, # min_usage = 1/(factor × K)
}

Benchmark Results (100 trials × 3 datasets)

Dataset	True K	Std RMSE	MoE RMSE	Improvement
Synthetic (X→Regime)	2	5.2168	4.3478	+16.7%
Hamilton GNP	2	0.7379	0.7376	+0.0%
VIX Regime	2	0.0118	0.0118	-0.7%

結論: レジームが特徴量から判別可能な場合に MoE が有効

Other Changes

GitHub Releases 自動公開ワークフロー追加 (tag push → auto release)
独立バージョニング導入 (v0.1.0+, setuptools-scm)
ベンチマークスクリプト追加 (examples/benchmark_*.py)
README にインストール方法・ベンチマーク結果を追加

Files Changed

src/boosting/mixture_gbdt.cpp/h - Core smoothing & load balancing implementation
include/LightGBM/config.h - New parameters
python-package/ - Python bindings & version config
.github/workflows/release.yml - Auto-release workflow
examples/benchmark_*.py - Benchmark scripts

- Rename python-package/lightgbm/ to python-package/lightgbm_moe/ - Update pyproject.toml: name to lightgbm-moe, description updated - Update import references in __init__.py, compat.py - Add LightGBM-MoE description to README.md

Add C++ implementation of Mixture-of-Experts GBDT: - Add MoE parameters to config.h (mixture_enable, mixture_num_experts, etc.) - Create MixtureGBDT class with K expert GBDTs + 1 gate GBDT - Implement EM-style training loop (Forward, E-step, M-step) - Add mixture boosting type to factory in boosting.cpp - Add mixture_gbdt.cpp to CMakeLists.txt Training algorithm: - Forward: compute expert predictions and gate probabilities - E-step: update responsibilities r_ik based on expert fit and gate - M-step Experts: train with responsibility-weighted gradients - M-step Gate: train with argmax responsibilities as pseudo-labels

Add C API and Python bindings for MoE-specific predictions: - LGBM_BoosterIsMixture: check if model is MoE - LGBM_BoosterGetNumExperts: get number of experts - LGBM_BoosterPredictRegime: predict regime (argmax of gate) - LGBM_BoosterPredictRegimeProba: predict regime probabilities - LGBM_BoosterPredictExpertPred: predict individual expert outputs Python Booster methods: - is_mixture() -> bool - num_experts() -> int - predict_regime(data) -> ndarray (n_samples,) - predict_regime_proba(data) -> ndarray (n_samples, n_experts) - predict_expert_pred(data) -> ndarray (n_samples, n_experts)

- docs_moe/version01.md: MoE specification document - docs_claude/: implementation work logs - 00_work_plan.md: overall plan and progress - 01_python_rebrand.md: Phase 1 details - 02_cpp_mixture_gbdt.md: Phase 2 details - 03_python_wrapper.md: Phase 3 details

- Fix OpenMP race condition in E-step (private scores variable) - Fix model loading to properly detect mixture type from file - Fix predict crash by initializing early_stop in InitPredict - Fix JSON parsing for loaded_parameter_ in MixtureGBDT - Add acceptance tests for MoE functionality (15 tests) - Add debug logging to boosting.cpp for model creation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add mathematical formulation (MoE vs standard GBDT) - Document EM-style training algorithm - List all key parameters and new prediction APIs - Include assumptions and limitations - Add performance benchmark results - Bilingual documentation (English + Japanese) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Two critical bug fixes for MoE GBDT: 1. Gate training during warmup: MStepGate was being skipped during warmup, preventing the gate from learning the quantile-based initialization. Now gate is always trained. 2. Gate indexing in Forward(): Fixed class-major vs sample-major order mismatch. GetPredictAt returns class-major order (all class 0, then class 1), but Forward() was assuming sample-major order. Now properly converts indexing. These fixes enable hard alpha (1.0) to work correctly, achieving 13.6% RMSE improvement over standard GBDT on regime-switching data. Also adds: - Regime-switching demo script with visualization - Updated README with benchmark results 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Added honest benchmark comparison: - Synthetic data (clear regime): MoE +13.6% improvement - Real financial data (weak regime): Standard GBDT wins Key insight: MoE is not universally better. It excels only when data has clear, separable regime structure where different regimes follow fundamentally different functions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Fair comparison using: - Optuna (50 trials each) - Time-series CV (no leakage) - Same features for both models Results on S&P 500 5-day return prediction: - CV: MoE slightly better (0.02975 vs 0.02998) - Test: Standard slightly better (0.01793 vs 0.01801) - Difference minimal (-0.43%), no significant advantage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add classic regime-switching benchmark results: - Hamilton's GNP (1951-2009): MoE beats Standard GBDT by +5.4% - VIX Volatility Regime: MoE marginally better (+0.4%) This validates MoE for datasets with documented regime structure. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add 6-benchmark comparison table (3 datasets × few/many features) - Document when MoE excels (Hamilton GNP +7.5% improvement) - Add expert differentiation analysis (only Synthetic Few succeeded) - Document EMA smoothing trade-off (accuracy vs interpretability) - Add Japanese translation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Key findings: - MoE excels when regime IS determinable from features (X) - Synthetic data: +10.8% improvement with expert differentiation success - Hamilton GNP / VIX: MoE loses, expert collapse occurs - Added regime confusion matrices showing expert-regime mapping Conclusion: MoE is effective when gate can learn regime from X, not when regime is latent (Markov switching) or unobserved. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…rparams) FAIR comparison: Both Standard GBDT and MoE now search: - num_leaves: 8-64 - learning_rate: 0.01-0.2 Results: - Synthetic (X→Regime): MoE wins +8~12% with expert differentiation - VIX Regime: Equal RMSE but experts differentiate (R0→E0:85%, R1→E1:71%) - Hamilton GNP: Standard wins -11~24% (latent Markov regime) Conclusion unchanged: MoE excels when regime is determinable from X 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Full Optuna search for both Standard GBDT and MoE: - num_leaves, max_depth, learning_rate, min_data_in_leaf - feature_fraction, bagging_fraction, bagging_freq - lambda_l1, lambda_l2 - MoE: K, alpha, warmup Results (EMA OFF): - Synthetic: MoE +12.8% (regime determinable from X) - VIX: 0% difference but excellent differentiation (91%/96%) - Hamilton: MoE -3.0% (latent regime causes collapse) Added EMA smoothing appendix: - Hamilton: EMA helps (+6.3% vs -3.0%) - Synthetic: No-EMA better (+12.8% vs +5.1%) - Recommendation: Use EMA for temporal persistence, No-EMA for instant regime 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add 'markov' option to mixture_r_smoothing parameter that enables Markov-style regime switching where previous gate probabilities influence current predictions. Changes: - config.h: Add 'markov' option to mixture_r_smoothing enum - mixture_gbdt.h: Add use_markov_, prev_gate_proba_ members and PredictWithPrevProba, PredictRegimeProbaWithPrevProba methods - mixture_gbdt.cpp: Implement Markov blending in Forward() and prediction APIs that accept previous probabilities - basic.py: Add predict_markov() and predict_regime_proba_markov() Python methods for time-series inference Usage: params = { 'boosting': 'mixture', 'mixture_r_smoothing': 'markov', 'mixture_r_ema_lambda': 0.3, # blending coefficient } model = lgb.train(params, train_data) y_pred = model.predict_markov(X_test) # Markov-smoothed predictions The Markov mode blends current gate probabilities with previous: proba[t] = (1-lambda) * gate_proba[t] + lambda * proba[t-1] 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Comprehensive benchmark comparing EMA and Markov smoothing modes with full hyperparameter optimization (50 trials per model). Tests on Synthetic, Hamilton GNP-like, and VIX datasets. Results show similar performance between EMA and Markov (~0-1.5% difference). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The parameter was renamed to better reflect that it's used for both EMA and Markov smoothing modes, not just EMA. Files updated: - include/LightGBM/config.h - src/boosting/mixture_gbdt.cpp - src/io/config_auto.cpp - python-package/lightgbm_moe/basic.py - docs/Parameters.rst - README.md - docs_moe/version01.md - docs_claude/02_cpp_mixture_gbdt.md - examples/benchmark_ema_vs_markov.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…ickiness) Implemented 5 new regime switching approaches: - Transition Matrix Learning: K×K soft transition matrix from responsibilities - Hamilton Filter: Bayesian update combining likelihood and transition prior - Regime Stickiness: Duration-based bonus using log(1 + duration) - Adaptive Gate: Online weight adjustment based on recent performance - Lagged Features: Helper function for y[t-1], y[t-2], X[t-1] features New C++ parameters: - mixture_r_smoothing: "none", "ema", "markov", "transition", "hamilton" - mixture_regime_stickiness: Duration-based bonus coefficient - mixture_transition_prior: Dirichlet smoothing for transition matrix New Python methods: - Booster.predict_adaptive(): Online adaptive weighting - Booster.predict_with_transition(): Transition matrix-based prediction - create_lagged_features(): Time-series feature engineering - estimate_transition_matrix(): Soft transition estimation - compute_regime_duration_stats(): Regime analysis Benchmark results (30 Optuna trials): - Synthetic: Transition best (RMSE=4.54) - Hamilton GNP: Hamilton Filter best (RMSE=0.31) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…ing_lambda) Removed advanced regime smoothing methods that required additional hyperparameters: - Removed transition mode (required mixture_transition_prior) - Removed hamilton mode (required mixture_transition_prior) - Removed stickiness mode (required mixture_regime_stickiness) Now only 3 smoothing modes remain: - none: No smoothing - ema: Exponential moving average - markov: Previous gate probability blending All modes use just mixture_smoothing_lambda (0-1). Removed 418 lines of C++ code and simplified benchmark script. Python helper functions (create_lagged_features, estimate_transition_matrix, etc.) are preserved for analysis use cases. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Added momentum smoothing mode: - Considers trend (direction of regime changes) - extrapolated[i] = r[i-1] + λ*(r[i-1] - r[i-2]) - r_smooth[i] = (1-λ)*r[i] + λ*extrapolated[i] - Useful when regimes have inertia Removed Python helper functions to keep codebase simple: - create_lagged_features() - estimate_transition_matrix() - compute_regime_duration_stats() - predict_adaptive() - predict_with_transition() Final smoothing modes (all use single mixture_smoothing_lambda): - none: No smoothing - ema: Exponential moving average - markov: Blend with previous gate probabilities - momentum: EMA with trend extrapolation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Removed unused remnants from earlier attempt to implement Markov mode by augmenting gate dataset with previous probabilities: - gate_dataset_, gate_markov_config_, gate_raw_data_ (member vars) - prev_responsibilities_, momentum_trend_ (unused buffers) - num_original_features_ (unused) - CreateGateDataset(), UpdateGateDataset() (never called) The simpler approach (blending after Forward) was kept instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…ling Updated all stub functions in MixtureGBDT to throw Log::Fatal error with clear messages instead of returning empty/default values silently. This makes unsupported operations fail loudly rather than producing confusing behavior. Affected functions: - GetEvalAt, GetNumPredictAt (validation) - PredictLeafIndex, PredictLeafIndexByMap - PredictContrib, PredictContribByMap - AddValidDataset, DumpModel - GetLeafValue, SetLeafValue 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Implement dynamic expert bias adjustment to prevent expert collapse without auxiliary loss interference. Based on the 2024 paper "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts". Algorithm: - Track actual load per expert (mean responsibility) - Compare with target load (1/K uniform distribution) - Adjust bias: bias[k] += η * (target - actual) - Add bias to gate logits before softmax This approach: - Does not interfere with prediction gradients - Requires no additional hyperparameters - Maintains balanced expert utilization automatically 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Replace fixed mixture_r_min threshold with mixture_balance_factor that calculates minimum usage dynamically: 1 / (factor * K) - factor=10 (default): K=2 allows 95:5, K=5 allows 98:2 - factor=2: K=2 allows 75:25, K=5 allows 90:10 - Range: 2-10 (lower = more balanced, higher = more imbalance allowed) Also removes hard clipping in EStep - collapse prevention is now handled entirely by Loss-Free Load Balancing (bias adjustment). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add .github/workflows/release.yml for automatic release on tag push - Builds wheels for Linux (x86_64, aarch64), macOS (x86_64, arm64), Windows - Creates GitHub Release with all artifacts attached - Update pyproject.toml for independent versioning (v0.1.0+) - Use setuptools-scm for automatic version from git tags - Change Development Status to Beta - Update URLs to point to LightGBM-MoE fork - Update README with GitHub Releases installation instructions - Add Option 1: Install from GitHub Releases (recommended) - Keep Option 2: Build from Source - Updated both English and Japanese sections - Add benchmark scripts with mixture_balance_factor in search space 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

kyo219 · 2026-01-12T13:45:57Z

Sorry, wrong repository. This was meant for my fork.

kyo219 and others added 25 commits January 11, 2026 21:32

feat: rebrand Python package to lightgbm-moe

1498720

- Rename python-package/lightgbm/ to python-package/lightgbm_moe/ - Update pyproject.toml: name to lightgbm-moe, description updated - Update import references in __init__.py, compat.py - Add LightGBM-MoE description to README.md

kyo219 requested review from borchero, guolinke, jameslamb, jmoralez and shiyu1994 as code owners January 12, 2026 13:43

kyo219 requested a review from StrikerRUS as a code owner January 12, 2026 13:43

kyo219 closed this Jan 12, 2026

kyo219 deleted the feature/markov_switching branch January 23, 2026 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Regime Smoothing & Expert Load Balancing for MoE #7126

Add Regime Smoothing & Expert Load Balancing for MoE #7126

Uh oh!

kyo219 commented Jan 12, 2026

Uh oh!

kyo219 commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add Regime Smoothing & Expert Load Balancing for MoE #7126

Add Regime Smoothing & Expert Load Balancing for MoE #7126

Uh oh!

Conversation

kyo219 commented Jan 12, 2026

Uh oh!

kyo219 commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant