-
Notifications
You must be signed in to change notification settings - Fork 4k
Add Regime Smoothing & Expert Load Balancing for MoE #7126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Rename python-package/lightgbm/ to python-package/lightgbm_moe/ - Update pyproject.toml: name to lightgbm-moe, description updated - Update import references in __init__.py, compat.py - Add LightGBM-MoE description to README.md
Add C++ implementation of Mixture-of-Experts GBDT: - Add MoE parameters to config.h (mixture_enable, mixture_num_experts, etc.) - Create MixtureGBDT class with K expert GBDTs + 1 gate GBDT - Implement EM-style training loop (Forward, E-step, M-step) - Add mixture boosting type to factory in boosting.cpp - Add mixture_gbdt.cpp to CMakeLists.txt Training algorithm: - Forward: compute expert predictions and gate probabilities - E-step: update responsibilities r_ik based on expert fit and gate - M-step Experts: train with responsibility-weighted gradients - M-step Gate: train with argmax responsibilities as pseudo-labels
Add C API and Python bindings for MoE-specific predictions: - LGBM_BoosterIsMixture: check if model is MoE - LGBM_BoosterGetNumExperts: get number of experts - LGBM_BoosterPredictRegime: predict regime (argmax of gate) - LGBM_BoosterPredictRegimeProba: predict regime probabilities - LGBM_BoosterPredictExpertPred: predict individual expert outputs Python Booster methods: - is_mixture() -> bool - num_experts() -> int - predict_regime(data) -> ndarray (n_samples,) - predict_regime_proba(data) -> ndarray (n_samples, n_experts) - predict_expert_pred(data) -> ndarray (n_samples, n_experts)
- docs_moe/version01.md: MoE specification document - docs_claude/: implementation work logs - 00_work_plan.md: overall plan and progress - 01_python_rebrand.md: Phase 1 details - 02_cpp_mixture_gbdt.md: Phase 2 details - 03_python_wrapper.md: Phase 3 details
- Fix OpenMP race condition in E-step (private scores variable) - Fix model loading to properly detect mixture type from file - Fix predict crash by initializing early_stop in InitPredict - Fix JSON parsing for loaded_parameter_ in MixtureGBDT - Add acceptance tests for MoE functionality (15 tests) - Add debug logging to boosting.cpp for model creation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Add mathematical formulation (MoE vs standard GBDT) - Document EM-style training algorithm - List all key parameters and new prediction APIs - Include assumptions and limitations - Add performance benchmark results - Bilingual documentation (English + Japanese) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Two critical bug fixes for MoE GBDT: 1. Gate training during warmup: MStepGate was being skipped during warmup, preventing the gate from learning the quantile-based initialization. Now gate is always trained. 2. Gate indexing in Forward(): Fixed class-major vs sample-major order mismatch. GetPredictAt returns class-major order (all class 0, then class 1), but Forward() was assuming sample-major order. Now properly converts indexing. These fixes enable hard alpha (1.0) to work correctly, achieving 13.6% RMSE improvement over standard GBDT on regime-switching data. Also adds: - Regime-switching demo script with visualization - Updated README with benchmark results 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Added honest benchmark comparison: - Synthetic data (clear regime): MoE +13.6% improvement - Real financial data (weak regime): Standard GBDT wins Key insight: MoE is not universally better. It excels only when data has clear, separable regime structure where different regimes follow fundamentally different functions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Fair comparison using: - Optuna (50 trials each) - Time-series CV (no leakage) - Same features for both models Results on S&P 500 5-day return prediction: - CV: MoE slightly better (0.02975 vs 0.02998) - Test: Standard slightly better (0.01793 vs 0.01801) - Difference minimal (-0.43%), no significant advantage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add classic regime-switching benchmark results: - Hamilton's GNP (1951-2009): MoE beats Standard GBDT by +5.4% - VIX Volatility Regime: MoE marginally better (+0.4%) This validates MoE for datasets with documented regime structure. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Add 6-benchmark comparison table (3 datasets × few/many features) - Document when MoE excels (Hamilton GNP +7.5% improvement) - Add expert differentiation analysis (only Synthetic Few succeeded) - Document EMA smoothing trade-off (accuracy vs interpretability) - Add Japanese translation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Key findings: - MoE excels when regime IS determinable from features (X) - Synthetic data: +10.8% improvement with expert differentiation success - Hamilton GNP / VIX: MoE loses, expert collapse occurs - Added regime confusion matrices showing expert-regime mapping Conclusion: MoE is effective when gate can learn regime from X, not when regime is latent (Markov switching) or unobserved. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…rparams) FAIR comparison: Both Standard GBDT and MoE now search: - num_leaves: 8-64 - learning_rate: 0.01-0.2 Results: - Synthetic (X→Regime): MoE wins +8~12% with expert differentiation - VIX Regime: Equal RMSE but experts differentiate (R0→E0:85%, R1→E1:71%) - Hamilton GNP: Standard wins -11~24% (latent Markov regime) Conclusion unchanged: MoE excels when regime is determinable from X 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Full Optuna search for both Standard GBDT and MoE: - num_leaves, max_depth, learning_rate, min_data_in_leaf - feature_fraction, bagging_fraction, bagging_freq - lambda_l1, lambda_l2 - MoE: K, alpha, warmup Results (EMA OFF): - Synthetic: MoE +12.8% (regime determinable from X) - VIX: 0% difference but excellent differentiation (91%/96%) - Hamilton: MoE -3.0% (latent regime causes collapse) Added EMA smoothing appendix: - Hamilton: EMA helps (+6.3% vs -3.0%) - Synthetic: No-EMA better (+12.8% vs +5.1%) - Recommendation: Use EMA for temporal persistence, No-EMA for instant regime 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add 'markov' option to mixture_r_smoothing parameter that enables
Markov-style regime switching where previous gate probabilities
influence current predictions.
Changes:
- config.h: Add 'markov' option to mixture_r_smoothing enum
- mixture_gbdt.h: Add use_markov_, prev_gate_proba_ members and
PredictWithPrevProba, PredictRegimeProbaWithPrevProba methods
- mixture_gbdt.cpp: Implement Markov blending in Forward() and
prediction APIs that accept previous probabilities
- basic.py: Add predict_markov() and predict_regime_proba_markov()
Python methods for time-series inference
Usage:
params = {
'boosting': 'mixture',
'mixture_r_smoothing': 'markov',
'mixture_r_ema_lambda': 0.3, # blending coefficient
}
model = lgb.train(params, train_data)
y_pred = model.predict_markov(X_test) # Markov-smoothed predictions
The Markov mode blends current gate probabilities with previous:
proba[t] = (1-lambda) * gate_proba[t] + lambda * proba[t-1]
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
Comprehensive benchmark comparing EMA and Markov smoothing modes with full hyperparameter optimization (50 trials per model). Tests on Synthetic, Hamilton GNP-like, and VIX datasets. Results show similar performance between EMA and Markov (~0-1.5% difference). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The parameter was renamed to better reflect that it's used for both EMA and Markov smoothing modes, not just EMA. Files updated: - include/LightGBM/config.h - src/boosting/mixture_gbdt.cpp - src/io/config_auto.cpp - python-package/lightgbm_moe/basic.py - docs/Parameters.rst - README.md - docs_moe/version01.md - docs_claude/02_cpp_mixture_gbdt.md - examples/benchmark_ema_vs_markov.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…ickiness) Implemented 5 new regime switching approaches: - Transition Matrix Learning: K×K soft transition matrix from responsibilities - Hamilton Filter: Bayesian update combining likelihood and transition prior - Regime Stickiness: Duration-based bonus using log(1 + duration) - Adaptive Gate: Online weight adjustment based on recent performance - Lagged Features: Helper function for y[t-1], y[t-2], X[t-1] features New C++ parameters: - mixture_r_smoothing: "none", "ema", "markov", "transition", "hamilton" - mixture_regime_stickiness: Duration-based bonus coefficient - mixture_transition_prior: Dirichlet smoothing for transition matrix New Python methods: - Booster.predict_adaptive(): Online adaptive weighting - Booster.predict_with_transition(): Transition matrix-based prediction - create_lagged_features(): Time-series feature engineering - estimate_transition_matrix(): Soft transition estimation - compute_regime_duration_stats(): Regime analysis Benchmark results (30 Optuna trials): - Synthetic: Transition best (RMSE=4.54) - Hamilton GNP: Hamilton Filter best (RMSE=0.31) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…ing_lambda) Removed advanced regime smoothing methods that required additional hyperparameters: - Removed transition mode (required mixture_transition_prior) - Removed hamilton mode (required mixture_transition_prior) - Removed stickiness mode (required mixture_regime_stickiness) Now only 3 smoothing modes remain: - none: No smoothing - ema: Exponential moving average - markov: Previous gate probability blending All modes use just mixture_smoothing_lambda (0-1). Removed 418 lines of C++ code and simplified benchmark script. Python helper functions (create_lagged_features, estimate_transition_matrix, etc.) are preserved for analysis use cases. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Added momentum smoothing mode: - Considers trend (direction of regime changes) - extrapolated[i] = r[i-1] + λ*(r[i-1] - r[i-2]) - r_smooth[i] = (1-λ)*r[i] + λ*extrapolated[i] - Useful when regimes have inertia Removed Python helper functions to keep codebase simple: - create_lagged_features() - estimate_transition_matrix() - compute_regime_duration_stats() - predict_adaptive() - predict_with_transition() Final smoothing modes (all use single mixture_smoothing_lambda): - none: No smoothing - ema: Exponential moving average - markov: Blend with previous gate probabilities - momentum: EMA with trend extrapolation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Removed unused remnants from earlier attempt to implement Markov mode by augmenting gate dataset with previous probabilities: - gate_dataset_, gate_markov_config_, gate_raw_data_ (member vars) - prev_responsibilities_, momentum_trend_ (unused buffers) - num_original_features_ (unused) - CreateGateDataset(), UpdateGateDataset() (never called) The simpler approach (blending after Forward) was kept instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…ling Updated all stub functions in MixtureGBDT to throw Log::Fatal error with clear messages instead of returning empty/default values silently. This makes unsupported operations fail loudly rather than producing confusing behavior. Affected functions: - GetEvalAt, GetNumPredictAt (validation) - PredictLeafIndex, PredictLeafIndexByMap - PredictContrib, PredictContribByMap - AddValidDataset, DumpModel - GetLeafValue, SetLeafValue 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Implement dynamic expert bias adjustment to prevent expert collapse without auxiliary loss interference. Based on the 2024 paper "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts". Algorithm: - Track actual load per expert (mean responsibility) - Compare with target load (1/K uniform distribution) - Adjust bias: bias[k] += η * (target - actual) - Add bias to gate logits before softmax This approach: - Does not interfere with prediction gradients - Requires no additional hyperparameters - Maintains balanced expert utilization automatically 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Replace fixed mixture_r_min threshold with mixture_balance_factor that calculates minimum usage dynamically: 1 / (factor * K) - factor=10 (default): K=2 allows 95:5, K=5 allows 98:2 - factor=2: K=2 allows 75:25, K=5 allows 90:10 - Range: 2-10 (lower = more balanced, higher = more imbalance allowed) Also removes hard clipping in EStep - collapse prevention is now handled entirely by Loss-Free Load Balancing (bias adjustment). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Add .github/workflows/release.yml for automatic release on tag push - Builds wheels for Linux (x86_64, aarch64), macOS (x86_64, arm64), Windows - Creates GitHub Release with all artifacts attached - Update pyproject.toml for independent versioning (v0.1.0+) - Use setuptools-scm for automatic version from git tags - Change Development Status to Beta - Update URLs to point to LightGBM-MoE fork - Update README with GitHub Releases installation instructions - Add Option 1: Install from GitHub Releases (recommended) - Keep Option 2: Build from Source - Updated both English and Japanese sections - Add benchmark scripts with mixture_balance_factor in search space 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Author
|
Sorry, wrong repository. This was meant for my fork. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
MoE (Mixture-of-Experts) モデルの安定性と性能を向上させる2つの主要機能を追加:
New Features
params = {
'boosting': 'mixture',
'mixture_r_smoothing': 'ema', # or 'markov', 'momentum'
'mixture_smoothing_lambda': 0.5, # smoothing strength (0-1)
}
エキスパートが均等に使用されるよう動的にバイアスを調整。勾配に干渉せず、学習を阻害しない。
params = {
'mixture_balance_factor': 5, # min_usage = 1/(factor × K)
}
Benchmark Results (100 trials × 3 datasets)
結論: レジームが特徴量から判別可能な場合に MoE が有効
Other Changes
Files Changed