Skip to content

Conversation

@kyo219
Copy link

@kyo219 kyo219 commented Jan 12, 2026

Summary

MoE (Mixture-of-Experts) モデルの安定性と性能を向上させる2つの主要機能を追加:

  1. Regime Smoothing: 時系列データでのレジーム切り替えを滑らかにする4つのモード
  2. Loss-Free Load Balancing: エキスパート崩壊を防ぐ動的バイアス調整

New Features

  1. Regime Smoothing (mixture_r_smoothing)
Mode Description Use Case
none スムージングなし(デフォルト) i.i.d. データ
ema 指数移動平均 緩やかなレジーム遷移
markov 前時点の確率を考慮 持続性のあるレジーム
momentum EMA + トレンド追従 トレンドのあるレジーム

params = {
'boosting': 'mixture',
'mixture_r_smoothing': 'ema', # or 'markov', 'momentum'
'mixture_smoothing_lambda': 0.5, # smoothing strength (0-1)
}

  1. Loss-Free Load Balancing (mixture_balance_factor)

エキスパートが均等に使用されるよう動的にバイアスを調整。勾配に干渉せず、学習を阻害しない。

params = {
'mixture_balance_factor': 5, # min_usage = 1/(factor × K)
}

Benchmark Results (100 trials × 3 datasets)

Dataset True K Std RMSE MoE RMSE Improvement
Synthetic (X→Regime) 2 5.2168 4.3478 +16.7%
Hamilton GNP 2 0.7379 0.7376 +0.0%
VIX Regime 2 0.0118 0.0118 -0.7%

結論: レジームが特徴量から判別可能な場合に MoE が有効

Other Changes

  • GitHub Releases 自動公開ワークフロー追加 (tag push → auto release)
  • 独立バージョニング導入 (v0.1.0+, setuptools-scm)
  • ベンチマークスクリプト追加 (examples/benchmark_*.py)
  • README にインストール方法・ベンチマーク結果を追加

Files Changed

  • src/boosting/mixture_gbdt.cpp/h - Core smoothing & load balancing implementation
  • include/LightGBM/config.h - New parameters
  • python-package/ - Python bindings & version config
  • .github/workflows/release.yml - Auto-release workflow
  • examples/benchmark_*.py - Benchmark scripts

kyo219 and others added 25 commits January 11, 2026 21:32
- Rename python-package/lightgbm/ to python-package/lightgbm_moe/
- Update pyproject.toml: name to lightgbm-moe, description updated
- Update import references in __init__.py, compat.py
- Add LightGBM-MoE description to README.md
Add C++ implementation of Mixture-of-Experts GBDT:
- Add MoE parameters to config.h (mixture_enable, mixture_num_experts, etc.)
- Create MixtureGBDT class with K expert GBDTs + 1 gate GBDT
- Implement EM-style training loop (Forward, E-step, M-step)
- Add mixture boosting type to factory in boosting.cpp
- Add mixture_gbdt.cpp to CMakeLists.txt

Training algorithm:
- Forward: compute expert predictions and gate probabilities
- E-step: update responsibilities r_ik based on expert fit and gate
- M-step Experts: train with responsibility-weighted gradients
- M-step Gate: train with argmax responsibilities as pseudo-labels
Add C API and Python bindings for MoE-specific predictions:
- LGBM_BoosterIsMixture: check if model is MoE
- LGBM_BoosterGetNumExperts: get number of experts
- LGBM_BoosterPredictRegime: predict regime (argmax of gate)
- LGBM_BoosterPredictRegimeProba: predict regime probabilities
- LGBM_BoosterPredictExpertPred: predict individual expert outputs

Python Booster methods:
- is_mixture() -> bool
- num_experts() -> int
- predict_regime(data) -> ndarray (n_samples,)
- predict_regime_proba(data) -> ndarray (n_samples, n_experts)
- predict_expert_pred(data) -> ndarray (n_samples, n_experts)
- docs_moe/version01.md: MoE specification document
- docs_claude/: implementation work logs
  - 00_work_plan.md: overall plan and progress
  - 01_python_rebrand.md: Phase 1 details
  - 02_cpp_mixture_gbdt.md: Phase 2 details
  - 03_python_wrapper.md: Phase 3 details
- Fix OpenMP race condition in E-step (private scores variable)
- Fix model loading to properly detect mixture type from file
- Fix predict crash by initializing early_stop in InitPredict
- Fix JSON parsing for loaded_parameter_ in MixtureGBDT
- Add acceptance tests for MoE functionality (15 tests)
- Add debug logging to boosting.cpp for model creation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Add mathematical formulation (MoE vs standard GBDT)
- Document EM-style training algorithm
- List all key parameters and new prediction APIs
- Include assumptions and limitations
- Add performance benchmark results
- Bilingual documentation (English + Japanese)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Two critical bug fixes for MoE GBDT:

1. Gate training during warmup: MStepGate was being skipped during warmup,
   preventing the gate from learning the quantile-based initialization.
   Now gate is always trained.

2. Gate indexing in Forward(): Fixed class-major vs sample-major order mismatch.
   GetPredictAt returns class-major order (all class 0, then class 1), but
   Forward() was assuming sample-major order. Now properly converts indexing.

These fixes enable hard alpha (1.0) to work correctly, achieving 13.6%
RMSE improvement over standard GBDT on regime-switching data.

Also adds:
- Regime-switching demo script with visualization
- Updated README with benchmark results

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Added honest benchmark comparison:
- Synthetic data (clear regime): MoE +13.6% improvement
- Real financial data (weak regime): Standard GBDT wins

Key insight: MoE is not universally better. It excels only when
data has clear, separable regime structure where different regimes
follow fundamentally different functions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Fair comparison using:
- Optuna (50 trials each)
- Time-series CV (no leakage)
- Same features for both models

Results on S&P 500 5-day return prediction:
- CV: MoE slightly better (0.02975 vs 0.02998)
- Test: Standard slightly better (0.01793 vs 0.01801)
- Difference minimal (-0.43%), no significant advantage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add classic regime-switching benchmark results:
- Hamilton's GNP (1951-2009): MoE beats Standard GBDT by +5.4%
- VIX Volatility Regime: MoE marginally better (+0.4%)

This validates MoE for datasets with documented regime structure.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Add 6-benchmark comparison table (3 datasets × few/many features)
- Document when MoE excels (Hamilton GNP +7.5% improvement)
- Add expert differentiation analysis (only Synthetic Few succeeded)
- Document EMA smoothing trade-off (accuracy vs interpretability)
- Add Japanese translation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Key findings:
- MoE excels when regime IS determinable from features (X)
- Synthetic data: +10.8% improvement with expert differentiation success
- Hamilton GNP / VIX: MoE loses, expert collapse occurs
- Added regime confusion matrices showing expert-regime mapping

Conclusion: MoE is effective when gate can learn regime from X,
not when regime is latent (Markov switching) or unobserved.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…rparams)

FAIR comparison: Both Standard GBDT and MoE now search:
- num_leaves: 8-64
- learning_rate: 0.01-0.2

Results:
- Synthetic (X→Regime): MoE wins +8~12% with expert differentiation
- VIX Regime: Equal RMSE but experts differentiate (R0→E0:85%, R1→E1:71%)
- Hamilton GNP: Standard wins -11~24% (latent Markov regime)

Conclusion unchanged: MoE excels when regime is determinable from X

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Full Optuna search for both Standard GBDT and MoE:
  - num_leaves, max_depth, learning_rate, min_data_in_leaf
  - feature_fraction, bagging_fraction, bagging_freq
  - lambda_l1, lambda_l2
  - MoE: K, alpha, warmup

Results (EMA OFF):
- Synthetic: MoE +12.8% (regime determinable from X)
- VIX: 0% difference but excellent differentiation (91%/96%)
- Hamilton: MoE -3.0% (latent regime causes collapse)

Added EMA smoothing appendix:
- Hamilton: EMA helps (+6.3% vs -3.0%)
- Synthetic: No-EMA better (+12.8% vs +5.1%)
- Recommendation: Use EMA for temporal persistence, No-EMA for instant regime

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add 'markov' option to mixture_r_smoothing parameter that enables
Markov-style regime switching where previous gate probabilities
influence current predictions.

Changes:
- config.h: Add 'markov' option to mixture_r_smoothing enum
- mixture_gbdt.h: Add use_markov_, prev_gate_proba_ members and
  PredictWithPrevProba, PredictRegimeProbaWithPrevProba methods
- mixture_gbdt.cpp: Implement Markov blending in Forward() and
  prediction APIs that accept previous probabilities
- basic.py: Add predict_markov() and predict_regime_proba_markov()
  Python methods for time-series inference

Usage:
  params = {
      'boosting': 'mixture',
      'mixture_r_smoothing': 'markov',
      'mixture_r_ema_lambda': 0.3,  # blending coefficient
  }
  model = lgb.train(params, train_data)
  y_pred = model.predict_markov(X_test)  # Markov-smoothed predictions

The Markov mode blends current gate probabilities with previous:
  proba[t] = (1-lambda) * gate_proba[t] + lambda * proba[t-1]

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Comprehensive benchmark comparing EMA and Markov smoothing modes with full
hyperparameter optimization (50 trials per model). Tests on Synthetic,
Hamilton GNP-like, and VIX datasets.

Results show similar performance between EMA and Markov (~0-1.5% difference).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The parameter was renamed to better reflect that it's used for both EMA
and Markov smoothing modes, not just EMA.

Files updated:
- include/LightGBM/config.h
- src/boosting/mixture_gbdt.cpp
- src/io/config_auto.cpp
- python-package/lightgbm_moe/basic.py
- docs/Parameters.rst
- README.md
- docs_moe/version01.md
- docs_claude/02_cpp_mixture_gbdt.md
- examples/benchmark_ema_vs_markov.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…ickiness)

Implemented 5 new regime switching approaches:
- Transition Matrix Learning: K×K soft transition matrix from responsibilities
- Hamilton Filter: Bayesian update combining likelihood and transition prior
- Regime Stickiness: Duration-based bonus using log(1 + duration)
- Adaptive Gate: Online weight adjustment based on recent performance
- Lagged Features: Helper function for y[t-1], y[t-2], X[t-1] features

New C++ parameters:
- mixture_r_smoothing: "none", "ema", "markov", "transition", "hamilton"
- mixture_regime_stickiness: Duration-based bonus coefficient
- mixture_transition_prior: Dirichlet smoothing for transition matrix

New Python methods:
- Booster.predict_adaptive(): Online adaptive weighting
- Booster.predict_with_transition(): Transition matrix-based prediction
- create_lagged_features(): Time-series feature engineering
- estimate_transition_matrix(): Soft transition estimation
- compute_regime_duration_stats(): Regime analysis

Benchmark results (30 Optuna trials):
- Synthetic: Transition best (RMSE=4.54)
- Hamilton GNP: Hamilton Filter best (RMSE=0.31)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…ing_lambda)

Removed advanced regime smoothing methods that required additional hyperparameters:
- Removed transition mode (required mixture_transition_prior)
- Removed hamilton mode (required mixture_transition_prior)
- Removed stickiness mode (required mixture_regime_stickiness)

Now only 3 smoothing modes remain:
- none: No smoothing
- ema: Exponential moving average
- markov: Previous gate probability blending

All modes use just mixture_smoothing_lambda (0-1).

Removed 418 lines of C++ code and simplified benchmark script.
Python helper functions (create_lagged_features, estimate_transition_matrix, etc.)
are preserved for analysis use cases.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Added momentum smoothing mode:
- Considers trend (direction of regime changes)
- extrapolated[i] = r[i-1] + λ*(r[i-1] - r[i-2])
- r_smooth[i] = (1-λ)*r[i] + λ*extrapolated[i]
- Useful when regimes have inertia

Removed Python helper functions to keep codebase simple:
- create_lagged_features()
- estimate_transition_matrix()
- compute_regime_duration_stats()
- predict_adaptive()
- predict_with_transition()

Final smoothing modes (all use single mixture_smoothing_lambda):
- none: No smoothing
- ema: Exponential moving average
- markov: Blend with previous gate probabilities
- momentum: EMA with trend extrapolation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Removed unused remnants from earlier attempt to implement Markov mode
by augmenting gate dataset with previous probabilities:

- gate_dataset_, gate_markov_config_, gate_raw_data_ (member vars)
- prev_responsibilities_, momentum_trend_ (unused buffers)
- num_original_features_ (unused)
- CreateGateDataset(), UpdateGateDataset() (never called)

The simpler approach (blending after Forward) was kept instead.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…ling

Updated all stub functions in MixtureGBDT to throw Log::Fatal error
with clear messages instead of returning empty/default values silently.
This makes unsupported operations fail loudly rather than producing
confusing behavior.

Affected functions:
- GetEvalAt, GetNumPredictAt (validation)
- PredictLeafIndex, PredictLeafIndexByMap
- PredictContrib, PredictContribByMap
- AddValidDataset, DumpModel
- GetLeafValue, SetLeafValue

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Implement dynamic expert bias adjustment to prevent expert collapse
without auxiliary loss interference. Based on the 2024 paper
"Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts".

Algorithm:
- Track actual load per expert (mean responsibility)
- Compare with target load (1/K uniform distribution)
- Adjust bias: bias[k] += η * (target - actual)
- Add bias to gate logits before softmax

This approach:
- Does not interfere with prediction gradients
- Requires no additional hyperparameters
- Maintains balanced expert utilization automatically

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Replace fixed mixture_r_min threshold with mixture_balance_factor
that calculates minimum usage dynamically: 1 / (factor * K)

- factor=10 (default): K=2 allows 95:5, K=5 allows 98:2
- factor=2: K=2 allows 75:25, K=5 allows 90:10
- Range: 2-10 (lower = more balanced, higher = more imbalance allowed)

Also removes hard clipping in EStep - collapse prevention is now
handled entirely by Loss-Free Load Balancing (bias adjustment).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Add .github/workflows/release.yml for automatic release on tag push
  - Builds wheels for Linux (x86_64, aarch64), macOS (x86_64, arm64), Windows
  - Creates GitHub Release with all artifacts attached

- Update pyproject.toml for independent versioning (v0.1.0+)
  - Use setuptools-scm for automatic version from git tags
  - Change Development Status to Beta
  - Update URLs to point to LightGBM-MoE fork

- Update README with GitHub Releases installation instructions
  - Add Option 1: Install from GitHub Releases (recommended)
  - Keep Option 2: Build from Source
  - Updated both English and Japanese sections

- Add benchmark scripts with mixture_balance_factor in search space

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@kyo219 kyo219 requested a review from StrikerRUS as a code owner January 12, 2026 13:43
@kyo219 kyo219 closed this Jan 12, 2026
@kyo219
Copy link
Author

kyo219 commented Jan 12, 2026

Sorry, wrong repository. This was meant for my fork.

@kyo219 kyo219 deleted the feature/markov_switching branch January 23, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant