Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

Post-Training for Qwen1.5-MoE & Qwen3-MoE

This folder contains the code and instructions for reproducing the experiments on Qwen1.5-MoE-A2.7B and Qwen3-30B-A3B-Base as described in the DenseMixer blog post. All experiments are conducted using llama-factory.


Table of Contents


Environment Setup

Refer to installation.sh for detailed setup instructions.


Data Preparation

We use a diverse set of datasets for training and evaluation. Our pre-processed datasets are available on Hugging Face.

For Qwen1.5-MoE-A2.7B, we use GSM (math reasoning), CodeAlpaca (code generation), ESFT-intent (intent understanding), ESFT-law (legal reasoning), ESFT-summary (summarization), ESFT-translation (translation). For code generation evaluation, we use MBPP and HumanEval.

For Qwen3-30B-A3B-Base, we use s1 (math reasoning), and nemotron-code (coding reasoning). For evaluation, we challenging math and coding benchmarks that require reasoning abilities.


Training

We support the following fine-tuning methods. For each method, replace {dataset_name} with your target dataset (gsm, codealpaca,esft).

1. Frozen Router

Full Fine-tuning & LoRA

cd LLaMA-Factory
export WANDB_API_KEY="YOUR_WANDB_API_KEY"
bash run/qwen1.5/frozen_full/train_{dataset_name}_frozen.sh
bash run/qwen1.5/frozen_lora/train_{dataset_name}.sh

2. Conventional Training

Full Fine-tuning & LoRA

cd LLaMA-Factory
export WANDB_API_KEY="YOUR_WANDB_API_KEY"
bash run/qwen1.5/conventional_full/train_{dataset_name}.sh
bash run/qwen1.5/conventional_lora/train_{dataset_name}_unfrozen.sh
###qwen3-30b
bash run/qwen3/train_full-code.sh
bash run/qwen3/train_full-math.sh

3. DenseMixer

To run DenseMixer, you need one additional setup

pip install densemixer
densemixer setup

Full Fine-tuning & LoRA

cd LLaMA-Factory
export WANDB_API_KEY="YOUR_WANDB_API_KEY"
bash run/qwen1.5/densemixer_full/train_{dataset_name}_densemixer.sh
bash run/qwen1.5/densemixer_lora/train_{dataset_name}_densemixer.sh
###qwen3-30b
bash run/qwen3/train_densemixer-code.sh
bash run/qwen3/train_densemixer-math.sh

4. ESFT Fine-tuning

Before running ESFT, generate expert configs:

cd esft_experts_gen/experts
bash get_expert_scores.sh
bash generate_expert_config.sh

ESFT-Gate (Selects experts by Average Gate Score)

bash run/qwen1.5/esft-gate/train_gsm.sh
bash run/qwen1.5/esft-gate/train_code.sh
bash run/qwen1.5/esft-gate/train_esft.sh
bash run/qwen1.5/esft-gate/train_amthinking_math.sh

ESFT-Token (Selects experts by Token Selection Ratio)

bash run/qwen1.5/esft-token/train_gsm.sh
bash run/qwen1.5/esft-token/train_code.sh
bash run/qwen1.5/esft-token/train_esft.sh
bash run/qwen1.5/esft-token/train_amthinking_math.sh

Implementation Details:


Evaluation

Evaluation scripts are in the eval/ directory. See eval/README.md for details and environment setup.

Key arguments to modify for your trained model:

  • --save_dir
  • --model_name_or_path
  • --tokenizer_name_or_path

References