We need to investigate training Qwen Next 80B-A3B-Thinking model on RTX pro machine.
- verl
verl workin well with None MoE models, but for MoE models, verl unable to handle backward correctly
- LLAMA Factory
LLama-Factory also supported training MoE models, for Qwen Next, we can only train qLoRA, other options like LoRA and full finetune is OOM. Even with qLoRA it also take very long time to train (related issue hiyouga/LlamaFactory#9178 (comment))
I'm following hiyouga/LlamaFactory#9178 (comment) suggestion to improve the training speed of qwen-next but still get OOM, even with qLoRA, I will continue to investigate on this
We need to investigate training Qwen Next 80B-A3B-Thinking model on RTX pro machine.
verl workin well with None MoE models, but for MoE models, verl unable to handle backward correctly
LLama-Factory also supported training MoE models, for Qwen Next, we can only train qLoRA, other options like LoRA and full finetune is OOM. Even with qLoRA it also take very long time to train (related issue hiyouga/LlamaFactory#9178 (comment))
I'm following hiyouga/LlamaFactory#9178 (comment) suggestion to improve the training speed of qwen-next but still get OOM, even with qLoRA, I will continue to investigate on this