-
Notifications
You must be signed in to change notification settings - Fork 110
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
Thanks for the cool framework for training draft models. But I recently encountered a problem when I tried to fine-tune an existing draft model (lmsys/Qwen3-235B-A22B-EAGLE3) for my domain. I prepared a part of the sharegpt dataset and I also prepared hidden_states. Adding the already trained model to cache/model/epoch_0 and launching offline training script I saw acc 0.00-0.03 for a long time. Although I expected to see a high acc because the model accelerates the target model quite well in various frameworks. After training whole first epoch I saw ~1.2 mean_acceptance_length in vllm framework (model degradation).
I've tried to print model parameters loaded from .safetensors file and they are correct. Also I tried to change vocab_mapping cache on t2d, d2t from trained model. Nothing helps
Reproduction
Add lmsys/Qwen3-235B-A22B-EAGLE3 model to cache/model/epoch_0 directory.
launch scripts/prepare_data.py for some dataset
launch scripts/prepare_hidden_states.py for prepared dataset
launch train_eagle3_offline.py
And we see 0.00 acc at the start of training
Environment
SpecForge framework of the latest version