-
Notifications
You must be signed in to change notification settings - Fork 110
Open
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
Hi team, I am trying to train Eagle3 for gpt-oss-120b by following the example at run_gpt_oss_120b_eagle3_sgl_online.sh.
I am using docker.io/lmsysorg/sglang:dev as the base image, and run pip install -e . under the SpecForge git directory for installation.
Currently it is failing with the below error. Can someone please help understand what might be happening here? Thanks!
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/git/SpecForge/scripts/train_eagle3_sgl_online.py", line 775, in <module>
[rank0]: main()
[rank0]: File "/mnt/git/SpecForge/scripts/train_eagle3_sgl_online.py", line 771, in main
[rank0]: trainer.train()
[rank0]: File "/mnt/git/SpecForge/scripts/train_eagle3_sgl_online.py", line 699, in train
[rank0]: data_for_draft = self.target_model.forward(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/git/SpecForge/specforge/modeling/target/sgl_model_wrapper.py", line 253, in forward
[rank0]: hidden_states_list, aux_hidden_states_list = self.extend(reqs)
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/git/SpecForge/specforge/modeling/target/sgl_model_wrapper.py", line 200, in extend
[rank0]: return _extend(
[rank0]: ^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/git/SpecForge/specforge/modeling/target/sgl_model_wrapper.py", line 81, in _extend
[rank0]: batch.prepare_for_extend()
[rank0]: File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/schedule_batch.py", line 1266, in prepare_for_extend
[rank0]: out_cache_loc = self.alloc_token_slots(extend_num_tokens)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/schedule_batch.py", line 988, in alloc_token_slots
[rank0]: f"{self._available_and_evictable_str()}"
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/schedule_batch.py", line 1843, in _available_and_evictable_str
[rank0]: evictable_size = self.tree_cache.evictable_size()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'NoneType' object has no attribute 'evictable_size'
Reproduction
TARGET_MODEL_PATH=/mnt/models/gpt-oss-120b
EXP_PATH=/mnt/git/SpecForge/exp/2025-10-24
NUM_GPUS=8
MAX_LENGTH=8192
CHAT_TEMPLATE=gpt-oss-naive
torchrun \
--standalone \
--nproc_per_node $NUM_GPUS \
scripts/train_eagle3_sgl_online.py \
--target-model-path $TARGET_MODEL_PATH \
--model-path $TARGET_MODEL_PATH \
--draft-model-config ./configs/gpt-oss-120B-eagle3.json \
--train-data-path $EXP_PATH/dataset/all_train.jsonl \
--tp-size $NUM_GPUS \
--output-dir $EXP_PATH/outputs \
--num-epochs 2 \
--batch-size 1 \
--learning-rate 7e-5 \
--draft-attention-backend sdpa \
--draft-global-batch-size 32 \
--max-length $MAX_LENGTH \
--chat-template $CHAT_TEMPLATE \
--cache-dir $EXP_PATH/cache/ \
--mem-frac=0.4 \
--total-steps=800000 \
--warmup-ratio=0.015 \
--dist-timeout=10 \
--save-interval 40000 \
--resume
Environment
I am using docker.io/lmsysorg/sglang:dev as the base image, and run pip install -e . under the SpecForge git directory for installation.
Metadata
Metadata
Assignees
Labels
No labels