-
Notifications
You must be signed in to change notification settings - Fork 110
Open
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
I am trying to train eagle3 heads for gpt-oss-120b on one H100 node. I set NUM_GPUS=8. When I use run_gpt_oss_120b_eagle3_online.sh as is, I get the below error -
[rank2]: server_args = ServerArgs(
[rank2]: ^^^^^^^^^^^
[rank2]: File "<string>", line 275, in __init__
[rank2]: File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 589, in __post_init__
[rank2]: self._handle_model_specific_adjustments()
[rank2]: File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 958, in _handle_model_specific_adjustments
[rank2]: prefill_attn_backend in supported_backends
[rank2]: AssertionError: GptOssForCausalLM requires one of ['triton', 'trtllm_mha', 'fa3', 'fa4'] attention backend, but got the following backends
[rank2]: - Prefill: flashinfer
[rank2]: - Decode: flashinfer
To fix the above issue, I removed attention_backend key from kwargs for ServerArgs. That resulted in the below OOM error.
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 23.79 GiB. GPU 0 has a total capacity of 79.19 GiB of which 15.15 GiB is free. Including non-PyTorch memory, this process has 64.03 GiB memory in use. Of the allocated memory 59.10 GiB is allocated by PyTorch, and 1.33 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
I get OOM even if I use --target-model-backend hf
Reproduction
cd /mnt/git/SpecForge/
pip install -r requirements.txt
pip install -e .
EXP_NAME=test-gpt-oss-120b
TARGET_MODEL_PATH=/mnt/models/gpt-oss-120b
EXP_PATH=/mnt/git/SpecForge/exp/$EXP_NAME
NUM_GPUS=8
MAX_LENGTH=8192
CHAT_TEMPLATE=gpt-oss-naive
python scripts/build_eagle3_dataset_cache.py \
--target-model-path $TARGET_MODEL_PATH \
--draft-model-config ./configs/gpt-oss-120B-eagle3.json \
--train-data-path $EXP_PATH/dataset/all_train.jsonl \
--cache-dir $EXP_PATH/cache/ \
--chat-template $CHAT_TEMPLATE \
--max-length $MAX_LENGTH
torchrun \
--standalone \
--nproc_per_node $NUM_GPUS \
scripts/train_eagle3.py \
--target-model-path $TARGET_MODEL_PATH \
--draft-model-config ./configs/gpt-oss-120B-eagle3.json \
--train-data-path $EXP_PATH/dataset/all_train.jsonl \
--output-dir $EXP_PATH/outputs \
--tp-size 8 \
--num-epochs 10 \
--batch-size 1 \
--learning-rate 1e-4 \
--max-length $MAX_LENGTH \
--chat-template $CHAT_TEMPLATE \
--cache-dir $EXP_PATH/cache/ \
--target-model-backend sglang \
--dist-timeout 60
Environment
Main branch of https://github.com/sgl-project/SpecForge
Metadata
Metadata
Assignees
Labels
No labels