Skip to content

[Bug] Eagle3 training for gpt-oss-120b fails with OOM #326

@gopalsarda

Description

@gopalsarda

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I am trying to train eagle3 heads for gpt-oss-120b on one H100 node. I set NUM_GPUS=8. When I use run_gpt_oss_120b_eagle3_online.sh as is, I get the below error -

[rank2]:     server_args = ServerArgs(
[rank2]:                   ^^^^^^^^^^^
[rank2]:   File "<string>", line 275, in __init__
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 589, in __post_init__
[rank2]:     self._handle_model_specific_adjustments()
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 958, in _handle_model_specific_adjustments
[rank2]:     prefill_attn_backend in supported_backends
[rank2]: AssertionError: GptOssForCausalLM requires one of ['triton', 'trtllm_mha', 'fa3', 'fa4'] attention backend, but got the following backends
[rank2]: - Prefill: flashinfer
[rank2]: - Decode: flashinfer

To fix the above issue, I removed attention_backend key from kwargs for ServerArgs. That resulted in the below OOM error.

[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 23.79 GiB. GPU 0 has a total capacity of 79.19 GiB of which 15.15 GiB is free. Including non-PyTorch memory, this process has 64.03 GiB memory in use. Of the allocated memory 59.10 GiB is allocated by PyTorch, and 1.33 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I get OOM even if I use --target-model-backend hf

Reproduction

cd /mnt/git/SpecForge/
pip install -r requirements.txt
pip install -e .

EXP_NAME=test-gpt-oss-120b
TARGET_MODEL_PATH=/mnt/models/gpt-oss-120b
EXP_PATH=/mnt/git/SpecForge/exp/$EXP_NAME
NUM_GPUS=8
MAX_LENGTH=8192
CHAT_TEMPLATE=gpt-oss-naive


python scripts/build_eagle3_dataset_cache.py \
    --target-model-path $TARGET_MODEL_PATH \
    --draft-model-config ./configs/gpt-oss-120B-eagle3.json \
    --train-data-path $EXP_PATH/dataset/all_train.jsonl \
    --cache-dir $EXP_PATH/cache/ \
    --chat-template $CHAT_TEMPLATE \
    --max-length $MAX_LENGTH

torchrun \
    --standalone \
    --nproc_per_node $NUM_GPUS \
    scripts/train_eagle3.py \
    --target-model-path $TARGET_MODEL_PATH \
    --draft-model-config ./configs/gpt-oss-120B-eagle3.json \
    --train-data-path $EXP_PATH/dataset/all_train.jsonl \
    --output-dir $EXP_PATH/outputs \
    --tp-size 8 \
    --num-epochs 10 \
    --batch-size 1 \
    --learning-rate 1e-4 \
    --max-length $MAX_LENGTH \
    --chat-template $CHAT_TEMPLATE \
    --cache-dir $EXP_PATH/cache/ \
    --target-model-backend sglang \
    --dist-timeout 60

Environment

Main branch of https://github.com/sgl-project/SpecForge

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions