[Bug] Eagle3 training for gpt-oss-120b fails with OOM

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

I am trying to train eagle3 heads for gpt-oss-120b on one H100 node. I set NUM_GPUS=8. When I use [run_gpt_oss_120b_eagle3_online.sh](https://github.com/sgl-project/SpecForge/blob/main/examples/run_gpt_oss_120b_eagle3_online.sh) as is, I get the below error - 
```
[rank2]:     server_args = ServerArgs(
[rank2]:                   ^^^^^^^^^^^
[rank2]:   File "<string>", line 275, in __init__
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 589, in __post_init__
[rank2]:     self._handle_model_specific_adjustments()
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/server_args.py", line 958, in _handle_model_specific_adjustments
[rank2]:     prefill_attn_backend in supported_backends
[rank2]: AssertionError: GptOssForCausalLM requires one of ['triton', 'trtllm_mha', 'fa3', 'fa4'] attention backend, but got the following backends
[rank2]: - Prefill: flashinfer
[rank2]: - Decode: flashinfer
```


To fix the above issue, I removed `attention_backend` key from `kwargs` for [ServerArgs](https://github.com/sgl-project/SpecForge/blob/main/specforge/modeling/target/eagle3_target_model.py#L191). That resulted in the below OOM error.
```
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 23.79 GiB. GPU 0 has a total capacity of 79.19 GiB of which 15.15 GiB is free. Including non-PyTorch memory, this process has 64.03 GiB memory in use. Of the allocated memory 59.10 GiB is allocated by PyTorch, and 1.33 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

I get OOM even if I use `--target-model-backend hf`

### Reproduction

```
cd /mnt/git/SpecForge/
pip install -r requirements.txt
pip install -e .

EXP_NAME=test-gpt-oss-120b
TARGET_MODEL_PATH=/mnt/models/gpt-oss-120b
EXP_PATH=/mnt/git/SpecForge/exp/$EXP_NAME
NUM_GPUS=8
MAX_LENGTH=8192
CHAT_TEMPLATE=gpt-oss-naive


python scripts/build_eagle3_dataset_cache.py \
    --target-model-path $TARGET_MODEL_PATH \
    --draft-model-config ./configs/gpt-oss-120B-eagle3.json \
    --train-data-path $EXP_PATH/dataset/all_train.jsonl \
    --cache-dir $EXP_PATH/cache/ \
    --chat-template $CHAT_TEMPLATE \
    --max-length $MAX_LENGTH

torchrun \
    --standalone \
    --nproc_per_node $NUM_GPUS \
    scripts/train_eagle3.py \
    --target-model-path $TARGET_MODEL_PATH \
    --draft-model-config ./configs/gpt-oss-120B-eagle3.json \
    --train-data-path $EXP_PATH/dataset/all_train.jsonl \
    --output-dir $EXP_PATH/outputs \
    --tp-size 8 \
    --num-epochs 10 \
    --batch-size 1 \
    --learning-rate 1e-4 \
    --max-length $MAX_LENGTH \
    --chat-template $CHAT_TEMPLATE \
    --cache-dir $EXP_PATH/cache/ \
    --target-model-backend sglang \
    --dist-timeout 60
```

### Environment

Main branch of https://github.com/sgl-project/SpecForge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Eagle3 training for gpt-oss-120b fails with OOM #326

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Eagle3 training for gpt-oss-120b fails with OOM #326

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions