-
Notifications
You must be signed in to change notification settings - Fork 110
Open
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
Hi Team! I have been trying to train a speculative decoding model for Qwen3-Coder-480B-A35B-Instruct-FP8, and I am having the following error:-
[rank0]: File "/SpecForge/scripts/train_eagle3_sgl_online.py", line 775, in <module>
[rank0]: main()
[rank0]: File "/SpecForge/scripts/train_eagle3_sgl_online.py", line 771, in main
[rank0]: trainer.train()
[rank0]: File "/SpecForge/scripts/train_eagle3_sgl_online.py", line 699, in train
[rank0]: data_for_draft = self.target_model.forward(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/specforge/modeling/target/sgl_model_wrapper.py", line 253, in forward
[rank0]: hidden_states_list, aux_hidden_states_list = self.extend(reqs)
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/specforge/modeling/target/sgl_model_wrapper.py", line 200, in extend
[rank0]: return _extend(
[rank0]: ^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/specforge/modeling/target/sgl_model_wrapper.py", line 86, in _extend
[rank0]: logits_output, _ = model_runner.forward(forward_batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1752, in forward
[rank0]: output = self._forward_raw(
[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1797, in _forward_raw
[rank0]: ret = self.forward_extend(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1697, in forward_extend
[rank0]: return self.model.forward(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_moe.py", line 654, in forward
[rank0]: hidden_states = self.model(
[rank0]: ^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen2_moe.py", line 492, in forward
[rank0]: hidden_states, residual = layer(
[rank0]: ^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_moe.py", line 533, in forward
[rank0]: hidden_states = self.mlp(hidden_states, forward_batch, use_reduce_scatter)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_moe.py", line 126, in forward
[rank0]: return self.forward_normal(hidden_states, use_reduce_scatter)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_moe.py", line 148, in forward_normal
[rank0]: final_hidden_states = self.experts(hidden_states, topk_output)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/layers/moe/ep_moe/layer.py", line 140, in forward
[rank0]: return self.forward_deepgemm(hidden_states, topk_output)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/layers/moe/ep_moe/layer.py", line 301, in forward_deepgemm
[rank0]: deep_gemm_wrapper.grouped_gemm_nt_f8f8bf16_masked(
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/layers/quantization/deep_gemm_wrapper/entrypoint.py", line 51, in grouped_gemm_nt_f8f8bf16_masked
[rank0]: with compile_utils.deep_gemm_execution_hook(
[rank0]: File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
[rank0]: return next(self.gen)
[rank0]: ^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/layers/quantization/deep_gemm_wrapper/compile_utils.py", line 333, in deep_gemm_execution_hook
[rank0]: _maybe_compile_deep_gemm_one_type_all(kernel_type, n, k, num_groups)
[rank0]: File "/.sglang/lib/python3.12/site-packages/sglang/srt/layers/quantization/deep_gemm_wrapper/compile_utils.py", line 298, in _maybe_compile_deep_gemm_one_type_all
[rank0]: thread_map(compile_func, collected_configs, max_workers=_COMPILE_WORKERS)
[rank0]: File "/.sglang/lib/python3.12/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
[rank0]: return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/.sglang/lib/python3.12/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
[rank0]: return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/concurrent/futures/_base.py", line 608, in map
[rank0]: fs = [self.submit(fn, *args) for args in zip(*iterables)]
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/concurrent/futures/thread.py", line 179, in submit
[rank0]: self._adjust_thread_count()
[rank0]: File "/usr/lib/python3.12/concurrent/futures/thread.py", line 202, in _adjust_thread_count
[rank0]: t.start()
[rank0]: File "/usr/lib/python3.12/threading.py", line 992, in start
[rank0]: _start_new_thread(self._bootstrap, ())
[rank0]: RuntimeError: can't start new thread
Also, it starts training and fails every time at 6%
Training: 6%|████████████▎ | 63/1000 [02:15<33:31, 2.15s/it]
[rank0]: Traceback (most recent call last):
My ulimits for the system are:-
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 7300382
max locked memory (kbytes, -l) 8192
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1048576
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Please help me out with this issue. Thanks in advance :)
Reproduction
torchrun --standalone --nproc_per_node 8 \
/SpecForge/scripts/train_eagle3_sgl_online.py \
--target-model-path Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
--model-path Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
--draft-model-config /SpecForge/configs/qwen3-coder-480B-A35B-instruct-eagle3.json \
--train-data-path /SpecForge/data/apps_train.jsonl \
--eval-data-path /SpecForge/data/apps_eval.jsonl \
--tp-size 8 \
--ep-size 8 \
--output-dir /SpecForge/outputs/qwen3-coder-480B-A35B-eagle3 \
--num-epochs 1 \
--batch-size 1 \
--learning-rate 5e-5 \
--draft-attention-backend flex_attention \
--max-length 2048 \
--chat-template qwen \
--cache-dir /SpecForge/cache \
--mem-frac=0.7 \
--dist-timeout 3600 \
--watchdog-timeout 1800 \
--disable-cuda-graph
Environment
I am using the the source installation:-
git clone https://github.com/sgl-project/SpecForge.git
pip install -v .
Metadata
Metadata
Assignees
Labels
No labels