Skip to content

[Bug] RuntimeError: can't start new thread #264

@Sayandip170900

Description

@Sayandip170900

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Hi Team! I have been trying to train a speculative decoding model for Qwen3-Coder-480B-A35B-Instruct-FP8, and I am having the following error:-

[rank0]:   File "/SpecForge/scripts/train_eagle3_sgl_online.py", line 775, in <module>
[rank0]:     main()
[rank0]:   File "/SpecForge/scripts/train_eagle3_sgl_online.py", line 771, in main
[rank0]:     trainer.train()
[rank0]:   File "/SpecForge/scripts/train_eagle3_sgl_online.py", line 699, in train
[rank0]:     data_for_draft = self.target_model.forward(
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/specforge/modeling/target/sgl_model_wrapper.py", line 253, in forward
[rank0]:     hidden_states_list, aux_hidden_states_list = self.extend(reqs)
[rank0]:                                                  ^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/specforge/modeling/target/sgl_model_wrapper.py", line 200, in extend
[rank0]:     return _extend(
[rank0]:            ^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/specforge/modeling/target/sgl_model_wrapper.py", line 86, in _extend
[rank0]:     logits_output, _ = model_runner.forward(forward_batch)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1752, in forward
[rank0]:     output = self._forward_raw(
[rank0]:              ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1797, in _forward_raw
[rank0]:     ret = self.forward_extend(
[rank0]:           ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1697, in forward_extend
[rank0]:     return self.model.forward(
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_moe.py", line 654, in forward
[rank0]:     hidden_states = self.model(
[rank0]:                     ^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen2_moe.py", line 492, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:                               ^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_moe.py", line 533, in forward
[rank0]:     hidden_states = self.mlp(hidden_states, forward_batch, use_reduce_scatter)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_moe.py", line 126, in forward
[rank0]:     return self.forward_normal(hidden_states, use_reduce_scatter)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_moe.py", line 148, in forward_normal
[rank0]:     final_hidden_states = self.experts(hidden_states, topk_output)
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/layers/moe/ep_moe/layer.py", line 140, in forward
[rank0]:     return self.forward_deepgemm(hidden_states, topk_output)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/layers/moe/ep_moe/layer.py", line 301, in forward_deepgemm
[rank0]:     deep_gemm_wrapper.grouped_gemm_nt_f8f8bf16_masked(
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/layers/quantization/deep_gemm_wrapper/entrypoint.py", line 51, in grouped_gemm_nt_f8f8bf16_masked
[rank0]:     with compile_utils.deep_gemm_execution_hook(
[rank0]:   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
[rank0]:     return next(self.gen)
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/layers/quantization/deep_gemm_wrapper/compile_utils.py", line 333, in deep_gemm_execution_hook
[rank0]:     _maybe_compile_deep_gemm_one_type_all(kernel_type, n, k, num_groups)
[rank0]:   File "/.sglang/lib/python3.12/site-packages/sglang/srt/layers/quantization/deep_gemm_wrapper/compile_utils.py", line 298, in _maybe_compile_deep_gemm_one_type_all
[rank0]:     thread_map(compile_func, collected_configs, max_workers=_COMPILE_WORKERS)
[rank0]:   File "/.sglang/lib/python3.12/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
[rank0]:     return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/.sglang/lib/python3.12/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
[rank0]:     return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 608, in map
[rank0]:     fs = [self.submit(fn, *args) for args in zip(*iterables)]
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 179, in submit
[rank0]:     self._adjust_thread_count()
[rank0]:   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 202, in _adjust_thread_count
[rank0]:     t.start()
[rank0]:   File "/usr/lib/python3.12/threading.py", line 992, in start
[rank0]:     _start_new_thread(self._bootstrap, ())
[rank0]: RuntimeError: can't start new thread

Also, it starts training and fails every time at 6%

Training:   6%|████████████▎                                                                                                                                                                                       | 63/1000 [02:15<33:31,  2.15s/it]
[rank0]: Traceback (most recent call last):

My ulimits for the system are:-

real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 7300382
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 1048576
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

Please help me out with this issue. Thanks in advance :)

Reproduction

torchrun --standalone --nproc_per_node 8 \
/SpecForge/scripts/train_eagle3_sgl_online.py \
  --target-model-path Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  --model-path Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  --draft-model-config /SpecForge/configs/qwen3-coder-480B-A35B-instruct-eagle3.json \
  --train-data-path /SpecForge/data/apps_train.jsonl \
  --eval-data-path /SpecForge/data/apps_eval.jsonl \
  --tp-size 8 \
  --ep-size 8 \
  --output-dir /SpecForge/outputs/qwen3-coder-480B-A35B-eagle3 \
  --num-epochs 1 \
  --batch-size 1 \
  --learning-rate 5e-5 \
  --draft-attention-backend flex_attention \
  --max-length 2048 \
  --chat-template qwen \
  --cache-dir /SpecForge/cache \
  --mem-frac=0.7 \
  --dist-timeout 3600 \
  --watchdog-timeout 1800 \
  --disable-cuda-graph

Environment

I am using the the source installation:-

git clone https://github.com/sgl-project/SpecForge.git

pip install -v .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions