Skip to content

1.3.0rc15: trtllm-serve hangs in fused_moe/quantization.py during cold-start on GB10 / sm_121 — 1.3.0rc12 works with identical config #14500

@zentradev-rabih

Description

@zentradev-rabih

Summary

Loading Nemotron-3-Super-120B-A12B-NVFP4 via trtllm-serve with the same launch args and the same extra_llm_api_options yaml on 1.3.0rc12 and 1.3.0rc15:

Image Cold-start Outcome
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 ~4 min ✅ serves /v1/models
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15 18+ min, never serves ❌ worker spins on NVIDIA-driver ioctls in fused_moe/quantization.py:2769, progress bar stuck at Loading weights concurrently: 0/1288

Reverting only the image tag back to rc12 immediately restores normal cold-start and serving.

Environment

  • GPU: NVIDIA GB10 (Grace-Blackwell, sm_121, integrated GPU, unified-memory architecture)
  • OS: DGX OS 7 (Ubuntu 24.04 aarch64)
  • Model: Nemotron-3-Super-120B-A12B-NVFP4 (MoE; hybrid Mamba/SSM + attention; NVFP4 quantized)
  • Container image tested: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15
  • Backend: pytorch (AutoDeploy via trtllm-serve)

Repro

Launch command (identical args work on 1.3.0rc12):

docker run --rm \
    --gpus all --ipc host --network host \
    --ulimit memlock=-1 --ulimit stack=67108864 \
    -e TLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
    -v /path/to/Nemotron-3-Super-120B-A12B-NVFP4:/model:ro \
    -v /path/to/extra-llm-api-config.yml:/extra-llm-api-config.yml:ro \
    nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15 \
    trtllm-serve /model --host 0.0.0.0 --port 8080 \
      --served_model_name nemotron-super-120b-nvfp4 \
      --max_batch_size 8 --tp_size 1 --ep_size 1 \
      --max_num_tokens 8192 --max_seq_len 1048576 \
      --trust_remote_code \
      --reasoning_parser nano-v3 --tool_parser qwen3_coder \
      --extra_llm_api_options /extra-llm-api-config.yml

extra-llm-api-config.yml:

kv_cache_config:
  dtype: fp8
  enable_block_reuse: false
  free_gpu_memory_fraction: 0.9
  mamba_ssm_cache_dtype: float16
  mamba_ssm_stochastic_rounding: true
  mamba_ssm_philox_rounds: 5
moe_config:
  backend: CUTLASS
cuda_graph_config:
  enable_padding: true
  max_batch_size: 8
enable_chunked_prefill: true

(No speculative decoding / MTP. tp_size=1, ep_size=1.)

Observed behavior

Boot reaches this point and then makes no further visible progress for 18 minutes:

[TensorRT-LLM] TensorRT LLM version: 1.3.0rc15
…
Loading safetensors weights in parallel: 100%|██████████| 17/17 [00:15<00:00,  1.13it/s]
[TRT-LLM] [W] [_torch] [load_weight_shard] Skipping device transfer from cpu to cuda on integrated GPU to conserve shared memory.
Loading weights concurrently:   0%|          | 0/1288 [00:00<?, ?it/s]
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/fused_moe/quantization.py:2769: UserWarning: TypedStorage is deprecated. […]
  dst_base = dst_w3_w1_weight.storage().data_ptr()

No further log lines, no errors emitted. During the hang:

  • CPU: the mpi4py.futures.server worker holds 99.9 % of one core; accumulated ~15 min of CPU time on a single thread
  • GPU: 0 % utilization throughout (nvidia-smi)
  • Block I/O: stable at 51.4 GB read (all weights already on disk; not I/O bound)
  • Memory: container RSS 19 GB / 119 GB unified; no swap thrashing (vmstat shows si≈0, so=0)

strace of the busy worker thread

ioctl(23, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xf0336fffbf50) = 0
ioctl(23, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xf0336fffbf50) = 0
ioctl(23, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xf0336fffbf50) = 0
…thousands of identical calls, all returning 0…

0x46 is the NVRM character-device magic; 0x2a is a memory-handshake ioctl. All calls succeed but the caller never reaches a satisfying condition.

Likely path

The log line Skipping device transfer from cpu to cuda on integrated GPU to conserve shared memory is the GB10-specific code path that avoids the cpu→cuda copy because unified memory makes it redundant. The hang then appears inside the NVFP4 fused-MoE weight-quantization path entered from there (_torch/modules/fused_moe/quantization.py:2769).

Why this looks like an rc15 regression, not a config issue

  • The same yaml and the same flags boot on 1.3.0rc12 in ~4 minutes and serve traffic stably at max_batch_size=8
  • Switching only the container tag rc15 → rc12 restores normal cold-start; no other change required
  • Nothing exotic in the config: no MTP, no TP, no EP, default pytorch backend

Worth noting: 1.3.0rc15 does not hit the sm_121 PTXAS / Triton codegen error that previously blocked 1.3.0rc14 on this same hardware — so the rc15 boot path on GB10 has clearly moved, just not yet to a working state.

What would help

  1. Confirmation whether the GB10 / sm_121 integrated-GPU NVFP4 fused-MoE path is exercised in CI for rc15
  2. Whether moe_config.backend: TRTLLM (vs CUTLASS) is expected to work-around it
  3. Whether mamba_ssm_cache_dtype: bfloat16 (vs float16) makes a difference
  4. Whether --enable_chunked_prefill off changes anything

Happy to capture py-spy dump, a longer strace, or run additional repro variants — let me know what's most useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.Pytorch<NV>Pytorch backend related issues

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions