Summary
Loading Nemotron-3-Super-120B-A12B-NVFP4 via trtllm-serve with the same launch args and the same extra_llm_api_options yaml on 1.3.0rc12 and 1.3.0rc15:
| Image |
Cold-start |
Outcome |
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 |
~4 min |
✅ serves /v1/models |
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15 |
18+ min, never serves |
❌ worker spins on NVIDIA-driver ioctls in fused_moe/quantization.py:2769, progress bar stuck at Loading weights concurrently: 0/1288 |
Reverting only the image tag back to rc12 immediately restores normal cold-start and serving.
Environment
- GPU: NVIDIA GB10 (Grace-Blackwell, sm_121, integrated GPU, unified-memory architecture)
- OS: DGX OS 7 (Ubuntu 24.04 aarch64)
- Model:
Nemotron-3-Super-120B-A12B-NVFP4 (MoE; hybrid Mamba/SSM + attention; NVFP4 quantized)
- Container image tested:
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15
- Backend:
pytorch (AutoDeploy via trtllm-serve)
Repro
Launch command (identical args work on 1.3.0rc12):
docker run --rm \
--gpus all --ipc host --network host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-e TLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-v /path/to/Nemotron-3-Super-120B-A12B-NVFP4:/model:ro \
-v /path/to/extra-llm-api-config.yml:/extra-llm-api-config.yml:ro \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15 \
trtllm-serve /model --host 0.0.0.0 --port 8080 \
--served_model_name nemotron-super-120b-nvfp4 \
--max_batch_size 8 --tp_size 1 --ep_size 1 \
--max_num_tokens 8192 --max_seq_len 1048576 \
--trust_remote_code \
--reasoning_parser nano-v3 --tool_parser qwen3_coder \
--extra_llm_api_options /extra-llm-api-config.yml
extra-llm-api-config.yml:
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.9
mamba_ssm_cache_dtype: float16
mamba_ssm_stochastic_rounding: true
mamba_ssm_philox_rounds: 5
moe_config:
backend: CUTLASS
cuda_graph_config:
enable_padding: true
max_batch_size: 8
enable_chunked_prefill: true
(No speculative decoding / MTP. tp_size=1, ep_size=1.)
Observed behavior
Boot reaches this point and then makes no further visible progress for 18 minutes:
[TensorRT-LLM] TensorRT LLM version: 1.3.0rc15
…
Loading safetensors weights in parallel: 100%|██████████| 17/17 [00:15<00:00, 1.13it/s]
[TRT-LLM] [W] [_torch] [load_weight_shard] Skipping device transfer from cpu to cuda on integrated GPU to conserve shared memory.
Loading weights concurrently: 0%| | 0/1288 [00:00<?, ?it/s]
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/fused_moe/quantization.py:2769: UserWarning: TypedStorage is deprecated. […]
dst_base = dst_w3_w1_weight.storage().data_ptr()
No further log lines, no errors emitted. During the hang:
- CPU: the
mpi4py.futures.server worker holds 99.9 % of one core; accumulated ~15 min of CPU time on a single thread
- GPU: 0 % utilization throughout (
nvidia-smi)
- Block I/O: stable at 51.4 GB read (all weights already on disk; not I/O bound)
- Memory: container RSS 19 GB / 119 GB unified; no swap thrashing (
vmstat shows si≈0, so=0)
strace of the busy worker thread
ioctl(23, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xf0336fffbf50) = 0
ioctl(23, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xf0336fffbf50) = 0
ioctl(23, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xf0336fffbf50) = 0
…thousands of identical calls, all returning 0…
0x46 is the NVRM character-device magic; 0x2a is a memory-handshake ioctl. All calls succeed but the caller never reaches a satisfying condition.
Likely path
The log line Skipping device transfer from cpu to cuda on integrated GPU to conserve shared memory is the GB10-specific code path that avoids the cpu→cuda copy because unified memory makes it redundant. The hang then appears inside the NVFP4 fused-MoE weight-quantization path entered from there (_torch/modules/fused_moe/quantization.py:2769).
Why this looks like an rc15 regression, not a config issue
- The same yaml and the same flags boot on
1.3.0rc12 in ~4 minutes and serve traffic stably at max_batch_size=8
- Switching only the container tag
rc15 → rc12 restores normal cold-start; no other change required
- Nothing exotic in the config: no MTP, no TP, no EP, default
pytorch backend
Worth noting: 1.3.0rc15 does not hit the sm_121 PTXAS / Triton codegen error that previously blocked 1.3.0rc14 on this same hardware — so the rc15 boot path on GB10 has clearly moved, just not yet to a working state.
What would help
- Confirmation whether the GB10 / sm_121 integrated-GPU NVFP4 fused-MoE path is exercised in CI for rc15
- Whether
moe_config.backend: TRTLLM (vs CUTLASS) is expected to work-around it
- Whether
mamba_ssm_cache_dtype: bfloat16 (vs float16) makes a difference
- Whether
--enable_chunked_prefill off changes anything
Happy to capture py-spy dump, a longer strace, or run additional repro variants — let me know what's most useful.
Summary
Loading Nemotron-3-Super-120B-A12B-NVFP4 via
trtllm-servewith the same launch args and the sameextra_llm_api_optionsyaml on1.3.0rc12and1.3.0rc15:nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12/v1/modelsnvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15fused_moe/quantization.py:2769, progress bar stuck atLoading weights concurrently: 0/1288Reverting only the image tag back to
rc12immediately restores normal cold-start and serving.Environment
Nemotron-3-Super-120B-A12B-NVFP4(MoE; hybrid Mamba/SSM + attention; NVFP4 quantized)nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15pytorch(AutoDeploy viatrtllm-serve)Repro
Launch command (identical args work on
1.3.0rc12):docker run --rm \ --gpus all --ipc host --network host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -e TLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ -v /path/to/Nemotron-3-Super-120B-A12B-NVFP4:/model:ro \ -v /path/to/extra-llm-api-config.yml:/extra-llm-api-config.yml:ro \ nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15 \ trtllm-serve /model --host 0.0.0.0 --port 8080 \ --served_model_name nemotron-super-120b-nvfp4 \ --max_batch_size 8 --tp_size 1 --ep_size 1 \ --max_num_tokens 8192 --max_seq_len 1048576 \ --trust_remote_code \ --reasoning_parser nano-v3 --tool_parser qwen3_coder \ --extra_llm_api_options /extra-llm-api-config.ymlextra-llm-api-config.yml:(No speculative decoding / MTP.
tp_size=1,ep_size=1.)Observed behavior
Boot reaches this point and then makes no further visible progress for 18 minutes:
No further log lines, no errors emitted. During the hang:
mpi4py.futures.serverworker holds 99.9 % of one core; accumulated ~15 min of CPU time on a single threadnvidia-smi)vmstatshowssi≈0,so=0)strace of the busy worker thread
0x46is the NVRM character-device magic;0x2ais a memory-handshake ioctl. All calls succeed but the caller never reaches a satisfying condition.Likely path
The log line
Skipping device transfer from cpu to cuda on integrated GPU to conserve shared memoryis the GB10-specific code path that avoids the cpu→cuda copy because unified memory makes it redundant. The hang then appears inside the NVFP4 fused-MoE weight-quantization path entered from there (_torch/modules/fused_moe/quantization.py:2769).Why this looks like an rc15 regression, not a config issue
1.3.0rc12in ~4 minutes and serve traffic stably atmax_batch_size=8rc15 → rc12restores normal cold-start; no other change requiredpytorchbackendWorth noting:
1.3.0rc15does not hit the sm_121 PTXAS / Triton codegen error that previously blocked1.3.0rc14on this same hardware — so the rc15 boot path on GB10 has clearly moved, just not yet to a working state.What would help
moe_config.backend: TRTLLM(vsCUTLASS) is expected to work-around itmamba_ssm_cache_dtype: bfloat16(vsfloat16) makes a difference--enable_chunked_prefilloff changes anythingHappy to capture
py-spy dump, a longer strace, or run additional repro variants — let me know what's most useful.