`1.3.0rc15`: trtllm-serve hangs in fused_moe/quantization.py during cold-start on GB10 / sm_121 — `1.3.0rc12` works with identical config

## Summary

Loading **Nemotron-3-Super-120B-A12B-NVFP4** via `trtllm-serve` with the same launch args and the same `extra_llm_api_options` yaml on `1.3.0rc12` and `1.3.0rc15`:

| Image | Cold-start | Outcome |
|---|---|---|
| `nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12` | ~4 min | ✅ serves `/v1/models` |
| `nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15` | 18+ min, never serves | ❌ worker spins on NVIDIA-driver ioctls in `fused_moe/quantization.py:2769`, progress bar stuck at `Loading weights concurrently: 0/1288` |

Reverting only the image tag back to `rc12` immediately restores normal cold-start and serving.

## Environment

- **GPU:** NVIDIA GB10 (Grace-Blackwell, **sm_121**, integrated GPU, unified-memory architecture)
- **OS:** DGX OS 7 (Ubuntu 24.04 aarch64)
- **Model:** `Nemotron-3-Super-120B-A12B-NVFP4` (MoE; hybrid Mamba/SSM + attention; NVFP4 quantized)
- **Container image tested:** `nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15`
- **Backend:** `pytorch` (AutoDeploy via `trtllm-serve`)

## Repro

Launch command (identical args work on `1.3.0rc12`):

```bash
docker run --rm \
    --gpus all --ipc host --network host \
    --ulimit memlock=-1 --ulimit stack=67108864 \
    -e TLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
    -v /path/to/Nemotron-3-Super-120B-A12B-NVFP4:/model:ro \
    -v /path/to/extra-llm-api-config.yml:/extra-llm-api-config.yml:ro \
    nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15 \
    trtllm-serve /model --host 0.0.0.0 --port 8080 \
      --served_model_name nemotron-super-120b-nvfp4 \
      --max_batch_size 8 --tp_size 1 --ep_size 1 \
      --max_num_tokens 8192 --max_seq_len 1048576 \
      --trust_remote_code \
      --reasoning_parser nano-v3 --tool_parser qwen3_coder \
      --extra_llm_api_options /extra-llm-api-config.yml
```

`extra-llm-api-config.yml`:

```yaml
kv_cache_config:
  dtype: fp8
  enable_block_reuse: false
  free_gpu_memory_fraction: 0.9
  mamba_ssm_cache_dtype: float16
  mamba_ssm_stochastic_rounding: true
  mamba_ssm_philox_rounds: 5
moe_config:
  backend: CUTLASS
cuda_graph_config:
  enable_padding: true
  max_batch_size: 8
enable_chunked_prefill: true
```

(No speculative decoding / MTP. `tp_size=1`, `ep_size=1`.)

## Observed behavior

Boot reaches this point and then makes no further visible progress for 18 minutes:

```
[TensorRT-LLM] TensorRT LLM version: 1.3.0rc15
…
Loading safetensors weights in parallel: 100%|██████████| 17/17 [00:15<00:00,  1.13it/s]
[TRT-LLM] [W] [_torch] [load_weight_shard] Skipping device transfer from cpu to cuda on integrated GPU to conserve shared memory.
Loading weights concurrently:   0%|          | 0/1288 [00:00<?, ?it/s]
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/fused_moe/quantization.py:2769: UserWarning: TypedStorage is deprecated. […]
  dst_base = dst_w3_w1_weight.storage().data_ptr()
```

No further log lines, no errors emitted. During the hang:

- **CPU:** the `mpi4py.futures.server` worker holds 99.9 % of one core; accumulated ~15 min of CPU time on a single thread
- **GPU:** 0 % utilization throughout (`nvidia-smi`)
- **Block I/O:** stable at 51.4 GB read (all weights already on disk; not I/O bound)
- **Memory:** container RSS 19 GB / 119 GB unified; no swap thrashing (`vmstat` shows `si≈0`, `so=0`)

### strace of the busy worker thread

```
ioctl(23, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xf0336fffbf50) = 0
ioctl(23, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xf0336fffbf50) = 0
ioctl(23, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xf0336fffbf50) = 0
…thousands of identical calls, all returning 0…
```

`0x46` is the NVRM character-device magic; `0x2a` is a memory-handshake ioctl. All calls succeed but the caller never reaches a satisfying condition.

### Likely path

The log line `Skipping device transfer from cpu to cuda on integrated GPU to conserve shared memory` is the GB10-specific code path that avoids the cpu→cuda copy because unified memory makes it redundant. The hang then appears inside the NVFP4 fused-MoE weight-quantization path entered from there (`_torch/modules/fused_moe/quantization.py:2769`).

## Why this looks like an rc15 regression, not a config issue

- The **same yaml and the same flags** boot on `1.3.0rc12` in ~4 minutes and serve traffic stably at `max_batch_size=8`
- Switching only the container tag `rc15 → rc12` restores normal cold-start; no other change required
- Nothing exotic in the config: no MTP, no TP, no EP, default `pytorch` backend

Worth noting: **`1.3.0rc15` does not hit the sm_121 PTXAS / Triton codegen error that previously blocked `1.3.0rc14`** on this same hardware — so the rc15 boot path on GB10 has clearly moved, just not yet to a working state.

## What would help

1. Confirmation whether the GB10 / sm_121 integrated-GPU NVFP4 fused-MoE path is exercised in CI for rc15
2. Whether `moe_config.backend: TRTLLM` (vs `CUTLASS`) is expected to work-around it
3. Whether `mamba_ssm_cache_dtype: bfloat16` (vs `float16`) makes a difference
4. Whether `--enable_chunked_prefill` off changes anything

Happy to capture `py-spy dump`, a longer strace, or run additional repro variants — let me know what's most useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`1.3.0rc15`: trtllm-serve hangs in fused_moe/quantization.py during cold-start on GB10 / sm_121 — `1.3.0rc12` works with identical config #14500

Summary

Environment

Repro

Observed behavior

strace of the busy worker thread

Likely path

Why this looks like an rc15 regression, not a config issue

What would help

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Image	Cold-start	Outcome
`nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12`	~4 min	✅ serves `/v1/models`
`nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc15`	18+ min, never serves	❌ worker spins on NVIDIA-driver ioctls in `fused_moe/quantization.py:2769`, progress bar stuck at `Loading weights concurrently: 0/1288`

1.3.0rc15: trtllm-serve hangs in fused_moe/quantization.py during cold-start on GB10 / sm_121 — 1.3.0rc12 works with identical config #14500

Description

Summary

Environment

Repro

Observed behavior

strace of the busy worker thread

Likely path

Why this looks like an rc15 regression, not a config issue

What would help

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`1.3.0rc15`: trtllm-serve hangs in fused_moe/quantization.py during cold-start on GB10 / sm_121 — `1.3.0rc12` works with identical config #14500