[AutoDeploy]: Investigate SuperV3 with MTP not scaling well

### 🚀 The feature, motivation and pitch

AD surpasses PT at WS=1 across all concurrencies — 18% faster at c=1, near-parity (1%) at c=64.
AD fails to scale at WS=4: PT leads by 20–35%. The gap is driven by poor WS=1→4 scaling on the AD side: PT scales 1.52× from WS=1 to WS=4 at c=1 (4.67ms→3.07ms), while AD only gains 1.04× (3.82ms→3.67ms). 

Investigate and resolve the poor scaling on AutoDeploy side.
Baseline branch: `nv-auto-deploy:gagam/super-mtp-perf-2-replay` (see #13725)

### Alternatives

_No response_

### Additional context

See SuperV3 MTP ticket #12359 
Scripts, configs and experiment data: 
https://gitlab-master.nvidia.com/ghubaraagam/agent-reports/-/tree/main/260428_superv3_mtp?ref_type=heads

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoDeploy]: Investigate SuperV3 with MTP not scaling well #14225

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[AutoDeploy]: Investigate SuperV3 with MTP not scaling well #14225

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions