Problem:
Nemotron 3 Super achieves 5x throughput over the previous Nemotron Super and 7.5x over Qwen3.5-122B, but the repository has no cookbook demonstrating how to actually run high-throughput batch inference. The Advanced Deployment Guide mentions the throughput backend mode for "offline batch jobs" but never demonstrates it.
The community is already building batch workloads organically (3.5M patent classification on a single RTX 5090, bulk code review at 12.5s per file), but there's no official guidance for optimal configuration.
Proposed Solution:
Add usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynb demonstrating:
- Server configuration for throughput (CUTLASS backend, EP, batch size tuning)
- Offline batch inference with vLLM's
LLM class
- Async concurrent requests against an OpenAI-compatible server
- Practical use case: bulk document classification with structured JSON output
- Throughput measurement and latency vs throughput backend comparison
The notebook follows the existing vllm_cookbook.ipynb pattern and requires no external API keys.
Why now:
With the Super 3 launch and GTC next week, community interest in throughput optimization is at its peak. Official guidance would validate NVIDIA's throughput claims with reproducible benchmarks.
I'm willing to implement this. Happy to adjust based on feedback.
Problem:
Nemotron 3 Super achieves 5x throughput over the previous Nemotron Super and 7.5x over Qwen3.5-122B, but the repository has no cookbook demonstrating how to actually run high-throughput batch inference. The Advanced Deployment Guide mentions the
throughputbackend mode for "offline batch jobs" but never demonstrates it.The community is already building batch workloads organically (3.5M patent classification on a single RTX 5090, bulk code review at 12.5s per file), but there's no official guidance for optimal configuration.
Proposed Solution:
Add
usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynbdemonstrating:LLMclassThe notebook follows the existing
vllm_cookbook.ipynbpattern and requires no external API keys.Why now:
With the Super 3 launch and GTC next week, community interest in throughput optimization is at its peak. Official guidance would validate NVIDIA's throughput claims with reproducible benchmarks.
I'm willing to implement this. Happy to adjust based on feedback.