start making updates to perf-overview.md instructions for release 1.2

zbpatel · zbpatel · commit 569350e5db74 · 2026-03-10T22:40:12.000-07:00
Signed-off-by: Zachary Patel &lt;22306219+zbpatel@users.noreply.github.com&gt;
diff --git a/docs/source/developer-guide/perf-overview.md b/docs/source/developer-guide/perf-overview.md
@@ -17,6 +17,8 @@ For DeepSeek R1 performance, please check out our [performance guide](../blogs/B
 
 For more information on benchmarking with `trtllm-bench` see this NVIDIA [blog post](https://developer.nvidia.com/blog/llm-inference-benchmarking-performance-tuning-with-tensorrt-llm/).
 
+For NUMA systems, we recommend consulting the ["CPU Affinity configuration in TensorRT LLM"](../deployment-guide/configuring-cpu-affinity.md) guide to achieve best performance. These options were enabled for relevant tests.
+
 ## Throughput Measurements
 
 The below table shows performance data where a local inference client is fed requests at an high rate / no delay between messages,
@@ -34,14 +36,17 @@ The following GPU variants were used for testing:
 - H100 SXM 80GB (DGX H100)
 - H200 SXM 141GB (DGX H200)
 - B200 180GB (DGX B200)
+- B300 288GB (DGX B300)
 - GB200 192GB (GB200 NVL72)
+- GB300 (GB300 NVL72)
 - RTX 6000 Pro Blackwell Server Edition
 
 Other hardware variants may have different TDP, memory bandwidth, core count, or other features leading to performance differences on these workloads.
 
 ### FP4 Models
 
 ```text
+nvidia/Kimi-K2-Instruct-NVFP4
 nvidia/DeepSeek-R1-0528-NVFP4-v2
 nvidia/Qwen3-235B-A22B-FP4
 nvidia/Qwen3-30B-A3B-FP4
@@ -52,6 +57,7 @@ nvidia/Llama-4-Maverick-17B-128E-Instruct-NVFP4
 ### FP8 Models
 
 ```text
+moonshotai/Kimi-K2-Instruct
 deepseek-ai/DeepSeek-R1-0528
 nvidia/Qwen3-235B-A22B-FP8
 nvidia/Llama-3.3-70B-Instruct-FP8