Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
deploy.yaml	deploy.yaml
perf.yaml	perf.yaml

Name

Last commit message

Last commit date

GPT-OSS-120B Disaggregated Prefill/Decode

Serves openai/gpt-oss-120b using TensorRT-LLM with disaggregated prefill/decode via Dynamo on GB200 nodes.

Topology

Role	Nodes	GPUs/node	Total GPUs	Parallelism
Prefill	1	1	1	TP1
Decode	1	4	4	TP4

Prerequisites

Dynamo Platform installed — See Kubernetes Deployment Guide
Blackwell GPU nodes (GB200 or B200)
HuggingFace token with access to the model

Deploy

Follow the top-level Quick Start to set up the namespace, HuggingFace token secret, and model download. Then:

kubectl apply -f trtllm/disagg/deploy.yaml -n ${NAMESPACE}

Monitor startup (model loading takes ~15–30 minutes depending on storage speed):

kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/part-of=gpt-oss-disagg -w

Test

kubectl port-forward svc/gpt-oss-disagg-frontend 8000:8000 -n ${NAMESPACE} &
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-120b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

Benchmark (optional)

Edit perf.yaml to set your namespace and PVC, then run:

kubectl apply -f trtllm/disagg/perf.yaml -n ${NAMESPACE}
kubectl logs -f -l job-name=gpt-oss-120b-disagg-bench -n ${NAMESPACE}

Key Configuration Notes

Engine Configs

The deploy.yaml includes a ConfigMap with separate engine configurations for prefill and decode workers. Key differences:

Prefill: TP1, max_batch_size=64, free_gpu_memory_fraction=0.8, overlap scheduler disabled
Decode: TP4, max_batch_size=1280, free_gpu_memory_fraction=0.85, overlap scheduler enabled

KV Transfer

Uses UCX-based cache transceiver (max_tokens_in_buffer=9216) for KV cache transfer between prefill and decode workers.

Quantization

Uses W4A8_MXFP4_MXFP8 quantization via the OVERRIDE_QUANT_ALGO environment variable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

GPT-OSS-120B Disaggregated Prefill/Decode

Topology

Prerequisites

Deploy

Test

Benchmark (optional)

Key Configuration Notes

Engine Configs

KV Transfer

Quantization

FilesExpand file tree

disagg

Directory actions

More options

Directory actions

More options

Latest commit

History

disagg

Folders and files

parent directory

README.md

GPT-OSS-120B Disaggregated Prefill/Decode

Topology

Prerequisites

Deploy

Test

Benchmark (optional)

Key Configuration Notes

Engine Configs

KV Transfer

Quantization