SLA Planner Quick Start Guide

Complete workflow to deploy SLA-based autoscaling for Dynamo deployments. This guide consolidates all necessary steps into a clear, sequential process.

Important

Prerequisites: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the Dynamo Platform installation.

Overview

The SLA Planner automatically scales prefill and decode workers to meet your TTFT (Time To First Token) and ITL (Inter-Token Latency) targets.

The deployment process consists of two mandatory phases:

Pre-Deployment Profiling (2-4 hours) - Generates performance data
SLA Planner Deployment (5-10 minutes) - Enables autoscaling

Tip

Fast Profiling with AI Configurator: For TensorRT-LLM users, we provide AI Configurator (AIC) that can complete profiling in 20-30 seconds using performance simulation instead of real deployments. Support for vLLM and SGLang coming soon. See AI Configurator section in the Profiling Guide.

flowchart TD
    A[Start Setup] --> B{Profiling Done?}
    B -->|No| C[Run Profiling<br/>2-4 hours]
    C --> D[Verify Results]
    D --> E[Deploy Planner<br/>5-10 minutes]
    B -->|Yes| E
    E --> F[Test System]
    F --> G[Ready!]

    style A fill:#e1f5fe
    style C fill:#fff3e0
    style E fill:#e8f5e8
    style G fill:#f3e5f5
    style B fill:#fff8e1

Prerequisites

Before deploying the SLA planner, ensure:

Dynamo platform installed (see Installation Guide)
kube-prometheus-stack installed and running. By default, the prometheus server is not deployed in the monitoring namespace. If it is deployed to a different namespace, set dynamo-operator.dynamo.metrics.prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.<namespace>.svc.cluster.local:9090".
Benchmarking resources setup (see Kubernetes utilities for Dynamo Benchmarking and Profiling) The script will create a dynamo-pvc with ReadWriteMany access, if your cluster's default storageClassName does not allow ReadWriteMany, you need to specify a different storageClassName in pvc.yaml.

Pre-Deployment Profiling

Deploying planner starts with running pre-deployment profiling.

Warning

MANDATORY: Pre-deployment profiling must be completed before deploying SLA planner. This process analyzes your model's performance characteristics to determine optimal tensor parallelism configurations and scaling parameters.

Step 1.1: Set Up Profiling Environment

Set up your Kubernetes namespace for profiling (one-time per namespace). If your namespace is already set up, skip this step.

export NAMESPACE=your-namespace

Prerequisites: Ensure all dependencies are installed:

pip install -r deploy/utils/requirements.txt

Step 1.2: Inject Your Configuration

Use the injector utility to place your DGD manifest into the PVC:

# Use default disagg.yaml config
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml

# Or use a custom disagg config file
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml

Note: All paths must start with /data/ for security reasons.

Step 1.3: Configure SLA Targets

For dense models, edit $DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml:

spec:
  template:
    spec:
      containers:
        - name: profile-sla
          args:
            - --isl
            - "3000" # average ISL is 3000 tokens
            - --osl
            - "150" # average OSL is 150 tokens
            - --ttft
            - "200" # target TTFT is 200ms
            - --itl
            - "20" # target ITL is 20ms
            - --backend
            - <vllm/sglang>
            - --deploy-after-profile

For MoE models, edit $DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_moe_job.yaml instead.

To automatically deploy the optimized DGD with planner after profiling, add --deploy-after-profile to the profiling job. It will deploy the DGD with the engine of the optimized parallelization mapping found for the SLA targets.

Step 1.4: Run Profiling

Set the container image and config path:

export DOCKER_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
export DGD_CONFIG_FILE=/data/configs/disagg.yaml

Run profiling:

# for dense models
envsubst < benchmarks/profiler/deploy/profile_sla_job.yaml | kubectl apply -f -

# for MoE models
envsubst < benchmarks/profiler/deploy/profile_sla_moe_job.yaml | kubectl apply -f -

# using aiconfigurator instead of real sweeping (see below for more details)
envsubst < benchmarks/profiler/deploy/profile_sla_aic_job.yaml | kubectl apply -f -

Step 1.5: Monitor Profiling Progress

kubectl get jobs -n $NAMESPACE
kubectl logs job/profile-sla -n $NAMESPACE

Note

Time Investment: This profiling process is comprehensive and typically takes 2-4 hours to complete. The script systematically tests multiple tensor parallelism configurations and load conditions to find optimal performance settings.

Step 1.6: Download Profiling Results (Optional)

If you want to view the profiling results and performance plots:

# Download to directory
python3 -m deploy.utils.download_pvc_results --namespace $NAMESPACE --output-dir ./results --folder /data/profiling_results

For detailed information about the output structure, performance plots, and how to analyze the results, see the Viewing Profiling Results section in the Profiling Guide.

Verify Success: Look for terminal output like:

Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
...
Final DGD config with planner: {...}
Deploying the optimized DGD with planner...

Step 1.7: Wait for Deployment to be Ready

kubectl get pods -n $NAMESPACE

Expected pods (all should be 1/1 Running):

vllm-disagg-planner-frontend-*            1/1 Running
vllm-disagg-planner-planner-*             1/1 Running
vllm-disagg-planner-backend-*             1/1 Running
vllm-disagg-planner-prefill-*             1/1 Running

Step 1.8: Test the System

# Port forward to frontend
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000

# Send a request
curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
    {
        "role": "user",
        "content": "Hello, how are you?"
    }
    ],
    "stream":true,
    "max_tokens": 30
  }'

Step 1.9: Monitor Scaling

# Check planner logs for scaling decisions
kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-planner --tail=10

Expected successful output (after streaming requests):

New adjustment interval started!
Observed num_req: X.XXX isl: X.XXX osl: X.XXX
Observed ttft: X.XXXs itl: X.XXXs
Number of prefill workers: 1, number of decode workers: 1

Production Readiness

Monitoring Metrics

Basic metrics (request count): Available with any request type
Latency metrics (TTFT/ITL): Available for both streaming and non-streaming requests
Scaling decisions: Require sufficient request volume

Troubleshooting

Connection Issues:

# Verify Prometheus is accessible
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
curl "http://localhost:9090/api/v1/query?query=up"

Missing Metrics:

# Check frontend metrics
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
curl http://localhost:8000/metrics | grep nv_llm_http_service

Worker Issues:

Large models can take 10+ minutes to initialize
Check worker logs: kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-backend
Ensure GPU resources are available for workers

Unknown Field subComponentType:

If you encounter the following error when applying the deployment:

Error from server (BadRequest): error when creating "components/backends/vllm/deploy/disagg.yaml": DynamoGraphDeployment in version "v1alpha1" cannot be handled as a DynamoGraphDeployment: strict decoding error: unknown field "spec.services.DecodeWorker.subComponentType", unknown field "spec.services.PrefillWorker.subComponentType"

This is because the subComponentType field has only been added in newer versions of the DynamoGraphDeployment CRD (> 0.5.0). You can upgrade the CRD version by following the instructions here.

Next Steps

Architecture Details: See SLA-based Planner Architecture for technical details
Performance Tuning: See Pre-Deployment Profiling Guide for advanced profiling options
Load Testing: See SLA Planner Load Test for comprehensive testing tools

Quick Reference

Phase	Duration	Purpose	Status Check
Profiling	2-4 hours	Generate performance data	`kubectl logs job/profile-sla`
Deployment	5-10 minutes	Enable autoscaling	`kubectl get pods`
Testing	5 minutes	Verify functionality	`kubectl logs deployment/planner`

Tip

Need Help? If you encounter issues, check the troubleshooting section or refer to the detailed guides linked in Next Steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLA Planner Quick Start Guide

Overview

Prerequisites

Pre-Deployment Profiling

Step 1.1: Set Up Profiling Environment

Step 1.2: Inject Your Configuration

Step 1.3: Configure SLA Targets

Step 1.4: Run Profiling

Step 1.5: Monitor Profiling Progress

Step 1.6: Download Profiling Results (Optional)

Step 1.7: Wait for Deployment to be Ready

Step 1.8: Test the System

Step 1.9: Monitor Scaling

Production Readiness

Monitoring Metrics

Troubleshooting

Next Steps

Quick Reference

FilesExpand file tree

sla_planner_quickstart.md

Latest commit

History

sla_planner_quickstart.md

File metadata and controls

SLA Planner Quick Start Guide

Overview

Prerequisites

Pre-Deployment Profiling

Step 1.1: Set Up Profiling Environment

Step 1.2: Inject Your Configuration

Step 1.3: Configure SLA Targets

Step 1.4: Run Profiling

Step 1.5: Monitor Profiling Progress

Step 1.6: Download Profiling Results (Optional)

Step 1.7: Wait for Deployment to be Ready

Step 1.8: Test the System

Step 1.9: Monitor Scaling

Production Readiness

Monitoring Metrics

Troubleshooting

Next Steps

Quick Reference