⚠️ EXPERIMENTAL BRANCH: This branch contains LLM Router v2, a next-generation routing system with multimodal support. For the production-ready LLM Router v1, please visit the main branch.
LLM Router v2 is currently experimental and not yet backwards compatible with v1. Key differences:
| Feature | v1 (Main Branch) | v2 (Experimental) |
|---|---|---|
| Server Implementation | Rust proxy | NVIDIA NeMo Agent Toolkit (FastAPI) |
| Inference Backend | BERT model + NVIDIA Triton Inference Server | Qwen 1.7B LLM or CLIP + Neural Network |
| Functionality | Classification + Proxying to LLM | Classification only (returns model name) |
| Input Support | Text only | Text + Images (multimodal) |
| Routing Methods | Task or complexity classification | Intent-based or Auto-routing (neural network) |
Future Plans: The intent is to make v2 fully backwards compatible with v1's proxying capabilities, then merge to main and retire the experimental label.
Ever struggled to decide which LLM or Vision-Language Model (VLM) to use for a specific task? In an ideal world the most accurate model would also be the cheapest and fastest, but in practice modern agentic AI systems have to make trade-offs between accuracy, speed, and cost.
This blueprint provides an experimental next-generation router that automates these tradeoffs by analyzing user prompts and identifying optimal models. Given a user prompt (text or multimodal), the router:
- applies one of two routing strategies: intent-based classification or auto-routing (based on a trained neural network)
- analyzes the prompt content, including images if present
- returns the name of the most appropriate LLM or VLM for the task
For example, using intent-based routing:
| User Prompt | Intent Classification | Recommended Model |
|---|---|---|
| "What's in this image?" (with image) | image_understanding | nvidia/nemotron-nano-12b-v2-vl |
| "Solve this complex math problem: ..." | hard_question | gpt-5-chat |
| "Hello, how are you?" | chit_chat | nvidia/nvidia-nemotron-nano-9b-v2 |
The key features of the experimental LLM Router v2 are:
- Multimodal Support: Route based on both text and images, optimized for VLMs
- Two Routing Strategies: Intent-based (using Qwen 1.7B) OR auto-routing (using CLIP embeddings + trained neural network)
- OpenAI API compliant: Returns model recommendations via chat completions endpoint
- Flexible: Use pre-configured intent mappings or train custom neural network routers on your own data
This blueprint is pre-configured to route between three complementary models:
| Model | Type | Provider | Use Case |
|---|---|---|---|
| gpt-5-chat | Frontier LLM | Azure OpenAI or OpenAI | Complex reasoning, hard questions |
| nvidia/nemotron-nano-12b-v2-vl | Open VLM | NVIDIA Build API | Multimodal queries, image understanding |
| nvidia/nvidia-nemotron-nano-9b-v2 | Small Open LLM | NVIDIA Build API | Simple text queries, chit chat |
- gpt-5-chat: Can be sourced from Azure OpenAI (default) or standard OpenAI API
- Nemotron models: Configured to use NVIDIA Build API endpoints (hosted), but can also use locally deployed NVIDIA NIMs for on-premise deployment
The three default models are examples only - you can route to any models by (1) updating the intent router's configuration or (b) re-training the auto-router.
The main goal of the LLM router is to intelligently route across frontier and open models to optimize the cost-quality-latency tradeoff.
This experimental blueprint is for:
- AI Engineers and Developers: Developers interested in exploring next-generation routing approaches with multimodal support.
- MLOps Teams: Teams interested in learning-based routing optimization and custom model selection strategies.
- Research Teams: Teams evaluating different routing strategies for production deployment.
- Linux operating systems (Ubuntu 22.04 or later recommended) or macOS
- Docker
- Docker Compose
- For local development: Python 3.12+ and uv package manager
git clone https://github.com/NVIDIA-AI-Blueprints/llm-router
cd llm-router
git checkout experimental # or the appropriate v2 branch name-
NVIDIA Build API key
- Navigate to NVIDIA API Catalog
- Click one of the models, such as nemotron-nano-12b-v2-vl
- Select the "Python" input option
- Click "Get API Key"
- Click "Generate Key" and copy the resulting key (starts with
nvapi-)
-
Azure OpenAI API access
This project uses Azure OpenAI for the GPT-5-chat model. You'll need:
- An Azure subscription with Azure OpenAI service enabled
- An Azure OpenAI resource deployed with
gpt-5-chatmodel - Your Azure OpenAI endpoint URL (format:
https://your-resource-name.openai.azure.com/) - Your Azure OpenAI API key (found in the Azure portal under your resource's "Keys and Endpoint")
Set these environment variables:
export AZURE_OPENAI_ENDPOINT="https://your-resource-name.openai.azure.com/" export OPENAI_API_KEY="your-azure-openai-api-key"
Using regular OpenAI instead: If you prefer to use OpenAI's API instead of Azure OpenAI, you'll need to update:
demo/app.py: Changecall_model_azure_openai()to useOpenAI()client withbase_url="https://api.openai.com/v1"and update model provider from"azure_openai"to"openai"demo/env_template.txtanddemo/.env: ReplaceAZURE_OPENAI_ENDPOINTwithOPENAI_API_KEY=sk-...src/nat_sfc_router/training/prepare_hf_data.py: ReplaceAzureOpenAIclient initialization withOpenAIclient2_Embedding_NN_Training.ipynb: Update cells that referenceAZURE_OPENAI_ENDPOINTandAzureOpenAIclient- Get your OpenAI API key from OpenAI Platform
For the Qwen 1.7B model:
| GPU | Family | Memory | # of GPUs (min.) |
|---|---|---|---|
| T4 or newer | Any | 16GB | 1 |
For training and using the auto-router (CLIP + Neural Network):
| Component | GPU Required | Memory | Notes |
|---|---|---|---|
| CLIP Embedding Server | Yes | 8GB+ | NVIDIA NVClip NIM (required for generating embeddings) |
| Neural Network Training | Optional | 4GB+ (if GPU) | Can run on CPU, but GPU accelerates training |
| Neural Network Inference | No | N/A | Router inference runs on CPU |
Note: Training the auto-router requires:
- A running CLIP server (GPU required) to generate embeddings from text and images
- PyTorch for neural network training (GPU optional but recommended for faster training)
- Once trained, the router artifacts can be used for inference on CPU-only systems
After meeting the prerequisites, follow these steps to start a demo chat application that uses the intent based router and supports multimodal inputs:
Create a .env file in the project root:
# API Keys
OPENAI_API_KEY=sk-your-openai-key-here
NVIDIA_API_KEY=nvapi-your-nvidia-key-hereOption A: Intent-Based Router (Default, Recommended for Getting Started)
docker compose --profile intent up -d --buildThis starts three services:
- router-backend (port 8001): Main routing service using NVIDIA NeMo Agent Toolkit
- qwen-router (port 8011): Qwen 1.7B model server for intent-based routing
- demo-app (port 7860): Interactive web interface
Option B: Neural Network Router
docker compose --profile nn up -d --buildThis starts three services:
- router-backend (port 8001): Main routing service using NVIDIA NeMo Agent Toolkit
- clip-server (port 51000): CLIP embedding server for neural network routing
- demo-app (port 7860): Interactive web interface
Note: You must also update the
objective_fninsrc/nat_sfc_router/configs/config.ymlto match your chosen profile:
- For intent-based router:
objective_fn: hf_intent_objective_fn- For neural network router:
objective_fn: nn_objective_fnSee the demo README for detailed instructions on switching between routing methods.
Open your browser to: http://localhost:7860
Try sending messages with or without images to see routing decisions in real-time.
Bring up Jupyter to explore the routing methods and training pipeline:
jupyter lab --no-browser --ip 0.0.0.0 --NotebookApp.token=''Open the notebooks:
1_IntentRouter_Example.ipynb- Intent-based routing examples2_Embedding_NN_Training.ipynb- Train custom neural network router3_Embedding_NN_Usage.ipynb- Use trained neural network router
The experimental LLM Router v2 has three main components:
-
Router Backend - A service built on NVIDIA NeMo Agent Toolkit that exposes a FastAPI endpoint compatible with OpenAI's chat completions API. The router backend analyzes prompts (text and images) and returns the optimal model name. Code is available in
src/nat_sfc_router/. -
Routing Models - Two routing strategies are available:
- Intent-Based Router: Uses the Qwen 1.75B model to match user intents to specific models. Requires the Qwen LLM service running on port 8011.
- Auto-Router: Uses CLIP embeddings and a trained neural network to predict optimal models based on quality, latency, and cost metrics. Requires the CLIP service running and a trained neural network model.
-
Demo Application - An interactive Gradio web interface that demonstrates the router in action. After receiving a routing decision, the demo app calls the recommended model's API and displays results. Code is available in
demo/.
Note: Unlike v1, v2 does not proxy requests to downstream LLMs. It only returns model recommendations. The demo app handles the actual API calls to recommended models.
The experimental v2 router provides two distinct routing approaches:
Uses a small LLM like Qwen 1.7B to match user intents to specific models.
Advantages:
- No training required
- Understands semantic intent
- Works out-of-the-box
- Easily configurable via intent mappings
Use Case: When you have clear intent categories (e.g., "visual analysis" → VLM, "code generation" → specialized LLM)
Configuration: See src/nat_sfc_router/configs/config.yml and src/nat_sfc_router/functions/hf_intent_objective_fn.py
route_config = [
{
"name": "hard_question",
"description": "A question that requires deep reasoning, or complex problem solving, or if the user asks for careful thinking or careful consideration",
},
{
"name": "chit_chat",
"description": "Any social chit chat, small talk, or casual conversation.",
},
{
"name": "try_again",
"description": "Only if the user explicitly says the previous answer was incorrect or incomplete.",
},
{
"name": "image_understanding",
"description": "A question that requires understanding an image.",
},
{
"name": "image_question",
"description": "A question that requires the assistant to see the user eg a question about their appearance, environment, scene or surroundings.",
},
]
MAP_INTENT_TO_PIPELINE = {
"other": "nvidia/nvidia-nemotron-nano-9b-v2",
"chit_chat": "nvidia/nvidia-nemotron-nano-9b-v2",
"hard_question": "gpt-5-chat",
"image_understanding": "nvidia/nemotron-nano-12b-v2-vl",
"image_question": "nvidia/nemotron-nano-12b-v2-vl",
"try_again": "gpt-5-chat",
}Uses CLIP embeddings to encode text/image pairs, then a trained neural network to predict the optimal model.
Advantages:
- Learns from actual usage patterns
- Optimizes for quality, latency, and cost
- Adapts to your specific workload
- Handles multimodal input natively
Use Case: When you have historical data and want data-driven routing decisions
Training Recommended: See notebooks for training pipeline:
2_Embedding_NN_Training.ipynb- Training the neural network3_Embedding_NN_Usage.ipynb- Using the trained router
Note: The GitHub repository includes a pre-trained neural network and the weights are stored in
llm-router/src/nat_sfc_router/training/router_artifacts. The notebook2_Embedding_NN_Training.ipynbre-trains the neural network and over-writes those weights. You can run the usage notebook or demo app without running the training notebook to use the existing neural network OR you can run the training notebook and then use this notebook or demo app with your neural network.
docker-compose up -d --buildThis starts three services:
- router-backend (port 8001): Main routing service using NVIDIA NeMo Agent Toolkit
- qwen-router (port 8011): Qwen 1.7B model server (for intent-based routing)
- demo-app (port 7860): Interactive Gradio web interface
Access the demo at: http://localhost:7860
The experimental LLM Router v2 is structured around selecting the right model for a given request:
The router backend is built on the NVIDIA NeMo Agent Toolkit and exposes a FastAPI service at http://localhost:8001/sfc_router/chat/completions. The endpoint accepts OpenAI-compatible chat completion requests with multimodal content (text and images) and returns the name of the optimal model.
The router backend is configured via src/nat_sfc_router/configs/config.yml:
functions:
healthcheck_fn:
_type: healthcheck
hf_intent_objective_fn:
_type: hf_intent_objective_fn
nn_objective_fn:
_type: nn_objective_fn
model_thresholds:
'gpt-5-chat': 0.70
'nvidia/nemotron-nano-12b-v2-vl': 0.75
'nvidia/nvidia-nemotron-nano-9b-v2': 0.4
model_costs:
'gpt-5-chat': 1.0
'nvidia/nemotron-nano-12b-v2-vl': 0.5
'nvidia/nvidia-nemotron-nano-9b-v2': 0.3
sfc_router_fn:
_type: sfc_router
objective_fn: hf_intent_objective_fn # <--- select routing function
workflow:
_type: sfc_router
objective_fn: hf_intent_objective_fn # <--- select routing functionThe router backend can use one of two strategies, configured by setting the objective_fn parameter:
-
Intent-Based Routing (
hf_intent_objective_fn): Uses the Qwen 1.7B model to classify user intents and map them to models. Intent mappings are defined insrc/nat_sfc_router/functions/hf_intent_objective_fn.py. No training required. -
Auto-Routing (
nn_objective_fn): Uses CLIP embeddings and a trained neural network to predict optimal models. Models are stored insrc/nat_sfc_router/training/router_artifacts/and can be retrained.
The LLM Router v2 is compatible with OpenAI chat completion requests. Unlike v1, the router does not proxy requests to downstream models - it only returns the recommended model name. Here's an example request:
curl -X POST http://localhost:8001/sfc_router/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "Explain quantum computing"
}
],
"stream": false
}'Response:
{
"id": "chatcmpl-1765473022",
"choices": [{
"message": {
"content": "nvidia/nvidia-nemotron-nano-9b-v2",
"role": "assistant"
}
}],
"model": "hf_intent_objective_fn"
}The selected model name is in choices[0].message.content. Your application is responsible for calling the recommended model's API.
The router supports multimodal requests with images encoded as base64 data URLs:
curl -X POST http://localhost:8001/sfc_router/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}
}
]
}
]
}'The experimental blueprint includes several resources to help you understand, evaluate, and customize the LLM Router v2:
-
Explore the notebooks: Three Jupyter notebooks demonstrate the routing methods and training pipeline:
1_IntentRouter_Example.ipynb- Intent-based routing examples and configuration2_Embedding_NN_Training.ipynb- Train custom neural network router on your data3_Embedding_NN_Usage.ipynb- Use and evaluate trained routers
-
Try the demo application: An interactive Gradio web interface in
demo/demonstrates end-to-end routing and model calling. -
Review the source code: The router implementation is in
src/nat_sfc_router/with detailed documentation. -
Train a custom router: Follow the notebooks to create a router optimized for your specific use case and workload.
This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
- The LLM Router v2 Blueprint doesn't generate any code that may require sandboxing.
- The LLM Router v2 Blueprint is shared as a reference and is provided "as is". The security in the production environment is the responsibility of the end users deploying it. When deploying in a production environment, please have security experts review any potential risks and threats; define the trust boundaries, implement logging and monitoring capabilities, secure the communication channels, integrate AuthN & AuthZ with appropriate access controls, keep the deployment up to date, ensure the containers/source code are secure and free of known vulnerabilities.
- A frontend that handles AuthN & AuthZ should be in place as missing AuthN & AuthZ could provide un gated access to the router if directly exposed to e.g. the internet.
- API keys for downstream models (OpenAI, NVIDIA Build) are configured in the demo application's
.envfile. The end users are responsible for safeguarding these credentials. - The LLM Router doesn't require any privileged access to the system.
- The end users are responsible for ensuring the availability of their deployment.
- The end users are responsible for building the container images and keeping them up to date.
- The end users are responsible for ensuring that OSS packages used by the blueprint are current.
- The logs from the router backend and demo app are printed to standard out and include input prompts and routing decisions for development purposes. The end users are advised to handle logging securely and avoid information leakage for production use cases.
