GitHub - NVIDIA-AI-Blueprints/llm-router: Route LLM requests to the best model for the task at hand.

NVIDIA AI Blueprint: LLM Router v2 (Experimental)

⚠️ EXPERIMENTAL BRANCH: This branch contains LLM Router v2, a next-generation routing system with multimodal support. For the production-ready LLM Router v1, please visit the main branch.

Important Notes

LLM Router v2 is currently experimental and not yet backwards compatible with v1. Key differences:

Feature	v1 (Main Branch)	v2 (Experimental)
Server Implementation	Rust proxy	NVIDIA NeMo Agent Toolkit (FastAPI)
Inference Backend	BERT model + NVIDIA Triton Inference Server	Qwen 1.7B LLM or CLIP + Neural Network
Functionality	Classification + Proxying to LLM	Classification only (returns model name)
Input Support	Text only	Text + Images (multimodal)
Routing Methods	Task or complexity classification	Intent-based or Auto-routing (neural network)

Future Plans: The intent is to make v2 fully backwards compatible with v1's proxying capabilities, then merge to main and retire the experimental label.

Overview

Ever struggled to decide which LLM or Vision-Language Model (VLM) to use for a specific task? In an ideal world the most accurate model would also be the cheapest and fastest, but in practice modern agentic AI systems have to make trade-offs between accuracy, speed, and cost.

This blueprint provides an experimental next-generation router that automates these tradeoffs by analyzing user prompts and identifying optimal models. Given a user prompt (text or multimodal), the router:

applies one of two routing strategies: intent-based classification or auto-routing (based on a trained neural network)
analyzes the prompt content, including images if present
returns the name of the most appropriate LLM or VLM for the task

For example, using intent-based routing:

User Prompt	Intent Classification	Recommended Model
"What's in this image?" (with image)	image_understanding	nvidia/nemotron-nano-12b-v2-vl
"Solve this complex math problem: ..."	hard_question	gpt-5-chat
"Hello, how are you?"	chit_chat	nvidia/nvidia-nemotron-nano-9b-v2

The key features of the experimental LLM Router v2 are:

Multimodal Support: Route based on both text and images, optimized for VLMs
Two Routing Strategies: Intent-based (using Qwen 1.7B) OR auto-routing (using CLIP embeddings + trained neural network)
OpenAI API compliant: Returns model recommendations via chat completions endpoint
Flexible: Use pre-configured intent mappings or train custom neural network routers on your own data

Models

This blueprint is pre-configured to route between three complementary models:

Model	Type	Provider	Use Case
gpt-5-chat	Frontier LLM	Azure OpenAI or OpenAI	Complex reasoning, hard questions
nvidia/nemotron-nano-12b-v2-vl	Open VLM	NVIDIA Build API	Multimodal queries, image understanding
nvidia/nvidia-nemotron-nano-9b-v2	Small Open LLM	NVIDIA Build API	Simple text queries, chit chat

gpt-5-chat: Can be sourced from Azure OpenAI (default) or standard OpenAI API
Nemotron models: Configured to use NVIDIA Build API endpoints (hosted), but can also use locally deployed NVIDIA NIMs for on-premise deployment

Using Different Models

The three default models are examples only - you can route to any models by (1) updating the intent router's configuration or (b) re-training the auto-router.

The main goal of the LLM router is to intelligently route across frontier and open models to optimize the cost-quality-latency tradeoff.

Target Audience

This experimental blueprint is for:

AI Engineers and Developers: Developers interested in exploring next-generation routing approaches with multimodal support.
MLOps Teams: Teams interested in learning-based routing optimization and custom model selection strategies.
Research Teams: Teams evaluating different routing strategies for production deployment.

Prerequisites

Software

Linux operating systems (Ubuntu 22.04 or later recommended) or macOS
Docker
Docker Compose
For local development: Python 3.12+ and uv package manager

Clone repository

git clone https://github.com/NVIDIA-AI-Blueprints/llm-router
cd llm-router
git checkout experimental  # or the appropriate v2 branch name

Get API Keys

NVIDIA Build API key
- Navigate to NVIDIA API Catalog
- Click one of the models, such as nemotron-nano-12b-v2-vl
- Select the "Python" input option
- Click "Get API Key"
- Click "Generate Key" and copy the resulting key (starts with nvapi-)
Azure OpenAI API access

This project uses Azure OpenAI for the GPT-5-chat model. You'll need:
- An Azure subscription with Azure OpenAI service enabled
- An Azure OpenAI resource deployed with gpt-5-chat model
- Your Azure OpenAI endpoint URL (format: https://your-resource-name.openai.azure.com/)
- Your Azure OpenAI API key (found in the Azure portal under your resource's "Keys and Endpoint")
Set these environment variables:
```
export AZURE_OPENAI_ENDPOINT="https://your-resource-name.openai.azure.com/"
export OPENAI_API_KEY="your-azure-openai-api-key"
```
Using regular OpenAI instead: If you prefer to use OpenAI's API instead of Azure OpenAI, you'll need to update:
- demo/app.py: Change call_model_azure_openai() to use OpenAI() client with base_url="https://api.openai.com/v1" and update model provider from "azure_openai" to "openai"
- demo/env_template.txt and demo/.env: Replace AZURE_OPENAI_ENDPOINT with OPENAI_API_KEY=sk-...
- src/nat_sfc_router/training/prepare_hf_data.py: Replace AzureOpenAI client initialization with OpenAI client
- 2_Embedding_NN_Training.ipynb: Update cells that reference AZURE_OPENAI_ENDPOINT and AzureOpenAI client
- Get your OpenAI API key from OpenAI Platform

Hardware Requirements

For the Qwen 1.7B model:

GPU	Family	Memory	# of GPUs (min.)
T4 or newer	Any	16GB	1

For training and using the auto-router (CLIP + Neural Network):

Component	GPU Required	Memory	Notes
CLIP Embedding Server	Yes	8GB+	NVIDIA NVClip NIM (required for generating embeddings)
Neural Network Training	Optional	4GB+ (if GPU)	Can run on CPU, but GPU accelerates training
Neural Network Inference	No	N/A	Router inference runs on CPU

Note: Training the auto-router requires:

A running CLIP server (GPU required) to generate embeddings from text and images
PyTorch for neural network training (GPU optional but recommended for faster training)
Once trained, the router artifacts can be used for inference on CPU-only systems

Quickstart Guide

After meeting the prerequisites, follow these steps to start a demo chat application that uses the intent based router and supports multimodal inputs:

1. Configure API Keys

Create a .env file in the project root:

# API Keys
OPENAI_API_KEY=sk-your-openai-key-here
NVIDIA_API_KEY=nvapi-your-nvidia-key-here

2. Launch Services with Docker Compose

Option A: Intent-Based Router (Default, Recommended for Getting Started)

docker compose --profile intent up -d --build

This starts three services:

router-backend (port 8001): Main routing service using NVIDIA NeMo Agent Toolkit
qwen-router (port 8011): Qwen 1.7B model server for intent-based routing
demo-app (port 7860): Interactive web interface

Option B: Neural Network Router

docker compose --profile nn up -d --build

This starts three services:

router-backend (port 8001): Main routing service using NVIDIA NeMo Agent Toolkit
clip-server (port 51000): CLIP embedding server for neural network routing
demo-app (port 7860): Interactive web interface

Note: You must also update the objective_fn in src/nat_sfc_router/configs/config.yml to match your chosen profile:

For intent-based router: objective_fn: hf_intent_objective_fn

For neural network router: objective_fn: nn_objective_fn

See the demo README for detailed instructions on switching between routing methods.

3. Access the Demo

Open your browser to: http://localhost:7860

Try sending messages with or without images to see routing decisions in real-time.

Quickstart Alternative - Explore the Notebooks

Bring up Jupyter to explore the routing methods and training pipeline:

jupyter lab --no-browser --ip 0.0.0.0 --NotebookApp.token=''

Open the notebooks:

1_IntentRouter_Example.ipynb - Intent-based routing examples
2_Embedding_NN_Training.ipynb - Train custom neural network router
3_Embedding_NN_Usage.ipynb - Use trained neural network router

Software Components

The experimental LLM Router v2 has three main components:

Router Backend - A service built on NVIDIA NeMo Agent Toolkit that exposes a FastAPI endpoint compatible with OpenAI's chat completions API. The router backend analyzes prompts (text and images) and returns the optimal model name. Code is available in src/nat_sfc_router/.
Routing Models - Two routing strategies are available:
- Intent-Based Router: Uses the Qwen 1.75B model to match user intents to specific models. Requires the Qwen LLM service running on port 8011.
- Auto-Router: Uses CLIP embeddings and a trained neural network to predict optimal models based on quality, latency, and cost metrics. Requires the CLIP service running and a trained neural network model.
Demo Application - An interactive Gradio web interface that demonstrates the router in action. After receiving a routing decision, the demo app calls the recommended model's API and displays results. Code is available in demo/.

Note: Unlike v1, v2 does not proxy requests to downstream LLMs. It only returns model recommendations. The demo app handles the actual API calls to recommended models.

Routing Methods

The experimental v2 router provides two distinct routing approaches:

1. Intent-Based Routing

Uses a small LLM like Qwen 1.7B to match user intents to specific models.

Advantages:

No training required
Understands semantic intent
Works out-of-the-box
Easily configurable via intent mappings

Use Case: When you have clear intent categories (e.g., "visual analysis" → VLM, "code generation" → specialized LLM)

Configuration: See src/nat_sfc_router/configs/config.yml and src/nat_sfc_router/functions/hf_intent_objective_fn.py

route_config = [
    {
        "name": "hard_question",
        "description": "A question that requires deep reasoning, or complex problem solving, or if the user asks for careful thinking or careful consideration",
    },
    {
        "name": "chit_chat",
        "description": "Any social chit chat, small talk, or casual conversation.",
    },
    {
        "name": "try_again",
        "description": "Only if the user explicitly says the previous answer was incorrect or incomplete.",
    },
    {
        "name": "image_understanding",
        "description": "A question that requires understanding an image.",
    },
    {
        "name": "image_question",
        "description": "A question that requires the assistant to see the user eg a question about their appearance, environment, scene or surroundings.",
    },
]

MAP_INTENT_TO_PIPELINE = {
    "other": "nvidia/nvidia-nemotron-nano-9b-v2",
    "chit_chat": "nvidia/nvidia-nemotron-nano-9b-v2",
    "hard_question": "gpt-5-chat",
    "image_understanding": "nvidia/nemotron-nano-12b-v2-vl",
    "image_question": "nvidia/nemotron-nano-12b-v2-vl",
    "try_again": "gpt-5-chat",
}

2. Auto-Routing (CLIP + Neural Network)

Uses CLIP embeddings to encode text/image pairs, then a trained neural network to predict the optimal model.

Advantages:

Learns from actual usage patterns
Optimizes for quality, latency, and cost
Adapts to your specific workload
Handles multimodal input natively

Use Case: When you have historical data and want data-driven routing decisions

Training Recommended: See notebooks for training pipeline:

2_Embedding_NN_Training.ipynb - Training the neural network
3_Embedding_NN_Usage.ipynb - Using the trained router

Note: The GitHub repository includes a pre-trained neural network and the weights are stored in llm-router/src/nat_sfc_router/training/router_artifacts. The notebook 2_Embedding_NN_Training.ipynb re-trains the neural network and over-writes those weights. You can run the usage notebook or demo app without running the training notebook to use the existing neural network OR you can run the training notebook and then use this notebook or demo app with your neural network.

Deployment Options

Docker Deployment

docker-compose up -d --build

This starts three services:

router-backend (port 8001): Main routing service using NVIDIA NeMo Agent Toolkit
qwen-router (port 8011): Qwen 1.7B model server (for intent-based routing)
demo-app (port 7860): Interactive Gradio web interface

Access the demo at: http://localhost:7860

Understand the Blueprint

The experimental LLM Router v2 is structured around selecting the right model for a given request:

Router Backend

The router backend is built on the NVIDIA NeMo Agent Toolkit and exposes a FastAPI service at http://localhost:8001/sfc_router/chat/completions. The endpoint accepts OpenAI-compatible chat completion requests with multimodal content (text and images) and returns the name of the optimal model.

The router backend is configured via src/nat_sfc_router/configs/config.yml:

functions:
  healthcheck_fn:
    _type: healthcheck
  hf_intent_objective_fn:
    _type: hf_intent_objective_fn

  nn_objective_fn:
    _type: nn_objective_fn
    
    model_thresholds:
      'gpt-5-chat': 0.70
      'nvidia/nemotron-nano-12b-v2-vl': 0.75
      'nvidia/nvidia-nemotron-nano-9b-v2': 0.4
    
    model_costs:
      'gpt-5-chat': 1.0
      'nvidia/nemotron-nano-12b-v2-vl': 0.5
      'nvidia/nvidia-nemotron-nano-9b-v2': 0.3

  sfc_router_fn:
    _type: sfc_router
    objective_fn: hf_intent_objective_fn # <--- select routing function

workflow:
  _type: sfc_router
  objective_fn: hf_intent_objective_fn # <--- select routing function

Routing Strategies

The router backend can use one of two strategies, configured by setting the objective_fn parameter:

Intent-Based Routing (hf_intent_objective_fn): Uses the Qwen 1.7B model to classify user intents and map them to models. Intent mappings are defined in src/nat_sfc_router/functions/hf_intent_objective_fn.py. No training required.
Auto-Routing (nn_objective_fn): Uses CLIP embeddings and a trained neural network to predict optimal models. Models are stored in src/nat_sfc_router/training/router_artifacts/ and can be retrained.

Using the Router

The LLM Router v2 is compatible with OpenAI chat completion requests. Unlike v1, the router does not proxy requests to downstream models - it only returns the recommended model name. Here's an example request:

curl -X POST http://localhost:8001/sfc_router/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Explain quantum computing"
      }
    ],
    "stream": false
  }'

Response:

{
  "id": "chatcmpl-1765473022",
  "choices": [{
    "message": {
      "content": "nvidia/nvidia-nemotron-nano-9b-v2",
      "role": "assistant"
    }
  }],
  "model": "hf_intent_objective_fn"
}

The selected model name is in choices[0].message.content. Your application is responsible for calling the recommended model's API.

Multimodal Support

The router supports multimodal requests with images encoded as base64 data URLs:

curl -X POST http://localhost:8001/sfc_router/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {
            "type": "image_url",
            "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}
          }
        ]
      }
    ]
  }'

Next Steps

The experimental blueprint includes several resources to help you understand, evaluate, and customize the LLM Router v2:

Explore the notebooks: Three Jupyter notebooks demonstrate the routing methods and training pipeline:
- 1_IntentRouter_Example.ipynb - Intent-based routing examples and configuration
- 2_Embedding_NN_Training.ipynb - Train custom neural network router on your data
- 3_Embedding_NN_Usage.ipynb - Use and evaluate trained routers
Try the demo application: An interactive Gradio web interface in demo/ demonstrates end-to-end routing and model calling.
Review the source code: The router implementation is in src/nat_sfc_router/ with detailed documentation.
Train a custom router: Follow the notebooks to create a router optimized for your specific use case and workload.

License 3^rd Party

This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

Security Considerations

The LLM Router v2 Blueprint doesn't generate any code that may require sandboxing.
The LLM Router v2 Blueprint is shared as a reference and is provided "as is". The security in the production environment is the responsibility of the end users deploying it. When deploying in a production environment, please have security experts review any potential risks and threats; define the trust boundaries, implement logging and monitoring capabilities, secure the communication channels, integrate AuthN & AuthZ with appropriate access controls, keep the deployment up to date, ensure the containers/source code are secure and free of known vulnerabilities.
A frontend that handles AuthN & AuthZ should be in place as missing AuthN & AuthZ could provide un gated access to the router if directly exposed to e.g. the internet.
API keys for downstream models (OpenAI, NVIDIA Build) are configured in the demo application's .env file. The end users are responsible for safeguarding these credentials.
The LLM Router doesn't require any privileged access to the system.
The end users are responsible for ensuring the availability of their deployment.
The end users are responsible for building the container images and keeping them up to date.
The end users are responsible for ensuring that OSS packages used by the blueprint are current.
The logs from the router backend and demo app are printed to standard out and include input prompts and routing decisions for development purposes. The end users are advised to handle logging securely and avoid information leakage for production use cases.

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.github		.github
demo		demo
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
1_IntentRouter_Example.ipynb		1_IntentRouter_Example.ipynb
2_Embedding_NN_Training.ipynb		2_Embedding_NN_Training.ipynb
3_Embedding_NN_Usage.ipynb		3_Embedding_NN_Usage.ipynb
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
configs		configs
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
qwen3_nonthinking.jinja		qwen3_nonthinking.jinja
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

NVIDIA AI Blueprint: LLM Router v2 (Experimental)

Important Notes

Overview

Models

Using Different Models

Target Audience

Prerequisites

Software

Clone repository

Get API Keys

Hardware Requirements

Quickstart Guide

1. Configure API Keys

2. Launch Services with Docker Compose

3. Access the Demo

Quickstart Alternative - Explore the Notebooks

Software Components

Routing Methods

1. Intent-Based Routing

2. Auto-Routing (CLIP + Neural Network)

Deployment Options

Docker Deployment

Understand the Blueprint

Router Backend

Routing Strategies

Using the Router

Multimodal Support

Next Steps

License 3rd Party

Security Considerations

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License 3^rd Party

Packages