Skip to content

Commit f24fc60

Browse files
committed
Sponsor: Add AMD Partnership Blog
Signed-off-by: bitliu <[email protected]>
1 parent d7d6a1c commit f24fc60

File tree

2 files changed

+216
-0
lines changed

2 files changed

+216
-0
lines changed

website/blog/2025-12-10-amd.md

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
---
2+
slug: semantic-on-amd
3+
title: "AMD × vLLM Semantic Router: Building Trustworthy, Evolvable Mixture-of-Models AI Infrastructure"
4+
authors: [Xunzhuo]
5+
tags: [amd, ecosystem, partnership, inference, routing]
6+
---
7+
8+
AMD is a long-term technology partner for the **vLLM** community: from accelerating the **vLLM engine** on AMD GPUs and ROCm™ Software to now exploring the next layer of the stack with **vLLM Semantic Router (VSR)**.
9+
10+
As AI moves from single models to **Mixture-of-Models (MoM)** architectures, the challenge shifts from "how big is your model" to **how intelligently you orchestrate many models together**.
11+
12+
VSR is the **intelligent brain for multi-model AI**. Together with AMD, we are building this layer to be:
13+
14+
* **System Intelligence** – semantic understanding, intent classification, and dynamic decision-making.
15+
* **Trustworthy** – hallucination-aware, risk-sensitive, and observable.
16+
* **Evolvable** – continuously improving via data, feedback, and experimentation.
17+
* **Production-ready on AMD** – from CI/CD → online playgrounds → large-scale production.
18+
19+
<div align="center">
20+
<img src="/img/amd-vsr.png" alt="AMD × vLLM Semantic Router" width="100%"/>
21+
</div>
22+
23+
<!-- truncate -->
24+
25+
---
26+
27+
## Strategic Focus: From Single Models to Multi-Model AI Infrastructure
28+
29+
In a MoM world, an enterprise AI stack typically includes:
30+
31+
* Router SLMs (small language models) that classify, route, and enforce policy.
32+
* Multiple LLMs and domain-specific models (e.g., code, finance, healthcare).
33+
* Tools, RAG pipelines, vector search, and business systems.
34+
35+
Without a robust routing layer, this turns into an opaque and fragile mesh.
36+
37+
The AMD × VSR collaboration aims to make that routing layer a **first-class, GPU-accelerated infrastructure component**, not an ad-hoc script glued between services.
38+
39+
---
40+
41+
## Near- and Mid-Term: Running VSR on AMD GPUs
42+
43+
In the near and mid term, the joint objective is simple and execution-oriented:
44+
45+
> **Deliver a production-grade VSR solution that runs efficiently on AMD GPUs.**
46+
47+
We are building two primary paths.
48+
49+
### 1. vLLM-based SLM / LLM inference on AMD GPUs
50+
51+
On AMD GPUs, using the vLLM engine, we will run:
52+
53+
* **Router SLMs**
54+
55+
* Task and intent classification
56+
* Risk scoring and safety gating
57+
* Tool / workflow selection
58+
* **LLMs and specialized models**
59+
60+
* General assistants
61+
* Domain-specific models (e.g., financial, legal, code)
62+
63+
VSR sits above as the decision fabric, consuming:
64+
65+
* semantic similarity and embeddings,
66+
* business metadata and domain tags,
67+
* latency and cost constraints,
68+
* risk and compliance requirements,
69+
70+
to perform **dynamic routing** across these models and endpoints.
71+
72+
AMD GPUs provide the throughput and memory footprint needed to run **router SLMs + multiple LLMs** in the same cluster, supporting high QPS workloads with stable latency instead of one-off demos.
73+
74+
### 2. Lightweight routing via ONNX binding on AMD GPUs
75+
76+
Not all routing needs a full inference stack.
77+
78+
For ultra-high-frequency and latency-sensitive stages at the “front door” of the system, we are also enabling:
79+
80+
* Exporting router SLMs to **ONNX**,
81+
* Running them on AMD GPUs through ONNX Runtime / custom bindings,
82+
* Forwarding complex generative work to vLLM or other back-end LLMs.
83+
84+
This lightweight path is designed for:
85+
86+
* front-of-funnel traffic classification and triage,
87+
* large-scale policy evaluation and offline experiments,
88+
* enterprises that want to **standardize on AMD GPUs while keeping model providers flexible**.
89+
90+
---
91+
92+
## Frontier Capabilities: Hallucination Detection and Online Learning
93+
94+
Routing is not just “task → model A/B”. To make MoM systems truly usable in production, we are pushing VSR into more advanced territory, with AMD GPUs as the execution engine.
95+
96+
### Hallucination-aware and risk-aware routing
97+
98+
We are developing **hallucination detection and factuality checks** as part of the routing layer itself:
99+
100+
* Dedicated classifiers and NLI-style models to evaluate:
101+
102+
* factual consistency between query, context, and answer,
103+
* risk level by domain (e.g., finance, healthcare, legal, compliance).
104+
* Policies that can, based on these signals:
105+
106+
* route to retrieval-augmented pipelines,
107+
* switch to a more conservative model,
108+
* require additional verification or human review,
109+
* lower the temperature or adjust generation strategy.
110+
111+
AMD GPUs enable these checks to be run **inline** at scale, rather than as slow, offline audits. The result is a **“hallucination-aware router”** that enforces trust boundaries around LLM responses.
112+
113+
### Online learning and adaptive routing
114+
115+
We also view routing as something that should **learn and adapt over time**, not remain hard-coded:
116+
117+
* Using user feedback and task outcomes as signals.
118+
* Tracking production KPIs (resolution rate, time-to-answer, escalation rate, etc.).
119+
* Running continuous experiments to compare routing strategies and model choices.
120+
121+
AMD GPU clusters give us the capacity to:
122+
123+
* replay real or synthetic workloads at scale,
124+
* run controlled A/B tests on routing policies,
125+
* train and update small “router brain” models with fast iteration cycles.
126+
127+
This turns VSR into a **self-optimizing control plane** for multi-model systems, not just a rules engine.
128+
129+
---
130+
131+
### Long-Term Objective 1: Training a Next-Generation Encoder Model on AMD GPUs
132+
133+
As a longer-term goal of this collaboration, we aim to explore training a **next-generation encoder-only model** on AMD GPUs, optimized for semantic routing, retrieval-augmented generation (RAG), and safety classification.
134+
135+
While recent encoder models (e.g., ModernBERT) show strong performance, they remain limited in context length, multilingual coverage, and alignment with emerging long-context attention techniques. This effort focuses on advancing encoder capabilities using AMD hardware, particularly for **long-context, high-throughput representation learning**.
136+
137+
The outcome will be an **open encoder model** designed to integrate with vLLM Semantic Router and modern AI pipelines, strengthening the retrieval and routing layers of AI systems while expanding hardware-diverse training and deployment options for the community and industry.
138+
139+
Here is a **matching Long-Term Objective**, written in the **same tone, length, and level of commitment** as Objective 3, and suitable to sit alongside it in the blog without overpowering the rest of the roadmap.
140+
141+
### Long-Term Objective 2: Community Public Beta for vLLM Semantic Router on AMD Infrastructure
142+
143+
As part of this long-term collaboration, each major release of vLLM Semantic Router will be accompanied by a **public beta environment** hosted on **AMD-sponsored infrastructure**, available free of charge to the community.
144+
145+
These public betas will allow users to:
146+
147+
* validate new routing, caching, and safety features,
148+
* gain hands-on experience with Semantic Router running on AMD GPUs,
149+
* and provide early feedback that helps improve performance, usability, and system design.
150+
151+
By lowering the barrier to experimentation and validation, this initiative aims to strengthen the vLLM ecosystem, accelerate real-world adoption, and ensure that new Semantic Router capabilities are shaped by community input before broader production deployment.
152+
153+
## Long-Term Objective 3: AMD GPU–Powered CI/CD and End-to-End Testbed for VSR OSS
154+
155+
In the long run, we want AMD GPUs to underpin how **VSR as an open-source project is built, validated, and shipped**.
156+
157+
We are designing a GPU-backed **CI/CD and end-to-end testbed** where:
158+
159+
* Router SLMs, LLMs, domain models, retrieval, and tools run together on AMD GPU clusters.
160+
* Multi-domain, multi-risk-level datasets are replayed as traffic.
161+
* Each VSR change runs through an automated evaluation pipeline, including:
162+
163+
* routing and policy regression tests,
164+
* A/B comparisons of new vs. previous strategies,
165+
* stress tests on latency, cost, and scalability,
166+
* focused suites for hallucination mitigation and compliance behavior.
167+
168+
The target state is clear:
169+
170+
> **Every VSR release comes with a reproducible, GPU-driven evaluation report, not just a changelog.**
171+
172+
AMD GPUs, in this model, are not only for serving models; they are the **verification engine for the routing infrastructure itself**.
173+
174+
---
175+
176+
## Long-Term Objective 4: An AMD-Backed Mixture-of-Models Playground
177+
178+
In parallel, we are planning an **online Mixture-of-Models playground** powered by AMD GPUs, open to the community and partners.
179+
180+
This playground will allow users to:
181+
182+
* Experiment with different routing strategies and model topologies under real workloads.
183+
* Observe, in a visual way, how VSR decides:
184+
185+
* which model to call,
186+
* when to retrieve,
187+
* when to apply additional checks or fallbacks.
188+
* Compare **quality, latency, and cost trade-offs** across configurations.
189+
190+
For model vendors, tool builders, and platform providers, this becomes a **neutral, AMD GPU–backed test environment** to:
191+
192+
* integrate their components into a MoM stack,
193+
* benchmark under realistic routing and governance constraints,
194+
* showcase capabilities within a transparent, observable system.
195+
196+
---
197+
198+
## Why This Collaboration Matters
199+
200+
Through the AMD × vLLM Semantic Router collaboration, we are aiming beyond “does this model run on this GPU”.
201+
202+
The joint ambitions are:
203+
204+
* To define a **reference architecture for intelligent, GPU-accelerated routing** on AMD platforms, including:
205+
206+
* vLLM-based inference paths,
207+
* ONNX-based lightweight router paths,
208+
* multi-model coordination and safety enforcement.
209+
* To treat routing as **trusted infrastructure**, supported by:
210+
211+
* GPU-powered CI/CD and end-to-end evaluation,
212+
* hallucination-aware and risk-aware policies,
213+
* online learning and adaptive strategies.
214+
* To provide the ecosystem with a **long-lived, AMD GPU–backed MoM playground** where ideas, models, and routing policies can be tested and evolved in the open.
215+
216+
In short, this is about **co-building trustworthy, evolvable multi-model AI infrastructure**—with AMD GPUs as a core execution and validation layer, and vLLM Semantic Router as the intelligent control plane that makes the entire system understandable, governable, and ready for real workloads.

website/static/img/amd-vsr.png

689 KB
Loading

0 commit comments

Comments
 (0)