-
Notifications
You must be signed in to change notification settings - Fork 323
[Doc] Add AMD partnership blog post and sponsors section #798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Xunzhuo
wants to merge
1
commit into
main
Choose a base branch
from
docs/add-amd-partnership-blog
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+216
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,216 @@ | ||
| --- | ||
| slug: semantic-on-amd | ||
| title: "AMD × vLLM Semantic Router: Building Trustworthy, Evolvable Mixture-of-Models AI Infrastructure" | ||
| authors: [Xunzhuo] | ||
| tags: [amd, ecosystem, partnership, inference, routing] | ||
| --- | ||
|
|
||
| AMD is a long-term technology partner for the **vLLM** community: from accelerating the **vLLM engine** on AMD GPUs and ROCm™ Software to now exploring the next layer of the stack with **vLLM Semantic Router (VSR)**. | ||
|
|
||
| As AI moves from single models to **Mixture-of-Models (MoM)** architectures, the challenge shifts from "how big is your model" to **how intelligently you orchestrate many models together**. | ||
|
|
||
| VSR is the **intelligent brain for multi-model AI**. Together with AMD, we are building this layer to be: | ||
|
|
||
| * **System Intelligence** – semantic understanding, intent classification, and dynamic decision-making. | ||
| * **Trustworthy** – hallucination-aware, risk-sensitive, and observable. | ||
| * **Evolvable** – continuously improving via data, feedback, and experimentation. | ||
| * **Production-ready on AMD** – from CI/CD → online playgrounds → large-scale production. | ||
|
|
||
| <div align="center"> | ||
| <img src="/img/amd-vsr.png" alt="AMD × vLLM Semantic Router" width="100%"/> | ||
| </div> | ||
|
|
||
| <!-- truncate --> | ||
|
|
||
| --- | ||
|
|
||
| ## Strategic Focus: From Single Models to Multi-Model AI Infrastructure | ||
|
|
||
| In a MoM world, an enterprise AI stack typically includes: | ||
|
|
||
| * Router SLMs (small language models) that classify, route, and enforce policy. | ||
| * Multiple LLMs and domain-specific models (e.g., code, finance, healthcare). | ||
| * Tools, RAG pipelines, vector search, and business systems. | ||
|
|
||
| Without a robust routing layer, this turns into an opaque and fragile mesh. | ||
|
|
||
| The AMD × VSR collaboration aims to make that routing layer a **first-class, GPU-accelerated infrastructure component**, not an ad-hoc script glued between services. | ||
|
|
||
| --- | ||
|
|
||
| ## Near- and Mid-Term: Running VSR on AMD GPUs | ||
|
|
||
| In the near and mid term, the joint objective is simple and execution-oriented: | ||
|
|
||
| > **Deliver a production-grade VSR solution that runs efficiently on AMD GPUs.** | ||
|
|
||
| We are building two primary paths. | ||
|
|
||
| ### 1. vLLM-based SLM / LLM inference on AMD GPUs | ||
|
|
||
| On AMD GPUs, using the vLLM engine, we will run: | ||
|
|
||
| * **Router SLMs** | ||
|
|
||
| * Task and intent classification | ||
| * Risk scoring and safety gating | ||
| * Tool / workflow selection | ||
| * **LLMs and specialized models** | ||
|
|
||
| * General assistants | ||
| * Domain-specific models (e.g., financial, legal, code) | ||
|
|
||
| VSR sits above as the decision fabric, consuming: | ||
|
|
||
| * semantic similarity and embeddings, | ||
| * business metadata and domain tags, | ||
| * latency and cost constraints, | ||
| * risk and compliance requirements, | ||
|
|
||
| to perform **dynamic routing** across these models and endpoints. | ||
|
|
||
| AMD GPUs provide the throughput and memory footprint needed to run **router SLMs + multiple LLMs** in the same cluster, supporting high QPS workloads with stable latency instead of one-off demos. | ||
|
|
||
| ### 2. Lightweight routing via ONNX binding on AMD GPUs | ||
|
|
||
| Not all routing needs a full inference stack. | ||
|
|
||
| For ultra-high-frequency and latency-sensitive stages at the “front door” of the system, we are also enabling: | ||
|
|
||
| * Exporting router SLMs to **ONNX**, | ||
| * Running them on AMD GPUs through ONNX Runtime / custom bindings, | ||
| * Forwarding complex generative work to vLLM or other back-end LLMs. | ||
|
|
||
| This lightweight path is designed for: | ||
|
|
||
| * front-of-funnel traffic classification and triage, | ||
| * large-scale policy evaluation and offline experiments, | ||
| * enterprises that want to **standardize on AMD GPUs while keeping model providers flexible**. | ||
|
|
||
| --- | ||
|
|
||
| ## Frontier Capabilities: Hallucination Detection and Online Learning | ||
|
|
||
| Routing is not just “task → model A/B”. To make MoM systems truly usable in production, we are pushing VSR into more advanced territory, with AMD GPUs as the execution engine. | ||
|
|
||
| ### Hallucination-aware and risk-aware routing | ||
|
|
||
| We are developing **hallucination detection and factuality checks** as part of the routing layer itself: | ||
|
|
||
| * Dedicated classifiers and NLI-style models to evaluate: | ||
|
|
||
| * factual consistency between query, context, and answer, | ||
| * risk level by domain (e.g., finance, healthcare, legal, compliance). | ||
| * Policies that can, based on these signals: | ||
|
|
||
| * route to retrieval-augmented pipelines, | ||
| * switch to a more conservative model, | ||
| * require additional verification or human review, | ||
| * lower the temperature or adjust generation strategy. | ||
|
|
||
| AMD GPUs enable these checks to be run **inline** at scale, rather than as slow, offline audits. The result is a **“hallucination-aware router”** that enforces trust boundaries around LLM responses. | ||
|
|
||
| ### Online learning and adaptive routing | ||
|
|
||
| We also view routing as something that should **learn and adapt over time**, not remain hard-coded: | ||
|
|
||
| * Using user feedback and task outcomes as signals. | ||
| * Tracking production KPIs (resolution rate, time-to-answer, escalation rate, etc.). | ||
| * Running continuous experiments to compare routing strategies and model choices. | ||
|
|
||
| AMD GPU clusters give us the capacity to: | ||
|
|
||
| * replay real or synthetic workloads at scale, | ||
| * run controlled A/B tests on routing policies, | ||
| * train and update small “router brain” models with fast iteration cycles. | ||
|
|
||
| This turns VSR into a **self-optimizing control plane** for multi-model systems, not just a rules engine. | ||
|
|
||
| --- | ||
|
|
||
| ### Long-Term Objective 1: Training a Next-Generation Encoder Model on AMD GPUs | ||
|
|
||
| As a longer-term goal of this collaboration, we aim to explore training a **next-generation encoder-only model** on AMD GPUs, optimized for semantic routing, retrieval-augmented generation (RAG), and safety classification. | ||
|
|
||
| While recent encoder models (e.g., ModernBERT) show strong performance, they remain limited in context length, multilingual coverage, and alignment with emerging long-context attention techniques. This effort focuses on advancing encoder capabilities using AMD hardware, particularly for **long-context, high-throughput representation learning**. | ||
|
|
||
| The outcome will be an **open encoder model** designed to integrate with vLLM Semantic Router and modern AI pipelines, strengthening the retrieval and routing layers of AI systems while expanding hardware-diverse training and deployment options for the community and industry. | ||
|
|
||
| Here is a **matching Long-Term Objective**, written in the **same tone, length, and level of commitment** as Objective 3, and suitable to sit alongside it in the blog without overpowering the rest of the roadmap. | ||
|
|
||
| ### Long-Term Objective 2: Community Public Beta for vLLM Semantic Router on AMD Infrastructure | ||
|
|
||
| As part of this long-term collaboration, each major release of vLLM Semantic Router will be accompanied by a **public beta environment** hosted on **AMD-sponsored infrastructure**, available free of charge to the community. | ||
|
|
||
| These public betas will allow users to: | ||
|
|
||
| * validate new routing, caching, and safety features, | ||
| * gain hands-on experience with Semantic Router running on AMD GPUs, | ||
| * and provide early feedback that helps improve performance, usability, and system design. | ||
|
|
||
| By lowering the barrier to experimentation and validation, this initiative aims to strengthen the vLLM ecosystem, accelerate real-world adoption, and ensure that new Semantic Router capabilities are shaped by community input before broader production deployment. | ||
|
|
||
| ## Long-Term Objective 3: AMD GPU–Powered CI/CD and End-to-End Testbed for VSR OSS | ||
|
|
||
| In the long run, we want AMD GPUs to underpin how **VSR as an open-source project is built, validated, and shipped**. | ||
|
|
||
| We are designing a GPU-backed **CI/CD and end-to-end testbed** where: | ||
|
|
||
| * Router SLMs, LLMs, domain models, retrieval, and tools run together on AMD GPU clusters. | ||
| * Multi-domain, multi-risk-level datasets are replayed as traffic. | ||
| * Each VSR change runs through an automated evaluation pipeline, including: | ||
|
|
||
| * routing and policy regression tests, | ||
| * A/B comparisons of new vs. previous strategies, | ||
| * stress tests on latency, cost, and scalability, | ||
| * focused suites for hallucination mitigation and compliance behavior. | ||
|
|
||
| The target state is clear: | ||
|
|
||
| > **Every VSR release comes with a reproducible, GPU-driven evaluation report, not just a changelog.** | ||
|
|
||
| AMD GPUs, in this model, are not only for serving models; they are the **verification engine for the routing infrastructure itself**. | ||
|
|
||
| --- | ||
|
|
||
| ## Long-Term Objective 4: An AMD-Backed Mixture-of-Models Playground | ||
|
|
||
| In parallel, we are planning an **online Mixture-of-Models playground** powered by AMD GPUs, open to the community and partners. | ||
|
|
||
| This playground will allow users to: | ||
|
|
||
| * Experiment with different routing strategies and model topologies under real workloads. | ||
| * Observe, in a visual way, how VSR decides: | ||
|
|
||
| * which model to call, | ||
| * when to retrieve, | ||
| * when to apply additional checks or fallbacks. | ||
| * Compare **quality, latency, and cost trade-offs** across configurations. | ||
|
|
||
| For model vendors, tool builders, and platform providers, this becomes a **neutral, AMD GPU–backed test environment** to: | ||
|
|
||
| * integrate their components into a MoM stack, | ||
| * benchmark under realistic routing and governance constraints, | ||
| * showcase capabilities within a transparent, observable system. | ||
|
|
||
| --- | ||
|
|
||
| ## Why This Collaboration Matters | ||
|
|
||
| Through the AMD × vLLM Semantic Router collaboration, we are aiming beyond “does this model run on this GPU”. | ||
|
|
||
| The joint ambitions are: | ||
|
|
||
| * To define a **reference architecture for intelligent, GPU-accelerated routing** on AMD platforms, including: | ||
|
|
||
| * vLLM-based inference paths, | ||
| * ONNX-based lightweight router paths, | ||
| * multi-model coordination and safety enforcement. | ||
| * To treat routing as **trusted infrastructure**, supported by: | ||
|
|
||
| * GPU-powered CI/CD and end-to-end evaluation, | ||
| * hallucination-aware and risk-aware policies, | ||
| * online learning and adaptive strategies. | ||
| * To provide the ecosystem with a **long-lived, AMD GPU–backed MoM playground** where ideas, models, and routing policies can be tested and evolved in the open. | ||
|
|
||
| In short, this is about **co-building trustworthy, evolvable multi-model AI infrastructure**—with AMD GPUs as a core execution and validation layer, and vLLM Semantic Router as the intelligent control plane that makes the entire system understandable, governable, and ready for real workloads. | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.