diff --git a/website/blog/2025-12-10-amd.md b/website/blog/2025-12-10-amd.md new file mode 100644 index 000000000..dc09eb108 --- /dev/null +++ b/website/blog/2025-12-10-amd.md @@ -0,0 +1,216 @@ +--- +slug: semantic-on-amd +title: "AMD × vLLM Semantic Router: Building Trustworthy, Evolvable Mixture-of-Models AI Infrastructure" +authors: [Xunzhuo] +tags: [amd, ecosystem, partnership, inference, routing] +--- + +AMD is a long-term technology partner for the **vLLM** community: from accelerating the **vLLM engine** on AMD GPUs and ROCm™ Software to now exploring the next layer of the stack with **vLLM Semantic Router (VSR)**. + +As AI moves from single models to **Mixture-of-Models (MoM)** architectures, the challenge shifts from "how big is your model" to **how intelligently you orchestrate many models together**. + +VSR is the **intelligent brain for multi-model AI**. Together with AMD, we are building this layer to be: + +* **System Intelligence** – semantic understanding, intent classification, and dynamic decision-making. +* **Trustworthy** – hallucination-aware, risk-sensitive, and observable. +* **Evolvable** – continuously improving via data, feedback, and experimentation. +* **Production-ready on AMD** – from CI/CD → online playgrounds → large-scale production. + +
+AMD × vLLM Semantic Router +
+ + + +--- + +## Strategic Focus: From Single Models to Multi-Model AI Infrastructure + +In a MoM world, an enterprise AI stack typically includes: + +* Router SLMs (small language models) that classify, route, and enforce policy. +* Multiple LLMs and domain-specific models (e.g., code, finance, healthcare). +* Tools, RAG pipelines, vector search, and business systems. + +Without a robust routing layer, this turns into an opaque and fragile mesh. + +The AMD × VSR collaboration aims to make that routing layer a **first-class, GPU-accelerated infrastructure component**, not an ad-hoc script glued between services. + +--- + +## Near- and Mid-Term: Running VSR on AMD GPUs + +In the near and mid term, the joint objective is simple and execution-oriented: + +> **Deliver a production-grade VSR solution that runs efficiently on AMD GPUs.** + +We are building two primary paths. + +### 1. vLLM-based SLM / LLM inference on AMD GPUs + +On AMD GPUs, using the vLLM engine, we will run: + +* **Router SLMs** + + * Task and intent classification + * Risk scoring and safety gating + * Tool / workflow selection +* **LLMs and specialized models** + + * General assistants + * Domain-specific models (e.g., financial, legal, code) + +VSR sits above as the decision fabric, consuming: + +* semantic similarity and embeddings, +* business metadata and domain tags, +* latency and cost constraints, +* risk and compliance requirements, + +to perform **dynamic routing** across these models and endpoints. + +AMD GPUs provide the throughput and memory footprint needed to run **router SLMs + multiple LLMs** in the same cluster, supporting high QPS workloads with stable latency instead of one-off demos. + +### 2. Lightweight routing via ONNX binding on AMD GPUs + +Not all routing needs a full inference stack. + +For ultra-high-frequency and latency-sensitive stages at the “front door” of the system, we are also enabling: + +* Exporting router SLMs to **ONNX**, +* Running them on AMD GPUs through ONNX Runtime / custom bindings, +* Forwarding complex generative work to vLLM or other back-end LLMs. + +This lightweight path is designed for: + +* front-of-funnel traffic classification and triage, +* large-scale policy evaluation and offline experiments, +* enterprises that want to **standardize on AMD GPUs while keeping model providers flexible**. + +--- + +## Frontier Capabilities: Hallucination Detection and Online Learning + +Routing is not just “task → model A/B”. To make MoM systems truly usable in production, we are pushing VSR into more advanced territory, with AMD GPUs as the execution engine. + +### Hallucination-aware and risk-aware routing + +We are developing **hallucination detection and factuality checks** as part of the routing layer itself: + +* Dedicated classifiers and NLI-style models to evaluate: + + * factual consistency between query, context, and answer, + * risk level by domain (e.g., finance, healthcare, legal, compliance). +* Policies that can, based on these signals: + + * route to retrieval-augmented pipelines, + * switch to a more conservative model, + * require additional verification or human review, + * lower the temperature or adjust generation strategy. + +AMD GPUs enable these checks to be run **inline** at scale, rather than as slow, offline audits. The result is a **“hallucination-aware router”** that enforces trust boundaries around LLM responses. + +### Online learning and adaptive routing + +We also view routing as something that should **learn and adapt over time**, not remain hard-coded: + +* Using user feedback and task outcomes as signals. +* Tracking production KPIs (resolution rate, time-to-answer, escalation rate, etc.). +* Running continuous experiments to compare routing strategies and model choices. + +AMD GPU clusters give us the capacity to: + +* replay real or synthetic workloads at scale, +* run controlled A/B tests on routing policies, +* train and update small “router brain” models with fast iteration cycles. + +This turns VSR into a **self-optimizing control plane** for multi-model systems, not just a rules engine. + +--- + +### Long-Term Objective 1: Training a Next-Generation Encoder Model on AMD GPUs + +As a longer-term goal of this collaboration, we aim to explore training a **next-generation encoder-only model** on AMD GPUs, optimized for semantic routing, retrieval-augmented generation (RAG), and safety classification. + +While recent encoder models (e.g., ModernBERT) show strong performance, they remain limited in context length, multilingual coverage, and alignment with emerging long-context attention techniques. This effort focuses on advancing encoder capabilities using AMD hardware, particularly for **long-context, high-throughput representation learning**. + +The outcome will be an **open encoder model** designed to integrate with vLLM Semantic Router and modern AI pipelines, strengthening the retrieval and routing layers of AI systems while expanding hardware-diverse training and deployment options for the community and industry. + +Here is a **matching Long-Term Objective**, written in the **same tone, length, and level of commitment** as Objective 3, and suitable to sit alongside it in the blog without overpowering the rest of the roadmap. + +### Long-Term Objective 2: Community Public Beta for vLLM Semantic Router on AMD Infrastructure + +As part of this long-term collaboration, each major release of vLLM Semantic Router will be accompanied by a **public beta environment** hosted on **AMD-sponsored infrastructure**, available free of charge to the community. + +These public betas will allow users to: + +* validate new routing, caching, and safety features, +* gain hands-on experience with Semantic Router running on AMD GPUs, +* and provide early feedback that helps improve performance, usability, and system design. + +By lowering the barrier to experimentation and validation, this initiative aims to strengthen the vLLM ecosystem, accelerate real-world adoption, and ensure that new Semantic Router capabilities are shaped by community input before broader production deployment. + +## Long-Term Objective 3: AMD GPU–Powered CI/CD and End-to-End Testbed for VSR OSS + +In the long run, we want AMD GPUs to underpin how **VSR as an open-source project is built, validated, and shipped**. + +We are designing a GPU-backed **CI/CD and end-to-end testbed** where: + +* Router SLMs, LLMs, domain models, retrieval, and tools run together on AMD GPU clusters. +* Multi-domain, multi-risk-level datasets are replayed as traffic. +* Each VSR change runs through an automated evaluation pipeline, including: + + * routing and policy regression tests, + * A/B comparisons of new vs. previous strategies, + * stress tests on latency, cost, and scalability, + * focused suites for hallucination mitigation and compliance behavior. + +The target state is clear: + +> **Every VSR release comes with a reproducible, GPU-driven evaluation report, not just a changelog.** + +AMD GPUs, in this model, are not only for serving models; they are the **verification engine for the routing infrastructure itself**. + +--- + +## Long-Term Objective 4: An AMD-Backed Mixture-of-Models Playground + +In parallel, we are planning an **online Mixture-of-Models playground** powered by AMD GPUs, open to the community and partners. + +This playground will allow users to: + +* Experiment with different routing strategies and model topologies under real workloads. +* Observe, in a visual way, how VSR decides: + + * which model to call, + * when to retrieve, + * when to apply additional checks or fallbacks. +* Compare **quality, latency, and cost trade-offs** across configurations. + +For model vendors, tool builders, and platform providers, this becomes a **neutral, AMD GPU–backed test environment** to: + +* integrate their components into a MoM stack, +* benchmark under realistic routing and governance constraints, +* showcase capabilities within a transparent, observable system. + +--- + +## Why This Collaboration Matters + +Through the AMD × vLLM Semantic Router collaboration, we are aiming beyond “does this model run on this GPU”. + +The joint ambitions are: + +* To define a **reference architecture for intelligent, GPU-accelerated routing** on AMD platforms, including: + + * vLLM-based inference paths, + * ONNX-based lightweight router paths, + * multi-model coordination and safety enforcement. +* To treat routing as **trusted infrastructure**, supported by: + + * GPU-powered CI/CD and end-to-end evaluation, + * hallucination-aware and risk-aware policies, + * online learning and adaptive strategies. +* To provide the ecosystem with a **long-lived, AMD GPU–backed MoM playground** where ideas, models, and routing policies can be tested and evolved in the open. + +In short, this is about **co-building trustworthy, evolvable multi-model AI infrastructure**—with AMD GPUs as a core execution and validation layer, and vLLM Semantic Router as the intelligent control plane that makes the entire system understandable, governable, and ready for real workloads. diff --git a/website/static/img/amd-vsr.png b/website/static/img/amd-vsr.png new file mode 100644 index 000000000..85016ef47 Binary files /dev/null and b/website/static/img/amd-vsr.png differ