ATOM: A Unified High‑Performance Inference Engine for AMD Instinct™ GPUs

Status: Draft – Internal Review Only
Audience: AMD internal stakeholders (pre‑external publication)
Purpose: External‑facing narrative, sanitized for public release, pending approval

Executive Summary

The rapid evolution of generative AI toward agentic workflows, multi-step reasoning, and Mixture-of-Experts (MoE) architectures has placed unprecedented demands on inference infrastructure. Achieving low latency and high throughput at scale now requires tight hardware-software co-design, moving beyond the limitations of loosely coupled, generic open-source components. ATOM is a unified inference engine purpose-built for AMD Instinct™ GPUs. Built on a minimalist, modular architecture centered on AITER routing, ATOM avoids the overhead of monolithic designs to provide a streamlined path from hardware to application.

A Dual-Purpose Architecture

ATOM is designed to bridge the gap between cutting-edge hardware capability and community usability through two strategic roles: • Standalone Optimization Platform: It serves as a "Proving Ground" for rapid iteration, allowing AMD to develop and validate internal system-level optimizations, compiler-driven graph enhancements, and kernel acceleration. • Integrated Plugin Backend: It integrates as a high-performance backend for popular frameworks like vLLM and SGLang, ensuring that the broader AI community can access AMD-specific optimizations without changing their existing workflows.

Driving Production Readiness

By unifying the execution layer, ATOM helps customers move faster from POC to production with stable performance. It brings new hardware features to users immediately, making AMD Instinct GPUs a reliable choice for high-concurrency and MoE workloads

Performance Momentum:

ATOM’s embrace the same minimalist mindset of nano-vllm "minimalist" architecture allows it to evolve at the speed of AMD silicon. By stripping away layers of abstraction found in generic frameworks, ATOM connects the model directly to the metal via the AITER kernel library.

Rapid Iteration & Optimization (MI355X Data) Recent internal benchmarking on the MI355X GPU (Dec 2025 – Jan 2026) demonstrates how this direct optimization translates to speed. Specifically, at a fixed high concurrency (Conc=128), we observe consistent double-digit gains: • GPT-OSS (FP4): Throughput surged by 40.8% in 1K/1K tasks, reflecting the rapid tuning of AITER kernels for the MI355X’s FP4 tensor cores. • DeepSeek R1 (FP8): Achieved a 39.4% throughput increase on long-context (8k/1k) tasks, proving ATOM’s capability to handle complex MoE routing efficiently.
Competitive Context: Closing the Gap with NVIDIA B200 ATOM is the vehicle enabling AMD to challenge the NVIDIA B200. • FP4 Leadership: By unlocking native FP4 support on the MI355X GPU, ATOM is bringing AMD within striking distance of B200 generation speeds on models like DeepSeek R1. • MoE Efficiency: ATOM’s specialized schedulers mitigate expert routing overhead, keeping GPU occupancy high even in sparse MoE models.

ATOM in the AMD AI Software Stack

ATOM is highlighted as the central performance layer coordinating compiler decisions, kernel execution, memory orchestration, and distributed communication.

Layer	Role
Frameworks (PyTorch, vLLM, SGLang)	Model definition, APIs, and serving interfaces
Compiler (AITER / MLIR)	Graph-level optimization, fusion, and scheduling
ATOM	AMD-optimized inference execution and orchestration
Communication (MORI)	Multi-GPU and multi-node communication

------|------| | Frameworks (PyTorch, vLLM, SGLang) | Model definition and serving APIs | | Compiler (AITER / MLIR) | Graph‑level optimization and fusion | | ATOM | AMD‑optimized inference execution engine | | Communication (MORI) | Multi‑GPU and multi‑node communication |

ATOM serves as the central performance layer, coordinating compiler decisions, kernel selection, and distributed execution.

Core Capabilities

1. System-Level Execution Optimization

Full-Graph Execution: Reduces kernel launch overhead and enables compiler-guided scheduling.
Memory and Cache Orchestration: Unified KV cache handling and cross-step data residency to minimize memory traffic.
Advanced Parallelism: Support for tensor, expert (MoE), data, pipeline, and hybrid parallel strategies.

2. Compiler-Driven Graph and Kernel Acceleration

Kernel Fusion: Minimizes micro-kernel overhead and improves GPU occupancy.
Model-Aware Tuning: Automatic specialization based on model structure and runtime characteristics.
Precision Evolution: Native FP8 support with a clear path to FP4 on next-generation architectures.

3. Distributed Inference at Scale

Compute–Communication Overlap: Tight integration with MORI for asynchronous dispatch and combine.
Multi-Node Scalability: Efficient expert parallelism and KV cache streaming.
Dynamic Load Balancing: Adaptive resource allocation for MoE-heavy workloads.

Performance and Validation

ATOM has been validated through AMD’s participation in InferenceMax, demonstrating strong single‑node and multi‑node inference performance on AMD Instinct™ MI355X GPUs for modern reasoning and MoE models.

(Suggested Figures: Throughput vs. concurrency charts, single‑node and multi‑node scaling plots)

Key takeaways:

Competitive or leading throughput for MoE‑heavy reasoning models
Strong scaling efficiency in distributed inference configurations
Consistent performance across a wide range of sequence lengths and concurrency levels

All results are reproducible using the open InferenceMax stack.

Benefits to Customers and Partners

Faster POCs: Reduce onboarding cycles from weeks to days
Predictable Performance: Consistent results across models and frameworks
Lower Engineering Overhead: One optimized backend instead of per‑framework tuning
Future‑Ready Platform: Early access to new precisions, architectures, and inference techniques

Roadmap Overview

Short Term (MI300 / MI355):

Production‑ready full‑graph execution
FP8 optimization and MoE scaling

Mid Term (MI400‑class):

FP4 enablement
Rack‑scale distributed inference

Long Term:

Multi‑modal inference
Integrated reinforcement learning rollout support

Conclusion

ATOM is a foundational component of AMD’s AI inference strategy. By unifying execution across kernels, compilers, and communication layers, ATOM enables AMD Instinct GPUs to deliver production‑grade inference performance for the next generation of AI workloads.

With ATOM, AMD moves beyond fragmented enablement toward a cohesive, scalable inference platform—one that accelerates customer adoption, strengthens the open‑source ecosystem, and positions AMD as a first‑class provider for modern AI inference.

This document is intended for internal review prior to external publication. Performance data and roadmap details are subject to change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ATOM: A Unified High‑Performance Inference Engine for AMD Instinct™ GPUs

Executive Summary

A Dual-Purpose Architecture

Driving Production Readiness

Performance Momentum:

ATOM in the AMD AI Software Stack

Core Capabilities

1. System-Level Execution Optimization

2. Compiler-Driven Graph and Kernel Acceleration

3. Distributed Inference at Scale

Performance and Validation

Benefits to Customers and Partners

Roadmap Overview

Conclusion

FilesExpand file tree

atom_overview_draft.md

Latest commit

History

atom_overview_draft.md

File metadata and controls

ATOM: A Unified High‑Performance Inference Engine for AMD Instinct™ GPUs

Executive Summary

A Dual-Purpose Architecture

Driving Production Readiness

Performance Momentum:

ATOM in the AMD AI Software Stack

Core Capabilities

1. System-Level Execution Optimization

2. Compiler-Driven Graph and Kernel Acceleration

3. Distributed Inference at Scale

Performance and Validation

Benefits to Customers and Partners

Roadmap Overview

Conclusion