Skip to content

Latest commit

 

History

History
240 lines (183 loc) · 15.5 KB

File metadata and controls

240 lines (183 loc) · 15.5 KB

ATOM Architecture Guide

Quick Reference

Class Import Purpose
LLMEngine from atom.model_engine.llm_engine import LLMEngine User-facing inference API
InputOutputProcessor from atom.model_engine.llm_engine import InputOutputProcessor Tokenize/detokenize, TTFT/TPOT stats
CoreManager from atom.model_engine.engine_core_mgr import CoreManager Multi-process orchestration via ZMQ
EngineCore from atom.model_engine.engine_core import EngineCore Per-process engine loop
DPEngineCoreProc from atom.model_engine.engine_core import DPEngineCoreProc Data-parallel engine core variant
ModelRunner from atom.model_engine.model_runner import ModelRunner Per-GPU model execution
Scheduler from atom.model_engine.scheduler import Scheduler Prefill-first request scheduling
BlockManager from atom.model_engine.block_manager import BlockManager KV cache block allocation
Sequence from atom.model_engine.sequence import Sequence Request state and token tracking
ForwardContext from atom.utils.forward_context import ForwardContext Global forward pass metadata
Config from atom.config import Config Master configuration dataclass

1. System Overview

ATOM (AiTer Optimized Model) is AMD's lightweight LLM inference engine, inspired by vLLM's architecture and built on the AITER kernel library for ROCm/HIP GPUs.

Key design principles:

  • Multi-process architecture -- each engine core runs in its own process, with ZMQ-based IPC connecting the user-facing API to one or more GPU workers.
  • AITER-native execution -- model forward passes use AITER's optimized attention, MoE, sampling, and communication kernels rather than generic PyTorch operators.
  • CUDA graph acceleration -- decode batches are captured into CUDA graphs for replay, eliminating per-step kernel launch overhead.
  • Prefill-first scheduling -- the scheduler prioritizes prompt prefills before decode steps, following vLLM's continuous batching strategy.
  • Speculative decoding -- optional EAGLE/MTP (Multi-Token Prediction) draft models propose tokens that are verified via rejection sampling.

2. Component Architecture

LLMEngine (user-facing API)
├── InputOutputProcessor (tokenize/detokenize, TTFT/TPOT stats)
├── CoreManager (multi-process orchestration via ZMQ)
│   └── EngineCore (one per DP rank, runs in its own process)
│       ├── ModelRunner (per-GPU execution via AsyncIOProcManager)
│       │   ├── Model (Qwen3, Llama, DeepSeek, Mixtral, etc.)
│       │   ├── Sampler / RejectionSampler
│       │   └── EagleProposer (optional MTP draft)
│       └── Scheduler
│           └── BlockManager (KV cache block management)
└── Config (master configuration)

Supported model architectures (registered in support_model_arch_dict, a module-level dict in model_runner.py):

Architecture key Implementation
Qwen3ForCausalLM atom.models.qwen3.Qwen3ForCausalLM
Qwen3MoeForCausalLM atom.models.qwen3_moe.Qwen3MoeForCausalLM
LlamaForCausalLM atom.models.llama.LlamaForCausalLM
MixtralForCausalLM atom.models.mixtral.MixtralForCausalLM
DeepseekV3ForCausalLM atom.models.deepseek_v2.DeepseekV2ForCausalLM
DeepseekV32ForCausalLM atom.models.deepseek_v2.DeepseekV2ForCausalLM
GptOssForCausalLM atom.models.gpt_oss.GptOssForCausalLM
Glm4MoeForCausalLM atom.models.glm4_moe.Glm4MoeForCausalLM

3. Request Lifecycle

A request flows through the system in ten steps:

  1. LLMEngine.add_request() / generate() -- the user submits a list of prompts (strings or pre-tokenized token IDs) together with SamplingParams.

  2. InputOutputProcessor.preprocess() -- each prompt is tokenized via the HuggingFace tokenizer. A Sequence object is created to track the request's state, timing, and block allocation. arrive_time is recorded.

  3. CoreManager.add_request() -- the list of Sequence objects is serialized with pickle and sent over a ZMQ ROUTER socket. When multiple DP ranks are active, requests are distributed round-robin.

  4. EngineCore.process_input_sockets() -- an I/O thread on the EngineCore process receives the serialized data on a ZMQ DEALER socket, deserializes it, and places the sequences into the input_queue.

  5. EngineCore.busy_loop() -- the main execution loop pulls from input_queue via pull_and_process_input_queue(), feeds new sequences into the scheduler, and repeatedly calls _process_engine_step() until all work is done.

  6. Scheduler.schedule() -- implements prefill-first scheduling. Waiting sequences are scheduled for prefill if they fit within max_num_seqs and max_num_batched_tokens and the BlockManager can allocate blocks. If no prefills are pending, running sequences are batched for decode. The scheduler returns a ScheduledBatch and the corresponding sequence map.

  7. ModelRunner.forward() -- executes the three-phase forward pass:

    • prepare_model() -- assembles input IDs (handling deferred output from previous steps), builds attention metadata, and gathers sampling temperatures.
    • run_model() -- runs the model forward. Prefill and large batches run eagerly; decode batches replay captured CUDA graphs. Returns logits and hidden states.
    • postprocess() -- samples tokens (or runs rejection sampling for speculative decoding), prepares deferred output via tokenIDProcessor, and optionally proposes draft tokens through EagleProposer.
  8. Scheduler.postprocess() -- appends sampled tokens to each Sequence, records first_token_time, checks stop conditions (EOS, stop token IDs, stop token sequences, max_tokens), and moves finished sequences out of the running queue. The BlockManager deallocates blocks for finished sequences.

  9. Output via ZMQ -- finished sequences are placed on the output_queue. A dedicated output thread serializes them and sends them over a ZMQ PUSH socket back to the CoreManager, which receives them on a PULL socket and places them in outputs_queue.

  10. InputOutputProcessor.postprocess() -- detokenizes completed sequences, computes TTFT (Time To First Token) and TPOT (Time Per Output Token), and returns structured output dictionaries.


4. Forward Context Pattern

ATOM uses a module-level global ForwardContext to pass metadata through CUDA graph boundaries without threading it as function parameters.

Core dataclasses (defined in atom/utils/forward_context.py):

  • ForwardContext -- top-level container holding:

    • attn_metadata (AttentionMetaData) -- cumulative sequence lengths, block tables, slot mappings, and backend-specific metadata.
    • context (Context) -- positions, prefill flag, batch size, graph batch size, draft flag.
    • dp_metadata (DPMetadata) -- cross-DP-rank token counts and cumulative sums.
    • spec_decode_metadata (SpecDecodeMetadata) -- draft token IDs, target/bonus logits indices.
    • kv_cache_data (dict[str, KVCacheTensor]) -- per-layer KV cache tensor references.
  • Context -- lightweight struct: positions, is_prefill, batch_size, graph_bs, is_draft.

  • DPMetadata -- data parallel metadata with num_tokens_across_dp() (all-reduce), max_tokens_across_dp, and chunked_sizes() context manager.

Global accessors:

Function Purpose
set_forward_context(attn_metadata, atom_config, context, ...) Set the global context before a forward pass
get_forward_context() Retrieve the current context (used by attention backends)
reset_forward_context() Clear after forward pass completes
set_kv_cache_data(kv_cache_data) Register KV cache tensors at initialization

This pattern enables stateless dispatch: attention backends and model operators call get_forward_context() to access metadata without requiring it as a function parameter, which is critical for CUDA graph compatibility.


5. Multi-Process Architecture

ATOM uses a multi-process design with ZMQ sockets for inter-process communication:

                        ┌──────────────────────────────────┐
                        │          LLMEngine               │
                        │  ┌────────────────────────────┐  │
                        │  │    CoreManager              │  │
                        │  │                             │  │
                        │  │  ROUTER ──────► DEALER      │  │
                        │  │  (input)        (per rank)  │  │
                        │  │                             │  │
                        │  │  PULL ◄─────── PUSH         │  │
                        │  │  (output)      (per rank)   │  │
                        │  └────────────────────────────┘  │
                        └──────────────────────────────────┘
                              │                    ▲
                     pickle   │                    │  pickle
                              ▼                    │
               ┌──────────────────────────────────────┐
               │         EngineCore (Process)          │
               │                                       │
               │  input_queue ──► busy_loop             │
               │                    │                   │
               │  ┌─────────────────▼───────────────┐  │
               │  │  AsyncIOProcManager              │  │
               │  │  ┌────────────────────────────┐  │  │
               │  │  │ ModelRunner (TP rank 0)     │  │  │
               │  │  │ ModelRunner (TP rank 1)     │  │  │
               │  │  │ ...                         │  │  │
               │  │  └────────────────────────────┘  │  │
               │  └──────────────────────────────────┘  │
               │                                       │
               │  Scheduler + BlockManager             │
               └──────────────────────────────────────┘

Socket types:

Socket Type Direction Purpose
Input ROUTER (CoreManager) / DEALER (EngineCore) CoreManager -> EngineCore Send requests and control commands
Output PUSH (EngineCore) / PULL (CoreManager) EngineCore -> CoreManager Return finished sequences and stream outputs

Process hierarchy:

  • CoreManager spawns one EngineCore process per DP rank using multiprocessing.Process.
  • Each EngineCore creates an AsyncIOProcManager, which in turn spawns one subprocess per TP rank.
  • Each ModelRunner subprocess initializes AITER's distributed environment via init_dist_env() from AITER, setting up NCCL communication across TP ranks.

Data-parallel variant (DPEngineCoreProc):

When data_parallel_size > 1, each EngineCore process is a DPEngineCoreProc that synchronizes with other DP ranks via torch.distributed.all_reduce on a Gloo process group. The busy_loop() override ensures all DP ranks stay in lockstep: if one rank has a prefill batch while another does not, the idle rank executes a dummy prefill (dummy_prefill_execution()) to keep NCCL collectives synchronized.


6. Sequence Lifecycle

The Sequence class (in atom/model_engine/sequence.py) is the central data structure tracking a single request through the engine.

Key fields:

Field Type Purpose
id int Auto-incrementing unique identifier
token_ids list[int] Full token sequence (prompt + generated)
num_prompt_tokens int Length of the original prompt
num_tokens int (property) Total length including generated tokens
block_table list[int] KV cache block IDs allocated to this sequence
status SequenceStatus Current lifecycle state
type SequenceType Current execution type
temperature float Sampling temperature
max_tokens int Maximum completion length
arrive_time float Timestamp when request entered the system
first_token_time float Timestamp of first generated token (for TTFT)
leave_time float Timestamp when request finished (for TPOT)
spec_token_ids list[int] Speculative/draft token IDs for MTP
stream_callback Callable Optional per-token streaming callback

Status transitions:

WAITING ──(scheduled for prefill)──► RUNNING ──(stop condition met)──► FINISHED
   ▲                                    │
   └────────(preempted by scheduler)────┘
  • SequenceStatus.WAITING -- queued in the scheduler's waiting deque, awaiting block allocation.
  • SequenceStatus.RUNNING -- actively being processed (prefill or decode).
  • SequenceStatus.FINISHED -- stop condition met (EOS, stop token, stop sequence, or max_tokens). Blocks are deallocated.
  • SequenceStatus.EXIT_ENGINE -- sentinel status used to signal engine shutdown.

Execution types:

  • SequenceType.DUMMY -- initial state before scheduling.
  • SequenceType.PREFILL -- prompt processing phase (all prompt tokens in one batch).
  • SequenceType.DECODE -- autoregressive token generation (one or more tokens per step with MTP).

Source Files

File Description
atom/model_engine/llm_engine.py LLMEngine user-facing API, InputOutputProcessor for tokenization/detokenization and TTFT/TPOT statistics
atom/model_engine/engine_core.py EngineCore main execution loop, DPEngineCoreProc data-parallel variant, EngineCoreRequestType message protocol
atom/model_engine/engine_core_mgr.py CoreManager ZMQ orchestration, process launching, round-robin DP dispatch
atom/model_engine/model_runner.py ModelRunner per-GPU execution (model loading, CUDA graph capture, forward pass), tokenIDProcessor deferred output handling
atom/model_engine/scheduler.py Scheduler prefill-first scheduling, ScheduledBatch batch descriptor, ScheduledBatchOutput forward results
atom/model_engine/sequence.py Sequence request state, SequenceStatus and SequenceType enums
atom/model_engine/block_manager.py BlockManager KV cache block allocation with optional prefix caching
atom/model_engine/request.py RequestOutput dataclass for streaming callbacks
atom/model_engine/async_proc.py AsyncIOProcManager and AsyncIOProc for spawning and managing ModelRunner subprocesses
atom/utils/forward_context.py ForwardContext, Context, DPMetadata, SpecDecodeMetadata, AttentionMetaData dataclasses and global accessors
atom/config.py Config master configuration, ParallelConfig, CompilationConfig, LayerQuantConfig, QuantizationConfig, SpeculativeConfig, KVCacheTensor