Skip to content

Latest commit

 

History

History
240 lines (171 loc) · 8.46 KB

File metadata and controls

240 lines (171 loc) · 8.46 KB

ATOM (AiTer Optimized Model) is a lightweight vLLM-like implementation, focusing on integration and optimization based on AITER.

🚀 Features

  • ROCm Optimized: Built on AMD's ROCm platform with AITER kernels (ASM, CK, Triton)
  • OpenAI-Compatible API: Drop-in server with /v1/chat/completions and /v1/completions endpoints
  • Piecewise torch.compile: 4 compilation levels with CUDA graph capture for low-latency decode
  • Multi-GPU Parallelism: Tensor parallelism (TP), data parallelism (DP), and expert parallelism (EP) with MORI all-to-all
  • Quantization: FP8, MXFP4, INT8, INT4 with auto-detection from HuggingFace configs
  • Speculative Decoding: Multi-Token Prediction (MTP) with EAGLE proposer
  • Prefix Caching: xxhash64-based KV cache block sharing across sequences

Supported Models

Model Family HF Architecture Dense/MoE Notes
Llama LlamaForCausalLM Dense Llama 2, Llama 3, Llama 3.1
Qwen3 Qwen3ForCausalLM Dense
Qwen3-MoE Qwen3MoeForCausalLM MoE 128 experts, top-8 routing
DeepSeek V2/V3 DeepseekV3ForCausalLM MoE MLA attention, MTP speculative decoding
Mixtral MixtralForCausalLM MoE 8 experts, top-2 routing
GLM-4-MoE Glm4MoeForCausalLM MoE
GLM-5 GlmMoeDsaForCausalLM MoE MLA attention, similar to DeepSeek v3.2. See recipe
GPT-OSS GptOssForCausalLM MoE Sliding window + attention sinks
Kimi-K2 via --trust-remote-code MoE See recipe

📋 Requirements

  • AMD GPU with ROCm support
  • Docker

🛠️ Installation

1. Pull Docker Image

docker pull rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0

2. Run Docker Container

docker run -it --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v $HOME:/home/$USER \
  -v /mnt:/mnt \
  -v /data:/data \
  --shm-size=16G \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0

3. Clone and Setup

pip install amd-aiter
git clone https://github.com/ROCm/ATOM.git; pip install ./ATOM

📚 Documentation

Full documentation: rocm.github.io/ATOM

Topic Description Guide
Architecture System overview, request lifecycle, component design Architecture Guide
Configuration Config classes, CLI arguments, environment variables Configuration Guide
Model Support Supported models, weight loading, adding new architectures Model Support Guide
Model Operations AITER kernel integration, linear/attention/MoE/norm wrappers Model Ops Guide
Scheduling & KV Cache Batch scheduling, block allocation, prefix caching Scheduling Guide
Compilation torch.compile levels, CUDA graphs, piecewise compilation Compilation Guide
Distributed Tensor/data/expert parallelism, multi-GPU deployment Distributed Guide
Serving & Benchmarks OpenAI API server, benchmarking, profiling, speculative decoding Serving Guide

Deployment Recipes:

💡 Usage

Basic Example

The default optimization level is 3 (piecewise torch.compile with CUDA graphs).

python -m atom.examples.simple_inference --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8

Note: First-time execution may take approximately 10 minutes for model compilation.

Serving

Start an OpenAI-compatible server:

# Single GPU
python -m atom.entrypoints.openai_server --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8

# Multi-GPU with tensor parallelism
python -m atom.entrypoints.openai_server --model deepseek-ai/DeepSeek-R1 --kv_cache_dtype fp8 -tp 8

Profiling

Profile offline inference:

python -m atom.examples.profile_offline --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8

With custom input/output lengths:

python -m atom.examples.profile_offline --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8 \
  --random-input --input-length 1024 --output-length 32

Profile a running server:

curl -s -S -X POST http://127.0.0.1:8000/start_profile
# ... run your workload ...
curl -s -S -X POST http://127.0.0.1:8000/stop_profile

Benchmarking

Run an online throughput benchmark against a running server:

MODEL=deepseek-ai/DeepSeek-R1
ISL=1024
OSL=1024
CONC=128
PORT=8000
RESULT_FILENAME=Deepseek-R1-result

python -m atom.benchmarks.benchmark_serving \
  --model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
  --dataset-name=random \
  --random-input-len=$ISL --random-output-len=$OSL \
  --random-range-ratio 0.8 \
  --num-prompts=$(( $CONC * 10 )) \
  --max-concurrency=$CONC \
  --request-rate=inf --ignore-eos \
  --save-result --percentile-metrics="ttft,tpot,itl,e2el" \
  --result-dir=./ --result-filename=$RESULT_FILENAME.json

Profile Analyze

ATOM supports automatic trace collection and analysis, which breaks down GPU kernel durations per module for both prefill and decode phases and exports the results to Excel (.xlsx) files.

Step 1: Collect a Trace

Launch the server with --torch-profiler-dir to enable the PyTorch profiler and --mark-trace to insert per-module annotations into the trace. Set TORCHINDUCTOR_COMPILE_THREADS=1 to ensure deterministic compilation order.

TORCHINDUCTOR_COMPILE_THREADS=1 python -m atom.entrypoints.openai_server \
  --model deepseek-ai/DeepSeek-R1 \
  --kv_cache_dtype fp8 -tp 8 \
  --torch-profiler-dir ./trace \
  --mark-trace

After the server processes requests and shuts down, two *.json.gz trace files will be generated in the --torch-profiler-dir directory.

Step 2: Analyze the Trace

Run parse_trace.py on the collected trace file(use it on trace file start with the model name):

python ATOM/tools/parse_trace.py ./trace/model_name_ts_*.json.gz

This produces two Excel files in the current directory:

Output File Description
prefill_breakdown.xlsx Per-kernel duration breakdown for one prefill layer
decode_breakdown.xlsx Per-kernel duration breakdown for one decode layer

Each file contains columns: cpu_module, gpu_kernel, duration_us, sum per module, plus averaged values across layers.

Options:

Flag Default Description
--layer N 3 Target transformer layer index to analyze (0-indexed)

📊 Performance

Online Serving Throughput

DS R1 Performance

For more information, visit InferenceMAX.

Accuracy Validation

Install lm-eval to test model accuracy:

pip install lm-eval[api]

Start a server, then run the evaluation:

python -m atom.entrypoints.openai_server --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8
lm_eval --model local-completions \
        --model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
        --tasks gsm8k \
        --num_fewshot 5

Acknowledgements

This project was adapted from nano-vllm.

Support & Reporting Issues

We welcome issues and contributions! Please use the GitHub Issues page to report bugs or request features: https://github.com/ROCm/ATOM/issues