Skip to content

Latest commit

 

History

History
395 lines (314 loc) · 22.4 KB

File metadata and controls

395 lines (314 loc) · 22.4 KB

ATOM Configuration Guide

ATOM (AiTer Optimized Model) is AMD's lightweight LLM inference engine built on AITER kernels for ROCm/HIP GPUs. This guide documents every configuration class, CLI flag, and environment variable that controls ATOM's runtime behaviour.


Quick Reference

Config Class Primary Purpose
Config Master dataclass -- model path, memory, TP size, scheduler limits, KV cache, profiler, and references to all sub-configs
CompilationConfig Compilation level (0-3), CUDA graph capture sizes, piecewise splitting ops, inductor settings
CompilationLevel Integer constants for the four compilation levels
CUDAGraphMode Enum controlling how CUDA graphs are captured (none / piecewise / full / hybrid)
QuantizationConfig Layer-wise quantization orchestrator: global config, per-layer overrides, exclude lists, layer name remapping
LayerQuantConfig Per-layer quantization parameters: quant type, dtype, dynamic flag, method
ParallelConfig Data-parallel size, rank, master IP/port
SpeculativeConfig Speculative decoding method, draft model, number of speculative tokens
KVCacheConfig / KVCacheTensor Per-layer KV cache tensor descriptors (k/v caches and scales)
SamplingParams Temperature, max tokens, stop strings, ignore-EOS flag
EngineArgs CLI argument parser that builds a Config for LLMEngine

1. Master Configuration (Config)

Defined in atom/config.py. The root dataclass that the engine consumes.

Field Type Default Description
model str (required) HuggingFace model name or local path
trust_remote_code bool False Trust remote code when loading the model from HuggingFace
max_num_batched_tokens int 16384 Maximum number of tokens batched together per scheduler step
scheduler_delay_factor float 0.0 Multiplicative delay (factor x previous prompt latency) before scheduling the next prompt
max_num_seqs int 512 Maximum number of sequences batched together
max_model_len int | None None Maximum context length; defaults to hf_config.max_position_embeddings (capped by it when set)
gpu_memory_utilization float 0.9 Fraction of GPU memory available for KV cache and weights (0.0 -- 1.0)
tensor_parallel_size int 1 Number of tensor-parallel GPUs (1 -- 8)
enforce_eager bool False Disable compilation and CUDA graphs; run in eager mode
parallel_config ParallelConfig ParallelConfig() Data-parallel configuration (see Section 4)
kv_cache_block_size int 16 Block size for paged KV cache; must be a multiple of 16 or exactly 1
num_kvcache_blocks int -1 Number of KV cache blocks (-1 = auto)
kv_cache_dtype str "bf16" KV cache data type ("bf16" or "fp8")
enable_prefix_caching bool False Enable prefix caching to reuse KV blocks across requests sharing the same prefix
port int 8006 Engine internal communication port
torch_profiler_dir str | None os.getenv("ATOM_TORCH_PROFILER_DIR", None) Directory for saving PyTorch profiler traces; creates the directory if it does not exist
compilation_config CompilationConfig CompilationConfig() Compilation and CUDA graph settings (see Section 2)
quant_config QuantizationConfig (auto-detected) Quantization settings; auto-detected from HuggingFace config during __post_init__ via QuantizationConfig(hf_config) (see Section 3)
asyncio_mode bool False Enable asyncio-based engine loop
load_dummy bool False Skip loading model weights (for benchmarking / testing)
enable_expert_parallel bool False Enable Expert Parallelism for MoE models
master_addr str "127.0.0.1" Master address for distributed communication
graph_bs Optional[list[int]] None Explicit list of batch sizes for CUDA graph capture; derived from compilation_config during init
enable_dp_attention bool False Enable data-parallel attention
torch_dtype torch.dtype (computed) Inferred from hf_config.torch_dtype; falls back to torch.bfloat16
speculative_config Optional[SpeculativeConfig] None Speculative decoding configuration (see Section 5)
bos_token_id int -1 Beginning-of-sequence token ID (-1 = use model default)
eos_token_id int -1 End-of-sequence token ID (-1 = use model default)
stop_token_ids list[int] [] Additional stop token IDs; populated from GenerationConfig.eos_token_id during init

Auto-derived fields (set in __post_init__, not user-supplied):

Field Type Description
hf_config PretrainedConfig Loaded automatically via get_hf_config(model)
generation_config GenerationConfig Loaded automatically via get_generation_config(model)

2. Compilation Configuration (CompilationConfig)

Defined in atom/config.py. Controls torch.compile and CUDA graph behaviour.

2.1 Compilation Levels (CompilationLevel)

Constant Value Description
NO_COMPILATION 0 No compilation -- pure eager execution
DYNAMO_AS_IS 1 Use torch.compile / TorchDynamo as-is
DYNAMO_ONCE 2 TorchDynamo with a single compilation pass
PIECEWISE 3 Piecewise compilation with CUDA graph capture (recommended for production)

2.2 CompilationConfig Fields

Field Type Default Description
level int 0 Compilation level (see table above); must be 0 -- 3
use_cudagraph bool True Whether to use CUDA graphs
cudagraph_capture_sizes Optional[list[int]] None Explicit list of batch sizes for CUDA graph capture; overrides cuda_graph_sizes when set
cuda_graph_sizes list[int] [] (post-init: [512]) CUDA graph sizing strategy: 1 value generates [1,2,4,8] + range(16, N+1, 16); multiple values used as-is; empty defaults to [512]
debug_dump_path str "" Path to dump debug / compilation information
cache_dir str "" Directory for compilation caches
use_inductor bool True Enable TorchInductor backend
cudagraph_mode Optional[CUDAGraphMode] None CUDA graph capture mode (see below); set to PIECEWISE automatically at level 3
splitting_ops Optional[list[str]] None Ops that split the graph into sub-graphs for piecewise compilation; auto-populated at level 3 with ["aiter.unified_attention_with_output", "aiter.mla_attention"]
cudagraph_copy_inputs bool False Copy input tensors into internally managed buffers before CUDA graph replay; only effective in PIECEWISE mode
compile_sizes Optional[list[Union[int, str]]] None Sizes to compile for inductor; accepts integers and the string "cudagraph_capture_sizes"
inductor_compile_config dict {} Additional configuration passed to the inductor backend

2.3 CUDA Graph Mode (CUDAGraphMode)

Mode Value Description
NONE 0 No CUDA graph capture
PIECEWISE 1 Piecewise CUDA graphs -- attention ops stay outside the graph for flexibility (default at level 3)
FULL 2 Full CUDA graph capture for all batches; best for small models / short prompts
FULL_DECODE_ONLY (FULL, NONE) Full CUDA graphs for decode batches only; mixed prefill-decode runs without graphs (useful in P/D setups)
FULL_AND_PIECEWISE (FULL, PIECEWISE) Full graphs for decode, piecewise for prefill/mixed -- most performant mode for most models

Helper methods on CUDAGraphMode:

  • decode_mode() -- returns the mode used for pure decode batches.
  • mixed_mode() -- returns the mode used for mixed prefill-decode batches.
  • requires_piecewise_compilation() -- whether the mode needs piecewise compilation.
  • has_full_cudagraphs() -- whether the mode includes full CUDA graph capture.
  • separate_routine() -- whether decode and mixed batches use different routines.

3. Quantization Configuration (QuantizationConfig & LayerQuantConfig)

Defined in atom/config.py. The quantization system uses two classes:

  • QuantizationConfig -- the top-level orchestrator that holds a global config, per-layer overrides, and exclusion lists. It is not a dict subclass.
  • LayerQuantConfig(dict) -- a dict subclass that stores the concrete quantization parameters for a single layer (or as the global default).

3.1 LayerQuantConfig Fields

LayerQuantConfig extends dict. Fields are stored and accessed as dictionary keys (e.g., cfg["quant_type"]).

Key Type Default Description
quant_type QuantType QuantType.No Quantization granularity (see below)
quant_dtype torch.dtype torch.bfloat16 Data type for quantized weights
is_dynamic bool True Use dynamic quantization (scales computed at runtime)
quant_method str "" Quantization method (e.g., "quark", "compressed-tensors")

3.2 QuantizationConfig Attributes

Attribute Type Description
torch_dtype torch.dtype The model's default dtype (from hf_config.torch_dtype)
hf_quant_config dict | None Raw quantization_config dict from HuggingFace config
global_quant_config LayerQuantConfig Default quantization config applied to all layers
layer_quant_config dict[str, LayerQuantConfig] Per-layer overrides keyed by layer name pattern (supports fnmatch globs like "*.mlp.*")
exclude_layers list[str] Layer names excluded from quantization (supports exact match and "re:" regex prefix)
quant_method str Top-level quantization method name (e.g., "quark", "compressed-tensors")

Key methods:

Method Description
get_name() Returns the quantization method name
get_layer_quant_config(layer_name) Returns the LayerQuantConfig for a layer: checks exclusions first, then per-layer overrides, then falls back to global config
should_ignore_layer_quant(layer_name) Returns True if the layer is in the exclusion list
remap_layer_name(hf_config, packed_modules_mapping) Remaps layer names for packed/fused modules (e.g., q_a_projfused_qkv_a_proj for DeepSeek)
compute_hash() Returns a SHA-256 hash of the quantization config for cache invalidation
parse_quark_config_dict(config) Parses a quark-format config dict into a LayerQuantConfig

3.3 QuantType Values (from AITER)

Value Description
QuantType.No No quantization
QuantType.per_Token Per-token / per-channel quantization
QuantType.per_1x128 Block quantization with group size 128
QuantType.per_1x32 Block quantization with group size 32
QuantType.per_128x128 Large 2D block quantization (remapped to per_1x128 in MoE kernels)
QuantType.per_Tensor Per-tensor quantization

3.4 Supported Quantization Dtypes

Dtype AITER Key Notes
FP8 (E4M3) "fp8" 8-bit floating point
MXFP4 "fp4x2" Microscaling FP4; forces QuantType.per_1x32
INT8 "i8" 8-bit integer
INT4 "i4x2" 4-bit integer (packed)

3.5 Auto-Detection from HuggingFace

During Config.__post_init__, ATOM constructs QuantizationConfig(hf_config) which reads hf_config.quantization_config and automatically determines quantization parameters:

For quark models (quant_method == "quark"):

  1. Parses global_quant_config dict via parse_quark_config_dict() to produce the global LayerQuantConfig.
  2. Parses each entry in layer_quant_config dict to produce per-layer overrides.
  3. Reads the "exclude" list for excluded layers.
  4. Within each config dict, weight.qscheme determines quant_type ("per_channel"per_Token, "per_tensor"per_Tensor, "per_group"per_1x32), and weight.dtype determines quant_dtype.
  5. input_tensors.is_dynamic controls dynamic quantization (defaults to True if absent).

For other models (compressed-tensors, etc.):

  1. If quant_method == "compressed-tensors" or channel quantization is detected, sets per_Token.
  2. If weight_block_size or group_size is found: group size 128 maps to per_1x128, group size 32 maps to per_1x32.
  3. Otherwise falls back to per_Tensor.
  4. The dtype is parsed from fields like dtype, weight_dtype, or quant_method looking for fp8, fp4, mxfp4, int8, int4, or num_bits.
  5. If activation_scheme is "static", is_dynamic is set to False.
  6. Excluded layers are read from the "ignore" key.

3.6 Layer-Level Quantization Dispatch

Linear layers, MoE layers, and fused ops call quant_config.get_layer_quant_config(prefix) to obtain the appropriate LayerQuantConfig for their position in the model. This enables mixed-precision quantization where different layers can have different quant types and dtypes (e.g., FP8 for attention, FP4 for MLP).


4. Parallel Configuration (ParallelConfig)

Defined in atom/config.py. Controls data parallelism. Environment variables (Section 8) override defaults when set.

Field Type Default Description
data_parallel_size int 1 Number of data-parallel groups; overridden by ATOM_DP_SIZE env var
data_parallel_size_local int 1 Number of local data-parallel groups
data_parallel_rank int 0 Rank within the data-parallel group; overridden by ATOM_DP_RANK
data_parallel_rank_local Optional[int] None Local rank within the data-parallel group (SPMD mode); overridden by ATOM_DP_RANK_LOCAL
data_parallel_master_port int 29500 Port used by the data-parallel master for process group initialization
data_parallel_base_port int get_open_port() Base port for data-parallel communication (dynamically assigned)
data_parallel_master_ip str "127.0.0.1" IP address of the data-parallel master

Computed property:

  • world_size -- set during init, equals TP x PP.
  • world_size_across_dp -- world_size * data_parallel_size.

5. Speculative Decoding Configuration (SpeculativeConfig)

Defined in atom/config.py. Currently only the Multi-Token Prediction (MTP) method with num_speculative_tokens=1 is supported.

Field Type Default Description
method Optional[str] "" Speculative decoding method; currently only "mtp" is accepted
model Optional[str] None Draft model name or path (typically the same as the target model for MTP)
num_speculative_tokens Optional[int] None Number of speculative tokens per iteration; must be 1
draft_model_hf_config Optional[PretrainedConfig] None HuggingFace config for the draft model; auto-loaded from model when None

Post-init behaviour:

  • Loads draft_model_hf_config from model if not provided.
  • For DeepSeek V3 / MTP models: overrides model_type to "deepseek_mtp", sets n_predict=1 and num_nextn_predict_layers=1, and switches architectures to ["DeepSeekMTPModel"].
  • Config.__post_init__ raises ValueError if num_speculative_tokens != 1.

6. Sampling Parameters (SamplingParams)

Defined in atom/sampling_params.py. Passed per-request to control generation.

Field Type Default Description
temperature float 1.0 Sampling temperature; lower values make output more deterministic
max_tokens int 64 Maximum number of tokens to generate
ignore_eos bool False Continue generating past the EOS token
stop_strings Optional[list[str]] None List of strings that trigger generation to stop

7. CLI Arguments (EngineArgs)

Defined in atom/model_engine/arg_utils.py. The EngineArgs dataclass exposes all flags via add_cli_args() and converts them into a Config via create_engine().

Flag Short Type Default Description
--model str "Qwen/Qwen3-0.6B" Model name or path
--trust-remote-code flag False Trust remote code when loading model
--tensor-parallel-size -tp int 1 Tensor parallel size
--data-parallel-size -dp int 1 Data parallel size
--enforce-eager flag False Enforce eager mode execution
--enable_prefix_caching flag False Enable prefix caching
--port int 8006 Engine internal port
--kv_cache_dtype str "bf16" KV cache dtype; choices: bf16, fp8
--block-size int 16 KV cache block size (maps to kv_cache_block_size)
--max-model-len int None Maximum model context length; defaults to hf_config.max_position_embeddings
--cudagraph-capture-sizes str "[1,2,4,8,16,32,48,64,128,256]" CUDA graph capture sizes as a Python list string
--level int 3 Compilation level (0 -- 3)
--load_dummy flag False Skip loading model weights
--enable-expert-parallel flag False Enable Expert Parallelism (EP MoE)
--torch-profiler-dir str None Directory for torch profiler traces
--enable-dp-attention flag False Enable DP attention
--method str None Speculative method; choices: mtp
--num-speculative-tokens int 1 Number of speculative tokens per iteration
--max-num-batched-tokens int 16384 Maximum number of tokens to batch in the async engine
--max-num-seqs int 512 Maximum number of sequences to batch together
--gpu-memory-utilization float 0.9 Fraction of GPU memory to use (0.0 -- 1.0)
--scheduler-delay-factor float 0.0 Delay factor multiplied by previous prompt latency before scheduling next prompt

Example:

python -m atom.entrypoint \
    --model deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 8 \
    --level 3 \
    --cudagraph-capture-sizes "[1,2,4,8,16,32,64,128,256]" \
    --kv_cache_dtype fp8 \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 256

8. Environment Variables

8.1 Variables Registered in atom/utils/envs.py

All variables use lazy evaluation. Boolean variables treat "1" as True and anything else (including unset) as False, unless noted otherwise.

Variable Type Default Description
ATOM_DP_RANK int 0 Data-parallel rank of this process
ATOM_DP_RANK_LOCAL int 0 Local data-parallel rank (for SPMD mode)
ATOM_DP_SIZE int 1 Total number of data-parallel groups
ATOM_DP_MASTER_IP str "127.0.0.1" IP address of the data-parallel master
ATOM_DP_MASTER_PORT int 29500 Port of the data-parallel master
ATOM_ENFORCE_EAGER Removed. Use CLI flag --enforce-eager instead.
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION bool False Enable QK-norm + RoPE + cache + quant fusion; enable for Qwen3-MoE models
ATOM_USE_TRITON_GEMM bool False Use Triton-based GEMM kernels instead of default backends
ATOM_USE_TRITON_MXFP4_BMM bool False Use Triton-based MXFP4 batched matrix multiply
ATOM_ENABLE_DS_INPUT_RMSNORM_QUANT_FUSION bool True Enable fused input RMSNorm + quantization for DeepSeek models
ATOM_ENABLE_DS_QKNORM_QUANT_FUSION bool True Enable fused QK-norm + quantization for DeepSeek models
ATOM_ENABLE_ALLREDUCE_RMSNORM_FUSION bool True Enable fused all-reduce + RMSNorm kernel
ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_RMSNORM_QUANT bool True Enable AITER Triton fused RMSNorm + quantization for LLaMA models
ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_SILU_MUL_QUANT bool True Enable AITER Triton fused SiLU + multiply + quantization for LLaMA models

8.2 Additional Environment Variables (Used Outside envs.py)

Variable Type Default Where Used Description
ATOM_TORCH_PROFILER_DIR str None atom/config.py (Config.torch_profiler_dir) Directory for PyTorch profiler output; sets the default for Config.torch_profiler_dir
ATOM_PROFILER_MORE str "0" atom/model_engine/model_runner.py Set to "1" to enable detailed profiling (record_shapes, with_stack, profile_memory)
HF_TOKEN str None atom/config.py (get_hf_config) HuggingFace authentication token for gated model downloads

9. Decision Tree -- Choosing a Compilation Level

Start
  |
  v
Is this a debugging / development run?
  |-- Yes --> Level 0 (NO_COMPILATION) or --enforce-eager
  |
  v
Do you need torch.compile but no graph splitting?
  |-- Yes, one-shot compile --> Level 2 (DYNAMO_ONCE)
  |-- Yes, keep Dynamo default --> Level 1 (DYNAMO_AS_IS)
  |
  v
Production inference on ROCm/HIP GPU?
  |-- Yes --> Level 3 (PIECEWISE) [default in EngineArgs]
              - Auto-sets CUDAGraphMode.PIECEWISE
              - Auto-populates splitting_ops for attention ops
              - Pair with --cudagraph-capture-sizes for your batch profile
  |
  v
Need maximum decode throughput?
  |-- Yes --> Level 3 + set cudagraph_mode to FULL_AND_PIECEWISE
              (full graphs for decode, piecewise for prefill)

Rules of thumb:

  • Level 3 is the default for EngineArgs and is recommended for most production workloads.
  • Level 0 / --enforce-eager is useful for debugging, profiling, or when CUDA graphs are incompatible with your model.
  • Match --cudagraph-capture-sizes to your expected batch sizes for optimal memory usage and launch latency.
  • When using --enable-dp-attention or Expert Parallelism (--enable-expert-parallel), level 3 is still recommended.

Source Files

File Description
atom/config.py Config, CompilationConfig, CompilationLevel, CUDAGraphMode, LayerQuantConfig, QuantizationConfig, ParallelConfig, SpeculativeConfig, KVCacheTensor, KVCacheConfig, get_hf_config
atom/utils/envs.py All ATOM_* environment variable definitions with lazy evaluation
atom/model_engine/arg_utils.py EngineArgs dataclass and CLI argument parser
atom/sampling_params.py SamplingParams dataclass
atom/model_engine/model_runner.py Uses ATOM_PROFILER_MORE and ATOM_TORCH_PROFILER_DIR for profiling