ATOM Configuration Guide

ATOM (AiTer Optimized Model) is AMD's lightweight LLM inference engine built on AITER kernels for ROCm/HIP GPUs. This guide documents every configuration class, CLI flag, and environment variable that controls ATOM's runtime behaviour.

Quick Reference

Config Class	Primary Purpose
`Config`	Master dataclass -- model path, memory, TP size, scheduler limits, KV cache, profiler, and references to all sub-configs
`CompilationConfig`	Compilation level (0-3), CUDA graph capture sizes, piecewise splitting ops, inductor settings
`CompilationLevel`	Integer constants for the four compilation levels
`CUDAGraphMode`	Enum controlling how CUDA graphs are captured (none / piecewise / full / hybrid)
`QuantizationConfig`	Layer-wise quantization orchestrator: global config, per-layer overrides, exclude lists, layer name remapping
`LayerQuantConfig`	Per-layer quantization parameters: quant type, dtype, dynamic flag, method
`ParallelConfig`	Data-parallel size, rank, master IP/port
`SpeculativeConfig`	Speculative decoding method, draft model, number of speculative tokens
`KVCacheConfig` / `KVCacheTensor`	Per-layer KV cache tensor descriptors (k/v caches and scales)
`SamplingParams`	Temperature, max tokens, stop strings, ignore-EOS flag
`EngineArgs`	CLI argument parser that builds a `Config` for `LLMEngine`

1. Master Configuration (`Config`)

Defined in atom/config.py. The root dataclass that the engine consumes.

Field	Type	Default	Description
`model`	`str`	(required)	HuggingFace model name or local path
`trust_remote_code`	`bool`	`False`	Trust remote code when loading the model from HuggingFace
`max_num_batched_tokens`	`int`	`16384`	Maximum number of tokens batched together per scheduler step
`scheduler_delay_factor`	`float`	`0.0`	Multiplicative delay (factor x previous prompt latency) before scheduling the next prompt
`max_num_seqs`	`int`	`512`	Maximum number of sequences batched together
`max_model_len`	`int \| None`	`None`	Maximum context length; defaults to `hf_config.max_position_embeddings` (capped by it when set)
`gpu_memory_utilization`	`float`	`0.9`	Fraction of GPU memory available for KV cache and weights (0.0 -- 1.0)
`tensor_parallel_size`	`int`	`1`	Number of tensor-parallel GPUs (1 -- 8)
`enforce_eager`	`bool`	`False`	Disable compilation and CUDA graphs; run in eager mode
`parallel_config`	`ParallelConfig`	`ParallelConfig()`	Data-parallel configuration (see Section 4)
`kv_cache_block_size`	`int`	`16`	Block size for paged KV cache; must be a multiple of 16 or exactly 1
`num_kvcache_blocks`	`int`	`-1`	Number of KV cache blocks (`-1` = auto)
`kv_cache_dtype`	`str`	`"bf16"`	KV cache data type (`"bf16"` or `"fp8"`)
`enable_prefix_caching`	`bool`	`False`	Enable prefix caching to reuse KV blocks across requests sharing the same prefix
`port`	`int`	`8006`	Engine internal communication port
`torch_profiler_dir`	`str \| None`	`os.getenv("ATOM_TORCH_PROFILER_DIR", None)`	Directory for saving PyTorch profiler traces; creates the directory if it does not exist
`compilation_config`	`CompilationConfig`	`CompilationConfig()`	Compilation and CUDA graph settings (see Section 2)
`quant_config`	`QuantizationConfig`	(auto-detected)	Quantization settings; auto-detected from HuggingFace config during `__post_init__` via `QuantizationConfig(hf_config)` (see Section 3)
`asyncio_mode`	`bool`	`False`	Enable asyncio-based engine loop
`load_dummy`	`bool`	`False`	Skip loading model weights (for benchmarking / testing)
`enable_expert_parallel`	`bool`	`False`	Enable Expert Parallelism for MoE models
`master_addr`	`str`	`"127.0.0.1"`	Master address for distributed communication
`graph_bs`	`Optional[list[int]]`	`None`	Explicit list of batch sizes for CUDA graph capture; derived from `compilation_config` during init
`enable_dp_attention`	`bool`	`False`	Enable data-parallel attention
`torch_dtype`	`torch.dtype`	(computed)	Inferred from `hf_config.torch_dtype`; falls back to `torch.bfloat16`
`speculative_config`	`Optional[SpeculativeConfig]`	`None`	Speculative decoding configuration (see Section 5)
`bos_token_id`	`int`	`-1`	Beginning-of-sequence token ID (`-1` = use model default)
`eos_token_id`	`int`	`-1`	End-of-sequence token ID (`-1` = use model default)
`stop_token_ids`	`list[int]`	`[]`	Additional stop token IDs; populated from `GenerationConfig.eos_token_id` during init

Auto-derived fields (set in __post_init__, not user-supplied):

Field	Type	Description
`hf_config`	`PretrainedConfig`	Loaded automatically via `get_hf_config(model)`
`generation_config`	`GenerationConfig`	Loaded automatically via `get_generation_config(model)`

2. Compilation Configuration (`CompilationConfig`)

Defined in atom/config.py. Controls torch.compile and CUDA graph behaviour.

2.1 Compilation Levels (`CompilationLevel`)

Constant	Value	Description
`NO_COMPILATION`	`0`	No compilation -- pure eager execution
`DYNAMO_AS_IS`	`1`	Use torch.compile / TorchDynamo as-is
`DYNAMO_ONCE`	`2`	TorchDynamo with a single compilation pass
`PIECEWISE`	`3`	Piecewise compilation with CUDA graph capture (recommended for production)

2.2 `CompilationConfig` Fields

Field	Type	Default	Description
`level`	`int`	`0`	Compilation level (see table above); must be 0 -- 3
`use_cudagraph`	`bool`	`True`	Whether to use CUDA graphs
`cudagraph_capture_sizes`	`Optional[list[int]]`	`None`	Explicit list of batch sizes for CUDA graph capture; overrides `cuda_graph_sizes` when set
`cuda_graph_sizes`	`list[int]`	`[]` (post-init: `[512]`)	CUDA graph sizing strategy: 1 value generates `[1,2,4,8] + range(16, N+1, 16)`; multiple values used as-is; empty defaults to `[512]`
`debug_dump_path`	`str`	`""`	Path to dump debug / compilation information
`cache_dir`	`str`	`""`	Directory for compilation caches
`use_inductor`	`bool`	`True`	Enable TorchInductor backend
`cudagraph_mode`	`Optional[CUDAGraphMode]`	`None`	CUDA graph capture mode (see below); set to `PIECEWISE` automatically at level 3
`splitting_ops`	`Optional[list[str]]`	`None`	Ops that split the graph into sub-graphs for piecewise compilation; auto-populated at level 3 with `["aiter.unified_attention_with_output", "aiter.mla_attention"]`
`cudagraph_copy_inputs`	`bool`	`False`	Copy input tensors into internally managed buffers before CUDA graph replay; only effective in PIECEWISE mode
`compile_sizes`	`Optional[list[Union[int, str]]]`	`None`	Sizes to compile for inductor; accepts integers and the string `"cudagraph_capture_sizes"`
`inductor_compile_config`	`dict`	`{}`	Additional configuration passed to the inductor backend

2.3 CUDA Graph Mode (`CUDAGraphMode`)

Mode	Value	Description
`NONE`	`0`	No CUDA graph capture
`PIECEWISE`	`1`	Piecewise CUDA graphs -- attention ops stay outside the graph for flexibility (default at level 3)
`FULL`	`2`	Full CUDA graph capture for all batches; best for small models / short prompts
`FULL_DECODE_ONLY`	`(FULL, NONE)`	Full CUDA graphs for decode batches only; mixed prefill-decode runs without graphs (useful in P/D setups)
`FULL_AND_PIECEWISE`	`(FULL, PIECEWISE)`	Full graphs for decode, piecewise for prefill/mixed -- most performant mode for most models

Helper methods on CUDAGraphMode:

decode_mode() -- returns the mode used for pure decode batches.
mixed_mode() -- returns the mode used for mixed prefill-decode batches.
requires_piecewise_compilation() -- whether the mode needs piecewise compilation.
has_full_cudagraphs() -- whether the mode includes full CUDA graph capture.
separate_routine() -- whether decode and mixed batches use different routines.

3. Quantization Configuration (`QuantizationConfig` & `LayerQuantConfig`)

Defined in atom/config.py. The quantization system uses two classes:

QuantizationConfig -- the top-level orchestrator that holds a global config, per-layer overrides, and exclusion lists. It is not a dict subclass.
LayerQuantConfig(dict) -- a dict subclass that stores the concrete quantization parameters for a single layer (or as the global default).

3.1 `LayerQuantConfig` Fields

LayerQuantConfig extends dict. Fields are stored and accessed as dictionary keys (e.g., cfg["quant_type"]).

Key	Type	Default	Description
`quant_type`	`QuantType`	`QuantType.No`	Quantization granularity (see below)
`quant_dtype`	`torch.dtype`	`torch.bfloat16`	Data type for quantized weights
`is_dynamic`	`bool`	`True`	Use dynamic quantization (scales computed at runtime)
`quant_method`	`str`	`""`	Quantization method (e.g., `"quark"`, `"compressed-tensors"`)

3.2 `QuantizationConfig` Attributes

Attribute	Type	Description
`torch_dtype`	`torch.dtype`	The model's default dtype (from `hf_config.torch_dtype`)
`hf_quant_config`	`dict \| None`	Raw `quantization_config` dict from HuggingFace config
`global_quant_config`	`LayerQuantConfig`	Default quantization config applied to all layers
`layer_quant_config`	`dict[str, LayerQuantConfig]`	Per-layer overrides keyed by layer name pattern (supports fnmatch globs like `".mlp."`)
`exclude_layers`	`list[str]`	Layer names excluded from quantization (supports exact match and `"re:"` regex prefix)
`quant_method`	`str`	Top-level quantization method name (e.g., `"quark"`, `"compressed-tensors"`)

Key methods:

Method	Description
`get_name()`	Returns the quantization method name
`get_layer_quant_config(layer_name)`	Returns the `LayerQuantConfig` for a layer: checks exclusions first, then per-layer overrides, then falls back to global config
`should_ignore_layer_quant(layer_name)`	Returns `True` if the layer is in the exclusion list
`remap_layer_name(hf_config, packed_modules_mapping)`	Remaps layer names for packed/fused modules (e.g., `q_a_proj` → `fused_qkv_a_proj` for DeepSeek)
`compute_hash()`	Returns a SHA-256 hash of the quantization config for cache invalidation
`parse_quark_config_dict(config)`	Parses a quark-format config dict into a `LayerQuantConfig`

3.3 `QuantType` Values (from AITER)

Value	Description
`QuantType.No`	No quantization
`QuantType.per_Token`	Per-token / per-channel quantization
`QuantType.per_1x128`	Block quantization with group size 128
`QuantType.per_1x32`	Block quantization with group size 32
`QuantType.per_128x128`	Large 2D block quantization (remapped to `per_1x128` in MoE kernels)
`QuantType.per_Tensor`	Per-tensor quantization

3.4 Supported Quantization Dtypes

Dtype	AITER Key	Notes
FP8 (E4M3)	`"fp8"`	8-bit floating point
MXFP4	`"fp4x2"`	Microscaling FP4; forces `QuantType.per_1x32`
INT8	`"i8"`	8-bit integer
INT4	`"i4x2"`	4-bit integer (packed)

3.5 Auto-Detection from HuggingFace

During Config.__post_init__, ATOM constructs QuantizationConfig(hf_config) which reads hf_config.quantization_config and automatically determines quantization parameters:

For quark models (quant_method == "quark"):

Parses global_quant_config dict via parse_quark_config_dict() to produce the global LayerQuantConfig.
Parses each entry in layer_quant_config dict to produce per-layer overrides.
Reads the "exclude" list for excluded layers.
Within each config dict, weight.qscheme determines quant_type ("per_channel" → per_Token, "per_tensor" → per_Tensor, "per_group" → per_1x32), and weight.dtype determines quant_dtype.
input_tensors.is_dynamic controls dynamic quantization (defaults to True if absent).

For other models (compressed-tensors, etc.):

If quant_method == "compressed-tensors" or channel quantization is detected, sets per_Token.
If weight_block_size or group_size is found: group size 128 maps to per_1x128, group size 32 maps to per_1x32.
Otherwise falls back to per_Tensor.
The dtype is parsed from fields like dtype, weight_dtype, or quant_method looking for fp8, fp4, mxfp4, int8, int4, or num_bits.
If activation_scheme is "static", is_dynamic is set to False.
Excluded layers are read from the "ignore" key.

3.6 Layer-Level Quantization Dispatch

Linear layers, MoE layers, and fused ops call quant_config.get_layer_quant_config(prefix) to obtain the appropriate LayerQuantConfig for their position in the model. This enables mixed-precision quantization where different layers can have different quant types and dtypes (e.g., FP8 for attention, FP4 for MLP).

4. Parallel Configuration (`ParallelConfig`)

Defined in atom/config.py. Controls data parallelism. Environment variables (Section 8) override defaults when set.

Field	Type	Default	Description
`data_parallel_size`	`int`	`1`	Number of data-parallel groups; overridden by `ATOM_DP_SIZE` env var
`data_parallel_size_local`	`int`	`1`	Number of local data-parallel groups
`data_parallel_rank`	`int`	`0`	Rank within the data-parallel group; overridden by `ATOM_DP_RANK`
`data_parallel_rank_local`	`Optional[int]`	`None`	Local rank within the data-parallel group (SPMD mode); overridden by `ATOM_DP_RANK_LOCAL`
`data_parallel_master_port`	`int`	`29500`	Port used by the data-parallel master for process group initialization
`data_parallel_base_port`	`int`	`get_open_port()`	Base port for data-parallel communication (dynamically assigned)
`data_parallel_master_ip`	`str`	`"127.0.0.1"`	IP address of the data-parallel master

Computed property:

world_size -- set during init, equals TP x PP.
world_size_across_dp -- world_size * data_parallel_size.

5. Speculative Decoding Configuration (`SpeculativeConfig`)

Defined in atom/config.py. Currently only the Multi-Token Prediction (MTP) method with num_speculative_tokens=1 is supported.

Field	Type	Default	Description
`method`	`Optional[str]`	`""`	Speculative decoding method; currently only `"mtp"` is accepted
`model`	`Optional[str]`	`None`	Draft model name or path (typically the same as the target model for MTP)
`num_speculative_tokens`	`Optional[int]`	`None`	Number of speculative tokens per iteration; must be `1`
`draft_model_hf_config`	`Optional[PretrainedConfig]`	`None`	HuggingFace config for the draft model; auto-loaded from `model` when `None`

Post-init behaviour:

Loads draft_model_hf_config from model if not provided.
For DeepSeek V3 / MTP models: overrides model_type to "deepseek_mtp", sets n_predict=1 and num_nextn_predict_layers=1, and switches architectures to ["DeepSeekMTPModel"].
Config.__post_init__ raises ValueError if num_speculative_tokens != 1.

6. Sampling Parameters (`SamplingParams`)

Defined in atom/sampling_params.py. Passed per-request to control generation.

Field	Type	Default	Description
`temperature`	`float`	`1.0`	Sampling temperature; lower values make output more deterministic
`max_tokens`	`int`	`64`	Maximum number of tokens to generate
`ignore_eos`	`bool`	`False`	Continue generating past the EOS token
`stop_strings`	`Optional[list[str]]`	`None`	List of strings that trigger generation to stop

7. CLI Arguments (`EngineArgs`)

Defined in atom/model_engine/arg_utils.py. The EngineArgs dataclass exposes all flags via add_cli_args() and converts them into a Config via create_engine().

Flag	Short	Type	Default	Description
`--model`		`str`	`"Qwen/Qwen3-0.6B"`	Model name or path
`--trust-remote-code`		flag	`False`	Trust remote code when loading model
`--tensor-parallel-size`	`-tp`	`int`	`1`	Tensor parallel size
`--data-parallel-size`	`-dp`	`int`	`1`	Data parallel size
`--enforce-eager`		flag	`False`	Enforce eager mode execution
`--enable_prefix_caching`		flag	`False`	Enable prefix caching
`--port`		`int`	`8006`	Engine internal port
`--kv_cache_dtype`		`str`	`"bf16"`	KV cache dtype; choices: `bf16`, `fp8`
`--block-size`		`int`	`16`	KV cache block size (maps to `kv_cache_block_size`)
`--max-model-len`		`int`	`None`	Maximum model context length; defaults to `hf_config.max_position_embeddings`
`--cudagraph-capture-sizes`		`str`	`"[1,2,4,8,16,32,48,64,128,256]"`	CUDA graph capture sizes as a Python list string
`--level`		`int`	`3`	Compilation level (0 -- 3)
`--load_dummy`		flag	`False`	Skip loading model weights
`--enable-expert-parallel`		flag	`False`	Enable Expert Parallelism (EP MoE)
`--torch-profiler-dir`		`str`	`None`	Directory for torch profiler traces
`--enable-dp-attention`		flag	`False`	Enable DP attention
`--method`		`str`	`None`	Speculative method; choices: `mtp`
`--num-speculative-tokens`		`int`	`1`	Number of speculative tokens per iteration
`--max-num-batched-tokens`		`int`	`16384`	Maximum number of tokens to batch in the async engine
`--max-num-seqs`		`int`	`512`	Maximum number of sequences to batch together
`--gpu-memory-utilization`		`float`	`0.9`	Fraction of GPU memory to use (0.0 -- 1.0)
`--scheduler-delay-factor`		`float`	`0.0`	Delay factor multiplied by previous prompt latency before scheduling next prompt

Example:

python -m atom.entrypoint \
    --model deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 8 \
    --level 3 \
    --cudagraph-capture-sizes "[1,2,4,8,16,32,64,128,256]" \
    --kv_cache_dtype fp8 \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 256

8. Environment Variables

8.1 Variables Registered in `atom/utils/envs.py`

All variables use lazy evaluation. Boolean variables treat "1" as True and anything else (including unset) as False, unless noted otherwise.

Variable	Type	Default	Description
`ATOM_DP_RANK`	`int`	`0`	Data-parallel rank of this process
`ATOM_DP_RANK_LOCAL`	`int`	`0`	Local data-parallel rank (for SPMD mode)
`ATOM_DP_SIZE`	`int`	`1`	Total number of data-parallel groups
`ATOM_DP_MASTER_IP`	`str`	`"127.0.0.1"`	IP address of the data-parallel master
`ATOM_DP_MASTER_PORT`	`int`	`29500`	Port of the data-parallel master
~~`ATOM_ENFORCE_EAGER`~~			Removed. Use CLI flag `--enforce-eager` instead.
`ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION`	`bool`	`False`	Enable QK-norm + RoPE + cache + quant fusion; enable for Qwen3-MoE models
`ATOM_USE_TRITON_GEMM`	`bool`	`False`	Use Triton-based GEMM kernels instead of default backends
`ATOM_USE_TRITON_MXFP4_BMM`	`bool`	`False`	Use Triton-based MXFP4 batched matrix multiply
`ATOM_ENABLE_DS_INPUT_RMSNORM_QUANT_FUSION`	`bool`	`True`	Enable fused input RMSNorm + quantization for DeepSeek models
`ATOM_ENABLE_DS_QKNORM_QUANT_FUSION`	`bool`	`True`	Enable fused QK-norm + quantization for DeepSeek models
`ATOM_ENABLE_ALLREDUCE_RMSNORM_FUSION`	`bool`	`True`	Enable fused all-reduce + RMSNorm kernel
`ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_RMSNORM_QUANT`	`bool`	`True`	Enable AITER Triton fused RMSNorm + quantization for LLaMA models
`ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_SILU_MUL_QUANT`	`bool`	`True`	Enable AITER Triton fused SiLU + multiply + quantization for LLaMA models

8.2 Additional Environment Variables (Used Outside `envs.py`)

Variable	Type	Default	Where Used	Description
`ATOM_TORCH_PROFILER_DIR`	`str`	`None`	`atom/config.py` (`Config.torch_profiler_dir`)	Directory for PyTorch profiler output; sets the default for `Config.torch_profiler_dir`
`ATOM_PROFILER_MORE`	`str`	`"0"`	`atom/model_engine/model_runner.py`	Set to `"1"` to enable detailed profiling (`record_shapes`, `with_stack`, `profile_memory`)
`HF_TOKEN`	`str`	`None`	`atom/config.py` (`get_hf_config`)	HuggingFace authentication token for gated model downloads

9. Decision Tree -- Choosing a Compilation Level

Start
  |
  v
Is this a debugging / development run?
  |-- Yes --> Level 0 (NO_COMPILATION) or --enforce-eager
  |
  v
Do you need torch.compile but no graph splitting?
  |-- Yes, one-shot compile --> Level 2 (DYNAMO_ONCE)
  |-- Yes, keep Dynamo default --> Level 1 (DYNAMO_AS_IS)
  |
  v
Production inference on ROCm/HIP GPU?
  |-- Yes --> Level 3 (PIECEWISE) [default in EngineArgs]
              - Auto-sets CUDAGraphMode.PIECEWISE
              - Auto-populates splitting_ops for attention ops
              - Pair with --cudagraph-capture-sizes for your batch profile
  |
  v
Need maximum decode throughput?
  |-- Yes --> Level 3 + set cudagraph_mode to FULL_AND_PIECEWISE
              (full graphs for decode, piecewise for prefill)

Rules of thumb:

Level 3 is the default for EngineArgs and is recommended for most production workloads.
Level 0 / --enforce-eager is useful for debugging, profiling, or when CUDA graphs are incompatible with your model.
Match --cudagraph-capture-sizes to your expected batch sizes for optimal memory usage and launch latency.
When using --enable-dp-attention or Expert Parallelism (--enable-expert-parallel), level 3 is still recommended.

Source Files

File	Description
`atom/config.py`	`Config`, `CompilationConfig`, `CompilationLevel`, `CUDAGraphMode`, `LayerQuantConfig`, `QuantizationConfig`, `ParallelConfig`, `SpeculativeConfig`, `KVCacheTensor`, `KVCacheConfig`, `get_hf_config`
`atom/utils/envs.py`	All `ATOM_*` environment variable definitions with lazy evaluation
`atom/model_engine/arg_utils.py`	`EngineArgs` dataclass and CLI argument parser
`atom/sampling_params.py`	`SamplingParams` dataclass
`atom/model_engine/model_runner.py`	Uses `ATOM_PROFILER_MORE` and `ATOM_TORCH_PROFILER_DIR` for profiling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ATOM Configuration Guide

Quick Reference

1. Master Configuration (`Config`)

2. Compilation Configuration (`CompilationConfig`)

2.1 Compilation Levels (`CompilationLevel`)

2.2 `CompilationConfig` Fields

2.3 CUDA Graph Mode (`CUDAGraphMode`)

3. Quantization Configuration (`QuantizationConfig` & `LayerQuantConfig`)

3.1 `LayerQuantConfig` Fields

3.2 `QuantizationConfig` Attributes

3.3 `QuantType` Values (from AITER)

3.4 Supported Quantization Dtypes

3.5 Auto-Detection from HuggingFace

3.6 Layer-Level Quantization Dispatch

4. Parallel Configuration (`ParallelConfig`)

5. Speculative Decoding Configuration (`SpeculativeConfig`)

6. Sampling Parameters (`SamplingParams`)

7. CLI Arguments (`EngineArgs`)

8. Environment Variables

8.1 Variables Registered in `atom/utils/envs.py`

8.2 Additional Environment Variables (Used Outside `envs.py`)

9. Decision Tree -- Choosing a Compilation Level

Source Files

FilesExpand file tree

configuration_guide.md

Latest commit

History

configuration_guide.md

File metadata and controls

ATOM Configuration Guide

Quick Reference

1. Master Configuration (Config)

2. Compilation Configuration (CompilationConfig)

2.1 Compilation Levels (CompilationLevel)

2.2 CompilationConfig Fields

2.3 CUDA Graph Mode (CUDAGraphMode)

3. Quantization Configuration (QuantizationConfig & LayerQuantConfig)

3.1 LayerQuantConfig Fields

3.2 QuantizationConfig Attributes

3.3 QuantType Values (from AITER)

3.4 Supported Quantization Dtypes

3.5 Auto-Detection from HuggingFace

3.6 Layer-Level Quantization Dispatch

4. Parallel Configuration (ParallelConfig)

5. Speculative Decoding Configuration (SpeculativeConfig)

6. Sampling Parameters (SamplingParams)

7. CLI Arguments (EngineArgs)

8. Environment Variables

8.1 Variables Registered in atom/utils/envs.py

8.2 Additional Environment Variables (Used Outside envs.py)

9. Decision Tree -- Choosing a Compilation Level

Source Files

1. Master Configuration (`Config`)

2. Compilation Configuration (`CompilationConfig`)

2.1 Compilation Levels (`CompilationLevel`)

2.2 `CompilationConfig` Fields

2.3 CUDA Graph Mode (`CUDAGraphMode`)

3. Quantization Configuration (`QuantizationConfig` & `LayerQuantConfig`)

3.1 `LayerQuantConfig` Fields

3.2 `QuantizationConfig` Attributes

3.3 `QuantType` Values (from AITER)

4. Parallel Configuration (`ParallelConfig`)

5. Speculative Decoding Configuration (`SpeculativeConfig`)

6. Sampling Parameters (`SamplingParams`)

7. CLI Arguments (`EngineArgs`)

8.1 Variables Registered in `atom/utils/envs.py`

8.2 Additional Environment Variables (Used Outside `envs.py`)