Add ATOM_CK_FREE=1 switch for CK-free e2e inference#20
Merged
Conversation
Single env var that auto-enables all non-CK paths: - envs.py: Add ATOM_CK_FREE environment variable - moe.py: Force Triton/FlyDSL MOE when ATOM_CK_FREE=1 - attention_mha.py: Force Triton attention when ATOM_CK_FREE=1 - attention_mla.py: Force Triton MLA decode when ATOM_CK_FREE=1 MLA prefill cache ops (concat_and_cache_mla, fused_qk_rope_concat_and_cache_mla) are handled by AITER-side fallbacks (PyTorch/Triton) that activate automatically when the CK module_cache JIT build fails.
Tests env var detection, MOE routing, MHA routing, and MLA routing conditions without requiring GPU or model weights. 11 tests total.
Previously use_triton_attn controlled both the cache update strategy and the paged attention backend. In CK-free builds this forced Triton PA even though ASM PA is CK-free and faster for decode. Now: always use Triton fused rope+cache (fast, no module_cache JIT) and independently select ASM PA for decode when head_dim=128 and no sliding window. For fp8 KV cache, fill per-token scale buffers with the uniform per-tensor scale so ASM PA can dequant correctly. Move kv_scale to CUDA at init for graph capture compatibility. Benchmark (Llama-3.1-8B, 1k/1k, con64): v7 Triton PA bf16 KV: 6,255 tok/s v9 ASM PA bf16 KV: 6,712 tok/s (+7.3%) v9 ASM PA fp8 KV: 6,830 tok/s (+9.2%)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ATOM_CK_FREE=1environment variable — a single switch that auto-enables all non-CK paths for e2e inferencefused_qk_norm_rope_cache_quant_shuffle)Dependencies
Test plan
ENABLE_CK=0 pip install -e .ATOM_CK_FREE=1and run DeepSeek inference (MLA path)ATOM_CK_FREE=1and run Llama inference (MHA path)