Add ATOM_CK_FREE=1 switch for CK-free e2e inference by sunway513 · Pull Request #20 · sunway513/ATOM

sunway513 · 2026-02-23T02:33:06Z

Summary

Add ATOM_CK_FREE=1 environment variable — a single switch that auto-enables all non-CK paths for e2e inference
MOE: force Triton/FlyDSL backends when CK-free
MHA attention: force Triton attention path (bypasses fused_qk_norm_rope_cache_quant_shuffle)
MLA decode: force Triton MLA decode path
MLA prefill cache ops handled by AITER-side fallbacks (see Add CK-free fallbacks for cache ops and RoPE auto-detection aiter#27)

Dependencies

Requires Add CK-free fallbacks for cache ops and RoPE auto-detection aiter#27 (CK-free cache op fallbacks + RoPE auto-detection)

Test plan

Build AITER with ENABLE_CK=0 pip install -e .
Set ATOM_CK_FREE=1 and run DeepSeek inference (MLA path)
Set ATOM_CK_FREE=1 and run Llama inference (MHA path)
Verify all ops route to non-CK backends via log messages

Single env var that auto-enables all non-CK paths: - envs.py: Add ATOM_CK_FREE environment variable - moe.py: Force Triton/FlyDSL MOE when ATOM_CK_FREE=1 - attention_mha.py: Force Triton attention when ATOM_CK_FREE=1 - attention_mla.py: Force Triton MLA decode when ATOM_CK_FREE=1 MLA prefill cache ops (concat_and_cache_mla, fused_qk_rope_concat_and_cache_mla) are handled by AITER-side fallbacks (PyTorch/Triton) that activate automatically when the CK module_cache JIT build fails.

Tests env var detection, MOE routing, MHA routing, and MLA routing conditions without requiring GPU or model weights. 11 tests total.

github-actions · 2026-02-23T02:48:14Z

tests/test_ck_free_mode.py

+import importlib
+


⚠️ [ruff] <F401> _{reported by reviewdog 🐶}
importlib imported but unused

Suggested change

import importlib

Previously use_triton_attn controlled both the cache update strategy and the paged attention backend. In CK-free builds this forced Triton PA even though ASM PA is CK-free and faster for decode. Now: always use Triton fused rope+cache (fast, no module_cache JIT) and independently select ASM PA for decode when head_dim=128 and no sliding window. For fp8 KV cache, fill per-token scale buffers with the uniform per-tensor scale so ASM PA can dequant correctly. Move kv_scale to CUDA at init for graph capture compatibility. Benchmark (Llama-3.1-8B, 1k/1k, con64): v7 Triton PA bf16 KV: 6,255 tok/s v9 ASM PA bf16 KV: 6,712 tok/s (+7.3%) v9 ASM PA fp8 KV: 6,830 tok/s (+9.2%)

sunway513 added 2 commits February 22, 2026 20:32

Add unit tests for ATOM_CK_FREE routing logic

62114d6

Tests env var detection, MOE routing, MHA routing, and MLA routing conditions without requiring GPU or model weights. 11 tests total.

github-actions bot reviewed Feb 23, 2026

View reviewed changes

sunway513 merged commit cf596f1 into main Feb 23, 2026
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ATOM_CK_FREE=1 switch for CK-free e2e inference#20

Add ATOM_CK_FREE=1 switch for CK-free e2e inference#20
sunway513 merged 3 commits intomainfrom
feat/ck-free-mode

sunway513 commented Feb 23, 2026 •

edited

Loading

Uh oh!

github-actions bot Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sunway513 commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependencies

Test plan

Uh oh!

github-actions bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sunway513 commented Feb 23, 2026 •

edited

Loading