Add CPU backend by stephen-huan · Pull Request #322 · jax-ml/jax-triton

stephen-huan · 2024-12-21T13:28:44Z

(This is more of a issue/feature request than a PR, but since I have a working prototype I figured I'd share it.)

Triton now has a CPU backend from triton-cpu, which compiles LLIR to assembly using LLVM. This PR adds support for this by using jax.pure_callback to wrap calling Triton kernels from Python (generating a XLA custom call). A proper implementation of this would add a openmp cpu launcher to jaxlib's gpu_triton.py akin to triton_kernels.cc (unfortunately, the cpu backend doesn't seem to fit neatly into jaxlib's existing Triton abstractions, for example, it seems cuda/rocm are mutually exclusive since they overwrite the same names in gpu_triton.py while cpu can co-exist with gpu). I don't have enough familiarity with C++/jaxlib/xla to make this change myself, hence the feature request.

The motivation for adding a cpu backend is that it's faster than TRITON_INTERPRET=1 and allows for jax.jit'ing Triton kernels like on gpu. In addition, it would possibly allow Pallas kernels to be ran on cpu without interpret=True, which is generally very slow. Pure JAX code can be ran on either cpu or gpu with no code modifications, and it'd be nice if this was also true for Triton/Pallas kernels (for debugging/prototyping, but also to run fast on cpu itself).

Known limitations of this PR:

Since jax.pure_callback is used instead of a C++ XLA custom call, kernel launch overhead is relatively high
- The codepath for the cpu backend is completely separate from gpu (doesn't use the custom call target triton_kernel_call or MLIR lowering triton_kernel_call_lowering) so the behavior is slightly different
Although implemented, input-output aliases are probably not handled correctly
zeroed_outputs doesn't receive meta parameters from Triton configurations
Autotuning has a runtime dependency on torch
The matmul tests are extremely slow (which might be triton-cpu's fault as we're just dispatching to it)

Passes (and definitely completely overfit to) all tests except for those that count the number of compilations (as it doesn't use the MLIR lowering path) and test_autotune_with_heuristics since Triton evaluates the configuration multiple times.

========================================================================== short test summary info ===========================================================================
FAILED tests/triton_call_test.py::TritonKernelCallTest::test_autotune_with_heuristics - AssertionError: Lists differ: [True, True, True, True, True, True, True, True, True, True,[147 chars]True] != [True, True, True, False]
FAILED tests/triton_call_test.py::TritonKernelCallTest::test_kernel_cache_equivalent_kernels - AssertionError: 0 != 1
FAILED tests/triton_call_test.py::TritonKernelCallTest::test_kernel_cache_same_kernel_different_params - AssertionError: 0 != 1
FAILED tests/triton_call_test.py::TritonKernelCallTest::test_specialization - AssertionError: Expected 'ast_to_ttir' to have been called once. Called 0 times.
=========================================================== 4 failed, 155 passed, 6 skipped in 2965.08s (0:49:25) ============================================================

The first two commits are unrelated fixes to the tests which can be merged, and I've opened #321 with them verbatim.

This was referenced Dec 21, 2024

Fix tests #321

Closed

A lot of boilerplate for TRITON_INTERPRET=1 without torch triton-lang/triton-cpu#206

Closed

A lot of boilerplate for TRITON_INTERPRET=1 without torch triton-lang/triton#5493

Open

stephen-huan force-pushed the cpu branch from 45ee0a9 to 0260213 Compare January 5, 2025 04:31

stephen-huan force-pushed the cpu branch from 0260213 to de391a4 Compare June 23, 2025 04:20

Handle TRITON_CACHE_DIR changes

a96159c

stephen-huan force-pushed the cpu branch from de391a4 to bcfe606 Compare September 9, 2025 05:29

stephen-huan added 2 commits September 9, 2025 02:39

Add CPU backend

2b1dda2

Fix tests with CPU backend

a5db0b3

stephen-huan force-pushed the cpu branch from bcfe606 to a5db0b3 Compare September 9, 2025 06:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU backend#322

Add CPU backend#322
stephen-huan wants to merge 3 commits intojax-ml:mainfrom
stephen-huan:cpu

stephen-huan commented Dec 21, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stephen-huan commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stephen-huan commented Dec 21, 2024 •

edited

Loading