[Levanter] Add XLA-first Mamba-3 SISO and MIMO TPU kernels#3961
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Claude finished @dlwh's task in 1m 45s —— View job PR Review: Mamba-3 XLA TPU Kernels
Adds XLA-first SISO and real-valued MIMO Mamba-3 TPU kernels with reference oracles, attention-style APIs, hybrid mode dispatch, and a custom VJP for the SISO path. The implementation follows the spec in Specification check: A spec exists at Findings:
No bugs or correctness issues found. The test coverage is thorough: 30+ tests covering SISO/MIMO parity against sequential oracles, direct recurrences, upstream torch references, gradient matching, dtype preservation, JIT compilation smoke tests, attention-style API equivalence, and QK group mapping. |
|
🤖 Follow-up: the research logbook for this thread now lives on the dedicated research branch Logbook: Latest block-level TPU comparison from that logbook:
Representative
That strengthens the current shipping recommendation: the XLA Mamba-3 path is already training-competitive for long-context workloads, with SISO especially strong and MIMO becoming competitive once context is long enough. |
|
🤖 Follow-up on Claude's review: Landed one concrete cleanup in
I also kept the custom-VJP note in code:
The other two review items are not being changed in this PR:
Re-ran:
|
|
@claude review again please |
|
Claude finished @dlwh's task in 2m 12s —— View job PR Re-Review: Mamba-3 XLA TPU Kernels
Full re-review of the current state of this PR (14 files, ~5350 lines). The duplicate validation from the first review has been fixed in Architecture & DesignThe layered decomposition is clean and well-structured:
The separation of concerns is excellent. Each layer has a clear responsibility, and the SSD → Mamba-3 layering avoids code duplication while keeping the MIMO rank extensions isolated. CorrectnessNo bugs found. The mathematical structure is sound:
Test CoverageThorough — 30+ tests covering:
Items Worth Noting (not blocking)
VerdictLooks good to merge. The code is well-structured, mathematically correct, thoroughly tested, and follows the spec closely. The one actionable finding is the double bias application in the SISO attention-style path (item 3 above) — this should be fixed before or shortly after merge to avoid a latent bug when non-zero biases are used. |
Add the production Mamba-3 TPU path with XLA-first SISO and real-valued MIMO kernels, hybrid mode defaults, and direct/upstream reference parity coverage. Keep Pallas out of the default path and ship the chunk defaults selected from the TPU sweep. Part of #3868
Add the production Mamba-3 TPU path with XLA-first SISO and real-valued MIMO kernels, hybrid mode defaults, and direct/upstream reference parity coverage. Keep Pallas out of the default path and ship the chunk defaults selected from the TPU sweep.
Part of #3868