Add examples of calling FlashInfer from JAX via jax-tvm-ffi (#3092)

katjasrz · web-flow · commit 27641471ebd4 · 2026-04-24T02:02:33.000Z
## 📌 Description

This PR adds a new example under examples/jax_tvm_ffi/ showing how to
call FlashInfer from JAX via jax-tvm-ffi. It also adds
examples/README.md to document the examples directory and make the new
example easier to discover.

The goal is to provide a minimal reference for users interested in
integrating FlashInfer outside of PyTorch, especially in JAX-based
workflows.

## 🔍 Related Issues

N/A

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

This PR only adds example code and documentation; no changes to core
functionality, so no additional tests were added. Examples run
successfully end-to-end.

&lt;!-- This is an auto-generated comment: release notes by coderabbit.ai
--&gt;
## Summary by CodeRabbit

* **Documentation**
* Added an Examples overview and detailed per-example guides covering
setup, installation, GPU/CUDA prerequisites, compilation/caching
behavior, Hugging Face gated-model steps, authentication flows, and
troubleshooting for JAX↔TVM FFI workflows.

* **New Features**
* Added runnable JAX↔TVM FFI examples (notebooks and standalone scripts)
demonstrating fused activations/FFN, RoPE, and attention kernels,
end-to-end Gemma 3 inference, correctness validations, and latency
micro-benchmarks.
&lt;!-- end of auto-generated comment: release notes by coderabbit.ai --&gt;
diff --git a/.gitignore b/.gitignore
@@ -199,3 +199,5 @@ cython_debug/
 
 # Cursor
 .cursor/
+docs/tutorials/generated/
+docs/sg_execution_times.rst
diff --git a/docs/conf.py b/docs/conf.py
@@ -38,6 +38,7 @@
     "sphinx.ext.napoleon",
     "sphinx.ext.autosummary",
     "sphinx.ext.mathjax",
+    "sphinx_gallery.gen_gallery",
 ]
 
 autodoc_default_flags = ["members"]
@@ -47,11 +48,24 @@
 
 language = "en"
 
-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+exclude_patterns = [
+    "_build",
+    "Thumbs.db",
+    ".DS_Store",
+    "tutorials/jax_tvm_ffi/README.rst",
+]
 
 # The name of the Pygments (syntax highlighting) style to use.
 pygments_style = "sphinx"
 
+sphinx_gallery_conf = {
+    "examples_dirs": "tutorials/jax_tvm_ffi",
+    "gallery_dirs": "tutorials/generated/jax_tvm_ffi",
+    "filename_pattern": r".*\.py",
+    "plot_gallery": "False",
+    "download_all_examples": False,
+}
+
 # A list of ignored prefixes for module index sorting.
 # If true, `todo` and `todoList` produce output, else they produce nothing.
 todo_include_todos = False
diff --git a/docs/index.rst b/docs/index.rst
@@ -25,6 +25,7 @@ FlashInfer is a library and kernel generator for Large Language Models that prov
 
    tutorials/recursive_attention
    tutorials/kv_layout
+   tutorials/generated/jax_tvm_ffi/index
 
 .. toctree::
    :maxdepth: 2
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,5 +1,6 @@
 furo == 2024.8.6
 sphinx == 8.1.3
+sphinx-gallery == 0.19.0
 sphinx-reredirects == 0.1.5
 sphinx-tabs == 3.4.5
 sphinx-toolbox == 3.8.1
diff --git a/docs/tutorials/jax_tvm_ffi/README.rst b/docs/tutorials/jax_tvm_ffi/README.rst
@@ -0,0 +1,66 @@
+FlashInfer on JAX with TVM FFI
+==============================
+
+These tutorials show how to call FlashInfer GPU kernels from JAX through the
+`jax-tvm-ffi <https://github.com/NVIDIA/jax-tvm-ffi>`_ bridge.
+
+The Sphinx-Gallery ``.py`` files in this directory are the canonical source:
+
+* ``flashinfer_jax_tvm_ffi.py`` explains the core build, register, and call
+  pattern for FlashInfer kernels from JAX.
+* ``gemma3_flashinfer_jax.py`` applies the same pattern to Gemma 3 1B Instruct
+  inference.
+
+During the documentation build, Sphinx-Gallery renders these files into HTML
+pages and creates downloadable Python and Jupyter notebook versions from the
+same source files. Do not edit or commit the generated
+``docs/tutorials/generated/jax_tvm_ffi/`` directory; it is produced by
+Sphinx-Gallery.
+
+The examples are not executed during the default documentation build because
+they require an NVIDIA GPU, CUDA, FlashInfer JIT compilation, and in the Gemma 3
+case Hugging Face credentials for a gated model.
+
+Execution requirements
+----------------------
+
+To run the tutorials directly, use a CUDA-capable environment with:
+
+* NVIDIA GPU with SM 7.5 or newer.
+* CUDA 12.6 or newer.
+* Python 3.10 or newer.
+* JAX with CUDA support.
+* ``flashinfer-python`` and ``jax-tvm-ffi``.
+
+The Gemma 3 tutorial additionally requires:
+
+* ``torch`` CPU wheels for dtype literals used by FlashInfer's JIT API.
+* ``safetensors``, ``huggingface_hub``, and ``transformers``.
+* Hugging Face access to ``google/gemma-3-1b-it`` and an ``HF_TOKEN``.
+
+For example:
+
+.. code-block:: bash
+
+   pip install 'jax[cuda13]'
+   pip install flashinfer-python -U jax-tvm-ffi \
+       --no-build-isolation \
+       --extra-index-url https://flashinfer.ai/whl/cu130/
+
+   # Additional dependencies for the Gemma 3 tutorial only:
+   pip install torch --index-url https://download.pytorch.org/whl/cpu
+   pip install safetensors huggingface_hub transformers
+
+To build the documentation locally from the repository root:
+
+.. code-block:: bash
+
+   pip install -r docs/requirements.txt
+   sphinx-build -b html docs docs/_build/html -j auto
+
+To run a tutorial directly, execute its canonical source file:
+
+.. code-block:: bash
+
+   python docs/tutorials/jax_tvm_ffi/flashinfer_jax_tvm_ffi.py
+   python docs/tutorials/jax_tvm_ffi/gemma3_flashinfer_jax.py
diff --git a/docs/tutorials/jax_tvm_ffi/flashinfer_jax_tvm_ffi.py b/docs/tutorials/jax_tvm_ffi/flashinfer_jax_tvm_ffi.py
diff --git a/docs/tutorials/jax_tvm_ffi/gemma3_flashinfer_jax.py b/docs/tutorials/jax_tvm_ffi/gemma3_flashinfer_jax.py