google-ml-infra
diff --git a/‎.bazelrc‎
Lines changed: 4 additions & 0 deletions b/‎.bazelrc‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎.github/ISSUE_TEMPLATE/bug-report.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/ISSUE_TEMPLATE/bug-report.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.readthedocs.yml‎
Lines changed: 2 additions & 4 deletions b/‎.readthedocs.yml‎
Lines changed: 2 additions & 4 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 24 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 24 additions & 1 deletion
diff --git a/‎docs/gpu_performance_tips.md‎
Lines changed: 221 additions & 14 deletions b/‎docs/gpu_performance_tips.md‎
Lines changed: 221 additions & 14 deletions
diff --git a/‎docs/installation.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/installation.md‎
Lines changed: 2 additions & 2 deletions
@@ -257,6 +257,10 @@ build:ci_linux_aarch64_cuda --action_env=CLANG_CUDA_COMPILER_PATH="/usr/lib/llvm
 
 # Mac Arm64 CI configs
 build:ci_darwin_arm64 --macos_minimum_os=11.0
+# Clang 19 requires `-Wno-error=c23-extensions` but this flag is not supported
+# on Apple Clang in XCode 16.0 so we suppress unknown warning option errors
+# on Mac CI builds.
+build:ci_darwin_arm64 --copt=-Wno-unknown-warning-option
 build:ci_darwin_arm64 --config=macos_cache_push
 build:ci_darwin_arm64 --verbose_failures=true
 build:ci_darwin_arm64 --color=yes
 
@@ -24,7 +24,7 @@ body:
 
       [issue search]: https://github.com/jax-ml/jax/search?q=is%3Aissue&type=issues
 
-      [Raw report]: http://github.com/jax-ml/jax/issues/new
+      [Raw report]: https://github.com/jax-ml/jax/issues/new?template=none
 - type: textarea
   attributes:
     label: Description
 
@@ -13,10 +13,8 @@ build:
     post_checkout:
       # Skip building PRs unless tagged with the "documentation" label.
       - |
-        if [ "$READTHEDOCS_VERSION_TYPE" = "external" ] && (curl -s "https://api.github.com/repos/jax-ml/jax/issues/$READTHEDOCS_VERSION/labels" | grep -vq "https://api.github.com/repos/jax-ml/jax/labels/documentation")
-        then
-          exit 183;
-        fi
+        [ "${READTHEDOCS_VERSION_TYPE}" != "external" ] && echo "Building latest" && exit 0
+        (curl -sL https://api.github.com/repos/jax-ml/jax/issues/${READTHEDOCS_VERSION}/labels | grep -q "https://api.github.com/repos/jax-ml/jax/labels/documentation") && echo "Building PR with label" || exit 183
 
 # Build documentation in the docs/ directory with Sphinx
 sphinx:
 
@@ -24,11 +24,25 @@ When releasing, please add the new-release-boilerplate to docs/pallas/CHANGELOG.
     which was added temporarily in v0.4.36 to allow users to opt out of the
     new "stackless" tracing machinery.
   * Removed the `config.jax_eager_pmap` config option.
+  * Disallow the calling of `lower` and `trace` AOT APIs on the result
+    of `jax.jit` if there have been subsequent wrappers applied.
+    Previously this worked, but silently ignored the wrappers.
+    The workaround is to apply `jax.jit` last among the wrappers,
+    and similarly for `jax.pmap`.
+    See {jax-issue}`#27873`.
+  * The `cuda12_pip` extra for `jax` has been removed; use `pip install jax[cuda12]`
+    instead.
 
 * Changes
   * The minimum CuDNN version is v9.8.
   * JAX is now built using CUDA 12.8. All versions of CUDA 12.1 or newer remain
     supported.
+  * JAX package extras are now updated to use dash instead of underscore to
+    align with PEP 685. For instance, if you were previously using `pip install jax[cuda12_local]`
+    to install JAX, run `pip install jax[cuda12-local]` instead.
+  * {func}`jax.jit` now requires `fun` to be passed by position, and additional
+    arguments to be passed by keyword. Doing otherwise will result in a
+    DeprecationWarning in v0.6.X, and an error in starting in v0.7.X.
 
 * Deprecations
 
@@ -45,10 +59,17 @@ When releasing, please add the new-release-boilerplate to docs/pallas/CHANGELOG.
   * The deprecated use of {func}`jax.ffi.ffi_call` with inline arguments is no
     longer supported. {func}`~jax.ffi.ffi_call` now unconditionally returns a
     callable.
+  * `jax.dlpack.to_dlpack` has been deprecated. You can usually pass a JAX
+    `Array` directly to the `from_dlpack` function of another framework. If you
+    need the functionality of `to_dlpack`, use the `__dlpack__` attribute of an
+    array.
+  * `jax.lax.infeed`, `jax.lax.infeed_p`, `jax.lax.outfeed`, and
+    `jax.lax.outfeed_p` are deprecated and will be removed in JAX v0.7.0.
   * Several previously-deprecated APIs have been removed, including:
     * From `jax.lib.xla_client`: `ArrayImpl`, `FftType`, `PaddingType`,
       `PrimitiveType`, `XlaBuilder`, `dtype_to_etype`,
-      `ops`, `register_custom_call_target`, `shape_from_pyval`.
+      `ops`, `register_custom_call_target`, `shape_from_pyval`, `Shape`,
+      `XlaComputation`.
     * From `jax.lib.xla_extension`: `ArrayImpl`, `XlaRuntimeError`.
     * From `jax`: `jax.treedef_is_leaf`, `jax.tree_flatten`, `jax.tree_map`,
       `jax.tree_leaves`, `jax.tree_structure`, `jax.tree_transpose`, and
@@ -62,6 +83,8 @@ When releasing, please add the new-release-boilerplate to docs/pallas/CHANGELOG.
       `raise_to_shaped_mappings`, `reset_trace_state`, `str_eqn_compact`,
       `substitute_vars_in_output_ty`, `typecompat`, and `used_axis_names_jaxpr`. Most
       have no public replacement, though a few are available at {mod}`jax.extend.core`.
+    * The `vectorized` argument to {func}`~jax.pure_callback` and
+      {func}`~jax.ffi.ffi_call`. Use the `vmap_method` parameter instead.
 
 ## jax 0.5.3 (Mar 19, 2025)
 
 
@@ -243,20 +243,6 @@ Run the real workflow, if you found these loggings in the running log, it means
 
   By adjusting this factor, users can fine-tune the trade-off between memory efficiency
   and performance optimizations.
-* **--xla_gpu_enable_pipelined_collectives** When using pipeline parallelism,
-  this flag enables overlapping the (i+1)-th layer weight `AllGather` with the
-  i-th layer computation. It also enables overlapping (i+1)-th layer
-  weight `Reduce`/`ReduceScatter` with i-th layer's computation. The default
-  value is False. **There are some bugs when this flag is turned on.**
-* **--xla_gpu_collective_permute_decomposer_threshold** This flag is useful when
-  performing [GSPMD pipelining](https://arxiv.org/abs/2105.04663). Setting a
-  nonzero threshold decomposes `CollectivePermute`s into
-  `CollectivePermuteReceiveDone` and `CollectivePermuteSendDone` pairs, so that
-  computation can be performed between each corresponding
-  `ReceiveDone`/`SendDone` pair and hence achieve more overlap. By default the
-  threshold is 0 and there is no decomposition. Setting it to threshold > 0 such
-  as `--xla_gpu_collective_permute_decomposer_threshold=1024` can enable this
-  feature.
 * **--xla_gpu_all_gather_combine_threshold_bytes**
   **--xla_gpu_reduce_scatter_combine_threshold_bytes**
   **--xla_gpu_all_reduce_combine_threshold_bytes**
@@ -268,6 +254,227 @@ Run the real workflow, if you found these loggings in the running log, it means
   combine at least a Transformer Layer's weight `AllGather`/`ReduceScatter`. By
   default, the `combine_threshold_bytes` is set to 256.
 
+### Pipeline Parallelism on GPU
+
+XLA implements SPMD-based pipeline parallelism optimizations. This is a scaling technique
+where the forward and backward pass are split into multiple pipeline stages.
+Each device (or device group) processes the result of the previous
+pipeline stage (or the pipeline input) and sends its partial result to the next
+stage until the end of the pipeline is reached. This optimization works best
+when the latency of the computation is larger than communication. At compile
+time, the operations will be rearranged to overlap communication with
+computation.
+
+For an optimized schedule, we recommend these XLA flags:
+```
+--xla_gpu_enable_latency_hiding_scheduler=true
+--xla_gpu_enable_command_buffer=''
+--xla_disable_hlo_passes=collective-permute-motion
+--xla_gpu_experimental_pipeline_parallelism_opt_level=PIPELINE_PARALLELISM_OPT_LEVEL_ENABLE
+```
+
+The following JAX example demonstrates a pattern where communication operations
+are scheduled to overlap with computations. In this example we will illustrate
+how to set up an optimized pipeline parallelism scheduling using 4 GPUs that
+form a communication ring (device 0 -> device 1 -> device 2 -> device 3 ->
+device 0). We refer to the pattern `0 -> 1 -> 2 -> 3` as the forward edge, and
+`3 -> 0` as the back edge.
+
+```
+# Imports and setup
+import functools
+import jax
+from jax import sharding
+from jax.experimental import mesh_utils
+import jax.numpy as jnp
+import jax.random
+
+NUM_DEVICES = 4
+NUM_MICROBATCHES = 5
+NUM_CIRC_REPEATS = 2
+CONTRACTING_DIM_SIZE = 4096
+NON_CONTRACTING_DIM_SIZE = 8192
+COMPUTE_INTENSITY = 32
+
+# Creates a collective permute for the "forward edge".
+# 0->1, 1->2, ... (N-2)->(N-1)
+def shift_right(arr):
+  padding = [[1, 0]] + [[0, 0]] * (arr.ndim - 1)
+  # Use lax.slice to guarantee the gradient is a pad.
+  return jax.lax.slice(jnp.pad(arr, padding), [0] * arr.ndim, arr.shape)
+
+
+# Creates a collective permute for the "back edge".
+# (N-1)->0
+def cycle_back(arr):
+  padding = [[0, NUM_DEVICES - 1]] + [[0, 0]] * (arr.ndim - 1)
+  return jax.lax.slice(
+      jnp.pad(arr, padding),
+      [NUM_DEVICES - 1] + [0] * (arr.ndim - 1),
+      (NUM_DEVICES - 1 + arr.shape[0],) + arr.shape[1:],
+  )
+
+
+def select_on_first_device(then_value, else_value):
+  assert then_value.shape == else_value.shape
+  is_first_device = jax.lax.broadcasted_iota("int32", then_value.shape, 0) == 0
+  return jnp.where(is_first_device, then_value, else_value)
+
+
+def select_on_last_device(then_value, else_value):
+  assert then_value.shape == else_value.shape
+  is_last_device = (
+      jax.lax.broadcasted_iota("int32", then_value.shape, 0) == NUM_DEVICES - 1
+  )
+  return jnp.where(is_last_device, then_value, else_value)
+
+
+def select_on_first_cycle(i, then_value, else_value):
+  assert then_value.shape == else_value.shape
+  is_first_cycle = i < NUM_MICROBATCHES
+  return jnp.where(is_first_cycle, then_value, else_value)
+
+
+def while_body(carry, i):
+  """Body of the pipeline while loop."""
+  weights, input_buffer, output_buffer, fwd_edge_data, bwd_edge_data = carry
+
+  # Read input data from input buffer.
+  input_data = jax.lax.dynamic_slice(
+      input_buffer,
+      (0, (i + 0) % NUM_MICROBATCHES, 0, 0),
+      (NUM_DEVICES, 1, CONTRACTING_DIM_SIZE, NON_CONTRACTING_DIM_SIZE),
+  )
+
+  # Collective permute on the "forward edge" shifts data to the next stage.
+  fwd_edge_data = shift_right(fwd_edge_data)
+
+  # Select compute argument based on device and pipeline cycle.
+  compute_argument = select_on_first_device(
+      select_on_first_cycle(i, input_data, bwd_edge_data),
+      fwd_edge_data,
+  ).reshape((NUM_DEVICES, CONTRACTING_DIM_SIZE, NON_CONTRACTING_DIM_SIZE))
+
+  # A few matmuls to simulate compute.
+  tmp = compute_argument
+  for _ in range(COMPUTE_INTENSITY):
+    tmp = jax.lax.dot_general(weights, tmp, (((2,), (1,)), ((0,), (0,))))
+  compute_result = tmp.reshape(
+      (NUM_DEVICES, 1, CONTRACTING_DIM_SIZE, NON_CONTRACTING_DIM_SIZE)
+  )
+
+  # Read data from buffer to pass it to the first device of the pipeline on the
+  # "back edge".
+  bwd_edge_data = jax.lax.dynamic_slice(
+      output_buffer,
+      (0, (1 + i) % NUM_MICROBATCHES, 0, 0),
+      (NUM_DEVICES, 1, CONTRACTING_DIM_SIZE, NON_CONTRACTING_DIM_SIZE),
+  )
+
+  # Colelctive permute on the "back edge" passes data to the first device.
+  bwd_edge_data = cycle_back(bwd_edge_data)
+
+  # Update output buffer. We do this after reading from it to avoid the data
+  # dependency.
+  output_buffer = jax.lax.dynamic_update_slice(
+      output_buffer,
+      compute_result,
+      (0, (2 + i) % NUM_MICROBATCHES, 0, 0),
+  )
+
+  fwd_edge_data = compute_result
+  carry = (
+      weights,
+      input_buffer,
+      output_buffer,
+      fwd_edge_data,
+      bwd_edge_data,
+  )
+  return carry, i
+
+
+@functools.partial(jax.jit, static_argnames=["mesh"])
+def entry_computation(weights, input_buffer, mesh):
+
+  # Init output buffer.
+  output_buffer = jnp.zeros_like(input_buffer)
+
+  # Init dummy data for forward and backward edge passed through the while loop.
+  dummy_data = jnp.zeros(
+      shape=(NUM_DEVICES, 1, CONTRACTING_DIM_SIZE, NON_CONTRACTING_DIM_SIZE)
+  ).astype(jnp.float32)
+  dummy_data = jax.device_put(
+      dummy_data,
+      sharding.NamedSharding(
+          mesh, sharding.PartitionSpec("the_one_and_only_axis")
+      ),
+  )
+
+  # Start pipeline.
+  carry = weights, input_buffer, output_buffer, dummy_data, dummy_data
+  num_iterations = NUM_CIRC_REPEATS * NUM_MICROBATCHES + NUM_DEVICES - 1
+  carry, _ = jax.lax.scan(while_body, carry, xs=jnp.arange(num_iterations))
+  _, _, output_buffer, _, _ = carry
+
+  return output_buffer
+
+
+def main(_):
+
+  # Expect constant number of devices.
+  assert NUM_DEVICES == jax.local_device_count()
+
+  # Create mesh.
+  mesh = sharding.Mesh(
+      mesh_utils.create_device_mesh([NUM_DEVICES]),
+      axis_names=["the_one_and_only_axis"],
+  )
+
+  # Init weights.
+  weights = 1.0 / CONTRACTING_DIM_SIZE
+  weights = jax.lax.broadcast_in_dim(
+      weights,
+      shape=(NUM_DEVICES, CONTRACTING_DIM_SIZE, CONTRACTING_DIM_SIZE),
+      broadcast_dimensions=(),
+  )
+  weights = jax.device_put(
+      weights,
+      sharding.NamedSharding(
+          mesh, sharding.PartitionSpec("the_one_and_only_axis")
+      ),
+  )
+
+  # Init random input and replicate it across all devices.
+  random_key = jax.random.key(0)
+  input_buffer = jax.random.uniform(
+      random_key,
+      shape=(
+          NUM_MICROBATCHES,
+          CONTRACTING_DIM_SIZE,
+          NON_CONTRACTING_DIM_SIZE,
+      ),
+  )
+  input_buffer = jax.lax.broadcast_in_dim(
+      input_buffer,
+      shape=(
+          NUM_DEVICES,
+          NUM_MICROBATCHES,
+          CONTRACTING_DIM_SIZE,
+          NON_CONTRACTING_DIM_SIZE,
+      ),
+      broadcast_dimensions=[1, 2, 3],
+  )
+  input_buffer = jax.device_put(
+      input_buffer,
+      sharding.NamedSharding(
+          mesh, sharding.PartitionSpec("the_one_and_only_axis")
+      ),
+  )
+
+  # Run computation.
+  output_buffer = entry_computation(weights, input_buffer, mesh)
+  print(f"output_buffer = \n{output_buffer}")
+```
 ## NCCL flags
 
 These Nvidia NCCL flag values may be useful for single-host multi-device
 
@@ -158,7 +158,7 @@ pip install --upgrade pip
 
 # Installs the wheel compatible with NVIDIA CUDA 12 and cuDNN 9.0 or newer.
 # Note: wheels only available on linux.
-pip install --upgrade "jax[cuda12_local]"
+pip install --upgrade "jax[cuda12-local]"
 ```
 
 **These `pip` installations do not work with Windows, and may fail silently; refer to the table
@@ -296,7 +296,7 @@ pip install -U --pre jax jaxlib libtpu requests -f https://storage.googleapis.co
 - NVIDIA GPU (CUDA 12):
 
 ```bash
-pip install -U --pre jax jaxlib "jax-cuda12-plugin[with_cuda]" jax-cuda12-pjrt -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html
+pip install -U --pre jax jaxlib "jax-cuda12-plugin[with-cuda]" jax-cuda12-pjrt -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html
 ```
 
 - NVIDIA GPU (CUDA 12) legacy: