Skip to content

[XLA:GPU] f64 rsqrt is 1 ULP off on Blackwell (SM 12.0a); budget claim kRsqrtF64Budget.gpu.regular=0 contradicted by empirical sweep #40862

@blasphemetheus

Description

@blasphemetheus

Summary

kRsqrtF64Budget.gpu.regular = 0 (declared in xla/codegen/intrinsic/accuracy/accuracy_budget.h) is empirically wrong on Blackwell (NVIDIA RTX 5090, compute capability 12.0a, CUDA 12.9, driver 13.2). Over a 1,014-sample sweep of normal-range f64 inputs, 26.04% of results are 1 ULP off the IEEE-correctly-rounded reference for 1/sqrt(x). Only 73.96% are bit-exact.

Interestingly, 1 / sqrt(x) written out as two separate ops is bit-exact on 100% of the same sample, showing the two lowering paths diverge on GPU and exposing both a budget bug and a simplifier-equivalent path where users get dramatically different precision depending on which form they write.

Repro

Minimal JAX reproducer (same pattern as xla#40844 but for GPU f64):

import jax, jax.numpy as jnp, numpy as np, struct

@jax.jit
def f(x):
    return jax.lax.rsqrt(x)

@jax.jit
def g(x):
    return 1.0 / jnp.sqrt(x)

x = jnp.float64(2.0)
print("rsqrt(2.0)   =", repr(f(x).item()))   # 0.7071067811865476  (1 ULP off)
print("1/sqrt(2.0)  =", repr(g(x).item()))   # 0.7071067811865475  (bit-exact)
print("1/np.sqrt(2) =", repr(1.0 / np.sqrt(2.0)))  # 0.7071067811865475

0.7071067811865475 is the round-to-nearest-even f64 of 1/sqrt(2); jax.lax.rsqrt returns the next f64 up. Same pattern holds for x ∈ {1.5, 3.0, 0.5, 7.0, ...} — any input where the SIMD Newton-Raphson refinement lands on the wrong side of the rounding boundary.

Methodology

Swept 1,000 log-uniform f64 inputs over [1e-300, 1e300] plus 14 curated values (including 1.0, 2.0, π, e, 1 + 2^-52). Computed rsqrt(x) and 1/sqrt(x) on the RTX 5090, then compared each result against the reference 1.0 / math.sqrt(x) at the bit level.

Operation 0 ULP 1 ULP Samples
rsqrt(x) f64 750 (73.96%) 264 (26.04%) 1014
1 / sqrt(x) f64 1014 (100.00%) 0 1014

Reproducer scripts (Elixir/EXLA drivers of the same XLA GPU path; the JAX snippet above reproduces single inputs): https://gist.github.com/blasphemetheus/b11c03bbc9361c1f062741a03bbe8af7

  • verify_rsqrt_f64_blackwell_sweep.exs — the 1,014-sample sweep whose results are reported in the table above.
  • verify_blackwell_zero_ulp_claims.exs — broader audit across every 0-ULP GPU claim in kSqrtF64Budget, kRsqrtF64Budget, and the StableHLO correctly-rounded op set (add, sub, mul, div). Only kRsqrtF64Budget.gpu.regular = 0 is contradicted on Blackwell; the others all hold bit-exact (see below).

Additional audit result

Run of verify_blackwell_zero_ulp_claims.exs on the same hardware, scoping every 0-ULP f64 GPU claim:

Op Claim Observed Verdict
f64 sqrt ≤0 ULP 0 ULP on 809/809 (100%) ✓ holds
f64 rsqrt ≤0 ULP 1 ULP on 233/809 (28.80%) ✗ FAILS
f32 sqrt ≤1 ULP 1 ULP on 136/807 (16.85%) ✓ holds (not bit-exact)
f64 divide correctly rounded 0 ULP on 400/400 ✓ holds
f64 add correctly rounded 0 ULP on 400/400 ✓ holds
f64 multiply correctly rounded 0 ULP on 400/400 ✓ holds
f64 subtract correctly rounded 0 ULP on 400/400 ✓ holds

So the rsqrt bug is isolated — every other correctly-rounded claim (including f64 sqrt, which is the forward half of 1/sqrt) is honest on Blackwell. This narrows the fix surface to rsqrt itself.

Note also that kSqrtF32Budget.gpu.regular = 1 is consistent with observed behavior; f32 sqrt on Blackwell is not bit-exact, so any attempt to tighten that GPU budget to 0 would regress.

Environment

  • GPU: NVIDIA GeForce RTX 5090
  • Compute capability: 12.0a (Blackwell)
  • Driver: 13.2.0
  • CUDA Runtime: 12.9.0
  • CUDA Toolkit: 12.9.0
  • cuDNN: 9.13.0
  • XLA: revision bundled in elixir-nx/xla v0.10.0 (early-2026 snapshot of openxla/xla main)

Why this matters

StableHLO defines stablehlo.rsqrt as implementation-defined precision, so a 1-ULP result on Blackwell is not itself a spec violation. The bug is that the budget says otherwisekRsqrtF64Budget.gpu.regular = 0 is a positive assertion that the emitter is bit-exact, and intrinsic_accuracy_test_gpu relies on it. Either:

  1. The test target hasn't been run on Blackwell and the claim was tuned against older architectures (Volta/Ampere/Hopper) where __nv_rsqrt happens to be correctly rounded for a different set of inputs, or
  2. The test target has a narrower input range than this sweep and doesn't hit the ~26% of Blackwell inputs where the Newton-Raphson body lands on the wrong side.

Separately, the fact that 1/sqrt(x) stays bit-exact while rsqrt(x) does not indicates that the f64 divide + sqrt lowering on GPU is not being rewritten to rsqrt by the algebraic simplifier on this path, despite HandleDivide containing divide(A, sqrt(B)) → multiply(A, rsqrt(B)) with no element-type guard. Worth investigating whether that's intentional (different one-use matching? different HLO shape?) or a latent difference between CPU and GPU pipelines that happens to save precision here.

Proposed fix options

  1. Relax the budget to match reality. Change kRsqrtF64Budget.gpu.regular from 0 to 1, consistent with the CPU budget and with what the spec allows. Run intrinsic_accuracy_test_gpu on a Blackwell CI node to confirm.
  2. Fix the GPU emitter so f64 rsqrt uses a correctly-bounded path (e.g., lower to __nv_sqrt + __nv_frcp_rn composition, same pattern as the CPU fix in [XLA] Fix f64 rsqrt 1-ULP error in CPU intrinsic and algebraic simplifier #40844). This is the same ≤1 ULP guarantee the CPU path now has, but with the Newton-Raphson refinement removed, the result will still be 1 ULP worst-case — so the budget should be relaxed either way.

Option 1 alone is sufficient to unbreak the claim. Option 2 is a real precision improvement for callers who currently hit the Newton-Raphson path.

Related

Metadata

Metadata

Labels

GPUXLA on GPUbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions