Fix a codegen race condition GH-1474

nvtw · nvtw · commit 56019eda8cd3 · 2026-05-27T00:34:46.000-07:00
* Implement MR comments

* More cleanup

* Polishing

* Undo faulty changes

* Add more adjustments such that no corrupted code should get generated when one thread fails during the compilation (syntax error etc)

* Run ruff

* Test: cross-module codegen race with shared @wp.func

Add a regression test for the codegen race fixed by the previous
commit ("Codegen race fix"). The existing tests in this file build
N modules where each module has its OWN ``@wp.kernel`` and no shared
``@wp.func`` -- so concurrent ``adj.build`` calls touch disjoint
adjoint state and the race never fires.

The bug needs a *shared* helper graph: when M modules each call the
same module-level ``@wp.func`` (and transitively a chain of helpers),
every module's ``ModuleBuilder`` re-walks and mutates the same
``Adjoint`` objects. Without ``_codegen_lock`` two threads land in
``adj.build`` concurrently, interleave their writes to ``adj.blocks``
/ ``adj.deferred_static_expressions``, and the emitted .cu sees
mangled function signatures (``var_5 = _race_helper_0(...)``
assigned a ``void`` return, ``adj__race_helper_0`` called with the
wrong arity, etc.). nvrtc then fails the build with a handful of
syntax errors per module.

Reproducing the race reliably requires three things:

* a chain of shared helpers (``_race_helper`` -&gt; ``_race_mid`` -&gt;
  ``_race_leaf``) so each module does meaningful shared-adjoint work
  -- a single small helper compiles too fast for threads to
  interleave;
* enough modules under ``force_load`` (``NUM_MODULES = 8``,
  worker count up to ``2 * NUM_MODULES``);
* a small retry loop (``ATTEMPTS = 4``) -- the race is
  timing-dependent and the first parallel build sometimes wins.

The test is CUDA-only: the CUDA codegen path emits the device-side
adjoint stub + reverse glue in addition to the forward path, giving
threads more interleaving opportunities. CPU codegen also touches
the shared adjoint state but the window is too small to reproduce
on a modern multi-core box. Skipping when CUDA is unavailable is
acceptable -- the race only ever bit a real user on the CUDA path
(PhoenX singleworld kernels).

Verification: without the lock the high-concurrency variant
consistently fails with NVRTC error 6 on a ``@wp.func`` adjoint
that was clobbered mid-build; with the lock applied both variants
pass in ~33 s on an RTX 3080 laptop.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

* Codegen race fix

Approved-by: Eric Shi &lt;ershi@nvidia.com&gt;

See merge request omniverse/warp!2413
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -160,6 +160,8 @@
   ([GH-1466](https://github.com/NVIDIA/warp/issues/1466)).
 - Fix tile byte-offset overflow for arrays larger than 2 GiB
   ([GH-1422](https://github.com/NVIDIA/warp/issues/1422)).
+- Fix an intermittent failure when loading modules in parallel with `wp.force_load(max_workers > 1)` that could cause
+  modules sharing a `@wp.func` to fail to compile or load ([GH-1474](https://github.com/NVIDIA/warp/issues/1474)).
 
 ### Documentation
 
diff --git a/warp/_src/context.py b/warp/_src/context.py
@@ -2233,6 +2233,24 @@ def get_unique_kernels(self):
         return self.unique_kernels.values()
 
 
+# Process-wide codegen lock. ``adj.build()`` mutates per-Adjoint state
+# (``adj.blocks``, ``adj.deferred_static_expressions``, etc.) on the
+# *same* Adjoint object whenever multiple modules reference a shared
+# ``@wp.func``. Holding this lock around the ``ModuleBuilder`` +
+# ``builder.codegen()`` window in :meth:`Module._compile` serialises
+# the Python-side codegen so concurrent ``Module.load`` calls (e.g.
+# from :func:`force_load` with ``max_workers > 1``) don't interleave
+# function emissions in the per-module .cu output. The expensive
+# nvrtc / nvcc invocation runs after the lock is released, so loading
+# N modules in parallel still parallelises the compiler step (the
+# dominant cost) -- only the much cheaper codegen serialises.
+#
+# Re-entrant so nested ``ModuleBuilder`` calls (e.g. the dummy build
+# inside :meth:`Module.get_module_hash`) on the same thread don't
+# deadlock.
+_codegen_lock = threading.RLock()
+
+
 class ModuleBuilder:
     def __init__(self, module, options, hasher=None):
         self.functions = {}
@@ -2811,7 +2829,12 @@ def hash_module(self) -> bytes:
         """
         block_dim = self.options["block_dim"]
         options = self.resolve_options(warp.config)
-        self.hashers[block_dim] = ModuleHasher(self._get_live_kernels(), options)
+        # ``ModuleHasher.__init__`` calls ``hash_kernel`` -> ``hash_adjoint``
+        # which reads shared ``@wp.func`` adjoint state; serialise with
+        # ``_codegen_lock`` so concurrent ``Module.load`` callers don't
+        # interleave hash + build mutations on the same Adjoint.
+        with _codegen_lock:
+            self.hashers[block_dim] = ModuleHasher(self._get_live_kernels(), options)
         self.resolved_options[block_dim] = options
         return self.hashers[block_dim].get_hash()
 
@@ -2823,16 +2846,27 @@ def get_module_hash(self, block_dim: int | None = None) -> bytes:
         if block_dim is None:
             block_dim = self.options["block_dim"]
 
-        if self.has_unresolved_static_expressions:
-            options = self.resolve_options(warp.config)
-            builder_options = options | {"output_arch": None}
-            _ = ModuleBuilder(self, builder_options)
-            self.has_unresolved_static_expressions = False
-
-        if block_dim not in self.hashers:
-            options = self.resolve_options(warp.config)
-            self.hashers[block_dim] = ModuleHasher(self._get_live_kernels(), options)
-            self.resolved_options[block_dim] = options
+        # Both branches below mutate shared ``@wp.func`` adjoint state
+        # (``ModuleBuilder`` runs ``adj.build`` to resolve deferred
+        # ``wp.static`` expressions; ``ModuleHasher`` reads the
+        # resulting adjoint blocks via ``hash_adjoint``). The two
+        # operations are stages of one logical "compute module hash"
+        # critical section, so they live in a single ``_codegen_lock``
+        # block. Splitting the lock per stage opens a window where
+        # another thread can re-run ``adj.build`` on a shared helper
+        # and clobber the state this thread is about to hash.
+        if self.has_unresolved_static_expressions or block_dim not in self.hashers:
+            with _codegen_lock:
+                if self.has_unresolved_static_expressions:
+                    options = self.resolve_options(warp.config)
+                    builder_options = options | {"output_arch": None}
+                    _ = ModuleBuilder(self, builder_options)
+                    self.has_unresolved_static_expressions = False
+
+                if block_dim not in self.hashers:
+                    options = self.resolve_options(warp.config)
+                    self.hashers[block_dim] = ModuleHasher(self._get_live_kernels(), options)
+                    self.resolved_options[block_dim] = options
 
         return self.hashers[block_dim].get_hash()
 
@@ -3002,13 +3036,6 @@ def _compile(
         ):
             return False
 
-        # Some of the tile codegen, such as cuFFTDx and cuBLASDx, requires knowledge of the target arch
-        builder = ModuleBuilder(
-            self,
-            options,
-            hasher=self.hashers.get(options["block_dim"], None),
-        )
-
         meta_path = os.path.join(output_dir, self._get_meta_name())
 
         build_dir = os.path.normpath(output_dir) + f"_p{os.getpid()}_t{threading.get_ident()}"
@@ -3025,20 +3052,46 @@ def _compile(
         if opt != 3 and not is_cpu and runtime.toolkit_version is not None and runtime.toolkit_version < (12, 9):
             log_warning("Optimization level other than 3 has no effect on CUDA versions prior to 12.9.", once=True)
 
-        # build CPU
-        if is_cpu:
-            # build
-            try:
-                source_code_path = os.path.join(build_dir, f"{module_name_short}.cpp")
+        # Python codegen window -- LOCKED. See ``_codegen_lock``.
+        # Some of the tile codegen, such as cuFFTDx and cuBLASDx,
+        # requires knowledge of the target arch. Snapshot builder
+        # collections needed by ``build_cuda`` below into locals so
+        # nothing inside the lock is touched after release.
+        #
+        # NOTE: this lock window is intentionally outside the
+        # ``failed_builds`` try/except below. ``ModuleBuilder`` can
+        # legitimately raise from ``adj.build`` (e.g. user kernels
+        # with type mismatches in the type-mismatch error tests); if
+        # we recorded those in ``failed_builds`` the next ``Module.load``
+        # on the same device short-circuits with ``return None`` and
+        # subsequent unrelated kernels in the same module silently
+        # fail to launch. Only the heavy native compile records
+        # ``failed_builds``.
+        with _codegen_lock:
+            builder = ModuleBuilder(
+                self,
+                options,
+                hasher=self.hashers.get(options["block_dim"], None),
+            )
+            if is_cpu:
+                source_code_ext = "cpp"
+                source_str = builder.codegen("cpu")
+            else:
+                source_code_ext = "cu"
+                source_str = builder.codegen("cuda")
+            meta = builder.build_meta()
+            ltoir_values = list(builder.ltoirs.values())
+            fatbin_values = list(builder.fatbins.values())
 
-                # write cpp sources
-                cpp_source = builder.codegen("cpu")
+        source_code_path = os.path.join(build_dir, f"{module_name_short}.{source_code_ext}")
 
-                with open(source_code_path, "w") as cpp_file:
-                    cpp_file.write(cpp_source)
+        with open(source_code_path, "w") as source_file:
+            source_file.write(source_str)
 
-                output_path = os.path.join(build_dir, output_name)
+        output_path = os.path.join(build_dir, output_name)
 
+        try:
+            if is_cpu:
                 # build object code
                 with warp.ScopedTimer(
                     "Compile x86", active=(warp.config.verbose or warp.config.log_level <= warp.LOG_DEBUG)
@@ -3058,28 +3111,7 @@ def _compile(
                         block_dim=options["block_dim"],
                         enable_tiles_in_stack_memory=options["enable_tiles_in_stack_memory"],
                     )
-
-            except Exception as e:
-                if isinstance(e, FileNotFoundError):
-                    _check_and_raise_long_path_error(e)
-
-                self.failed_builds.add(None)
-
-                raise (e)
-
-        else:
-            # build
-            try:
-                source_code_path = os.path.join(build_dir, f"{module_name_short}.cu")
-
-                # write cuda sources
-                cu_source = builder.codegen("cuda")
-
-                with open(source_code_path, "w") as cu_file:
-                    cu_file.write(cu_source)
-
-                output_path = os.path.join(build_dir, output_name)
-
+            else:
                 # generate PTX or CUBIN
                 with warp.ScopedTimer(
                     f"Compile CUDA (arch={options['output_arch']}{arch_suffix}, mode={mode}, block_dim={options['block_dim']})",
@@ -3096,27 +3128,28 @@ def _compile(
                         fuse_fp=options["fuse_fp"],
                         lineinfo=options["lineinfo"],
                         compile_time_trace=options["compile_time_trace"],
-                        ltoirs=builder.ltoirs.values(),
-                        fatbins=builder.fatbins.values(),
+                        ltoirs=ltoir_values,
+                        fatbins=fatbin_values,
                         arch_suffix=arch_suffix,
                         pch_dir=runtime.get_nvrtc_pch_dir(),
                         llvm_cuda=options["llvm_cuda"],
                         use_precompiled_headers=options["use_precompiled_headers"],
                     )
 
-            except Exception as e:
-                if isinstance(e, FileNotFoundError):
-                    _check_and_raise_long_path_error(e)
+        except Exception as e:
+            if isinstance(e, FileNotFoundError):
+                _check_and_raise_long_path_error(e)
 
-                if device:
-                    self.failed_builds.add(device.context)
+            if is_cpu:
+                self.failed_builds.add(None)
+            elif device:
+                self.failed_builds.add(device.context)
 
-                raise (e)
+            raise (e)
 
         # ------------------------------------------------------------
-        # build meta data
+        # write meta data (already produced under ``_codegen_lock``)
 
-        meta = builder.build_meta()
         output_meta_path = os.path.join(build_dir, self._get_meta_name())
 
         with open(output_meta_path, "w") as meta_file:
diff --git a/warp/tests/test_module_parallel_load.py b/warp/tests/test_module_parallel_load.py
@@ -13,6 +13,44 @@
 
 import warp as wp
 
+# Chain of shared ``@wp.func`` helpers used by
+# ``TestParallelLoadSharedHelper`` below. Multiple helpers calling each
+# other transitively maximise the race surface: every kernel build
+# walks the entire chain, so concurrent ``adj.build`` calls have many
+# adjoint-state mutations to interleave with one another.
+
+
+@wp.func
+def _race_leaf(x: float) -> float:
+    a = wp.sin(x) + wp.cos(x)
+    a = a * a + 0.5
+    a = wp.sqrt(wp.abs(a) + 1.0)
+    b = wp.tan(x * 0.5) + wp.exp(-x * x * 0.01)
+    return a + b
+
+
+@wp.func
+def _race_mid(x: float, k: int) -> float:
+    s = float(0.0)
+    for _ in range(4):
+        s = s + _race_leaf(x + s)
+    if k > 0:
+        s = s * float(k)
+    else:
+        s = -s
+    return s
+
+
+@wp.func
+def _race_helper(x: float, idx: int) -> float:
+    y = _race_mid(x, idx)
+    z = float(idx) + 1.0
+    if idx % 2 == 0:
+        y = y + _race_leaf(z)
+    else:
+        y = y - _race_leaf(z)
+    return y
+
 
 def _generate_module_code(index):
     """Generate source code for a module with a simple kernel.
@@ -121,5 +159,127 @@ def test_force_load_single_module(self):
         _assert_modules_loaded_on_cpu(self, modules)
 
 
+def _assert_modules_loaded_on_cuda(test, modules, device):
+    for m in modules:
+        ctx = device.context
+        loaded = any(c == ctx for (c, _block_dim) in m.execs.keys())
+        test.assertTrue(loaded, f"Module {m.name} was not loaded on {device}")
+
+
+@unittest.skipUnless(wp.is_cuda_available(), "CUDA codegen path race needs a CUDA device")
+class TestParallelLoadSharedHelper(unittest.TestCase):
+    """Regression test for the cross-module codegen race on CUDA.
+
+    Each ``ModuleBuilder`` walks the reachable ``@wp.func`` graph and
+    calls ``adj.build`` on each helper, which mutates per-Adjoint state
+    (``adj.blocks``, ``adj.deferred_static_expressions``, ...). When
+    two modules that reference the *same* helper build concurrently
+    (e.g., via ``wp.force_load(max_workers > 1)``), without a lock
+    around the codegen window the threads interleave their writes to
+    the shared adjoint and the emitted .cu file has corrupt sections
+    -- one helper's body emitted inside another, or references to
+    ``var_*`` / ``_idx`` / ``dim`` that were never declared. nvrtc
+    then rejects the file with dozens of syntax errors and
+    ``Module._compile`` raises.
+
+    The race reproduces reliably only on the CUDA codegen path (not
+    CPU) because the CUDA emit walks a longer per-function path: it
+    additionally emits the device-side adjoint stub, snapshots
+    ``adj.blocks[0].body_replay`` for the reverse-mode glue, and
+    reads ``options['enable_backward']`` from the codegen-time
+    global -- giving more interleaving opportunities for two
+    threads building the same shared helper.
+
+    Reproducing the race reliably requires:
+
+    * a non-trivial *call graph* of shared ``@wp.func`` helpers
+      (``_race_helper`` -> ``_race_mid`` -> ``_race_leaf``) so each
+      module's ``ModuleBuilder`` does meaningful shared adjoint work;
+    * enough modules submitted to ``force_load`` so the
+      ``ThreadPoolExecutor`` actually runs several builds
+      concurrently;
+    * a retry loop (``ATTEMPTS``) so we accept a probabilistic
+      reproduction -- the race is timing-dependent.
+
+    With the codegen lock in place every attempt succeeds. Without
+    the lock at least one attempt raises a build error.
+    """
+
+    NUM_MODULES = 8
+    ATTEMPTS = 4
+
+    @staticmethod
+    def _make_kernel(idx: int):
+        # ``module="unique"`` puts every factory output in its own
+        # ``Module`` object; if N kernels share a helper, N separate
+        # modules each try to inline that helper's adjoint at compile
+        # time -- the exact pattern PhoneX hits with its singleworld
+        # factory.
+        @wp.kernel(module="unique")
+        def k(out: wp.array(dtype=float)):
+            tid = wp.tid()
+            x = float(tid) + float(idx)
+            out[tid] = _race_helper(x, idx)
+
+        return k
+
+    def _build_kernels(self, attempt: int):
+        # Vary the spawn count per attempt so each invocation builds
+        # a fresh set of unique modules -- after the first attempt the
+        # earlier kernels' modules are already loaded, so subsequent
+        # ``force_load`` calls would short-circuit on the hash check
+        # and never exercise the codegen path again.
+        offset = attempt * 1000
+        kernels = [self._make_kernel(offset + i) for i in range(self.NUM_MODULES)]
+        modules: list = []
+        seen: set = set()
+        for k in kernels:
+            k.module.mark_modified()
+            if id(k.module) in seen:
+                continue
+            seen.add(id(k.module))
+            modules.append(k.module)
+        return modules
+
+    def test_force_load_parallel_with_shared_func(self):
+        """N modules sharing a chain of ``@wp.func`` helpers must load
+        successfully under ``max_workers > 1``. Without the codegen
+        lock at least one of the ``ATTEMPTS`` parallel CUDA builds
+        raises because the shared helpers' adjoints were clobbered
+        mid-build."""
+        device = wp.get_preferred_device()
+        for attempt in range(self.ATTEMPTS):
+            modules = self._build_kernels(attempt)
+            try:
+                wp.force_load(device=device, modules=modules, max_workers=self.NUM_MODULES)
+            except Exception as e:
+                self.fail(
+                    f"attempt {attempt}: parallel build raised {type(e).__name__}: {e}. "
+                    "Check the ``_codegen_lock`` window in warp._src.context.Module._compile."
+                )
+            _assert_modules_loaded_on_cuda(self, modules, device)
+
+    def test_force_load_parallel_with_shared_func_high_concurrency(self):
+        """Same race but with more modules than worker threads, so the
+        ``ThreadPoolExecutor`` queues tasks and reuses workers between
+        builds. ``force_load`` submits exactly ``len(devices) * len(modules)``
+        tasks, so to actually change scheduling vs. the basic test we
+        submit 2x the modules and cap ``max_workers`` below that count
+        -- this forces real queueing/contention rather than a
+        thread-per-task fan-out."""
+        device = wp.get_preferred_device()
+        max_workers = max(2, self.NUM_MODULES // 2)
+        for attempt in range(self.ATTEMPTS):
+            modules = self._build_kernels(attempt + 100) + self._build_kernels(attempt + 200)
+            self.assertGreater(len(modules), max_workers)
+            try:
+                wp.force_load(device=device, modules=modules, max_workers=max_workers)
+            except Exception as e:
+                self.fail(
+                    f"attempt {attempt}: parallel build raised {type(e).__name__}: {e} (high-concurrency variant)."
+                )
+            _assert_modules_loaded_on_cuda(self, modules, device)
+
+
 if __name__ == "__main__":
     unittest.main(verbosity=2)
diff --git a/warp/tests/unittest_suites.py b/warp/tests/unittest_suites.py