Skip to content

Commit 56019ed

Browse files
committed
Fix a codegen race condition GH-1474
* Implement MR comments * More cleanup * Polishing * Undo faulty changes * Add more adjustments such that no corrupted code should get generated when one thread fails during the compilation (syntax error etc) * Run ruff * Test: cross-module codegen race with shared @wp.func Add a regression test for the codegen race fixed by the previous commit ("Codegen race fix"). The existing tests in this file build N modules where each module has its OWN ``@wp.kernel`` and no shared ``@wp.func`` -- so concurrent ``adj.build`` calls touch disjoint adjoint state and the race never fires. The bug needs a *shared* helper graph: when M modules each call the same module-level ``@wp.func`` (and transitively a chain of helpers), every module's ``ModuleBuilder`` re-walks and mutates the same ``Adjoint`` objects. Without ``_codegen_lock`` two threads land in ``adj.build`` concurrently, interleave their writes to ``adj.blocks`` / ``adj.deferred_static_expressions``, and the emitted .cu sees mangled function signatures (``var_5 = _race_helper_0(...)`` assigned a ``void`` return, ``adj__race_helper_0`` called with the wrong arity, etc.). nvrtc then fails the build with a handful of syntax errors per module. Reproducing the race reliably requires three things: * a chain of shared helpers (``_race_helper`` -> ``_race_mid`` -> ``_race_leaf``) so each module does meaningful shared-adjoint work -- a single small helper compiles too fast for threads to interleave; * enough modules under ``force_load`` (``NUM_MODULES = 8``, worker count up to ``2 * NUM_MODULES``); * a small retry loop (``ATTEMPTS = 4``) -- the race is timing-dependent and the first parallel build sometimes wins. The test is CUDA-only: the CUDA codegen path emits the device-side adjoint stub + reverse glue in addition to the forward path, giving threads more interleaving opportunities. CPU codegen also touches the shared adjoint state but the window is too small to reproduce on a modern multi-core box. Skipping when CUDA is unavailable is acceptable -- the race only ever bit a real user on the CUDA path (PhoenX singleworld kernels). Verification: without the lock the high-concurrency variant consistently fails with NVRTC error 6 on a ``@wp.func`` adjoint that was clobbered mid-build; with the lock applied both variants pass in ~33 s on an RTX 3080 laptop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Codegen race fix Approved-by: Eric Shi <ershi@nvidia.com> See merge request omniverse/warp!2413
1 parent 184c2c4 commit 56019ed

4 files changed

Lines changed: 257 additions & 61 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,8 @@
160160
([GH-1466](https://github.com/NVIDIA/warp/issues/1466)).
161161
- Fix tile byte-offset overflow for arrays larger than 2 GiB
162162
([GH-1422](https://github.com/NVIDIA/warp/issues/1422)).
163+
- Fix an intermittent failure when loading modules in parallel with `wp.force_load(max_workers > 1)` that could cause
164+
modules sharing a `@wp.func` to fail to compile or load ([GH-1474](https://github.com/NVIDIA/warp/issues/1474)).
163165

164166
### Documentation
165167

warp/_src/context.py

Lines changed: 93 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -2233,6 +2233,24 @@ def get_unique_kernels(self):
22332233
return self.unique_kernels.values()
22342234

22352235

2236+
# Process-wide codegen lock. ``adj.build()`` mutates per-Adjoint state
2237+
# (``adj.blocks``, ``adj.deferred_static_expressions``, etc.) on the
2238+
# *same* Adjoint object whenever multiple modules reference a shared
2239+
# ``@wp.func``. Holding this lock around the ``ModuleBuilder`` +
2240+
# ``builder.codegen()`` window in :meth:`Module._compile` serialises
2241+
# the Python-side codegen so concurrent ``Module.load`` calls (e.g.
2242+
# from :func:`force_load` with ``max_workers > 1``) don't interleave
2243+
# function emissions in the per-module .cu output. The expensive
2244+
# nvrtc / nvcc invocation runs after the lock is released, so loading
2245+
# N modules in parallel still parallelises the compiler step (the
2246+
# dominant cost) -- only the much cheaper codegen serialises.
2247+
#
2248+
# Re-entrant so nested ``ModuleBuilder`` calls (e.g. the dummy build
2249+
# inside :meth:`Module.get_module_hash`) on the same thread don't
2250+
# deadlock.
2251+
_codegen_lock = threading.RLock()
2252+
2253+
22362254
class ModuleBuilder:
22372255
def __init__(self, module, options, hasher=None):
22382256
self.functions = {}
@@ -2811,7 +2829,12 @@ def hash_module(self) -> bytes:
28112829
"""
28122830
block_dim = self.options["block_dim"]
28132831
options = self.resolve_options(warp.config)
2814-
self.hashers[block_dim] = ModuleHasher(self._get_live_kernels(), options)
2832+
# ``ModuleHasher.__init__`` calls ``hash_kernel`` -> ``hash_adjoint``
2833+
# which reads shared ``@wp.func`` adjoint state; serialise with
2834+
# ``_codegen_lock`` so concurrent ``Module.load`` callers don't
2835+
# interleave hash + build mutations on the same Adjoint.
2836+
with _codegen_lock:
2837+
self.hashers[block_dim] = ModuleHasher(self._get_live_kernels(), options)
28152838
self.resolved_options[block_dim] = options
28162839
return self.hashers[block_dim].get_hash()
28172840

@@ -2823,16 +2846,27 @@ def get_module_hash(self, block_dim: int | None = None) -> bytes:
28232846
if block_dim is None:
28242847
block_dim = self.options["block_dim"]
28252848

2826-
if self.has_unresolved_static_expressions:
2827-
options = self.resolve_options(warp.config)
2828-
builder_options = options | {"output_arch": None}
2829-
_ = ModuleBuilder(self, builder_options)
2830-
self.has_unresolved_static_expressions = False
2831-
2832-
if block_dim not in self.hashers:
2833-
options = self.resolve_options(warp.config)
2834-
self.hashers[block_dim] = ModuleHasher(self._get_live_kernels(), options)
2835-
self.resolved_options[block_dim] = options
2849+
# Both branches below mutate shared ``@wp.func`` adjoint state
2850+
# (``ModuleBuilder`` runs ``adj.build`` to resolve deferred
2851+
# ``wp.static`` expressions; ``ModuleHasher`` reads the
2852+
# resulting adjoint blocks via ``hash_adjoint``). The two
2853+
# operations are stages of one logical "compute module hash"
2854+
# critical section, so they live in a single ``_codegen_lock``
2855+
# block. Splitting the lock per stage opens a window where
2856+
# another thread can re-run ``adj.build`` on a shared helper
2857+
# and clobber the state this thread is about to hash.
2858+
if self.has_unresolved_static_expressions or block_dim not in self.hashers:
2859+
with _codegen_lock:
2860+
if self.has_unresolved_static_expressions:
2861+
options = self.resolve_options(warp.config)
2862+
builder_options = options | {"output_arch": None}
2863+
_ = ModuleBuilder(self, builder_options)
2864+
self.has_unresolved_static_expressions = False
2865+
2866+
if block_dim not in self.hashers:
2867+
options = self.resolve_options(warp.config)
2868+
self.hashers[block_dim] = ModuleHasher(self._get_live_kernels(), options)
2869+
self.resolved_options[block_dim] = options
28362870

28372871
return self.hashers[block_dim].get_hash()
28382872

@@ -3002,13 +3036,6 @@ def _compile(
30023036
):
30033037
return False
30043038

3005-
# Some of the tile codegen, such as cuFFTDx and cuBLASDx, requires knowledge of the target arch
3006-
builder = ModuleBuilder(
3007-
self,
3008-
options,
3009-
hasher=self.hashers.get(options["block_dim"], None),
3010-
)
3011-
30123039
meta_path = os.path.join(output_dir, self._get_meta_name())
30133040

30143041
build_dir = os.path.normpath(output_dir) + f"_p{os.getpid()}_t{threading.get_ident()}"
@@ -3025,20 +3052,46 @@ def _compile(
30253052
if opt != 3 and not is_cpu and runtime.toolkit_version is not None and runtime.toolkit_version < (12, 9):
30263053
log_warning("Optimization level other than 3 has no effect on CUDA versions prior to 12.9.", once=True)
30273054

3028-
# build CPU
3029-
if is_cpu:
3030-
# build
3031-
try:
3032-
source_code_path = os.path.join(build_dir, f"{module_name_short}.cpp")
3055+
# Python codegen window -- LOCKED. See ``_codegen_lock``.
3056+
# Some of the tile codegen, such as cuFFTDx and cuBLASDx,
3057+
# requires knowledge of the target arch. Snapshot builder
3058+
# collections needed by ``build_cuda`` below into locals so
3059+
# nothing inside the lock is touched after release.
3060+
#
3061+
# NOTE: this lock window is intentionally outside the
3062+
# ``failed_builds`` try/except below. ``ModuleBuilder`` can
3063+
# legitimately raise from ``adj.build`` (e.g. user kernels
3064+
# with type mismatches in the type-mismatch error tests); if
3065+
# we recorded those in ``failed_builds`` the next ``Module.load``
3066+
# on the same device short-circuits with ``return None`` and
3067+
# subsequent unrelated kernels in the same module silently
3068+
# fail to launch. Only the heavy native compile records
3069+
# ``failed_builds``.
3070+
with _codegen_lock:
3071+
builder = ModuleBuilder(
3072+
self,
3073+
options,
3074+
hasher=self.hashers.get(options["block_dim"], None),
3075+
)
3076+
if is_cpu:
3077+
source_code_ext = "cpp"
3078+
source_str = builder.codegen("cpu")
3079+
else:
3080+
source_code_ext = "cu"
3081+
source_str = builder.codegen("cuda")
3082+
meta = builder.build_meta()
3083+
ltoir_values = list(builder.ltoirs.values())
3084+
fatbin_values = list(builder.fatbins.values())
30333085

3034-
# write cpp sources
3035-
cpp_source = builder.codegen("cpu")
3086+
source_code_path = os.path.join(build_dir, f"{module_name_short}.{source_code_ext}")
30363087

3037-
with open(source_code_path, "w") as cpp_file:
3038-
cpp_file.write(cpp_source)
3088+
with open(source_code_path, "w") as source_file:
3089+
source_file.write(source_str)
30393090

3040-
output_path = os.path.join(build_dir, output_name)
3091+
output_path = os.path.join(build_dir, output_name)
30413092

3093+
try:
3094+
if is_cpu:
30423095
# build object code
30433096
with warp.ScopedTimer(
30443097
"Compile x86", active=(warp.config.verbose or warp.config.log_level <= warp.LOG_DEBUG)
@@ -3058,28 +3111,7 @@ def _compile(
30583111
block_dim=options["block_dim"],
30593112
enable_tiles_in_stack_memory=options["enable_tiles_in_stack_memory"],
30603113
)
3061-
3062-
except Exception as e:
3063-
if isinstance(e, FileNotFoundError):
3064-
_check_and_raise_long_path_error(e)
3065-
3066-
self.failed_builds.add(None)
3067-
3068-
raise (e)
3069-
3070-
else:
3071-
# build
3072-
try:
3073-
source_code_path = os.path.join(build_dir, f"{module_name_short}.cu")
3074-
3075-
# write cuda sources
3076-
cu_source = builder.codegen("cuda")
3077-
3078-
with open(source_code_path, "w") as cu_file:
3079-
cu_file.write(cu_source)
3080-
3081-
output_path = os.path.join(build_dir, output_name)
3082-
3114+
else:
30833115
# generate PTX or CUBIN
30843116
with warp.ScopedTimer(
30853117
f"Compile CUDA (arch={options['output_arch']}{arch_suffix}, mode={mode}, block_dim={options['block_dim']})",
@@ -3096,27 +3128,28 @@ def _compile(
30963128
fuse_fp=options["fuse_fp"],
30973129
lineinfo=options["lineinfo"],
30983130
compile_time_trace=options["compile_time_trace"],
3099-
ltoirs=builder.ltoirs.values(),
3100-
fatbins=builder.fatbins.values(),
3131+
ltoirs=ltoir_values,
3132+
fatbins=fatbin_values,
31013133
arch_suffix=arch_suffix,
31023134
pch_dir=runtime.get_nvrtc_pch_dir(),
31033135
llvm_cuda=options["llvm_cuda"],
31043136
use_precompiled_headers=options["use_precompiled_headers"],
31053137
)
31063138

3107-
except Exception as e:
3108-
if isinstance(e, FileNotFoundError):
3109-
_check_and_raise_long_path_error(e)
3139+
except Exception as e:
3140+
if isinstance(e, FileNotFoundError):
3141+
_check_and_raise_long_path_error(e)
31103142

3111-
if device:
3112-
self.failed_builds.add(device.context)
3143+
if is_cpu:
3144+
self.failed_builds.add(None)
3145+
elif device:
3146+
self.failed_builds.add(device.context)
31133147

3114-
raise (e)
3148+
raise (e)
31153149

31163150
# ------------------------------------------------------------
3117-
# build meta data
3151+
# write meta data (already produced under ``_codegen_lock``)
31183152

3119-
meta = builder.build_meta()
31203153
output_meta_path = os.path.join(build_dir, self._get_meta_name())
31213154

31223155
with open(output_meta_path, "w") as meta_file:

warp/tests/test_module_parallel_load.py

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,44 @@
1313

1414
import warp as wp
1515

16+
# Chain of shared ``@wp.func`` helpers used by
17+
# ``TestParallelLoadSharedHelper`` below. Multiple helpers calling each
18+
# other transitively maximise the race surface: every kernel build
19+
# walks the entire chain, so concurrent ``adj.build`` calls have many
20+
# adjoint-state mutations to interleave with one another.
21+
22+
23+
@wp.func
24+
def _race_leaf(x: float) -> float:
25+
a = wp.sin(x) + wp.cos(x)
26+
a = a * a + 0.5
27+
a = wp.sqrt(wp.abs(a) + 1.0)
28+
b = wp.tan(x * 0.5) + wp.exp(-x * x * 0.01)
29+
return a + b
30+
31+
32+
@wp.func
33+
def _race_mid(x: float, k: int) -> float:
34+
s = float(0.0)
35+
for _ in range(4):
36+
s = s + _race_leaf(x + s)
37+
if k > 0:
38+
s = s * float(k)
39+
else:
40+
s = -s
41+
return s
42+
43+
44+
@wp.func
45+
def _race_helper(x: float, idx: int) -> float:
46+
y = _race_mid(x, idx)
47+
z = float(idx) + 1.0
48+
if idx % 2 == 0:
49+
y = y + _race_leaf(z)
50+
else:
51+
y = y - _race_leaf(z)
52+
return y
53+
1654

1755
def _generate_module_code(index):
1856
"""Generate source code for a module with a simple kernel.
@@ -121,5 +159,127 @@ def test_force_load_single_module(self):
121159
_assert_modules_loaded_on_cpu(self, modules)
122160

123161

162+
def _assert_modules_loaded_on_cuda(test, modules, device):
163+
for m in modules:
164+
ctx = device.context
165+
loaded = any(c == ctx for (c, _block_dim) in m.execs.keys())
166+
test.assertTrue(loaded, f"Module {m.name} was not loaded on {device}")
167+
168+
169+
@unittest.skipUnless(wp.is_cuda_available(), "CUDA codegen path race needs a CUDA device")
170+
class TestParallelLoadSharedHelper(unittest.TestCase):
171+
"""Regression test for the cross-module codegen race on CUDA.
172+
173+
Each ``ModuleBuilder`` walks the reachable ``@wp.func`` graph and
174+
calls ``adj.build`` on each helper, which mutates per-Adjoint state
175+
(``adj.blocks``, ``adj.deferred_static_expressions``, ...). When
176+
two modules that reference the *same* helper build concurrently
177+
(e.g., via ``wp.force_load(max_workers > 1)``), without a lock
178+
around the codegen window the threads interleave their writes to
179+
the shared adjoint and the emitted .cu file has corrupt sections
180+
-- one helper's body emitted inside another, or references to
181+
``var_*`` / ``_idx`` / ``dim`` that were never declared. nvrtc
182+
then rejects the file with dozens of syntax errors and
183+
``Module._compile`` raises.
184+
185+
The race reproduces reliably only on the CUDA codegen path (not
186+
CPU) because the CUDA emit walks a longer per-function path: it
187+
additionally emits the device-side adjoint stub, snapshots
188+
``adj.blocks[0].body_replay`` for the reverse-mode glue, and
189+
reads ``options['enable_backward']`` from the codegen-time
190+
global -- giving more interleaving opportunities for two
191+
threads building the same shared helper.
192+
193+
Reproducing the race reliably requires:
194+
195+
* a non-trivial *call graph* of shared ``@wp.func`` helpers
196+
(``_race_helper`` -> ``_race_mid`` -> ``_race_leaf``) so each
197+
module's ``ModuleBuilder`` does meaningful shared adjoint work;
198+
* enough modules submitted to ``force_load`` so the
199+
``ThreadPoolExecutor`` actually runs several builds
200+
concurrently;
201+
* a retry loop (``ATTEMPTS``) so we accept a probabilistic
202+
reproduction -- the race is timing-dependent.
203+
204+
With the codegen lock in place every attempt succeeds. Without
205+
the lock at least one attempt raises a build error.
206+
"""
207+
208+
NUM_MODULES = 8
209+
ATTEMPTS = 4
210+
211+
@staticmethod
212+
def _make_kernel(idx: int):
213+
# ``module="unique"`` puts every factory output in its own
214+
# ``Module`` object; if N kernels share a helper, N separate
215+
# modules each try to inline that helper's adjoint at compile
216+
# time -- the exact pattern PhoneX hits with its singleworld
217+
# factory.
218+
@wp.kernel(module="unique")
219+
def k(out: wp.array(dtype=float)):
220+
tid = wp.tid()
221+
x = float(tid) + float(idx)
222+
out[tid] = _race_helper(x, idx)
223+
224+
return k
225+
226+
def _build_kernels(self, attempt: int):
227+
# Vary the spawn count per attempt so each invocation builds
228+
# a fresh set of unique modules -- after the first attempt the
229+
# earlier kernels' modules are already loaded, so subsequent
230+
# ``force_load`` calls would short-circuit on the hash check
231+
# and never exercise the codegen path again.
232+
offset = attempt * 1000
233+
kernels = [self._make_kernel(offset + i) for i in range(self.NUM_MODULES)]
234+
modules: list = []
235+
seen: set = set()
236+
for k in kernels:
237+
k.module.mark_modified()
238+
if id(k.module) in seen:
239+
continue
240+
seen.add(id(k.module))
241+
modules.append(k.module)
242+
return modules
243+
244+
def test_force_load_parallel_with_shared_func(self):
245+
"""N modules sharing a chain of ``@wp.func`` helpers must load
246+
successfully under ``max_workers > 1``. Without the codegen
247+
lock at least one of the ``ATTEMPTS`` parallel CUDA builds
248+
raises because the shared helpers' adjoints were clobbered
249+
mid-build."""
250+
device = wp.get_preferred_device()
251+
for attempt in range(self.ATTEMPTS):
252+
modules = self._build_kernels(attempt)
253+
try:
254+
wp.force_load(device=device, modules=modules, max_workers=self.NUM_MODULES)
255+
except Exception as e:
256+
self.fail(
257+
f"attempt {attempt}: parallel build raised {type(e).__name__}: {e}. "
258+
"Check the ``_codegen_lock`` window in warp._src.context.Module._compile."
259+
)
260+
_assert_modules_loaded_on_cuda(self, modules, device)
261+
262+
def test_force_load_parallel_with_shared_func_high_concurrency(self):
263+
"""Same race but with more modules than worker threads, so the
264+
``ThreadPoolExecutor`` queues tasks and reuses workers between
265+
builds. ``force_load`` submits exactly ``len(devices) * len(modules)``
266+
tasks, so to actually change scheduling vs. the basic test we
267+
submit 2x the modules and cap ``max_workers`` below that count
268+
-- this forces real queueing/contention rather than a
269+
thread-per-task fan-out."""
270+
device = wp.get_preferred_device()
271+
max_workers = max(2, self.NUM_MODULES // 2)
272+
for attempt in range(self.ATTEMPTS):
273+
modules = self._build_kernels(attempt + 100) + self._build_kernels(attempt + 200)
274+
self.assertGreater(len(modules), max_workers)
275+
try:
276+
wp.force_load(device=device, modules=modules, max_workers=max_workers)
277+
except Exception as e:
278+
self.fail(
279+
f"attempt {attempt}: parallel build raised {type(e).__name__}: {e} (high-concurrency variant)."
280+
)
281+
_assert_modules_loaded_on_cuda(self, modules, device)
282+
283+
124284
if __name__ == "__main__":
125285
unittest.main(verbosity=2)

0 commit comments

Comments
 (0)