Commit 56019ed
committed
Fix a codegen race condition GH-1474
* Implement MR comments
* More cleanup
* Polishing
* Undo faulty changes
* Add more adjustments such that no corrupted code should get generated when one thread fails during the compilation (syntax error etc)
* Run ruff
* Test: cross-module codegen race with shared @wp.func
Add a regression test for the codegen race fixed by the previous
commit ("Codegen race fix"). The existing tests in this file build
N modules where each module has its OWN ``@wp.kernel`` and no shared
``@wp.func`` -- so concurrent ``adj.build`` calls touch disjoint
adjoint state and the race never fires.
The bug needs a *shared* helper graph: when M modules each call the
same module-level ``@wp.func`` (and transitively a chain of helpers),
every module's ``ModuleBuilder`` re-walks and mutates the same
``Adjoint`` objects. Without ``_codegen_lock`` two threads land in
``adj.build`` concurrently, interleave their writes to ``adj.blocks``
/ ``adj.deferred_static_expressions``, and the emitted .cu sees
mangled function signatures (``var_5 = _race_helper_0(...)``
assigned a ``void`` return, ``adj__race_helper_0`` called with the
wrong arity, etc.). nvrtc then fails the build with a handful of
syntax errors per module.
Reproducing the race reliably requires three things:
* a chain of shared helpers (``_race_helper`` -> ``_race_mid`` ->
``_race_leaf``) so each module does meaningful shared-adjoint work
-- a single small helper compiles too fast for threads to
interleave;
* enough modules under ``force_load`` (``NUM_MODULES = 8``,
worker count up to ``2 * NUM_MODULES``);
* a small retry loop (``ATTEMPTS = 4``) -- the race is
timing-dependent and the first parallel build sometimes wins.
The test is CUDA-only: the CUDA codegen path emits the device-side
adjoint stub + reverse glue in addition to the forward path, giving
threads more interleaving opportunities. CPU codegen also touches
the shared adjoint state but the window is too small to reproduce
on a modern multi-core box. Skipping when CUDA is unavailable is
acceptable -- the race only ever bit a real user on the CUDA path
(PhoenX singleworld kernels).
Verification: without the lock the high-concurrency variant
consistently fails with NVRTC error 6 on a ``@wp.func`` adjoint
that was clobbered mid-build; with the lock applied both variants
pass in ~33 s on an RTX 3080 laptop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Codegen race fix
Approved-by: Eric Shi <ershi@nvidia.com>
See merge request omniverse/warp!24131 parent 184c2c4 commit 56019ed
4 files changed
Lines changed: 257 additions & 61 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
160 | 160 | | |
161 | 161 | | |
162 | 162 | | |
| 163 | + | |
| 164 | + | |
163 | 165 | | |
164 | 166 | | |
165 | 167 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2233 | 2233 | | |
2234 | 2234 | | |
2235 | 2235 | | |
| 2236 | + | |
| 2237 | + | |
| 2238 | + | |
| 2239 | + | |
| 2240 | + | |
| 2241 | + | |
| 2242 | + | |
| 2243 | + | |
| 2244 | + | |
| 2245 | + | |
| 2246 | + | |
| 2247 | + | |
| 2248 | + | |
| 2249 | + | |
| 2250 | + | |
| 2251 | + | |
| 2252 | + | |
| 2253 | + | |
2236 | 2254 | | |
2237 | 2255 | | |
2238 | 2256 | | |
| |||
2811 | 2829 | | |
2812 | 2830 | | |
2813 | 2831 | | |
2814 | | - | |
| 2832 | + | |
| 2833 | + | |
| 2834 | + | |
| 2835 | + | |
| 2836 | + | |
| 2837 | + | |
2815 | 2838 | | |
2816 | 2839 | | |
2817 | 2840 | | |
| |||
2823 | 2846 | | |
2824 | 2847 | | |
2825 | 2848 | | |
2826 | | - | |
2827 | | - | |
2828 | | - | |
2829 | | - | |
2830 | | - | |
2831 | | - | |
2832 | | - | |
2833 | | - | |
2834 | | - | |
2835 | | - | |
| 2849 | + | |
| 2850 | + | |
| 2851 | + | |
| 2852 | + | |
| 2853 | + | |
| 2854 | + | |
| 2855 | + | |
| 2856 | + | |
| 2857 | + | |
| 2858 | + | |
| 2859 | + | |
| 2860 | + | |
| 2861 | + | |
| 2862 | + | |
| 2863 | + | |
| 2864 | + | |
| 2865 | + | |
| 2866 | + | |
| 2867 | + | |
| 2868 | + | |
| 2869 | + | |
2836 | 2870 | | |
2837 | 2871 | | |
2838 | 2872 | | |
| |||
3002 | 3036 | | |
3003 | 3037 | | |
3004 | 3038 | | |
3005 | | - | |
3006 | | - | |
3007 | | - | |
3008 | | - | |
3009 | | - | |
3010 | | - | |
3011 | | - | |
3012 | 3039 | | |
3013 | 3040 | | |
3014 | 3041 | | |
| |||
3025 | 3052 | | |
3026 | 3053 | | |
3027 | 3054 | | |
3028 | | - | |
3029 | | - | |
3030 | | - | |
3031 | | - | |
3032 | | - | |
| 3055 | + | |
| 3056 | + | |
| 3057 | + | |
| 3058 | + | |
| 3059 | + | |
| 3060 | + | |
| 3061 | + | |
| 3062 | + | |
| 3063 | + | |
| 3064 | + | |
| 3065 | + | |
| 3066 | + | |
| 3067 | + | |
| 3068 | + | |
| 3069 | + | |
| 3070 | + | |
| 3071 | + | |
| 3072 | + | |
| 3073 | + | |
| 3074 | + | |
| 3075 | + | |
| 3076 | + | |
| 3077 | + | |
| 3078 | + | |
| 3079 | + | |
| 3080 | + | |
| 3081 | + | |
| 3082 | + | |
| 3083 | + | |
| 3084 | + | |
3033 | 3085 | | |
3034 | | - | |
3035 | | - | |
| 3086 | + | |
3036 | 3087 | | |
3037 | | - | |
3038 | | - | |
| 3088 | + | |
| 3089 | + | |
3039 | 3090 | | |
3040 | | - | |
| 3091 | + | |
3041 | 3092 | | |
| 3093 | + | |
| 3094 | + | |
3042 | 3095 | | |
3043 | 3096 | | |
3044 | 3097 | | |
| |||
3058 | 3111 | | |
3059 | 3112 | | |
3060 | 3113 | | |
3061 | | - | |
3062 | | - | |
3063 | | - | |
3064 | | - | |
3065 | | - | |
3066 | | - | |
3067 | | - | |
3068 | | - | |
3069 | | - | |
3070 | | - | |
3071 | | - | |
3072 | | - | |
3073 | | - | |
3074 | | - | |
3075 | | - | |
3076 | | - | |
3077 | | - | |
3078 | | - | |
3079 | | - | |
3080 | | - | |
3081 | | - | |
3082 | | - | |
| 3114 | + | |
3083 | 3115 | | |
3084 | 3116 | | |
3085 | 3117 | | |
| |||
3096 | 3128 | | |
3097 | 3129 | | |
3098 | 3130 | | |
3099 | | - | |
3100 | | - | |
| 3131 | + | |
| 3132 | + | |
3101 | 3133 | | |
3102 | 3134 | | |
3103 | 3135 | | |
3104 | 3136 | | |
3105 | 3137 | | |
3106 | 3138 | | |
3107 | | - | |
3108 | | - | |
3109 | | - | |
| 3139 | + | |
| 3140 | + | |
| 3141 | + | |
3110 | 3142 | | |
3111 | | - | |
3112 | | - | |
| 3143 | + | |
| 3144 | + | |
| 3145 | + | |
| 3146 | + | |
3113 | 3147 | | |
3114 | | - | |
| 3148 | + | |
3115 | 3149 | | |
3116 | 3150 | | |
3117 | | - | |
| 3151 | + | |
3118 | 3152 | | |
3119 | | - | |
3120 | 3153 | | |
3121 | 3154 | | |
3122 | 3155 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
16 | 54 | | |
17 | 55 | | |
18 | 56 | | |
| |||
121 | 159 | | |
122 | 160 | | |
123 | 161 | | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
124 | 284 | | |
125 | 285 | | |
0 commit comments