Skip to content

Fix flaky bolt_thrustjit_test caused by a race condition in LLVM 13's RTDyldObjectLinkingLayer::onObjEmit.#396

Open
frankobe wants to merge 1 commit intobytedance:mainfrom
frankobe:fix/jit-onObjEmit-race
Open

Fix flaky bolt_thrustjit_test caused by a race condition in LLVM 13's RTDyldObjectLinkingLayer::onObjEmit.#396
frankobe wants to merge 1 commit intobytedance:mainfrom
frankobe:fix/jit-onObjEmit-race

Conversation

@frankobe
Copy link
Collaborator

@frankobe frankobe commented Mar 14, 2026

What problem does this PR solve?

Issue Number: #151

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

Root cause: In LLVM 13's onObjEmit, the execution order is:

  1. R.notifyEmitted() — dispatches a separate thread pool task that unblocks lookup()
  2. notifyObjectLoaded() — EventListener callback
  3. NotifyEmitted callback
  4. R.withResourceKeyDo() — stores MemMgr in the layer's MemMgrs map

After step 1, lookup() returns on the caller thread while steps 2–4 are still running on the pool thread. If cache eviction then calls rt->remove()handleRemoveResources, it cannot find the MemMgr (not stored yet), so notifyFreeingObject is never called and memory leaks permanently. The later withResourceKeyDo sees a defunct tracker and errors out.

Symptoms:

  • ASSERT_LT(jit->GetMemoryUsage(), 2*LIMIT) fails with 2408 vs 2048
  • JIT session error: Resource tracker 0x... became defunct
  • ~RTDyldObjectLinkingLayer asserts MemMgrs.empty() on shutdown

Fix — per-module emit fence:

  • Hook setNotifyEmitted to mark the ResourceKey as pending (step 3)
  • Wrap setDispatchTask to signal completion after T->run() returns (step 5, after step 4)
  • Thread-local state links steps 3 and 5 (same pool thread, same task)
  • CompileModule waits only on its own ResourceKey — not the entire thread pool

Additional fixes:

  • Reorder ~ThrustJIT: compile_threads_.wait() before lruCache_.clear() / endSession()
  • shutting_down_ gate for kPerPool fence mode to prevent races during destruction
  • optimize_layer_.add() error handling: return nullptr on failure instead of falling through to lookup()

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    ============================================================================
    ThrustJitBenchmark.cpp                          relative  time/iter   iters/s
    ============================================================================
    SeqCompile_NoFence                                          1.47ms    678.84
    SeqCompile_PerPool                              100.92%     1.46ms    685.08
    SeqCompile_PerModule                            100.91%     1.46ms    684.99
    ----------------------------------------------------------------------------
    SeqCompileEvict_NoFence                                     1.46ms    683.03
    SeqCompileEvict_PerPool                          99.09%     1.48ms    676.79
    SeqCompileEvict_PerModule                        99.35%     1.47ms    678.57
    ----------------------------------------------------------------------------
    ConcCompile_NoFence                                        12.70ms     78.74
    ConcCompile_PerPool                              98.76%    12.86ms     77.77
    ConcCompile_PerModule                           102.56%    12.38ms     80.76
    

    Per-module fence: <1% overhead sequential, 2.6% faster than pool barrier under concurrency.

  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Release Note:
- Fixed flaky JIT test and shutdown crash caused by a race between onObjEmit and cache eviction in LLVM 13's RTDyldObjectLinkingLayer.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

…, and MemMgrs assertion

In LLVM 13's RTDyldObjectLinkingLayer::onObjEmit, notifyEmitted() dispatches
a separate task that unblocks lookup() before withResourceKeyDo() stores the
MemMgr. If cache eviction calls rt->remove() in that window,
handleRemoveResources cannot find the MemMgr, notifyFreeingObject is never
called, and memory leaks permanently. The later withResourceKeyDo sees a
defunct tracker and errors out.

Symptoms:
- Flaky ASSERT_LT(GetMemoryUsage(), 2*LIMIT) in cacheLimit test
- "JIT session error: Resource tracker became defunct"
- ~RTDyldObjectLinkingLayer asserts MemMgrs.empty() on shutdown

Fix:
- Add per-module emit fence: NotifyEmitted callback marks the ResourceKey as
  pending; setDispatchTask wrapper signals completion after T->run() returns
  (guaranteeing withResourceKeyDo has finished). CompileModule waits only on
  its own key, not the entire thread pool.
- Reorder ~ThrustJIT: wait for thread pool before clearing cache and ending
  session, preventing the same race during shutdown.
- Add shutting_down_ gate to reject CompileModule calls during destruction.
- Fix optimize_layer_.add() error handling: return nullptr on failure instead
  of falling through to lookup() in an inconsistent state.
- Add concurrentEvictionStress test (16 threads x 32 modules, tiny cache).
- Add folly::Benchmark for emit-fence latency comparison.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@frankobe frankobe force-pushed the fix/jit-onObjEmit-race branch from 1aa4b28 to adade7b Compare March 14, 2026 07:00
@frankobe frankobe requested a review from kexianda March 14, 2026 07:01
@frankobe frankobe changed the title Fix race in onObjEmit causing flaky JIT test and MemMgrs assertion Fix flaky bolt_thrustjit_test caused by a race condition in LLVM 13's RTDyldObjectLinkingLayer::onObjEmit. Mar 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant