Fix flaky bolt_thrustjit_test caused by a race condition in LLVM 13's RTDyldObjectLinkingLayer::onObjEmit.#396
Open
frankobe wants to merge 1 commit intobytedance:mainfrom
Open
Conversation
…, and MemMgrs assertion In LLVM 13's RTDyldObjectLinkingLayer::onObjEmit, notifyEmitted() dispatches a separate task that unblocks lookup() before withResourceKeyDo() stores the MemMgr. If cache eviction calls rt->remove() in that window, handleRemoveResources cannot find the MemMgr, notifyFreeingObject is never called, and memory leaks permanently. The later withResourceKeyDo sees a defunct tracker and errors out. Symptoms: - Flaky ASSERT_LT(GetMemoryUsage(), 2*LIMIT) in cacheLimit test - "JIT session error: Resource tracker became defunct" - ~RTDyldObjectLinkingLayer asserts MemMgrs.empty() on shutdown Fix: - Add per-module emit fence: NotifyEmitted callback marks the ResourceKey as pending; setDispatchTask wrapper signals completion after T->run() returns (guaranteeing withResourceKeyDo has finished). CompileModule waits only on its own key, not the entire thread pool. - Reorder ~ThrustJIT: wait for thread pool before clearing cache and ending session, preventing the same race during shutdown. - Add shutting_down_ gate to reject CompileModule calls during destruction. - Fix optimize_layer_.add() error handling: return nullptr on failure instead of falling through to lookup() in an inconsistent state. - Add concurrentEvictionStress test (16 threads x 32 modules, tiny cache). - Add folly::Benchmark for emit-fence latency comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1aa4b28 to
adade7b
Compare
bolt_thrustjit_test caused by a race condition in LLVM 13's RTDyldObjectLinkingLayer::onObjEmit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: #151
Type of Change
Description
Root cause: In LLVM 13's
onObjEmit, the execution order is:R.notifyEmitted()— dispatches a separate thread pool task that unblockslookup()notifyObjectLoaded()— EventListener callbackNotifyEmittedcallbackR.withResourceKeyDo()— stores MemMgr in the layer'sMemMgrsmapAfter step 1,
lookup()returns on the caller thread while steps 2–4 are still running on the pool thread. If cache eviction then callsrt->remove()→handleRemoveResources, it cannot find the MemMgr (not stored yet), sonotifyFreeingObjectis never called and memory leaks permanently. The laterwithResourceKeyDosees a defunct tracker and errors out.Symptoms:
ASSERT_LT(jit->GetMemoryUsage(), 2*LIMIT)fails with2408 vs 2048JIT session error: Resource tracker 0x... became defunct~RTDyldObjectLinkingLayerassertsMemMgrs.empty()on shutdownFix — per-module emit fence:
setNotifyEmittedto mark the ResourceKey as pending (step 3)setDispatchTaskto signal completion afterT->run()returns (step 5, after step 4)CompileModulewaits only on its own ResourceKey — not the entire thread poolAdditional fixes:
~ThrustJIT:compile_threads_.wait()beforelruCache_.clear()/endSession()shutting_down_gate forkPerPoolfence mode to prevent races during destructionoptimize_layer_.add()error handling: returnnullptron failure instead of falling through tolookup()Performance Impact
No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).
Positive Impact: I have run benchmarks.
Click to view Benchmark Results
Per-module fence: <1% overhead sequential, 2.6% faster than pool barrier under concurrency.
Negative Impact: Explained below (e.g., trade-off for correctness).
Release Note
Checklist (For Author)
Breaking Changes
No
Yes (Description: ...)
Click to view Breaking Changes