Fix dirty address list OOM with halted SMP CPUs#189
Open
garybeihl wants to merge 1 commit intorenode:masterfrom
Open
Fix dirty address list OOM with halted SMP CPUs#189garybeihl wants to merge 1 commit intorenode:masterfrom
garybeihl wants to merge 1 commit intorenode:masterfrom
Conversation
When one CPU is halted (e.g., SMP boot with nosmp), the halted CPU never fetches dirty addresses. The shared per-architecture list grew unbounded (134M+ entries observed) because TryReduceBroadcastedDirtyAddresses could not advance past the halted CPU's unread index. Replace the shared list + index tracking with per-consumer HashSets. AppendDirtyAddresses now adds pages directly to each same-architecture consumer's set, skipping halted CPUs and marking them for a full TLB flush on resume. GetNewDirtyAddressesForCore returns null when a full flush is needed, which TranslationCPU handles by calling TlibInvalidateTranslationCache. This eliminates the unbounded memory growth and the O(n) RemoveRange operations that were also a performance bottleneck. Signed-off-by: Gary Beihl <garybeihl@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
nosmp)When one CPU is halted, its dirty address index never advances, preventing
TryReduceBroadcastedDirtyAddressesfrom trimming the shared list. With a running CPU continuously dirtying pages, the list grows unbounded (134M+ entries observed), eventually causing OOM.The new design adds pages directly to each same-architecture consumer's HashSet, skipping halted CPUs. On resume, the skipped CPU gets a full TLB flush via
TlibInvalidateTranslationCacheinstead of replaying millions of stale entries.Discovered while booting Linux on an AST2600 (dual Cortex-A7) with
nosmp maxcpus=1.Test plan
nosmp maxcpus=1— no OOM after extended runtime