Skip to content

Fix dirty address list OOM with halted SMP CPUs#189

Open
garybeihl wants to merge 1 commit intorenode:masterfrom
garybeihl:fix-dirty-address-oom
Open

Fix dirty address list OOM with halted SMP CPUs#189
garybeihl wants to merge 1 commit intorenode:masterfrom
garybeihl:fix-dirty-address-oom

Conversation

@garybeihl
Copy link
Copy Markdown

Summary

  • Fix unbounded memory growth when one CPU is halted (e.g., SMP boot with nosmp)
  • Replace shared list + index tracking with per-consumer HashSets
  • Skip halted CPUs during dirty address broadcast, mark them for full TLB flush on resume

When one CPU is halted, its dirty address index never advances, preventing TryReduceBroadcastedDirtyAddresses from trimming the shared list. With a running CPU continuously dirtying pages, the list grows unbounded (134M+ entries observed), eventually causing OOM.

The new design adds pages directly to each same-architecture consumer's HashSet, skipping halted CPUs. On resume, the skipped CPU gets a full TLB flush via TlibInvalidateTranslationCache instead of replaying millions of stale entries.

Discovered while booting Linux on an AST2600 (dual Cortex-A7) with nosmp maxcpus=1.

Test plan

  • Boot Linux SMP kernel with nosmp maxcpus=1 — no OOM after extended runtime
  • Boot Linux SMP kernel normally — both CPUs share dirty addresses correctly
  • Halt/resume a CPU — resumed CPU gets full TLB flush

When one CPU is halted (e.g., SMP boot with nosmp), the halted CPU
never fetches dirty addresses. The shared per-architecture list grew
unbounded (134M+ entries observed) because TryReduceBroadcastedDirtyAddresses
could not advance past the halted CPU's unread index.

Replace the shared list + index tracking with per-consumer HashSets.
AppendDirtyAddresses now adds pages directly to each same-architecture
consumer's set, skipping halted CPUs and marking them for a full TLB
flush on resume. GetNewDirtyAddressesForCore returns null when a full
flush is needed, which TranslationCPU handles by calling
TlibInvalidateTranslationCache.

This eliminates the unbounded memory growth and the O(n) RemoveRange
operations that were also a performance bottleneck.

Signed-off-by: Gary Beihl <garybeihl@microsoft.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 19, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants