optionally use MADV_GUARD_INSTALL for large allocation guard pages#341
Open
thomasbuilds wants to merge 1 commit into
Open
optionally use MADV_GUARD_INSTALL for large allocation guard pages#341thomasbuilds wants to merge 1 commit into
thomasbuilds wants to merge 1 commit into
Conversation
Contributor
This produces a regression for this program. When CONFIG_GUARD_PAGES_USE_MADVISE is false, the program runs normally, but when CONFIG_GUARD_PAGES_USE_MADVISE is true, the malloc after mlockall(MCL_FUTURE | MCL_ONFAULT); fails with errno=22. |
9e3e3a6 to
f54ee16
Compare
ded5838 to
35a0009
Compare
Add CONFIG_GUARD_PAGES_USE_MADVISE (default false) to install the guard regions of large allocations with MADV_GUARD_INSTALL (Linux 6.13+) inside a single read-write mapping instead of as separate PROT_NONE mappings, keeping each large allocation to one VMA instead of three. The single-VMA property is preserved through allocate_pages(), allocate_pages_aligned(), the region quarantine and the in-place realloc shrink so it holds under allocation churn, including under CONFIG_LABEL_MEMORY where the quarantined region is named as a whole to avoid splitting the VMA. Guard install zaps any existing pages in the range, so the quarantine still purges data and frees resident memory with a single system call, the same count as the PROT_NONE remap it replaces; allocation costs one extra system call (mmap + 2 madvise instead of mmap + mprotect). Kernel support is probed on a fresh mapping at runtime and cached. Guard installation is best-effort: any madvise failure falls back to the PROT_NONE scheme. EINVAL means the specific mapping can't be guarded (VM_LOCKED), so it resets the cached state to force a re-probe: under mlockall(MCL_FUTURE) the probe mapping is itself locked and latches the feature off, while freeing a one-off mlock'd allocation only loses the single call. errno is preserved across the fallback. It is off by default because the guard bytes and quarantined regions are then accounted as committed memory (resident memory and total address space are unchanged), which regresses strict overcommit (vm.overcommit_memory=2). Add large allocation guard regression tests covering underflow, aligned allocation overflow/underflow and the in-place realloc shrink paths, which apply to both guard schemes, and build the new configuration in CI.
35a0009 to
6252879
Compare
Contributor
Author
|
Thanks @rdevshp, the PR got updated quite a lot since your last review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses the high-VMA-count concern from
KERNEL_FEATURE_WISHLIST.md(see #258).MADV_GUARD_INSTALL(Linux 6.13+) lets guard regions live inside a single read-write mapping at the page-table level instead of as separatePROT_NONEVMAs.Change
Adds
CONFIG_GUARD_PAGES_USE_MADVISE(default false). When enabled, guard regions for large allocations are installed withMADV_GUARD_INSTALLinside one read-write mapping rather than carved out as separatePROT_NONEmappings, keeping each large allocation to a single VMA instead of three. This is applied inallocate_pages(),allocate_pages_aligned(), the region quarantine, and the in-place realloc shrink, so the single-VMA property holds under allocation churn rather than only for live allocations. It also holds across themremapgrowth path: guard markers move with the mapping and the moved body merges into the never-faulted destination fragments (verified on 6.17). For aligned allocations the installed guards sit exactly adjacent to the usable region, giving the clean[guard][usable][guard]layout discussed in #350.Syscall cost:
MADV_GUARD_INSTALLzaps any existing pages in the range, so no separateMADV_DONTNEEDis needed and the quarantine and the shrink path stay at one syscall each (onemadviseinstead of onemmap). Only allocation pays one extra syscall (mmap+ 2xmadviseinstead ofmmap+mprotect).Kernel support is probed once at runtime on a fresh mapping and cached. Guard installation is best-effort: any
madvisefailure falls back to the existingPROT_NONEscheme rather than failing the allocation, preservingerrno.MADV_GUARD_INSTALLreturnsEINVALonVM_LOCKEDmappings; that resets the cached state so the next allocation re-probes: undermlockall(MCL_FUTURE)the probe mapping is itself locked and the feature latches off rather than being retried per allocation, while freeing a one-offmlock'd allocation only loses that single call. UnderCONFIG_LABEL_MEMORYthe quarantined region is labeled as a whole soPR_SET_VMA_ANON_NAMEdoes not split the single VMA back into three.One sharp edge is documented rather than fixed: guard install is not atomic, so if it fails partway through the realloc-shrink path and the
PROT_NONEfallback also fails (two ENOMEMs back to back), part of the discarded tail may be left guarded or zapped while realloc returns NULL. Failing loudly on a later access is preferred over aMADV_GUARD_REMOVErecovery that would silently expose zeroed pages.Why off by default
In #258 it was noted this would "require having full overcommit enabled if it doesn't reduce the accounted memory", and that is what I measured. Resident memory and total address space are unchanged (
RLIMIT_ASunaffected), but private-writable commit charge grows, which regresses strict overcommit (vm.overcommit_memory=2):PROT_NONEscheme). RSS is still released because guard install zaps the pages (measured at +184 KiB resident after 2560 quarantined 1 MiB frees).There is also a throughput cost on allocation-rate-bound workloads: guard installation writes a page-table marker for every 4 KiB page and allocates page tables for the guard range, so its cost scales with the randomized guard size, while a
PROT_NONEreservation populates no page tables at all. Measured: ~-2% single-threaded churn with pages touched, -26% for pure alloc/free of 256 KiB allocations, ~2x slower in an 8-thread 256 KiB churn stress test (medians of 9 interleaved runs), and several times slower when churning allocations in the tens of MiB, where guards span thousands of pages. Measured TLB shootdown IPIs are slightly lower than with thePROT_NONEscheme, so the cost is in-kernel page-table work rather than interrupt traffic. The win is in whole-process operations that scale with VMA count, which is the actual motivation (see below). Hence opt-in rather than a default behavior change, following theCONFIG_LABEL_MEMORYprecedent of a compile-time option defaulting to false.Measurements
Linux 6.17 x86_64, 8 cores. 2000 concurrently-live 256 KiB allocations, all pages touched:
Adjacent single-VMA allocations merge, so it does better than 1 VMA/allocation. VMAs after sustained churn (2560 x 1 MiB alloc/free, full quarantine):
CONFIG_LABEL_MEMORYThat's a ~5.6x reduction under
CONFIG_LABEL_MEMORY(the Android default), and <=PROT_NONEin every config. (Counts vary with the randomized guard sizes.) Whole-process operations with 2000 live allocations: the/proc/self/smapspayload drops from 2844 KiB to 46 KiB, so code that does work per VMA scans far less, and VMA-dominatedfork()latency drops ~32%.Verification
-Werrorunder gcc and clang, feature off and on, with and withoutCONFIG_LABEL_MEMORY; the CI matrix (gcc, clang, musl) now also runs the test suite with the feature enabled. All 56 tests pass in every configuration.CONFIG_LABEL_MEMORY; quarantined, shrunk and mremap-grown regions stay single-VMA.mlockall(MCL_FUTURE)latches the feature off via the probe and all allocations succeed on thePROT_NONEscheme; freeing anmlock'd allocation falls back for that call only.madvisefaults with strace: with every call failingENOMEM, failing from the 7th call onward, and failingEINVALintermittently (forcing repeated re-probes and mixed-scheme allocations), the full suite passes and guards still fault through the fallbacks.mlock'd frees racing the probe) completes ~230k operations with no corruption; the only cross-thread state is the single atomic feature flag.madvise's return value, so the feature must be validated on a real kernel: qemu-user silently no-opsMADV_GUARD_INSTALL, which would leave large allocations without guards. This is a reason it must stay opt-in.