Code huge pages on lld-style PIE binaries (sublime, Discord, slack, libjvm)#5
Conversation
The hugifyr transformation aims to make the kernel grant code huge pages on the binary's executable LOAD. For that to work, every 2MB chunk that LOAD RE touches must be exclusively RE — if a non-exec LOAD's vaddr range overlaps any of those chunks, mmap-order overlay mixes protections and the kernel can't issue a code huge page on it. Add a parser for readelf -lW LOAD entries plus check_re_chunk_isolation which asserts no other LOAD's vaddr range intersects an RE 2MB chunk. Wire it into test_basic. The check fires loudly if a future change to the layout pass picks a vaddr_delta that's just large enough to land .text on a 2MB boundary but not large enough to push subsequent LOADs out of RE's last chunk — i.e. start-aligning instead of end-aligning the executable segment. Also add test_load_layouts that builds test1.c with default ld and with -Wl,-z,noseparate-code (Oracle JDK-style combined R+E first segment) and verifies hugifyr produces a runnable binary for each. The lld-style layout (rodata in seg0, used by Chromium-based apps and Sublime Text) isn't covered here because hugifyr's main path doesn't currently handle it: shifting .text without also shifting seg0's rodata breaks RIP-relative LEAs from code into rodata. Fixing that while keeping end-alignment is separate work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The main shifting path crashes lld-style PIE binaries (Sublime Text,
Discord, Slack, MS Edge, Chrome, MongoDB) because their first read-only
LOAD ("seg0") carries .rodata / .eh_frame_hdr / .eh_frame /
.gcc_except_table — sections that .text RIP-references via direct LEA
displacements with no relocation entries. Shifting .text without
shifting those sections invalidates every cross-segment LEA and the
binary segfaults during dl_main / unwinder init.
This commit doesn't fix that fully — moving seg0's rodata into a
shifted segment with end-alignment preserved is structurally bigger
work. It establishes the necessary precondition: a safe transformation
that runs on lld-style binaries, leaves them runnable, and ensures the
exec LOAD's p_offset and p_vaddr have the same residue modulo 2MB.
Detection: seg0_has_movable_sections() walks sections at vaddrs below
the first PT_X LOAD's p_vaddr. Anything SHF_ALLOC, not SHT_NOBITS, and
not in the existing relocatable_section_types whitelist (which already
covers .dynsym, .gnu.hash, .rela.*, .dynamic, .interp, .note.*) is
considered RIP-referenced from code => the binary is lld-style. The
whitelist is conservative; unknown section types route to the safe
padding-only path rather than to the shifting path.
Padding-only path: pad_offset_to_match_vaddr() computes
delta = (p_vaddr_RE - p_offset_RE) mod 2MB, bumps p_offset of every
phdr at-or-after the original exec offset by delta, bumps every
section's sh_offset similarly, bumps e_shoff, and stamps the first
LOAD's p_align to 2MB. It does NOT touch any p_vaddr / sh_addr /
relocations / symbols / DWARF / build-id. The output is byte-identical
to the input except for the inserted file padding and the updated
offset fields. The transformed binary runs identically to the
original.
Tests:
- check_offset_vaddr_mod_2mb_match: asserts p_offset%2MB ==
p_vaddr%2MB on the exec LOAD. Wired into test_basic and every
test_load_layouts variant.
- test_load_layouts gets the lld variant back (built with
-fuse-ld=lld); it now exercises the new padding-only path.
Verified on real-world closed-source PIE binaries we already had
downloaded:
- Sublime Text 4180: --version → "Sublime Text Build 4180" matches
- Discord 0.0.135: matches
- Slack 4.42.117: matches
- MEGAsync (modern): main path, matches
- Cisco Webex CEF: main path, matches
- cloudflared/terraform: ET_EXEC fallback, unchanged
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… path) The padding-only path added in ceec465 only fixed the file-side mod-2MB alignment of LOAD RE without changing any vaddr — so lld-style binaries became correct but never huge-page-eligible. This commit replaces it with a transformation that does enable code huge pages on lld-style PIE. What's new: - AdjInfo carries a list of "movable seg0" vaddr ranges: sections in seg0 that are SHF_ALLOC, non-NOBITS, and NOT in relocatable_section_types (.rodata, .eh_frame, .eh_frame_hdr, .gcc_except_table). calc_adjusted_addr remaps addresses inside those ranges by the same vaddr_delta as everything at-or-after old_exec_vaddr, so RIP-relative LEAs from .text into .rodata stay valid after the shift. Empty for non-lld binaries (the existing behavior). - adjust_program_headers extends seg0 LOAD R's filesz/memsz to cover the shifted seg0 contents, clamps LOAD RE's p_vaddr to max(round_down(p_vaddr,2MB), seg0_end_after_shift) so seg0 LOAD R and LOAD RE never overlap in vaddr space, and shifts PT_GNU_EH_FRAME (which targets a movable .eh_frame_hdr). - adjust_section_headers shifts sh_offset for movable seg0 sections; seg0 has p_vaddr == p_offset == 0, so the file delta equals vaddr_delta. - segment_offset_delta for lld-style is exec_p_vaddr_clamped - old_p_offset (LOAD RE's file region starts where extended seg0 ends); section_offset_delta accounts for the clamp so every section in LOAD RE has sh_offset_new - sh_addr_new == p_offset_new - p_vaddr_clamped (kernel constraint for a single LOAD's file mapping). - pad_segment_start now fills the gap between the last non-exec section and the first executable section in LOAD RE — never below p_vaddr or over metadata. This avoids clobbering ELF header / PHDR / .interp / .note in the 2-LOAD R+E first ("combined" -z noseparate-code) layout. - pad_offset_to_match_vaddr removed. For modern PIE (4-LOAD with metadata-only seg0) and 2-LOAD R+E first ("combined") the new code is a no-op via the seg0_end_after_shift = 0 clamp degeneration. Tests: - check_segment_alignment unchanged for the modern path. - New check_exec_load_end_aligned: every variant must have LOAD RE's end on a 2MB boundary. - check_re_chunk_isolation relaxed to require only that fully-covered 2MB chunks be exclusive code (partial chunks at the start/end of LOAD RE can legitimately share their range with adjacent LOAD R / LOAD RW). - All three checks (offset/vaddr-mod, end-aligned, chunk-isolation) wired into every test_load_layouts variant including lld. Verification: - test_basic + test_load_layouts (default, combined, lld) + TLS + TLS-relocs all pass. - Real-world smoke test on lld-style PIE: sublime_text (Build 4180), Discord (0.0.135), slack (4.42.117) all run identically; LOAD RE ends at 2MB, full chunks isolated. - libjvm.so (Oracle JDK 21.0.11, 2-LOAD R+E first / 20MB code) runs the full Java workload (JIT, GC, Streams, ConcurrentHashMap, Executors, recursion) bit-identical to the original. - Booted under /boot/vmlinuz-6.14.11rothp (READ_ONLY_THP_FOR_FS=y) with the hugified libjvm.so: THPeligible=1 (was 0 on host), khugepaged collapsed 16384 kB into 8 file-PMD-mapped 2MB pages on the libjvm.so r-xp mapping after running the workload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds lld-style PIE support to hugifyr’s huge-page transformation by shifting RIP-referenced seg0 content alongside the exec LOAD (instead of shifting .text alone), and strengthens regression coverage for multiple PT_LOAD layout shapes.
Changes:
- Extend address-adjustment logic to also shift “movable” seg0 section ranges (lld-style layouts) and clamp exec LOAD start to avoid vaddr overlap.
- Recompute offset deltas against the clamped exec
p_vaddr, update PHDR/SHDR adjustments accordingly, and refine exec-segment padding behavior. - Expand test harness with readelf-based validations for offset/vaddr modulo matching, exec LOAD end alignment, and 2MB chunk isolation across layout variants.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/hugifyr.c |
Implements lld-style PIE handling via movable seg0 ranges, exec p_vaddr clamping, and updated PHDR/SHDR/offset adjustment logic. |
tests/test.py |
Adds regression checks for exec LOAD modulo constraint, end alignment, and full-chunk isolation; adds a layout-variant test matrix (default/combined/lld). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| phdr->p_vaddr = new_pvaddr; | ||
| phdr->p_paddr = new_pvaddr; | ||
| if (shrink > 0 && i == (size_t)info->exec_index) { | ||
| if (phdr->p_memsz > shrink) phdr->p_memsz -= shrink; | ||
| if (phdr->p_filesz > shrink) phdr->p_filesz -= shrink; |
There was a problem hiding this comment.
In the exec LOAD clamping path, when shrink > 0 you only subtract shrink from p_memsz/p_filesz if the size is greater than shrink. If shrink >= p_memsz this leaves the segment end shifted forward (breaking the intended end-alignment and potentially mapping unintended file bytes). This should be handled explicitly (e.g., validate shrink <= p_memsz and fail, or clamp sizes to 0 / recompute sizes consistently for the new p_vaddr).
| phdr->p_vaddr = new_pvaddr; | |
| phdr->p_paddr = new_pvaddr; | |
| if (shrink > 0 && i == (size_t)info->exec_index) { | |
| if (phdr->p_memsz > shrink) phdr->p_memsz -= shrink; | |
| if (phdr->p_filesz > shrink) phdr->p_filesz -= shrink; | |
| if (shrink > 0 && i == (size_t)info->exec_index) { | |
| if (shrink > phdr->p_memsz || shrink > phdr->p_filesz) { | |
| pr_error("Invalid executable LOAD clamp: shrink=%ju memsz=%ju filesz=%ju\n", | |
| (uintmax_t)shrink, (uintmax_t)phdr->p_memsz, | |
| (uintmax_t)phdr->p_filesz); | |
| abort(); | |
| } | |
| } | |
| phdr->p_vaddr = new_pvaddr; | |
| phdr->p_paddr = new_pvaddr; | |
| if (shrink > 0 && i == (size_t)info->exec_index) { | |
| phdr->p_memsz -= shrink; | |
| phdr->p_filesz -= shrink; |
| def check_exec_load_end_aligned(filename, huge=0x200000): | ||
| """Verify that for the executable LOAD, p_vaddr + p_memsz lands on a | ||
| 2MB boundary AND p_align is 2MB. The END being aligned is what makes | ||
| the last code huge page eligible — it's required regardless of | ||
| whether the START is also 2MB-aligned. (lld-style transformed | ||
| binaries have a non-aligned p_vaddr clamped to seg0_end_after_shift, | ||
| but the end is still extended to a 2MB boundary.)""" | ||
| loads = parse_load_segments(filename) | ||
| exec_load = next((l for l in loads if 'E' in l['flags']), None) | ||
| if not exec_load: | ||
| raise RuntimeError(f"{filename}: no executable LOAD") | ||
| end = exec_load['vaddr'] + exec_load['memsz'] | ||
| if end % huge != 0: | ||
| raise RuntimeError( | ||
| f"{filename}: exec LOAD end 0x{end:x} (vaddr=0x{exec_load['vaddr']:x} + " | ||
| f"memsz=0x{exec_load['memsz']:x}) is not 2MB-aligned") | ||
| print(f"exec LOAD end 2MB-aligned OK in {filename} " |
There was a problem hiding this comment.
check_exec_load_end_aligned()'s docstring says it verifies both end alignment and that p_align is 2MB, but the implementation never parses or asserts p_align. Either update the docstring to match what’s actually checked, or extend parse_load_segments() to capture p_align and assert it here (so test_load_layouts() also enforces the alignment requirement).
| main shifting path would crash this binary. Routes through the new | ||
| padding-only path that establishes | ||
| p_offset%2MB == p_vaddr%2MB for the exec LOAD without changing any | ||
| vaddrs. Output runs identically to the original.""" |
There was a problem hiding this comment.
The test_load_layouts() docstring still describes the lld variant as going through a “padding-only path” that keeps vaddrs unchanged. In the current implementation (hugifyr.c) the lld-style case shifts seg0 movable sections and clamps the exec LOAD start, so vaddrs do change. Please update this docstring to reflect the current approach (section-aware shift / end-aligned clamp) so future readers don’t infer the wrong safety properties.
| main shifting path would crash this binary. Routes through the new | |
| padding-only path that establishes | |
| p_offset%2MB == p_vaddr%2MB for the exec LOAD without changing any | |
| vaddrs. Output runs identically to the original.""" | |
| generic whole-segment shift would crash this binary. Instead, the | |
| lld-specific path shifts only movable seg0 sections, then clamps the | |
| exec LOAD start to seg0_end_after_shift and pads the file as needed | |
| to establish p_offset%2MB == p_vaddr%2MB. Output runs identically to | |
| the original.""" |
…shift # Conflicts: # tests/test.py
…ibjvm) (#5) * Add chunk-isolation regression test for the exec LOAD's last 2MB The hugifyr transformation aims to make the kernel grant code huge pages on the binary's executable LOAD. For that to work, every 2MB chunk that LOAD RE touches must be exclusively RE — if a non-exec LOAD's vaddr range overlaps any of those chunks, mmap-order overlay mixes protections and the kernel can't issue a code huge page on it. Add a parser for readelf -lW LOAD entries plus check_re_chunk_isolation which asserts no other LOAD's vaddr range intersects an RE 2MB chunk. Wire it into test_basic. The check fires loudly if a future change to the layout pass picks a vaddr_delta that's just large enough to land .text on a 2MB boundary but not large enough to push subsequent LOADs out of RE's last chunk — i.e. start-aligning instead of end-aligning the executable segment. Also add test_load_layouts that builds test1.c with default ld and with -Wl,-z,noseparate-code (Oracle JDK-style combined R+E first segment) and verifies hugifyr produces a runnable binary for each. The lld-style layout (rodata in seg0, used by Chromium-based apps and Sublime Text) isn't covered here because hugifyr's main path doesn't currently handle it: shifting .text without also shifting seg0's rodata breaks RIP-relative LEAs from code into rodata. Fixing that while keeping end-alignment is separate work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Padding-only path for lld-style PIE; align p_offset%2MB to p_vaddr%2MB The main shifting path crashes lld-style PIE binaries (Sublime Text, Discord, Slack, MS Edge, Chrome, MongoDB) because their first read-only LOAD ("seg0") carries .rodata / .eh_frame_hdr / .eh_frame / .gcc_except_table — sections that .text RIP-references via direct LEA displacements with no relocation entries. Shifting .text without shifting those sections invalidates every cross-segment LEA and the binary segfaults during dl_main / unwinder init. This commit doesn't fix that fully — moving seg0's rodata into a shifted segment with end-alignment preserved is structurally bigger work. It establishes the necessary precondition: a safe transformation that runs on lld-style binaries, leaves them runnable, and ensures the exec LOAD's p_offset and p_vaddr have the same residue modulo 2MB. Detection: seg0_has_movable_sections() walks sections at vaddrs below the first PT_X LOAD's p_vaddr. Anything SHF_ALLOC, not SHT_NOBITS, and not in the existing relocatable_section_types whitelist (which already covers .dynsym, .gnu.hash, .rela.*, .dynamic, .interp, .note.*) is considered RIP-referenced from code => the binary is lld-style. The whitelist is conservative; unknown section types route to the safe padding-only path rather than to the shifting path. Padding-only path: pad_offset_to_match_vaddr() computes delta = (p_vaddr_RE - p_offset_RE) mod 2MB, bumps p_offset of every phdr at-or-after the original exec offset by delta, bumps every section's sh_offset similarly, bumps e_shoff, and stamps the first LOAD's p_align to 2MB. It does NOT touch any p_vaddr / sh_addr / relocations / symbols / DWARF / build-id. The output is byte-identical to the input except for the inserted file padding and the updated offset fields. The transformed binary runs identically to the original. Tests: - check_offset_vaddr_mod_2mb_match: asserts p_offset%2MB == p_vaddr%2MB on the exec LOAD. Wired into test_basic and every test_load_layouts variant. - test_load_layouts gets the lld variant back (built with -fuse-ld=lld); it now exercises the new padding-only path. Verified on real-world closed-source PIE binaries we already had downloaded: - Sublime Text 4180: --version → "Sublime Text Build 4180" matches - Discord 0.0.135: matches - Slack 4.42.117: matches - MEGAsync (modern): main path, matches - Cisco Webex CEF: main path, matches - cloudflared/terraform: ET_EXEC fallback, unchanged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * lld-style PIE: end-aligned section-aware shift (replaces padding-only path) The padding-only path added in ceec465 only fixed the file-side mod-2MB alignment of LOAD RE without changing any vaddr — so lld-style binaries became correct but never huge-page-eligible. This commit replaces it with a transformation that does enable code huge pages on lld-style PIE. What's new: - AdjInfo carries a list of "movable seg0" vaddr ranges: sections in seg0 that are SHF_ALLOC, non-NOBITS, and NOT in relocatable_section_types (.rodata, .eh_frame, .eh_frame_hdr, .gcc_except_table). calc_adjusted_addr remaps addresses inside those ranges by the same vaddr_delta as everything at-or-after old_exec_vaddr, so RIP-relative LEAs from .text into .rodata stay valid after the shift. Empty for non-lld binaries (the existing behavior). - adjust_program_headers extends seg0 LOAD R's filesz/memsz to cover the shifted seg0 contents, clamps LOAD RE's p_vaddr to max(round_down(p_vaddr,2MB), seg0_end_after_shift) so seg0 LOAD R and LOAD RE never overlap in vaddr space, and shifts PT_GNU_EH_FRAME (which targets a movable .eh_frame_hdr). - adjust_section_headers shifts sh_offset for movable seg0 sections; seg0 has p_vaddr == p_offset == 0, so the file delta equals vaddr_delta. - segment_offset_delta for lld-style is exec_p_vaddr_clamped - old_p_offset (LOAD RE's file region starts where extended seg0 ends); section_offset_delta accounts for the clamp so every section in LOAD RE has sh_offset_new - sh_addr_new == p_offset_new - p_vaddr_clamped (kernel constraint for a single LOAD's file mapping). - pad_segment_start now fills the gap between the last non-exec section and the first executable section in LOAD RE — never below p_vaddr or over metadata. This avoids clobbering ELF header / PHDR / .interp / .note in the 2-LOAD R+E first ("combined" -z noseparate-code) layout. - pad_offset_to_match_vaddr removed. For modern PIE (4-LOAD with metadata-only seg0) and 2-LOAD R+E first ("combined") the new code is a no-op via the seg0_end_after_shift = 0 clamp degeneration. Tests: - check_segment_alignment unchanged for the modern path. - New check_exec_load_end_aligned: every variant must have LOAD RE's end on a 2MB boundary. - check_re_chunk_isolation relaxed to require only that fully-covered 2MB chunks be exclusive code (partial chunks at the start/end of LOAD RE can legitimately share their range with adjacent LOAD R / LOAD RW). - All three checks (offset/vaddr-mod, end-aligned, chunk-isolation) wired into every test_load_layouts variant including lld. Verification: - test_basic + test_load_layouts (default, combined, lld) + TLS + TLS-relocs all pass. - Real-world smoke test on lld-style PIE: sublime_text (Build 4180), Discord (0.0.135), slack (4.42.117) all run identically; LOAD RE ends at 2MB, full chunks isolated. - libjvm.so (Oracle JDK 21.0.11, 2-LOAD R+E first / 20MB code) runs the full Java workload (JIT, GC, Streams, ConcurrentHashMap, Executors, recursion) bit-identical to the original. - Booted under /boot/vmlinuz-6.14.11rothp (READ_ONLY_THP_FOR_FS=y) with the hugified libjvm.so: THPeligible=1 (was 0 on host), khugepaged collapsed 16384 kB into 8 file-PMD-mapped 2MB pages on the libjvm.so r-xp mapping after running the workload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Three commits on top of
mainthat add real code-huge-page support across all the LOAD-segment shapes hugifyr currently sees:p_offset % 2MB == p_vaddr % 2MBon the exec LOAD without changing any vaddr. Safe but does not actually enable code huge pages.The motivation is the lld-style layout:
.rodata/.eh_frame*/.gcc_except_tablelive in seg0 (the first read-only LOAD), and.textis RIP-referenced into them. The original main path shifted.textwithout shifting that data, breaking RIP-relative LEAs at runtime.What changed in 943d280
AdjInfonow carries the seg0 movable vaddr ranges (everything in seg0 that'sSHF_ALLOC, non-SHT_NOBITS, and not inrelocatable_section_types).calc_adjusted_addrshifts addresses inside those ranges by the samevaddr_deltaas everything at-or-after the exec LOAD — relocations / symbols / dynamic-section pointers / PT_GNU_EH_FRAME / DWARF references stay consistent.adjust_program_headersextends seg0 LOAD R's filesz/memsz to cover the shifted contents and clamps LOAD RE'sp_vaddrtomax(round_down(p_vaddr, 2MB), seg0_end_after_shift)so seg0 LOAD R and LOAD RE never overlap in vaddr space.segment_offset_deltaandsection_offset_deltaare recomputed against the clamped p_vaddr sosh_offset - sh_addr == p_offset - p_vaddrholds for every section in LOAD RE (kernel constraint for a single file-backed mapping).pad_segment_startrewritten to pad only the gap between the last non-executable section in LOAD RE and the first executable section — never over the ELF header / PHDR /.interp/.note*. This fixes the 2-LOAD R+E first ("combined" /-z noseparate-code) layout.pad_offset_to_match_vaddrremoved.check_exec_load_end_aligned;check_re_chunk_isolationnow requires only that fully-covered 2MB chunks be exclusive code; both checks wired into everytest_load_layoutsvariant (default, combined, lld).For modern 4-LOAD PIE and 2-LOAD R+E first ("combined") the new code is a no-op —
seg0_end_after_shift = 0collapses the clamp into the existinground_down.Test plan
makeclean.cd tests && python3 test.py—test_basic,test_load_layouts(default / combined / lld), TLS, TLS relocations all pass; every layout variant satisfiesoffset%2MB == vaddr%2MB, end-aligned, and chunk-isolation.--versionoutput identical to original; LOAD RE ends on 2MB; full-chunk isolation OK.tmp/jdk-21.0.11/lib/server/libjvm.so, 2-LOAD R+E first / ~20 MB of code): a Java workload exercising JIT, GC, parallel Streams, ConcurrentHashMap, ExecutorService, recursion, and 12 MB of allocation churn returns bit-identical output through the hugified library on the host./boot/vmlinuz-6.14.11rothpVM withREAD_ONLY_THP_FOR_FS=y, hugified libjvm.so, THP=always: after running the workload, the libjvm.so r-xp mapping reportsTHPeligible: 1(was0on host) andFilePmdMapped: 16384 kB— i.e. khugepaged collapsed eight 2 MB file PMDs over the 20 MB exec mapping. Workload output matches host.🤖 Generated with Claude Code