radlink: link-time performance pass (~1.9x on a Fortnite link)#830
Open
honkstar1 wants to merge 6 commits into
Open
radlink: link-time performance pass (~1.9x on a Fortnite link)#830honkstar1 wants to merge 6 commits into
honkstar1 wants to merge 6 commits into
Conversation
get_cpu_features was the top main-thread hot spot (~5.6s for one Fortnite link), 97% of it inside a single ATOMIC_LOAD(g_cpu_features). On MSVC, blake3_dispatch.c defines ATOMIC_LOAD as _InterlockedOr(&x,0) -- a lock'd RMW (full barrier) run on every BLAKE3 compress dispatch. The value is written once and read-only after, so the barrier is pointless. Enable BLAKE3's plain-load path (C11 _Atomic, a plain mov on x86) via build flags only, leaving the vendored third_party/blake3 source untouched: /std:c11 /experimental:c11atomics -DBLAKE3_ATOMICS=1 Scoped to the radlink target. MSVC C11 atomics need both /std:c11 and /experimental:c11atomics. get_cpu_features: 5591ms -> 4ms (main thread). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
coff_read_symbol_name scans a cstr in the memory-mapped string table -- the
dominant, page-fault-bound cost of bulk symbol parsing. Many hot callers parse
a full symbol but only read scalar fields (value/section/storage_class/aux) to
interpret the symbol value; the name is never used.
Add name-skipping parse variants and route the interp-only paths through them:
coff_parse_symbol{16,32}_no_name (coff_parse.c) -- and the full variants now
call these + add the name, so the scalar logic lives in one place
lnk_parsed_symbol_from_coff_symbol_idx_no_name (lnk_obj.c)
lnk_interp_from_symbol / lnk_can_replace_symbol / lnk_on_symbol_replace
(lnk_symbol_table.c) and the lnk_search_lib_task loop (lnk.c)
Where the name is still needed (lnk_search_lib) it uses the already-cached
LNK_Symbol.name instead of re-parsing. lnk_can_replace_symbol previously parsed
dst/src twice (full parse + a second parse for interp); collapsed to one
no-name parse each.
coff_parse_symbol32 on the main thread: 3922ms -> ~810ms (name-needed callers).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_on_symbol_replace merged ref lists by walking the destination's singly linked refs list to its tail on every merge. Across repeated COMDAT merges into one accumulating leader this is O(n^2) and was 96% of the function. Add a refs_tail pointer to LNK_Symbol so the append is O(1): src->refs_tail->next = dst->refs; src->refs_tail = dst->refs_tail; maintained at all ref-list write sites (lnk_make_symbol, the null_symbol and import-stub sites in lnk.c). Order and head identity are preserved exactly, so this is a pure perf change: the head node stays the primary ref, and interior order is irrelevant (every multi-ref consumer sorts). lnk_on_symbol_replace (main thread): 1306ms -> 161ms exclusive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tp_for_parallel woke workers with a loop of single semaphore_drop calls -- one ReleaseSemaphore syscall per worker (up to worker_count, twice in shared mode). The main thread spent ~3.3s in ReleaseSemaphore over a Fortnite link. Add semaphore_drop_n(sem, count) (a single ReleaseSemaphore(h, count, 0) on Windows; loop on POSIX) and replace the wakeup loops. Residual is the unavoidable kernel cost of waking N threads. ReleaseSemaphore (main thread): 3274ms -> 940ms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The capped cstr length scan was a byte-by-byte loop. Switch to memchr, which is SIMD-accelerated in the CRT. Speeds up every capped-cstr scan in the codebase; notably cv_name_from_symbol (CodeView symbol-name scan during GSI build). cv_name_from_symbol (main thread): 1098ms -> 201ms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_fixup_cv_type_indices did two open-addressing probes per type-index reference: lnk_leaf_hash_table_search (leaf_ref -> canonical bucket), then lnk_assigned_type_ht_search (canonical bucket -> assigned type index, via a second hash table keyed by leaf-ref content). Both are cache-miss-bound and this ran across every type-index reference in every obj. Store the assigned type index directly on the leaf hash table: add a ti_arr parallel to bucket_arr. lnk_assign_type_indices_task writes ti = min+i into the leaf's bucket slot (each unique leaf owns a distinct slot, so worker writes never collide), and the new lnk_leaf_hash_table_search_ti recovers it in one probe. Removes the entire assigned_type_hts table and its build pass; deletes the now-dead lnk_leaf_hash_table_search and lnk_assigned_type_ht_search. Correctness: deduplicated leaves share the same ghash (debug_h value), so the fixup query and the assign-time canonical bucket hash to the same slot. lnk_fixup_cv_type_indices (main thread): 1445ms -> 390ms inclusive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
26be2b6 to
7c99b7f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Profiled
radlinklinkingUnrealEditorFortnite-Engine.dll(Superluminal, MSVCrelease) and landed 7 optimizations plus dead-code cleanup. Cold-run wall time dropped ~45s → ~23s (~1.9x). The vendored BLAKE3 source is left untouched.Each commit is independent and carries its own rationale + measured gain.
Gains (main-thread, same workload)
get_cpu_featurescoff_parse_symbol32(main)lnk_on_symbol_replaceReleaseSemaphore(main)cv_name_from_symbollnk_fixup_cv_type_indices(incl)Commits
_InterlockedOr(&x,0)per-dispatch barrier → plain load, via build flags only (nothird_partyedit).refs_tail—lnk_on_symbol_replaceref-merge O(n²) → O(1).ReleaseSemaphore(h, count, 0)instead of a per-worker loop.memchrinstr8_cstring_cappedinstead of a byte loop.Correctness
Linker torture suite (
build/torture.exe): 65/65 linker tests pass (COMDAT, weak/undef/abs, ghash type-merge, relocs, import/export), unchanged from baseline, verified after every commit.Caveats
/std:c11 /experimental:c11atomics(scoped to the radlink target)./experimental:c11atomicsis an unstable MSVC switch — happy to use a project-localATOMIC_LOADoverride instead if preferred.🤖 Generated with Claude Code