radlink: link-time performance pass (~1.9x on a Fortnite link) by honkstar1 · Pull Request #830 · EpicGames/raddebugger

honkstar1 · 2026-06-16T22:00:29Z

Profiled radlink linking UnrealEditorFortnite-Engine.dll (Superluminal, MSVC release) and landed 7 optimizations plus dead-code cleanup. Cold-run wall time dropped ~45s → ~23s (~1.9x). The vendored BLAKE3 source is left untouched.

Each commit is independent and carries its own rationale + measured gain.

Gains (main-thread, same workload)

Symbol	Before	After
`get_cpu_features`	5591 ms	4 ms
`coff_parse_symbol32` (main)	3922 ms	~810 ms
`lnk_on_symbol_replace`	1306 ms	161 ms
`ReleaseSemaphore` (main)	3274 ms	940 ms
`cv_name_from_symbol`	1098 ms	201 ms
`lnk_fixup_cv_type_indices` (incl)	1445 ms	390 ms

Commits

BLAKE3 C11 atomics — _InterlockedOr(&x,0) per-dispatch barrier → plain load, via build flags only (no third_party edit).
No-name symbol parse — skip the string-table cstr scan on interp-only paths.
refs_tail — lnk_on_symbol_replace ref-merge O(n²) → O(1).
Batched worker wakeup — one ReleaseSemaphore(h, count, 0) instead of a per-worker loop.
memchr in str8_cstring_capped instead of a byte loop.
Type-index fixup single probe — store assigned ti on the leaf hash table; removes the second hash table + its build pass.

Correctness

Linker torture suite (build/torture.exe): 65/65 linker tests pass (COMDAT, weak/undef/abs, ghash type-merge, relocs, import/export), unchanged from baseline, verified after every commit.

Caveats

Commit 1 uses /std:c11 /experimental:c11atomics (scoped to the radlink target). /experimental:c11atomics is an unstable MSVC switch — happy to use a project-local ATOMIC_LOAD override instead if preferred.
radlink output is already run-to-run non-deterministic, so the torture suite (not a byte-diff) was used as the gate.

🤖 Generated with Claude Code

get_cpu_features was the top main-thread hot spot (~5.6s for one Fortnite link), 97% of it inside a single ATOMIC_LOAD(g_cpu_features). On MSVC, blake3_dispatch.c defines ATOMIC_LOAD as _InterlockedOr(&x,0) -- a lock'd RMW (full barrier) run on every BLAKE3 compress dispatch. The value is written once and read-only after, so the barrier is pointless. Enable BLAKE3's plain-load path (C11 _Atomic, a plain mov on x86) via build flags only, leaving the vendored third_party/blake3 source untouched: /std:c11 /experimental:c11atomics -DBLAKE3_ATOMICS=1 Scoped to the radlink target. MSVC C11 atomics need both /std:c11 and /experimental:c11atomics. get_cpu_features: 5591ms -> 4ms (main thread). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coff_read_symbol_name scans a cstr in the memory-mapped string table -- the dominant, page-fault-bound cost of bulk symbol parsing. Many hot callers parse a full symbol but only read scalar fields (value/section/storage_class/aux) to interpret the symbol value; the name is never used. Add name-skipping parse variants and route the interp-only paths through them: coff_parse_symbol{16,32}_no_name (coff_parse.c) -- and the full variants now call these + add the name, so the scalar logic lives in one place lnk_parsed_symbol_from_coff_symbol_idx_no_name (lnk_obj.c) lnk_interp_from_symbol / lnk_can_replace_symbol / lnk_on_symbol_replace (lnk_symbol_table.c) and the lnk_search_lib_task loop (lnk.c) Where the name is still needed (lnk_search_lib) it uses the already-cached LNK_Symbol.name instead of re-parsing. lnk_can_replace_symbol previously parsed dst/src twice (full parse + a second parse for interp); collapsed to one no-name parse each. coff_parse_symbol32 on the main thread: 3922ms -> ~810ms (name-needed callers). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lnk_on_symbol_replace merged ref lists by walking the destination's singly linked refs list to its tail on every merge. Across repeated COMDAT merges into one accumulating leader this is O(n^2) and was 96% of the function. Add a refs_tail pointer to LNK_Symbol so the append is O(1): src->refs_tail->next = dst->refs; src->refs_tail = dst->refs_tail; maintained at all ref-list write sites (lnk_make_symbol, the null_symbol and import-stub sites in lnk.c). Order and head identity are preserved exactly, so this is a pure perf change: the head node stays the primary ref, and interior order is irrelevant (every multi-ref consumer sorts). lnk_on_symbol_replace (main thread): 1306ms -> 161ms exclusive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

tp_for_parallel woke workers with a loop of single semaphore_drop calls -- one ReleaseSemaphore syscall per worker (up to worker_count, twice in shared mode). The main thread spent ~3.3s in ReleaseSemaphore over a Fortnite link. Add semaphore_drop_n(sem, count) (a single ReleaseSemaphore(h, count, 0) on Windows; loop on POSIX) and replace the wakeup loops. Residual is the unavoidable kernel cost of waking N threads. ReleaseSemaphore (main thread): 3274ms -> 940ms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The capped cstr length scan was a byte-by-byte loop. Switch to memchr, which is SIMD-accelerated in the CRT. Speeds up every capped-cstr scan in the codebase; notably cv_name_from_symbol (CodeView symbol-name scan during GSI build). cv_name_from_symbol (main thread): 1098ms -> 201ms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lnk_fixup_cv_type_indices did two open-addressing probes per type-index reference: lnk_leaf_hash_table_search (leaf_ref -> canonical bucket), then lnk_assigned_type_ht_search (canonical bucket -> assigned type index, via a second hash table keyed by leaf-ref content). Both are cache-miss-bound and this ran across every type-index reference in every obj. Store the assigned type index directly on the leaf hash table: add a ti_arr parallel to bucket_arr. lnk_assign_type_indices_task writes ti = min+i into the leaf's bucket slot (each unique leaf owns a distinct slot, so worker writes never collide), and the new lnk_leaf_hash_table_search_ti recovers it in one probe. Removes the entire assigned_type_hts table and its build pass; deletes the now-dead lnk_leaf_hash_table_search and lnk_assigned_type_ht_search. Correctness: deduplicated leaves share the same ghash (debug_h value), so the fixup query and the assign-time canonical bucket hash to the same slot. lnk_fixup_cv_type_indices (main thread): 1445ms -> 390ms inclusive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

honkstar1 and others added 6 commits June 16, 2026 14:50

honkstar1 force-pushed the perf/radlink-link-time branch from 26be2b6 to 7c99b7f Compare June 16, 2026 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

radlink: link-time performance pass (~1.9x on a Fortnite link)#830

radlink: link-time performance pass (~1.9x on a Fortnite link)#830
honkstar1 wants to merge 6 commits into
EpicGames:masterfrom
honkstar1:perf/radlink-link-time

honkstar1 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

honkstar1 commented Jun 16, 2026

Gains (main-thread, same workload)

Commits

Correctness

Caveats

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant