Skip to content

radlink: link-time performance pass (~1.9x on a Fortnite link)#830

Open
honkstar1 wants to merge 6 commits into
EpicGames:masterfrom
honkstar1:perf/radlink-link-time
Open

radlink: link-time performance pass (~1.9x on a Fortnite link)#830
honkstar1 wants to merge 6 commits into
EpicGames:masterfrom
honkstar1:perf/radlink-link-time

Conversation

@honkstar1

Copy link
Copy Markdown

Profiled radlink linking UnrealEditorFortnite-Engine.dll (Superluminal, MSVC release) and landed 7 optimizations plus dead-code cleanup. Cold-run wall time dropped ~45s → ~23s (~1.9x). The vendored BLAKE3 source is left untouched.

Each commit is independent and carries its own rationale + measured gain.

Gains (main-thread, same workload)

Symbol Before After
get_cpu_features 5591 ms 4 ms
coff_parse_symbol32 (main) 3922 ms ~810 ms
lnk_on_symbol_replace 1306 ms 161 ms
ReleaseSemaphore (main) 3274 ms 940 ms
cv_name_from_symbol 1098 ms 201 ms
lnk_fixup_cv_type_indices (incl) 1445 ms 390 ms

Commits

  1. BLAKE3 C11 atomics_InterlockedOr(&x,0) per-dispatch barrier → plain load, via build flags only (no third_party edit).
  2. No-name symbol parse — skip the string-table cstr scan on interp-only paths.
  3. refs_taillnk_on_symbol_replace ref-merge O(n²) → O(1).
  4. Batched worker wakeup — one ReleaseSemaphore(h, count, 0) instead of a per-worker loop.
  5. memchr in str8_cstring_capped instead of a byte loop.
  6. Type-index fixup single probe — store assigned ti on the leaf hash table; removes the second hash table + its build pass.

Correctness

Linker torture suite (build/torture.exe): 65/65 linker tests pass (COMDAT, weak/undef/abs, ghash type-merge, relocs, import/export), unchanged from baseline, verified after every commit.

Caveats

  • Commit 1 uses /std:c11 /experimental:c11atomics (scoped to the radlink target). /experimental:c11atomics is an unstable MSVC switch — happy to use a project-local ATOMIC_LOAD override instead if preferred.
  • radlink output is already run-to-run non-deterministic, so the torture suite (not a byte-diff) was used as the gate.

🤖 Generated with Claude Code

honkstar1 and others added 6 commits June 16, 2026 14:50
get_cpu_features was the top main-thread hot spot (~5.6s for one Fortnite
link), 97% of it inside a single ATOMIC_LOAD(g_cpu_features). On MSVC,
blake3_dispatch.c defines ATOMIC_LOAD as _InterlockedOr(&x,0) -- a lock'd
RMW (full barrier) run on every BLAKE3 compress dispatch. The value is
written once and read-only after, so the barrier is pointless.

Enable BLAKE3's plain-load path (C11 _Atomic, a plain mov on x86) via build
flags only, leaving the vendored third_party/blake3 source untouched:
  /std:c11 /experimental:c11atomics -DBLAKE3_ATOMICS=1
Scoped to the radlink target. MSVC C11 atomics need both /std:c11 and
/experimental:c11atomics.

get_cpu_features: 5591ms -> 4ms (main thread).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
coff_read_symbol_name scans a cstr in the memory-mapped string table -- the
dominant, page-fault-bound cost of bulk symbol parsing. Many hot callers parse
a full symbol but only read scalar fields (value/section/storage_class/aux) to
interpret the symbol value; the name is never used.

Add name-skipping parse variants and route the interp-only paths through them:
  coff_parse_symbol{16,32}_no_name (coff_parse.c) -- and the full variants now
    call these + add the name, so the scalar logic lives in one place
  lnk_parsed_symbol_from_coff_symbol_idx_no_name (lnk_obj.c)
  lnk_interp_from_symbol / lnk_can_replace_symbol / lnk_on_symbol_replace
    (lnk_symbol_table.c) and the lnk_search_lib_task loop (lnk.c)

Where the name is still needed (lnk_search_lib) it uses the already-cached
LNK_Symbol.name instead of re-parsing. lnk_can_replace_symbol previously parsed
dst/src twice (full parse + a second parse for interp); collapsed to one
no-name parse each.

coff_parse_symbol32 on the main thread: 3922ms -> ~810ms (name-needed callers).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_on_symbol_replace merged ref lists by walking the destination's singly
linked refs list to its tail on every merge. Across repeated COMDAT merges
into one accumulating leader this is O(n^2) and was 96% of the function.

Add a refs_tail pointer to LNK_Symbol so the append is O(1):
  src->refs_tail->next = dst->refs;
  src->refs_tail       = dst->refs_tail;
maintained at all ref-list write sites (lnk_make_symbol, the null_symbol and
import-stub sites in lnk.c). Order and head identity are preserved exactly, so
this is a pure perf change: the head node stays the primary ref, and interior
order is irrelevant (every multi-ref consumer sorts).

lnk_on_symbol_replace (main thread): 1306ms -> 161ms exclusive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tp_for_parallel woke workers with a loop of single semaphore_drop calls -- one
ReleaseSemaphore syscall per worker (up to worker_count, twice in shared mode).
The main thread spent ~3.3s in ReleaseSemaphore over a Fortnite link.

Add semaphore_drop_n(sem, count) (a single ReleaseSemaphore(h, count, 0) on
Windows; loop on POSIX) and replace the wakeup loops. Residual is the
unavoidable kernel cost of waking N threads.

ReleaseSemaphore (main thread): 3274ms -> 940ms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The capped cstr length scan was a byte-by-byte loop. Switch to memchr, which is
SIMD-accelerated in the CRT. Speeds up every capped-cstr scan in the codebase;
notably cv_name_from_symbol (CodeView symbol-name scan during GSI build).

cv_name_from_symbol (main thread): 1098ms -> 201ms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnk_fixup_cv_type_indices did two open-addressing probes per type-index
reference: lnk_leaf_hash_table_search (leaf_ref -> canonical bucket), then
lnk_assigned_type_ht_search (canonical bucket -> assigned type index, via a
second hash table keyed by leaf-ref content). Both are cache-miss-bound and this
ran across every type-index reference in every obj.

Store the assigned type index directly on the leaf hash table: add a ti_arr
parallel to bucket_arr. lnk_assign_type_indices_task writes ti = min+i into the
leaf's bucket slot (each unique leaf owns a distinct slot, so worker writes
never collide), and the new lnk_leaf_hash_table_search_ti recovers it in one
probe. Removes the entire assigned_type_hts table and its build pass; deletes
the now-dead lnk_leaf_hash_table_search and lnk_assigned_type_ht_search.

Correctness: deduplicated leaves share the same ghash (debug_h value), so the
fixup query and the assign-time canonical bucket hash to the same slot.

lnk_fixup_cv_type_indices (main thread): 1445ms -> 390ms inclusive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@honkstar1 honkstar1 force-pushed the perf/radlink-link-time branch from 26be2b6 to 7c99b7f Compare June 16, 2026 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant