Skip to content

perf(intel): index-backed searchBlock in IndirectCallAnalyzer#47

Merged
r0ny123 merged 2 commits into
masterfrom
claude/perf-sweep-20260526-indirect-call-index
May 26, 2026
Merged

perf(intel): index-backed searchBlock in IndirectCallAnalyzer#47
r0ny123 merged 2 commits into
masterfrom
claude/perf-sweep-20260526-indirect-call-index

Conversation

@r0ny123

@r0ny123 r0ny123 commented May 26, 2026

Copy link
Copy Markdown
Owner

Summary

Picks up the highest-impact deferred item from PR #46: collapse the O(B·I) linear scan in IndirectCallAnalyzer.searchBlock to an O(1) dict lookup. Single optimization class.

resolveRegisterCalls walks the CFG backward through up to block_depth=3 levels of incoming refs to resolve call <register> targets. At every level, searchBlock was doing:

for block in analysis_state.getBlocks():
    if address in [i[0] for i in block]:
        return block

One call is O(B·I). The recursive descent in processBlock makes that call once per incoming ref at every depth. Functions with many register calls (the file already mentions "found one Go sample with 130k register calls") hit this hard.

Change

  • Lazy-cache an {instruction_addr: containing_block} index on analysis_state the first time searchBlock runs. Subsequent lookups during the same function analysis are O(1).
  • Cache lives on the state object (not on the analyzer), so the index has the correct lifetime (one per function analysis), the analyzer stays re-entrancy-safe, and direct callers (e.g. unit tests) get the O(1) path automatically — no separate fallback branch needed.
  • Preserve "first matching block wins" by using if addr not in index during construction — important because FunctionAnalysisState.getBlocks() can place the same instruction in multiple overlapping blocks via the sorted potential_starts walk.
  • contextlib.suppress(AttributeError) guards the cache write so test doubles or __slots__-locked states still work; the freshly built dict is returned in that case.

Measurements

Microbench (synthetic 80 blocks × 15 instructions = 1200 lookups):

Path Best of 5
legacy linear scan 28.60 ms
indexed O(1) lookup 0.27 ms
speedup ~107×

Parity check on the same fixture: 1200 lookups, 0 mismatches — both paths return the same block object reference.

End-to-end on asprox is unchanged (asprox is malware with mostly direct calls, so it doesn't stress resolveRegisterCalls). The win scales linearly with the number of indirect calls in the binary, which is high in Go and other compiler-heavy targets.

Behavior compatibility

  • Public API of IndirectCallAnalyzer unchanged (searchBlock, processBlock, resolveRegisterCalls, getDword all keep the same signatures).
  • Same block-list reference identity returned for any given address.
  • Report serialization untouched.
  • asprox sha256, num_instructions, function count, and integration assertions unchanged.

Test plan

  • python -m pytest tests/test* — 111 passed, 79 subtests passed in 12.63 s
  • python -m ruff check . — All checks passed
  • python -m ruff format --check . — 95 files already formatted
  • Microbench parity check — 1200/1200 identical block references
  • End-to-end asprox disassembly invariants verified

Residual risk

  • The cache is on analysis_state, so its lifetime is tied to the state object. If blocks were ever mutated after the first searchBlock call (currently they aren't — resolveRegisterCalls only runs after finalizeAnalysis), the index would go stale. The fix would be cache invalidation on the mutation site, not here.
  • contextlib.suppress(AttributeError) is a deliberate fallback for objects that reject attribute assignment — for those the index is rebuilt on every call, which is no worse than the legacy O(B·I) scan.

Still deferred (out of scope for this branch)

  • SmdaFunction.getNormalizedBlockRefs caching (needs architecture_metadata mutation tracking).
  • BinaryInfo.getImportedFunctions discards a PeSymbolProvider.parseSymbols result before re-parsing for imports.
  • Static *FileLoader.getArchitecture / getCodeAreas each lief.parse(binary)BinaryInfo caches but the static accessors don't share it.
  • Dead _logCandidateStats + latent == 2 vs == 0 bug in FunctionCandidateManager.
  • mcrit-install 3×5 CI matrix likely over-spec for a smoke test.

Review follow-ups

  • Gemini (PR perf(intel): index-backed searchBlock in IndirectCallAnalyzer #47, 2026-05-26): suggested moving the index cache off the analyzer and onto analysis_state to avoid re-entrancy/thread-safety risks and simplify resolveRegisterCalls. Applied in 1e0fc9c — speedup reran at ~107× (up from ~92×) with the same 0/1200 parity, fallback branch and try/finally removed.

https://claude.ai/code/session_01C8CcS2k1g59ByLKYdEcaxR

resolveRegisterCalls() resolves each "call <register>" by walking the
CFG backward through up to block_depth (=3) levels of incoming refs.
At every level, searchBlock was doing a linear scan over every block in
the function and, for each block, a list comprehension over every
instruction:

    for block in analysis_state.getBlocks():
        if address in [i[0] for i in block]:
            return block

So one call to searchBlock is O(B*I) — and the recursive descent into
processBlock calls it once per incoming ref at every depth. Functions
with many register calls (the file already mentions a Go sample with
130k of them) hit this hot.

This commit:

* Seeds an {instruction_addr: containing_block} dict once at the start
  of resolveRegisterCalls(), so every searchBlock lookup is O(1).
* Preserves "first matching block wins" by using `if addr not in index`
  during construction — important because FunctionAnalysisState.getBlocks
  can place the same instruction in multiple overlapping blocks via
  the sorted potential_starts walk.
* Clears the index in a finally so a reused analyzer instance never
  serves a stale index after the function completes.
* Keeps a slim linear-scan fallback in searchBlock for direct callers
  (e.g. existing unit tests that drive processBlock without going
  through resolveRegisterCalls).

Microbench (80 blocks × 15 instructions, 1200 lookups):
  legacy linear scan:    17.04 ms
  indexed O(1) lookup:    0.18 ms
  -> 92x faster, bit-identical block-object references returned.

End-to-end on asprox is unchanged (it has few register calls); the win
scales with the number of indirect calls in the binary.

Validation:
- pytest tests/test* -> 111 passed, 79 subtests passed
- ruff check + format --check clean
- asprox sha256 / num_instructions / function count unchanged
@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. To trigger a review, include @coderabbit in the PR description. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8579cfdf-3d90-4e14-ada0-341ba80135e5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes block lookups in IndirectCallAnalyzer from $O(N^2)$ to $O(1)$ by introducing an instruction-to-block index during register call resolution. The reviewer recommends caching this index lazily on the analysis_state object instead of storing it on the analyzer instance (self). This change would ensure thread safety and re-entrancy, while also simplifying resolveRegisterCalls by eliminating the need for manual index management and try...finally blocks.

Comment thread src/smda/intel/IndirectCallAnalyzer.py Outdated
Comment thread src/smda/intel/IndirectCallAnalyzer.py Outdated
Address Gemini review on PR #47: stash the {instruction_addr: block}
index on analysis_state instead of self. analysis_state has the right
lifetime (one per function analysis) so the cache is naturally
re-entrancy-safe and can't outlive what it indexes, the analyzer keeps
no transient state, and the explicit seed + try/finally in
resolveRegisterCalls goes away. searchBlock now lazy-builds on first
call, so the legacy fallback branch is also gone — every caller
(including direct unit-test callers) gets the O(1) path automatically.

contextlib.suppress(AttributeError) guards the cache write so that
test doubles or hypothetical __slots__-locked states still work; the
freshly built dict is returned in that case.

Re-ran the focused micro-bench (80 blocks x 15 instructions, 1200
lookups): ~107x faster than the legacy scan, 0/1200 parity mismatches.
End-to-end asprox sha256/num_instructions/function count unchanged.

Validation:
- pytest tests/test* -> 111 passed, 79 subtests passed
- ruff check + format --check clean
@r0ny123 r0ny123 marked this pull request as ready for review May 26, 2026 18:39
@r0ny123 r0ny123 merged commit b2a1d20 into master May 26, 2026
46 checks passed
@r0ny123 r0ny123 deleted the claude/perf-sweep-20260526-indirect-call-index branch May 28, 2026 20:50
r0ny123 pushed a commit that referenced this pull request Jun 10, 2026
Address Gemini review on PR #47: stash the {instruction_addr: block}
index on analysis_state instead of self. analysis_state has the right
lifetime (one per function analysis) so the cache is naturally
re-entrancy-safe and can't outlive what it indexes, the analyzer keeps
no transient state, and the explicit seed + try/finally in
resolveRegisterCalls goes away. searchBlock now lazy-builds on first
call, so the legacy fallback branch is also gone — every caller
(including direct unit-test callers) gets the O(1) path automatically.

contextlib.suppress(AttributeError) guards the cache write so that
test doubles or hypothetical __slots__-locked states still work; the
freshly built dict is returned in that case.

Re-ran the focused micro-bench (80 blocks x 15 instructions, 1200
lookups): ~107x faster than the legacy scan, 0/1200 parity mismatches.
End-to-end asprox sha256/num_instructions/function count unchanged.

Validation:
- pytest tests/test* -> 111 passed, 79 subtests passed
- ruff check + format --check clean
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants