Skip to content

perf: SIMD writeEscaped + telemetry early exit + bench Heisenbench fix#300

Merged
justrach merged 3 commits intomainfrom
perf/simd-telemetry-bench-fix
Apr 19, 2026
Merged

perf: SIMD writeEscaped + telemetry early exit + bench Heisenbench fix#300
justrach merged 3 commits intomainfrom
perf/simd-telemetry-bench-fix

Conversation

@justrach
Copy link
Copy Markdown
Owner

Summary

Three targeted perf/fix commits cherry-picked cleanly onto current main.

  • SIMD writeEscaped in handleCall: replaces scalar per-byte loop with mcpj.writeEscaped (16-byte SIMD batches from mcp-zig) in the hot response-assembly path and BenchContext.runToolCall. -3 to -8% on tree/outline.
  • Telemetry early exit: recordToolCall returns immediately when telemetry is disabled, skipping the getrusage syscall. Zero cost for --no-telemetry users.
  • Bench Heisenbench fix: BenchContext.runToolCall previously returned usize, so runCase timed the full call including telem.recordToolCall(). When the telemetry early exit removed the getrusage from the timed window, codedb_status appeared 35% slower. Fixed by returning struct { dispatch_ns, response_bytes } and accumulating only the inner dispatch timer.

Benchmark (dispatch-only, 22-file corpus, 100 iters, current main baseline)

Tool Latency Ops/sec
codedb_tree 16.1µs 62,035
codedb_outline 53.3µs 18,737
codedb_search 76.6µs 13,045
codedb_bundle 118.5µs 8,433
codedb_status 70.8µs 14,106

zig build and zig build bench both clean. No regressions.

Test plan

  • zig build compiles clean on macOS arm64
  • zig build bench runs and outputs consistent latency table

🤖 Generated with Claude Code

justrach and others added 3 commits April 19, 2026 13:13
… handleCall

The 3-block response assembler was calling the local scalar writeEscaped
(1-byte inner loop) for summary, raw data, and guidance. Switch to
mcpj.writeEscaped (mcp-zig SIMD, 16-byte vectors) which is already
imported — meaningful gain on large tool outputs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
recordToolCall called getRssKb() unconditionally before passing to
record(), which guards on self.enabled. Move the early-exit to
recordToolCall so --no-telemetry users pay zero syscall overhead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BenchContext.runToolCall previously returned usize (response bytes),
so runCase timed the entire call including telem.recordToolCall() and
response assembly. With the telemetry early-exit added in 2ad595e,
the telem.recordToolCall() on disabled runs shrank from ~25µs (getrusage
syscall) to ~0µs, making status appear 35% slower — a Heisenbench.

Fix: runToolCall now returns struct { dispatch_ns, response_bytes }.
runCase accumulates dispatch_ns (the inner nanoTimestamp delta around
dispatch() only). The outer wall-clock timer is removed entirely.

Also: BenchContext.runToolCall now uses mcpj.writeEscaped (SIMD) for
the bench path, matching handleCall.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Threshold: 10.00%

Tool Base (ns) Head (ns) Delta Status
codedb_bundle 575620 519993 -9.66% OK
codedb_changes 101400 61716 -39.14% OK
codedb_deps 50442 12928 -74.37% OK
codedb_edit 18263 5796 -68.26% OK
codedb_find 91043 67202 -26.19% OK
codedb_hot 147424 109964 -25.41% OK
codedb_outline 347133 259522 -25.24% OK
codedb_read 134856 96718 -28.28% OK
codedb_search 238751 191011 -20.00% OK
codedb_snapshot 4003872 2516884 -37.14% OK
codedb_status 140760 116487 -17.24% OK
codedb_symbol 107942 60583 -43.87% OK
codedb_tree 105743 79150 -25.15% OK
codedb_word 134555 77695 -42.26% OK

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ef54b6167b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/mcp.zig
result.appendSlice(alloc, if (is_error) "],\"isError\":true}" else "],\"isError\":false}") catch return result.items.len;
return result.items.len;
result.appendSlice(alloc, if (is_error) "],\"isError\":true}" else "],\"isError\":false}") catch return .{ .dispatch_ns = @intCast(elapsed), .response_bytes = result.items.len };
return .{ .dispatch_ns = @intCast(elapsed), .response_bytes = result.items.len };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Switch dispatch timing to monotonic nanoseconds

runToolCall now returns dispatch_ns via @intCast(elapsed), but elapsed is derived from cio.nanoTimestamp() (wall-clock CLOCK_REALTIME) and can go negative or jump during NTP/manual clock adjustments; in that case the cast to u64 traps and aborts zig build bench, and even without a trap it can skew benchmark latencies well beyond the 10% threshold. This was introduced when runCase started trusting dispatch_ns instead of its previous monotonic timer path.

Useful? React with 👍 / 👎.

@justrach justrach merged commit c463e3b into main Apr 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant