Skip to content

perf(core): Automated performance tuning by Claude#1633

Draft
yamadashy wants to merge 9 commits into
mainfrom
perf/auto-perf-tuning
Draft

perf(core): Automated performance tuning by Claude#1633
yamadashy wants to merge 9 commits into
mainfrom
perf/auto-perf-tuning

Conversation

@yamadashy

@yamadashy yamadashy commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Summary

Eight behavior-preserving optimizations from automated perf-tuning passes, each the single highest-impact candidate of its run:

  1. searchFiles: fs adapter answering globby's gitignore stat() calls from dirent types (commit 1)
  2. Security check: pre-warm the secretlint worker pool so thread spawn overlaps file collection (commit 2)
  3. Security check: stream lint batches to the workers while file collection is still in flight (commit 3)
  4. searchFiles: cache readdir results across globby's internal double traversal (commit 4)
  5. truncateBase64: sampled base64-run detection replacing the per-character precondition scan (commit 5)
  6. createRenderContext: lazy memoized getters for fileLineCounts / markdownCodeBlockDelimiter (commit 6)
  7. searchFiles: in-repo synchronous gitignore filter replacing globby's async per-path predicate (commit 7)
  8. searchFiles: fast-path the ignore-file predicate with constant-prefix keys + direct ignore.ignores() (commit 8)

Each change was individually verified against its own baseline on its own machine: −4.8%, −5.1%, −4.3%, −3.5%, −3.1%, −2.7%, −3.6% and −4.6% end-to-end respectively (details below; runs were measured on different machines/days, so per-change deltas are the reliable numbers).


Change 1: globby fs adapter (searchFiles)

With gitignore: true, globby's ignore filter calls fs.stat on every matched path (~1,100 on this repo per run) only to decide whether trailing-slash rules (dir/) apply. The information it needs (is this path a directory?) was already produced by the traversal's readdir(withFileTypes) calls.

createGlobbyFsAdapter() in src/core/file/fileSearch.ts:

  • Wraps the callback-form readdir to record each dirent's type while delegating to the real node:fs readdir.
  • Serves globby's stat() calls from that map; falls through to a real stat for symlinks/special entries and unseen paths, so behavior is preserved exactly.
  • Forwards statSync so globby's cwd-is-directory validation keeps running; a fresh adapter per call means the cache cannot go stale.

Regression test: tests/core/file/fileSearchFsAdapter.test.ts (real directory tree with symlinks, trailing-slash rules, nested .gitignore; asserts per-path stats are eliminated).

Benchmark (8 interleaved warm runs): end-to-end median 1242ms → 1182ms (−60ms, −4.8%); [globby] phase 254ms → 206ms.

Review feedback addressed: gemini-code-assist suggested posix-normalizing cache keys for Windows — investigated and declined with path.win32 verification (globby's stat path goes through path.normalize() + path.resolve(), producing the same native-separator keys; a blanket \/ rewrite would risk collisions on POSIX). Documented in the adapter's docstring; see thread replies.

Change 2: security worker pool pre-warm (packager / securityCheck)

Spawning the 2 secretlint worker threads costs ~50–100ms each (thread creation + importing the 7MB @secretlint/secretlint-rule-preset-recommend bundle) and previously happened inside runSecurityCheck — i.e. after file collection finished, squarely on the critical path. The security leg gates the security ∥ processFiles phase (~208ms of a ~242ms phase).

  • createSecurityCheckTaskRunner() in src/core/security/securityCheck.ts creates the pool early in pack() (mirroring the already-merged metrics prewarm pattern) and posts one empty-items task per worker; the spawn then overlaps the ~165ms collect+git phase.
  • The warm-up count mirrors Tinypool's own sizing (min(2, concurrency, ceil(numOfTasks/100))), so no thread is spawned that the real workload would not have created.
  • An empty-items batch returns [] without linting anything — the same rules run on exactly the same content; only spawn timing changes.
  • Pool teardown starts right after the security phase resolves, overlapping output generation and metrics; the finally block awaits the same memoized promise on every path (including errors and the skill-generate early return), so no path leaks threads.
  • runSecurityCheck / validateFileSafety accept the pre-created runner via their deps objects (Partial-merge, backward compatible — existing callers unaffected).

8 new tests: warmup sizing/failure tolerance, runner reuse without pool creation or premature cleanup, forwarding through validateFileSafety, and packager lifecycle (cleanup on success / on error / disabled security). securityScanSpec.test.ts now routes its inline runner through the new production forwarding path, keeping the end-to-end regression net intact.

Benchmark (20 interleaved warm pairs, quiet 4-core host): end-to-end median 859ms → 815.5ms (−43.5ms, −5.1%); security check phase (trace log) 207.8ms → 161.9ms.

Change 3: stream security batches during file collection

Even with pre-warmed workers (change 2), all lint batches were dispatched only after file collection had fully completed, leaving the entire lint wall time (~160ms on this repo) on the critical path while the workers sat idle during the I/O-bound collection phase.

  • createSecurityCheckStream() (new src/core/security/securityCheckStreaming.ts): collectFiles now reports each file via an optional onFileCollected callback as soon as its content is read; the stream buffers them and dispatches every full BATCH_SIZE batch to the worker pool immediately, so lint work overlaps collection instead of following it.
  • finalize() returns exactly what runSecurityCheck would: it flushes the remainder plus git diff/log items (same construction, same trailing position), enqueues any raw file that never arrived via the callback (a safety net so custom collectFiles implementations cannot skip the check), and re-orders suspicious file results back to canonical rawFiles order — streamed batches complete in nondeterministic collection order.
  • Batch failures are captured, not rejected, until finalize awaits them — so error paths that abandon the session can never surface unhandled rejections; finalize re-throws the first failure like Promise.all would.
  • Two-stage warm-up: createSecurityCheckTaskRunner() now runs before file search, so the first worker's spawn + secretlint preset import (~100ms) overlaps the ~155ms search phase and the worker is ready when the first batch arrives. The second warm-up task is posted via completeWarmup() once the file count is known, preserving the existing sizing rule (second worker only from 101 items).
  • runSecurityCheck keeps its original behavior for non-streamed callers (MCP, lib API); multi-root display-path labeling is applied before items enter the stream, so streamed paths match rawFiles exactly.

12 new tests: batch sizing identical to runSecurityCheck (full batches + remainder), eager dispatch before finalize, the unstreamed-file safety net, dedup of already-streamed files, git item construction/ordering, canonical result re-ordering, error propagation (incl. pre-finalize failures), progress totals, two-stage warm-up sizing/idempotency/failure tolerance, and the onFileCollected contract in collectFiles.

Benchmark (39 interleaved warm pairs, 4-core host, default pack of this repo): end-to-end median 869ms → 832ms (−37ms, −4.3%), paired mean delta −34.8ms (−4.0%), paired t = 5.51. Trace: security-check tail after collection ~160ms → ~60–90ms.

Behavior verification (change 3)

  • Output byte-identical (cmp) between base and patched builds for the default run, --no-security-check, --include-empty-directories, and multi-root (src website).
  • Planted-secrets directory: both builds flag and exclude the same 2 suspicious files with identical output.
  • Local review (correctness + tests/conventions): no correctness findings; a weak packager-wiring assertion and a sort-sentinel nit were fixed before push.

Change 4: readdir cache across globby's double traversal (commit 4)

This was flagged as the "future candidate" in the change-3 run, now measured and implemented. A single globby() call walks the tree twiceglobIgnoreFiles first discovers .gitignore/.repomixignore/.ignore files, then the main fast-glob scan re-walks the same tree — issuing 582 readdir(withFileTypes) calls for 291 directories on this repo. Both walks share the per-call options.fs adapter, so it is the natural cache point.

  • createGlobbyFsAdapter() now records each successful readdir(withFileTypes) result and replays the second walk's calls from memory via process.nextTick (preserving the async callback contract fast-glob expects). One readdir syscall per directory per search.
  • Errors are never cached — a transiently failing directory is retried by the second walk, unchanged from before.
  • Sharing one Dirent[] across walks is safe: fast-glob wraps dirents in its own entry objects and never mutates the source array (the only mutation site in @nodelib/fs.scandir is behind followSymbolicLinks: true, which repomix never sets).
  • Both walks now see one consistent directory snapshot per search; previously a directory changing between the two walks could be listed differently by each (noted in the adapter docstring).
  • New regression test asserts no directory is listed twice during a search (fails on the previous code with 2× calls).

Benchmark (25 interleaved warm pairs, quiet 4-core Linux, default pack of this repo): end-to-end median 808ms → 780ms (−28ms, −3.5%), paired delta median −30ms, paired t ≈ 4.6, 20/25 pairs improved. [globby] phase 166–177ms → 139–149ms.

Behavior verification (change 4)

  • Output byte-identical (cmp) between base and patched builds for the default run, --include-empty-directories, multi-root (src website), and --no-gitignore (with the A/B-swapped lib file excluded from the pack, since --no-gitignore packs lib/ itself).
  • Local review (correctness + tests/conventions): no correctness findings; two test-clarity suggestions (spy-mechanism comment, withFileTypes === true filter guard) applied before push.

Alternatives investigated and rejected (change 4 run, 5 parallel investigation passes)

  • SECURITY_CHECK_BATCH_SIZE 50 → 200: measured −58ms under heavy machine load, but 25 quiet interleaved pairs show identical medians (793ms vs 793ms) — the IPC overhead it removes only matters on contended hosts; 400 regresses (~+20ms, worker imbalance).
  • Lazy fileLineCounts / markdownCodeBlockDelimiter getters in createRenderContext (~14–15ms on the default XML path; they are only consumed by skill generation / the markdown style): below the 2% threshold. (Superseded — see change 6: re-measured at −2.7% e2e on the current tip, where the tail overlap structure has changed.)
  • Lazy handlebars import: cuts ~17ms of startup, but the deferred import() evaluation blocks in-flight collection I/O callbacks (a 0.24ms readFile stretched to 94ms in the probe), netting only ~4ms end-to-end. Module evaluation is single-threaded — this constraint applies to all lazy-import ideas in this codebase.
  • Fire-and-forget metrics pool destroy (~15ms median): still borderline below threshold, unchanged from the previous pass.
  • Startup config probing parallelization (9 sequential stat() calls): <1ms, OS handles it efficiently.

Change 5: sampled base64-run scan in truncateBase64 (commit 5)

This repo's own repomix.config.json enables truncateBase64: true (the CI benchmark workload), so hasLongBase64Run — the cheap precondition that gates the expensive standalone-base64 regex — runs over every packed file's content (~5.5 MB per pack). It walked every character (23ms main-thread self time in CPU profiles, 35.6ms isolated on the corpus) even though it almost always returns false.

  • The scan now samples one character every MIN_BASE64_LENGTH_STANDALONE (256) positions: a qualifying run occupies 256 consecutive indices, so it must contain a sample point — no run can slip between samples (the last sample is clamped to len-1 to cover the trailing partial window). Only a sampled base64-class hit triggers a bounded outward expansion measuring the surrounding run; after a too-short run the sampling phase resets past it (a qualifying run starting at hi+1 always contains sample hi+256).
  • Equivalence: differential-tested against the per-character reference on the full repo corpus (1,096 files, 0 mismatches) plus 20k randomized fuzz cases; false positives only cost a regex pass (the regex re-validates), false negatives are provably impossible. Surrogate-pair/charCodeAt semantics unchanged. Worst case stays O(n) (bounded ~2× character inspections on pathological alternating runs).
  • 6 new tests: sample-stride alignments 0–511, run ending exactly at EOF, whole-content run, repeated expand-and-skip phase resets over misaligned near-threshold runs, and a seeded differential fuzz pinning both false-positive and false-negative directions.

Benchmark (20 interleaved warm pairs, quiet 4-core Linux, default pack of this repo, pristine HEAD worktree build vs patched build): end-to-end median 865ms → 820.5ms (paired delta median −26.5ms, −3.1%), paired mean −37.5ms, t = 5.14, 18/20 pairs improved. Isolated scan cost over the packed corpus: 35.6ms → 1.6ms p50 (~22×).

Behavior verification (change 5)

  • Output byte-identical (cmp) between base and patched builds on the same tree.
  • Local review (2 agents: correctness, tests/conventions): no correctness findings (sampling coverage, termination, off-by-ones, complexity all verified); two comment-accuracy nits (doc fraction claim, test-rationale comment) fixed before push.

Alternatives investigated and rejected (change 5 run, 5 parallel investigation scopes)

  • Regex precheck /[A-Za-z0-9+/]{256}/.test(): measured 4.5× slower than the per-character loop (155ms vs 35ms on the corpus) — bounded-repetition re-scanning at each start position.
  • Early git diff/log token dispatch from the packager: with a warm token cache those tasks resolve while calculateMetrics awaits outputPromise (final Promise.all resolves in ~0ms; the 63–67ms wall figure is main-thread-busy completion latency, not queue wait) — e2e median +15ms under noise, unproven.
  • FILE_COLLECT_CONCURRENCY 50 → 128/256: identical medians over 40 quiet interleaved runs; libuv's 4-thread pool is saturated at queue depth 50.
  • Startup lazy-import prefetches (tinypool / fast-glob / handlebars module-level import()): 0 to −3ms — ESM already fetches/compiles the static graph in parallel; the startup budget is sequential evaluation of ~255 modules, which only bundling would cut.
  • Skip fileLineCounts + markdownCodeBlockDelimiter on the XML path: re-measured at ~11ms p50 on a quiet machine (6.2 + 4.7) — still below the 2% threshold, matching the change-4 run's rejection. (Superseded — see change 6.)
  • Exit tail: measured 0.1–0.2ms after the summary prints — no event-loop retention, nothing to reclaim.

Round 6 (2026-06-12): no qualifying change — all candidates measured below the 2% threshold

This run investigated with 5 parallel scopes (startup, file search, file collection, output/metrics tail, cross-cutting profile), prototyped the two strongest candidates, and measured both below the ≥2% bar on a quiet machine; nothing was committed. Recording the negatives so future passes skip them:

  • Replace globby's ignore-file discovery walk with a readdir BFS + literal ignoreFiles paths: fully prototyped (exact fast-glob deep/entry-filter replication via micromatch, git-root gating, fallback paths). An isolated investigation measured −45–53ms, but that machine was running 5 agents concurrently; on a quiet machine, interleaved searchFiles medians are identical (106–109ms both) and 10 CLI pairs average −1.6ms. With the change-4 readdir cache already landed, the discovery walk's marginal cost over a raw BFS is only ~10–15ms of fast-glob matching CPU — the 45–53ms figure was contention inflation. Reverted.
  • Handlebars precompile + handlebars/runtime (drops the compiler, neo-async, source-map from the module graph; ~35–50ms isolated module-load saving): prototyped via lib patching, output byte-identical, but 20 interleaved CLI pairs show only ~7ms mean — module-graph savings compress e2e exactly like the change-5 run's lazy-import findings predicted.
  • Cache git diff/log token counts in tokenCountCache: the worker-side encode is only ~5ms per item (trace: Counted tokens ... Took: 4.96ms); the 69/85ms "Git diff/log token calculation" wall figures are main-thread-busy completion latency, not compute — same artifact the change-5 run documented for early dispatch. No e2e win available.
  • esbuild bundling (partial or full CLI): partial bundles of the globby/handlebars chains save ~35ms in isolation but ~5ms e2e (module cache already warm when defaultAction lazy-loads); full-CLI bundle has hard blockers (commander's dynamic require, import.meta.url in processConcurrency) and would be a build-system change beyond an automated perf pass.
  • Collection phase: I/O-bound at its floor — pure fs.readFile of the 1,095-file corpus is ~139ms of the ~162ms pipeline; remaining CPU is TextDecoder UTF-8 validation (~15ms, required). Sub-threshold micro-items only (shared TextDecoder ~0.5ms, path.resolve ~1ms).
  • Cross-cutting profile: GC total 43ms but no pause >9ms on the critical path; security/metrics IPC dispatch ~13.5ms spread across 44 non-blocking posts; hasLongBase64Run 2KB-prescreen ~5ms; cold-run MD5 fast-reject via byteLength index ~5.5ms (cold only); sortPaths/reportTopFiles/lineCounts all <3ms real.

Change 6: lazy render-context getters for fileLineCounts / markdownCodeBlockDelimiter (commit 6)

createRenderContext eagerly ran two full scans over every packed file's content (~5.5MB per pack) on every run: calculateFileLineCounts (consumed only by --skill-generate) and calculateMarkdownDelimiter (consumed only by the markdown/skill templates). Both are now memoized getters, so the default XML (and plain/JSON) path never executes them; markdown/skill consumers compute them once on first access with identical values. RenderContext shape and all call sites are unchanged.

Why this clears the bar now when the change-4/5/round-6 passes rejected it (~11–15ms function cost; "<1ms e2e, hidden behind the parallel metrics branch"): this round's tail tracing established that on warm runs the metrics workers are pure cache hits (zero dispatches) and the git diff/log token tasks complete ~28ms before produceOutput resolvesgenerateOutput is the sole main-thread bottleneck at the tail, so render-side cuts are wall-visible up to that slack. The earlier "fully hidden" rationale does not hold on the current tip's overlap structure.

Benchmark (90 interleaved ABBA pairs in 3 batches, quiet 4-core Linux, warm, default pack of this repo):

  • batch 1 (20 pairs): median Δ −21.7ms, t = −3.51, 15/20 improved
  • batch 2 (30 pairs): median Δ −25.6ms, t = −2.49, 19/30 improved
  • batch 3 (40 pairs): median Δ −40.3ms, t = −6.71, 37/40 improved
  • pooled mean −28.6ms on a ~1045ms baseline = −2.7%

Isolated scan cost is ~11–16ms warm; the larger e2e delta is consistent with reduced allocation/GC pressure — match(/\n/g) allocates one match string per line (~200k allocations across the corpus) at peak heap, right before the 5MB render.

Behavior verification (change 6)

  • Output byte-identical (cmp) between base and patched builds for the xml and markdown styles (markdown exercises the getter path).
  • All 1385 tests pass; lint clean (3 pre-existing warnings in unrelated files).
  • Local review (2 agents: correctness, tests/conventions): no blockers — Handlebars lookupProperty triggers own-property getters identically to data properties (verified against the runtime source), object-literal getters satisfy the readonly interface, processedFiles is not mutated between context creation and render, and no code spreads/serializes a RenderContext outside the skill path (where memoization keeps values identical).

Alternatives investigated and rejected (change 6 run, 5 parallel scopes)

  • searchFiles: gitignore: false + pre-built predicate from globby's internal getIgnorePatternsAndPredicate, applied as a sync Array.filter: measured −5.2% e2e (median −56.5ms, t = −4.71, 19/20 pairs) with byte-identical output on this repo — but unshippable as prototyped. globby/ignore.js is not in globby's exports map (the prototype patched node_modules, which would not exist for npm consumers), and disabling globby's gitignore machinery silently drops its gitignore→fast-glob pattern injection that prunes ignored directories during traversal on repos whose collected gitignore patterns contain no negations (repomix does not feed .gitignore contents into its own ignore option) — a large traversal-regression risk on repos with big ignored directories (.venv/, target/, build/). The discovery walk's ignore set would also differ from globby's (which inherits the full repomix ignore array). Future path: an upstream globby PR exporting ignore.js (or the predicate helper) plus replication of convertPatternsForFastGlob would unlock this ~5% safely.
  • Security scope re-audit (3–4 workers, byte-based batching, batch size 75/100, real-content warm-up item): all measured worse or noise. 3 workers regress ~+20% (contention with the metrics pool); byte-based batching changes nothing here (~4KB average files rarely hit a 300KB cap); the post-collection tail is already the irreducible final-batch IPC round-trip (~15–20ms).
  • Metrics/tail scope: git diff/log token tasks are not on the critical path (resolve ~28ms before produceOutput); the "63–85ms git-token latency" is the synchronous MD5 cache-key loop (22–68ms) delaying dispatch — harmless here, and partial MD5 stays rejected (cache invalidation). Decoupling metrics from writeOutputToDisk via an onContentReady callback saves only ~8.5ms (~0.8%); overlapping pool destroy with saveTokenCountCache saves only the 1–7ms cache-save window. The unmerged perf/output-token-ipc-optimization branch is superseded by the wrapper fast path (warm XML runs never reach calculateOutputMetrics).
  • Critical-path audit: no unaccounted serial gap ≥20ms remains; the only ≥20ms serial segments inside produceOutput were the two scans removed by this change. Removing the second sortOutputFiles call inside generateOutput re-confirmed rejected (0ms wall on a quiet machine per the round-6 measurement, and generateOutput is a public API whose direct callers may pass unsorted files).

Change 7: in-repo synchronous gitignore filter (commit 7)

This ships the candidate the change-6 round measured at −5.2% e2e but rejected as unshippable: with gitignore: true / ignoreFiles, globby filters every matched path through an async predicate — Promise.all over one promise per path plus a promisified stat each, ~1,100 microtask round-trips per search on this repo. The two original blockers are both resolved without touching node_modules:

  • New gitignoreParse.ts + gitignoreFilter.ts discover and parse .gitignore / .repomixignore / .ignore in-repo (using the ignore package — the same matcher globby uses internally, promoted to a direct dependency along with fast-glob), build a synchronous predicate, and hand searchFiles the same fast-glob pruning-pattern injection globby derives on its no-negations fast path. The main scan runs with gitignore: false, ignoreFiles: [] and the predicate is applied as a plain Array.filter.
  • globby is retained for the traversal itself, so expandDirectories, negative include patterns, and all other pattern preprocessing keep globby's exact behavior — the earlier prototype's raw fast-glob swap would have silently changed --include "src"-style semantics.
  • Parent .gitignore collection up to the git root (including worktree .git files), per-ignore-file relative anchoring (gitignore spec §2.22.1), directory trailing-slash rules, and the usingGitRoot pruning bail-out are ported line-by-line from globby v16's ignore.js/utilities.js. The discovery walk shares the readdir-caching fs adapter (change 4) with the main scan, so the tree is still read once per search.

Behavior verification (change 7)

  • Output byte-identical (cmp) vs the previous build: default pack, --no-gitignore, multi-root (src website), and a subdirectory pack (website/client, exercising parent-gitignore collection).
  • searchFiles/listFiles/listDirectories parity on synthetic repos: negations, nested .gitignore (CRLF), anchored and trailing-slash patterns, .repomixignore/.ignore, a 300-directory ignored tree (readdir counts equal: 301 = 301 — traversal pruning preserved), and a subdir-of-git-root with parent gitignores and a nested negation override.
  • Differential fuzz vs globby's internal predicate: 9,600 checks across 60 random ignore-file trees, 0 mismatches; pruning-pattern derivation equal as multisets across 120 configurations (ordering differs only where globby's own async discovery order is already nondeterministic).
  • 29 new unit tests for the two modules; local review (2 agents: correctness vs globby source line-by-line, tests/conventions): no blockers — the one correctness note (EPERM on an ignore-file read loses its .code in the wrapped error) is byte-for-byte globby's own pre-existing behavior.

Benchmark (30 interleaved ABBA pairs, quiet 4-core Linux, warm, default pack of this repo): e2e median 816ms → 796.5ms, paired delta median −29ms (−3.6%), mean −25.7ms (−3.1%), t = −3.40, 24/30 pairs improved. Search phase ([globby] trace) 149–189ms → 125–145ms.

Alternatives investigated and rejected (change 7 round, 5 parallel scopes)

  • Skip the second sortOutputFiles inside generateOutput via an alreadySorted flag: re-measured at −22ms (the round-6 "0ms" figure was wrong — the comparator's two dictionary-mode object lookups per comparison cost ~22ms for 1,072 files with sortByChanges enabled). Above threshold but smaller than change 7; strong candidate for a future round.
  • Direct XML string builder replacing the Handlebars render on the non-parsable xml path: ≈−19ms (the per-process Handlebars JIT warm-up dominates; the template uses {{{triple-brace}}} everywhere so direct concatenation is byte-identical, md5-verified). Above threshold but smaller than change 7 and adds template/builder dual maintenance; future candidate.
  • Startup window: re-measured at ~15ms total (process start → search start) with the compile cache warm — nothing ≥2% exists there; gpt-tokenizer/@secretlint/web-tree-sitter confirmed worker-only, never on the main-thread graph.
  • Sampled/partial MD5 for contentCacheKey (~11ms isolated): below threshold and weakens the cache key — stays rejected.
  • Security stream tail (9–49ms variance after collection): structurally irreducible — the last batches depend on the last files collected; pre-dispatching git items saves ≤5ms.
  • Git diff/log child processes: spawn ~250ms before collection ends — fully off the critical path; no overlap win available.

Change 8: fast-path the ignore-file predicate (commit 8)

The change-7 module kept globby's createIgnoreMatcher as the per-path entry point of isIgnored(): every tested path went through several path operations (resolve, normalize, relative, inside-path check, slash conversion) before reaching the ignore matcher — ~1,400 file/directory paths per search on this repo, with the whole pipeline run twice for directories. That resolution work dominated the post-scan filter cost.

  • Fast path in buildIgnoreFileFilter: fast-glob only ever emits clean, slash-separated scan-root-relative strings, and for those the baseDir-relative form the ignore package expects is a precomputed constant prefix (the scan root's path below the git root; empty when they coincide) plus the input — one string concatenation + ig.ignores(), zero per-path node:path calls.
  • A routing regex sends every other input shape ('', ./.. segments, backslashes, doubled or trailing slashes, absolute paths) to the unchanged legacy createIgnoreMatcher — edge-case semantics are preserved by construction rather than re-implemented (the ignore package would throw on several of these shapes).
  • Local review (2 agents): correctness review verified ignores()test().ignored in the ignore package source (including under negation patterns and its per-method caches), routing completeness against the package's throw conditions, Windows separator handling, and all fileSearch.ts call-site formats (absolute: false, no markDirectories) — no blockers. Tests/conventions review: no blockers; two comment-accuracy should-fixes applied before push.

Behavior verification (change 8)

  • Output byte-identical (cmp) vs the previous build: default pack, subdirectory pack (website/client — exercises the git-root prefix branch), multi-root (src website), and --no-gitignore.
  • 1416/1416 tests pass (3 new tests pin the fallback routing: dot-segment/doubled/trailing-slash normalization, scan root & outside-base inputs, absolute inputs); lint clean.

Benchmark (32 interleaved ABBA pairs, warm, default pack of this repo, 4-core Linux; note this round's container is ~1.6–1.7× slower than earlier rounds' hosts, so absolute ms are not comparable across rounds): e2e median 1615ms → 1540ms, paired mean Δ −82.0ms (−5.1%), median Δ −74ms (−4.6%), t = −7.68, 29/32 pairs improved. Search phase trace: 749–778ms → 643–684ms with identical results (1099 files, 255 directories).

Alternatives investigated and rejected (change 8 round, 5 parallel scopes)

  • Skip the second sortOutputFiles inside generateOutput (the change-7 round's −22ms "strong candidate"): re-measured at ~0.12ms isolated on the real corpus — the 22ms figure was the git-log subprocess cost that prefetchSortData's cache already eliminated. Permanently dead; the only remaining value would be structural cleanliness.
  • Direct XML string builder replacing the Handlebars render (change-7 round's ≈−19ms): re-prototyped on the tip, byte-identical, but noise-level e2e (20 interleaved pairs, t = −0.14). The earlier estimate came from measurements without the landed change-6 lazy render-context getters. Dead unless the render path regresses.
  • Precompute the MD5 contentCacheKey during processFiles: prototyped (optional ProcessedFile.contentCacheKey field, hash after all transforms, fallback for other callers), byte-identical — but +7.6ms (noise, t = 0.86) over 20 interleaved pairs on the tip. The warm-run MD5 loop (~19ms) does execute before the Handlebars render, but moving it into the processFiles∥security window does not shorten the wall on the current overlap structure.
  • Fire-and-forget metrics pool destroy: re-measured at ~30ms on this tip (up from the earlier ~15ms figure) but stays rejected — pack() is a library entry point (MCP server); dropping the awaited cleanup risks leaked worker threads in long-running hosts.
  • Process note for future rounds: three of this round's five investigation worktrees were silently based on main instead of the branch tip and "re-discovered" the already-landed change 6; their absolute findings were discarded and only tip-verified measurements were used. Verify the worktree base commit before trusting agent measurements.

Checklist

  • Run npm run test — 1416/1416 pass
  • Run npm run lint — clean (3 pre-existing warnings in unrelated files)

https://claude.ai/code/session_015jxJ9Nx3ncjkTTPHtLJqq7
https://claude.ai/code/session_015sBq63cfQRHYkmnvrokGF2
https://claude.ai/code/session_011DHBuMqYeyMgJuYRSeJxSa
https://claude.ai/code/session_01Ea6eConhLEQFKZsVkJz1zE
https://claude.ai/code/session_016akbidec8cut61QAGRKb99
https://claude.ai/code/session_01RD8vNvv1qtYV8BgdxMU7js
https://claude.ai/code/session_014MsDPw1ZUnHVU4giu48JA7
https://claude.ai/code/session_01N3uqykUShsrDKkyvjuKi13


Generated by Claude Code

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 32dc20bb-36ed-4ca2-aeee-9045f5f3f063

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/auto-perf-tuning

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@yamadashy yamadashy added the automated label Jun 11, 2026 — with Claude
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

⚡ Performance Benchmark

Latest commit:9276d79 Merge remote-tracking branch 'origin/main' into perf/auto-perf-tuning
Status:✅ Benchmark complete!
Ubuntu:0.74s (±0.01s) → 0.64s (±0.01s) · -0.11s (-14.7%)
macOS:0.54s (±0.10s) → 0.44s (±0.06s) · -0.09s (-17.2%)
Windows:0.95s (±0.01s) → 0.81s (±0.01s) · -0.15s (-15.4%)
Details
  • Packing the repomix repository with node bin/repomix.cjs
  • Warmup: 2 runs (discarded), interleaved execution
  • Measurement: 20 runs / 30 on macOS (median ± IQR)
  • Workflow run
History

a008d75 perf(file): Fast-path the ignore-file predicate without per-path resolution

Ubuntu:0.79s (±0.02s) → 0.66s (±0.02s) · -0.13s (-16.2%)
macOS:0.43s (±0.01s) → 0.38s (±0.01s) · -0.05s (-12.5%)
Windows:0.95s (±0.03s) → 0.82s (±0.04s) · -0.13s (-14.0%)

b053bbe perf(file): Fast-path the ignore-file predicate without per-path resolution

Ubuntu:0.68s (±0.02s) → 0.58s (±0.01s) · -0.10s (-14.6%)
macOS:0.44s (±0.04s) → 0.38s (±0.02s) · -0.05s (-11.9%)
Windows:1.22s (±0.14s) → 1.01s (±0.05s) · -0.21s (-17.3%)

632bf8f perf(file): Replace globby's gitignore machinery with an in-repo synchronous ignore-file filter

Ubuntu:0.74s (±0.01s) → 0.64s (±0.01s) · -0.10s (-13.0%)
macOS:0.68s (±0.11s) → 0.60s (±0.14s) · -0.08s (-12.3%)
Windows:0.95s (±0.01s) → 0.83s (±0.01s) · -0.13s (-13.1%)

a890579 perf(output): Defer line-count and markdown-delimiter scans with lazy render-context getters

Ubuntu:0.70s (±0.02s) → 0.63s (±0.01s) · -0.07s (-10.5%)
macOS:0.72s (±0.19s) → 0.69s (±0.13s) · -0.04s (-5.1%)
Windows:1.05s (±0.02s) → 0.91s (±0.03s) · -0.14s (-13.3%)

0fe40b4 perf(file): Sample-scan base64 run detection in truncateBase64

Ubuntu:0.81s (±0.02s) → 0.72s (±0.02s) · -0.10s (-11.9%)
macOS:0.46s (±0.12s) → 0.42s (±0.16s) · -0.05s (-9.9%)
Windows:0.74s (±0.02s) → 0.68s (±0.02s) · -0.06s (-8.4%)

2d0a45a perf(file): Sample-scan base64 run detection in truncateBase64

Ubuntu:0.78s (±0.01s) → 0.69s (±0.01s) · -0.08s (-10.6%)
macOS:0.99s (±0.19s) → 0.90s (±0.21s) · -0.09s (-9.0%)
Windows:0.98s (±0.03s) → 0.88s (±0.03s) · -0.10s (-10.0%)

1f2621e perf(file): Cache readdir results across globby's double traversal

Ubuntu:0.71s (±0.02s) → 0.65s (±0.01s) · -0.06s (-8.9%)
macOS:0.55s (±0.10s) → 0.50s (±0.08s) · -0.05s (-9.4%)
Windows:0.96s (±0.01s) → 0.87s (±0.03s) · -0.09s (-9.7%)

7eeca34 perf(security): Stream security check batches during file collection

Ubuntu:0.69s (±0.01s) → 0.62s (±0.02s) · -0.07s (-9.6%)
macOS:0.46s (±0.08s) → 0.44s (±0.07s) · -0.02s (-4.2%)
Windows:1.04s (±0.14s) → 0.92s (±0.15s) · -0.12s (-11.2%)

21fa845 perf(security): Pre-warm security worker pool to overlap spawn with file collection

Ubuntu:0.81s (±0.02s) → 0.77s (±0.01s) · -0.04s (-4.7%)
macOS:0.43s (±0.01s) → 0.43s (±0.01s) · -0.01s (-1.4%)
Windows:1.22s (±0.05s) → 1.15s (±0.04s) · -0.07s (-5.8%)

5101545 perf(file): Answer globby's gitignore stat calls from readdir dirent types

Ubuntu:0.78s (±0.02s) → 0.75s (±0.02s) · -0.02s (-3.0%)
macOS:0.71s (±0.14s) → 0.67s (±0.07s) · -0.04s (-5.7%)
Windows:0.54s (±0.02s) → 0.53s (±0.01s) · -0.01s (-1.1%)

e722492 perf(file): Answer globby's gitignore stat calls from readdir dirent types

Ubuntu:0.79s (±0.02s) → 0.76s (±0.02s) · -0.03s (-3.3%)
macOS:0.50s (±0.07s) → 0.49s (±0.07s) · -0.01s (-1.4%)
Windows:0.85s (±0.27s) → 0.83s (±0.09s) · -0.02s (-2.4%)

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a custom fs adapter for globby to cache directory entry types during traversal, significantly reducing the number of stat syscalls when applying gitignore rules. It also adds a comprehensive test suite to verify this optimization. The review feedback points out a cross-platform issue on Windows where backslash path separators cause cache misses. To resolve this, it is recommended to normalize paths to POSIX style using toPosixPath in both the caching and lookup steps.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +400 to +404
for (const entry of entries) {
if (entry.isDirectory() || entry.isFile()) {
isDirectoryByPath.set(path.join(dirPath, entry.name), entry.isDirectory());
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

On Windows, path.join produces paths with backslashes (\\), whereas globby/fast-glob standardizes on forward slashes (/) internally. This causes the cache lookups in stat to always miss on Windows, silently bypassing the performance optimization. Normalizing the path to POSIX style using the existing toPosixPath helper ensures the cache works correctly across all platforms.

Suggested change
for (const entry of entries) {
if (entry.isDirectory() || entry.isFile()) {
isDirectoryByPath.set(path.join(dirPath, entry.name), entry.isDirectory());
}
}
for (const entry of entries) {
if (entry.isDirectory() || entry.isFile()) {
isDirectoryByPath.set(toPosixPath(path.join(dirPath, entry.name)), entry.isDirectory());
}
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — the cache keys actually do match on Windows, so this suggestion isn't applied.

The stat lookup path doesn't come from fast-glob's slash-separated strings: globby's filter runs every result through path.normalize() and then path.resolve(cwd, …) (globby createFilterFunctionAsync), which produces native-separator absolute paths on Windows. Our cache key is built with path.join(dirPath, entry.name), which applies the same native normalization (it converts the walker's / joins to \). Verified with path.win32:

const p = require('path').win32;
const cwd = 'C:\\Users\\x\\repo';
const walkerDir = p.resolve(cwd, '.') + '/src/core';      // fast-glob walker joins with '/'
p.join(walkerDir, 'file.ts')                               // cache key
// => 'C:\\Users\\x\\repo\\src\\core\\file.ts'
p.resolve(cwd, p.normalize('src/core/file.ts'))            // globby stat path
// => 'C:\\Users\\x\\repo\\src\\core\\file.ts'  — identical

Applying toPosixPath would also be a small correctness risk on POSIX, where \ is a legal filename character: a blanket \/ rewrite could collide two distinct paths and serve a wrong isDirectory, whereas a separator mismatch in the current code merely falls through to a real stat (perf-neutral, never wrong). Added a docstring note in createGlobbyFsAdapter documenting this invariant.


Generated by Claude Code

Comment on lines +414 to +416
const stat = (statPath: string, callback: (error: NodeJS.ErrnoException | null, stats: Stats) => void): void => {
const isDirectory = isDirectoryByPath.get(statPath);
if (isDirectory === undefined) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To match the POSIX-normalized keys stored in isDirectoryByPath, the lookup path in stat should also be normalized using toPosixPath.

Suggested change
const stat = (statPath: string, callback: (error: NodeJS.ErrnoException | null, stats: Stats) => void): void => {
const isDirectory = isDirectoryByPath.get(statPath);
if (isDirectory === undefined) {
const stat = (statPath: string, callback: (error: NodeJS.ErrnoException | null, stats: Stats) => void): void => {
const isDirectory = isDirectoryByPath.get(toPosixPath(statPath));
if (isDirectory === undefined) {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applied for the same reason as the sibling comment: globby's stat path goes through path.normalize() + path.resolve() before reaching the adapter, so it carries native separators that already match the path.join-built cache keys on Windows (verified with path.win32). A toPosixPath rewrite on lookups would additionally risk key collisions on POSIX filenames containing \. The invariant is now documented in the adapter's docstring.


Generated by Claude Code

…types

intent(file-search): automated perf tuning pass — single highest-impact, behavior-preserving change against a ~1.25s default pack run
learned(globby): with gitignore enabled, globby's ignore filter calls fs.stat on every matched path (~1100 syscalls on this repo) only to decide whether trailing-slash rules apply; the traversal's readdir(withFileTypes) already carried each entry's type
decision(fs-adapter): pass a per-call fs adapter to globby that records dirent types during readdir and serves the stat calls from memory; symlinks/special entries and unseen paths fall through to a real stat since stat follows links while dirents do not
rejected(secretlint-prefilter): trigger-regex pre-filter before lintSource (~88ms) — a hand-maintained trigger list can silently miss preset rules and produce security false negatives
rejected(gitlog-token-cache): caching the git log token count (~10-16ms) — below the 2% improvement threshold and off the critical path
constraint(fs-adapter): statSync must be forwarded so globby's cwd-is-directory validation keeps running; ignore.js readFile falls back to real fs on its own
constraint(cache-keys): keys are path.join-normalized to native separators, matching globby's normalize+resolve chain on every platform; deliberately NOT posix-normalized because a blanket backslash rewrite could collide distinct POSIX paths (review feedback on PR #1633 investigated and declined — verified via path.win32 that join/resolve produce identical keys)

Benchmark (repomix repo itself, ~1100 files, 8 interleaved runs each, warm):
- end-to-end: median 1242ms -> 1182ms (-60ms, -4.8%)
- globby phase: 254ms -> 206ms (-48ms)
- output byte-identical vs base build (cmp) for default run, --include-empty-directories, symlink/trailing-slash edge cases, and non-git directories
- npm run lint clean (3 pre-existing warnings), npm run test 1357/1357 pass
@yamadashy yamadashy force-pushed the perf/auto-perf-tuning branch from e722492 to 5101545 Compare June 11, 2026 18:14
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 11, 2026

Copy link
Copy Markdown

Deploying repomix with  Cloudflare Pages  Cloudflare Pages

Latest commit: 9276d79
Status: ✅  Deploy successful!
Preview URL: https://861b5050.repomix.pages.dev
Branch Preview URL: https://perf-auto-perf-tuning.repomix.pages.dev

View logs

claude added 4 commits June 11, 2026 22:13
…ile collection

intent(security): automated perf tuning pass — single highest-impact, behavior-preserving change against a ~860ms default pack run
learned(security-workers): spawning the 2 secretlint worker threads costs ~50-100ms each (thread creation + 7MB preset bundle import) and previously happened inside runSecurityCheck, i.e. after file collection — squarely on the critical path; the security leg gates the security∥processFiles phase (~208ms of a ~242ms phase)
decision(security-pool): create the pool in pack() right after createMetricsTaskRunner and post one empty-items task per worker, mirroring the already-merged metrics prewarm pattern; the spawn then overlaps the ~165ms collect+git phase instead of starting after it
decision(pool-teardown): start taskRunner.cleanup() right after the security phase resolves so the worker destroy overlaps output generation and metrics; the finally block awaits the same promise (starting it if an earlier stage threw) so no path leaks threads
constraint(warmup-count): warm-up count mirrors Tinypool's own sizing (min(2, concurrency, ceil(numOfTasks/100))) so no thread is spawned that the real workload would not have created anyway
constraint(secretlint-rules): same rules run on exactly the same content — only worker spawn timing changes; an empty-items warmup batch returns [] without linting anything
rejected(metrics-prewarm-zero): skipping metrics warmup on warm-likely runs (~41ms) — exposes a ~150ms BPE init to the common "one file changed" incremental run, the exact case the merged cache-aware prewarm hedges
rejected(base64-sampling): sparse-sampling hasLongBase64Run (~38ms CPU) — runs inside processFiles, which is parallel to and shorter than the security leg, so the CPU saving does not move wall time
rejected(metrics-cleanup-noawait): fire-and-forget metrics pool destroy (~14-17ms) — borderline at the 2% threshold, kept as a future candidate

Benchmark (repomix repo itself, ~1100 files, 20 interleaved warm pairs, quiet 4-core host):
- end-to-end: median 859ms -> 815.5ms (-43.5ms, -5.1%)
- security check phase (trace log): 207.8ms -> 161.9ms (-46ms)
- output byte-identical vs base build (cmp) for default run and --no-security-check
- npm run lint clean (3 pre-existing warnings), npm run test 1365/1365 pass (8 new tests)
The security check previously dispatched all its worker batches only
after file collection had fully completed, putting the entire lint wall
time (~160ms on this repo) on the critical path between collection and
output generation. The security workers sit idle during collection
(which is I/O-bound on the main thread), so the lint work can overlap
it almost entirely.

Changes:
- New createSecurityCheckStream (securityCheckStreaming.ts): buffers
  collected files and dispatches each full BATCH_SIZE batch to the
  worker pool immediately. finalize() flushes the remainder plus git
  diff/log items, enqueues any raw file that never arrived via addFile
  (so custom collectFiles implementations that ignore the callback
  cannot skip the check), awaits all batches, and re-orders suspicious
  file results back to canonical rawFiles order. Batch failures are
  captured (not rejected) until finalize so abandoned sessions on error
  paths cannot surface unhandled rejections.
- collectFiles gains an optional onFileCollected callback, invoked in
  completion order for every file that ends up in rawFiles.
- pack() wires the callback to the stream using final display paths
  (multi-root labels applied), and forwards the stream to
  validateFileSafety, which uses finalize() instead of runSecurityCheck.
- createSecurityCheckTaskRunner now runs before file search and warms
  up in two stages: one worker immediately (its spawn + secretlint
  preset import overlap the ~155ms search phase, so it is ready when
  the first batch arrives), and the second via completeWarmup() once
  the file count is known — preserving the existing sizing rule
  (second worker only from 101 items).

Behavior is unchanged: the same items are linted with the same batch
size and rules, and outputs were verified byte-identical (cmp) against
the base build for the default run, --no-security-check,
--include-empty-directories, multi-root, and a planted-secrets case
(same 2 files flagged and excluded). runSecurityCheck keeps its
original behavior for non-streamed callers (MCP, lib API).

Benchmark (this repo, warm, 39 interleaved pairs, 4-core Linux,
default pack): end-to-end median 869ms -> 832ms (-37ms, -4.3%), paired
mean delta -34.8ms (-4.0%), paired t = 5.51. Trace: security-check
tail after collection ~160ms -> ~60-90ms.

Variants measured and rejected: holding dispatch until the metrics
warm-up settles (warm-up spans nearly the whole collection window,
nullifying the overlap) and capping pre-finalize dispatch to one batch
in flight (paired delta ~2ms, no effect).

npm run test: 1378/1378 pass (12 new). npm run lint: clean.

https://claude.ai/code/session_015sBq63cfQRHYkmnvrokGF2
intent(file-search): automated perf tuning pass — single highest-impact, behavior-preserving change against a ~810ms default pack run
learned(globby-traversal): a single globby() call walks the tree twice — globIgnoreFiles discovers .gitignore/.repomixignore/.ignore files, then the main fast-glob scan re-walks the same tree — issuing 582 readdir(withFileTypes) calls for 291 directories on this repo; both walks share the per-call options.fs adapter, so it is the natural cache point
decision(fs-adapter): record successful withFileTypes readdir results in the existing createGlobbyFsAdapter and replay the second walk's calls via process.nextTick (preserves the async callback contract fast-glob expects); errors are never cached so a transiently failing directory is retried
constraint(dirent-sharing): both traversals receive the same Dirent[] array — safe because fast-glob wraps dirents in its own entry objects and never mutates the source array
constraint(snapshot-semantics): both walks now see one consistent directory snapshot per search; previously a directory changing between the two walks could be listed differently by each (noted in the adapter docstring)
rejected(security-batch-size): SECURITY_CHECK_BATCH_SIZE 50 -> 200 measured -58ms under heavy machine load, but 25 quiet interleaved pairs show identical medians (793ms vs 793ms) — the IPC overhead it removes only matters on contended hosts; 400 regresses (~+20ms, worker imbalance)
rejected(lazy-render-context): lazy fileLineCounts/markdownCodeBlockDelimiter getters in createRenderContext (~14-15ms on the default XML path, 1.8%) — below the 2% threshold
rejected(lazy-handlebars-import): deferring the handlebars import cuts ~17ms of startup but the deferred import() evaluation blocks in-flight collection I/O callbacks (a 0.24ms readFile stretched to 94ms in the probe), netting only ~4ms end to end
rejected(metrics-cleanup-noawait): fire-and-forget metrics pool destroy (~15ms median) — still borderline below threshold, unchanged from the previous pass

Benchmark (repomix repo itself, ~1100 files, 25 interleaved warm pairs,
quiet 4-core Linux, default pack):
- end-to-end median 808ms -> 780ms (-28ms, -3.5%), paired delta median
  -30ms, paired mean -28.8ms (t = 4.6), 20/25 pairs improved
- [globby] search phase 166-177ms -> 139-149ms
- readdir(withFileTypes) during search: one call per directory; a new
  regression test asserts no directory is listed twice
- output byte-identical (cmp) vs the base build for the default run,
  --include-empty-directories, multi-root (src website), and
  --no-gitignore (with the A/B-swapped lib file excluded from the pack)

npm run test: 1379/1379 pass (1 new). npm run lint: clean (3
pre-existing warnings in unrelated files).

https://claude.ai/code/session_011DHBuMqYeyMgJuYRSeJxSa
intent(file-process): automated perf tuning pass — single highest-impact, behavior-preserving change against a ~865ms default pack run; truncateBase64 is enabled in this repo's own config so its precondition scan runs on every packed file in the benchmark workload
learned(base64-scan): hasLongBase64Run walked every character of every file (~5.5MB per pack, 23ms main-thread self time in CPU profiles, 35ms isolated) even though it almost always returns false — the per-character loop was itself the previous optimization over the regex it gates
decision(sampled-scan): sample one character every MIN_BASE64_LENGTH_STANDALONE (256) positions — any qualifying run occupies 256 consecutive indices, so it must contain a sample point; only a sampled base64-class hit triggers a bounded outward expansion to measure the surrounding run, and the sampling phase resets cleanly after each short-run skip (next possible run from hi+1 always covers sample hi+256)
constraint(equivalence): differential-tested against the per-character reference on the full repo corpus (1096 files, 0 mismatches) plus 20k randomized fuzz cases; a deterministic-LCG differential test now pins both false-positive and false-negative directions in the suite
rejected(regex-precheck): /[A-Za-z0-9+/]{256}/.test() measured 4.5x SLOWER than the per-character loop (155ms vs 35ms on the corpus) — bounded-repetition re-scanning at each start position, not a viable replacement
rejected(early-git-token-dispatch): pre-dispatching git diff/log token counts from the packager — with a warm token cache they resolve while calculateMetrics awaits outputPromise (Promise.all resolves in ~0ms; the 63-67ms wall time is main-thread-busy completion latency, not queue wait), e2e median +15ms under noise, unproven
rejected(collect-concurrency): FILE_COLLECT_CONCURRENCY 50 -> 128/256 — identical medians over 40 quiet interleaved runs; libuv's 4-thread pool is saturated at depth 50, queue depth adds nothing
rejected(startup-lazy-imports): module-level import() prefetches of tinypool/fast-glob/handlebars all measure 0 to -3ms — ESM already fetches/compiles the static graph in parallel; the budget is sequential module evaluation (~255 modules), only bundling would cut it
rejected(lazy-render-context): skipping fileLineCounts + markdownCodeBlockDelimiter on the XML path re-measured at ~11ms p50 quiet (6.2 + 4.7) — still below the 2% threshold, matching the previous pass's rejection

Benchmark (repomix repo itself, ~1100 files, 20 interleaved warm pairs,
quiet 4-core Linux, default pack, pristine HEAD worktree build vs
patched build):
- end-to-end median 865ms -> 820.5ms (paired delta median -26.5ms,
  -3.1%), paired mean -37.5ms (t = 5.14), 18/20 pairs improved
- isolated scan cost over the packed corpus: 35.6ms -> 1.6ms p50 (~22x)
- output byte-identical (cmp) vs the base build on the same tree
- 6 new tests: stride alignments 0-511, run ending at EOF,
  whole-content run, phase reset after short-run skips, near-threshold
  non-matches, and the seeded differential fuzz

npm run test: 1385/1385 pass. npm run lint: clean (3 pre-existing
warnings in unrelated files).

https://claude.ai/code/session_01Ea6eConhLEQFKZsVkJz1zE
@yamadashy yamadashy force-pushed the perf/auto-perf-tuning branch from 2d0a45a to 0fe40b4 Compare June 12, 2026 10:13

Copy link
Copy Markdown
Owner Author

Perf-tuning run 2026-06-12: no qualifying candidate found

This automated pass investigated 5 scopes in parallel (startup/module evaluation, packager orchestration, file collection, file search, output/metrics tail) against the current branch tip (0fe40b4, baseline ~800-850ms warm on a 4-core Linux host). No candidate met the ≥2% (~16ms) end-to-end threshold, so no commit was pushed. Recording the findings so future runs skip them:

Investigated and rejected (this run)

  • Bundling the CLI with esbuild (the "only bundling would cut startup" idea from the change-4 run): measured +200ms slower than the unbundled build — bin/repomix.cjs enables module.enableCompileCache(), and evaluating one 3.3MB bundle is slower than 144 individually-cached modules. Bundling also breaks import.meta.url-relative tinypool worker paths, require.resolve of tree-sitter WASM, and the package.json version read. Startup module evaluation (~150ms) is effectively already paid for by the compile cache; lazy-import variants remain rejected from earlier runs.
  • Speculative generateOutput during the security wait: +169ms regression — the ~75ms synchronous Handlebars render blocks the event loop while security workers need the main thread, plus GC pressure from the 5MB string.
  • Skipping the redundant sortOutputFiles call inside generateOutput: saves ~20ms of CPU but 0ms wall (20 interleaved pairs) — the second call is a cache hit whose event-loop yield actually lets worker threads run.
  • Skip fileLineCounts/markdownCodeBlockDelimiter on non-applicable styles (re-measured): ~27ms inside generateOutput, <1ms e2e — fully hidden behind the parallel calculateMetrics/git-token branch.
  • Partial/sampled MD5 for contentCacheKey: ~9-50ms CPU saved in the cache-key loop, ~0ms e2e (metrics finishes ~19ms from cache and is never the bottleneck); would also invalidate all existing user token caches (CACHE_VERSION bump).
  • File-read path micro-optimizations (FileHandle vs path readFile, manual chunked reads skipping fstat, readFileSync in workers, path.join vs path.resolve, TextDecoder singleton): all 0ms or regressions — on Node 22/io_uring the collect phase is kernel-batched I/O with ~15ms total main-thread CPU; the UTF-8 fast path (null-probe + fatal TextDecoder) already skips jschardet for ~99% of files.
  • expandDirectories: false for globby: ~7ms, below noise.
  • Dirent-based filter inside globby's union filter: ~6ms, and lives in node_modules/globby (upstream change, not patchable here).
  • Git prefetch reordering: prefetchSortData already fires at pack() start in parallel with file search; getGitDiffs/getGitLogs have no remaining serial gap.
  • Config-load phase: 7-17ms total, nothing serially wasteful.

Where the remaining time is (for future reference)

~150ms module evaluation (mitigated by compile cache; only architectural changes would cut it), ~150-170ms globby (76ms already saved by the readdir cache on this branch; the residual union-filter cost is dominated by per-result async ignore matching), ~230-260ms collection I/O (kernel-bound, overlapped with security), ~100ms output render+write (overlapped with git token workers), ~75ms git diff/log token counting (the slower parallel branch at the tail).

The five changes already on this branch remain the full set of measurable wins; the pipeline now has no serial segment whose removal clears the 2% bar on a warm run.


Generated by Claude Code

claude added 3 commits June 12, 2026 19:02
… render-context getters

createRenderContext eagerly ran calculateFileLineCounts and
calculateMarkdownDelimiter — two full scans over every packed file's
content (~5.5MB) — on every run, even though fileLineCounts is only
consumed by skill generation and markdownCodeBlockDelimiter only by the
markdown/skill templates. Memoized getters defer both scans, so the
default XML (and plain/JSON) path never pays for them. Output is
byte-identical (verified with cmp for xml and markdown styles); when a
template or packSkill touches the property, the same computation runs
once and is cached.

Why this clears the bar now when earlier rounds rejected it: this
round's tail profiling showed that on warm runs (token cache hit) the
metrics workers resolve instantly and git token tasks complete ~28ms
before produceOutput, so generateOutput is the sole main-thread
bottleneck at the tail — the scans are wall-visible, not hidden behind
the parallel metrics branch as previously assumed.

Benchmark (packing this repo, warm, quiet 4-core Linux, interleaved
ABBA pairs, node bin/repomix.cjs --quiet):

- batch 1 (20 pairs): median delta -21.7ms, mean -20.1ms, t=-3.51, 15/20 improved
- batch 2 (30 pairs): median delta -25.6ms, mean -18.0ms, t=-2.49, 19/30 improved
- batch 3 (40 pairs): median delta -40.3ms, mean -40.9ms, t=-6.71, 37/40 improved
- pooled (90 pairs): mean -28.6ms on a ~1045ms baseline = -2.7%

Isolated scan cost is ~11-16ms (warm); the larger e2e delta is
consistent with reduced allocation/GC pressure from dropping the
per-line match(/\n/g) array allocations at peak heap.

intent(perf-tuning): automated round targeting >=2% end-to-end CLI improvement with behavior preserved
decision(render-context): memoized getters over style-conditional skips — RenderContext shape and all call sites unchanged, and any consumer that does read the properties still gets identical values
rejected(file-search): gitignore:false + prebuilt globby predicate measured -5.2% e2e but unshippable — depends on globby's unexported ignore.js internals (exports-map patch won't exist for npm consumers) and loses globby's gitignore-pattern directory pruning, risking large traversal regressions on repos with big ignored dirs and no negation patterns
learned(metrics-tail): on warm runs the tail critical path is produceOutput -> write -> wrapper extraction; git diff/log token tasks finish ~28ms earlier, so render-side CPU cuts are wall-visible up to that slack

https://claude.ai/code/session_01RD8vNvv1qtYV8BgdxMU7js
…hronous ignore-file filter

With gitignore:true / ignoreFiles, globby filters every matched path
through an async predicate — one Promise per path via Promise.all plus
a promisified stat each, ~1,100 microtask round-trips per search on
this repo. searchFiles now discovers and parses .gitignore /
.repomixignore / .ignore itself (new gitignoreParse.ts +
gitignoreFilter.ts, using the `ignore` package that globby itself
uses), passes gitignore:false to globby, and applies the predicate as
a synchronous Array.filter over the results. The fast-glob
pruning-pattern injection (no-negations fast path) and
parent-.gitignore collection up to the git root are replicated
exactly, and the discovery walk shares the readdir-caching fs adapter
with the main scan so the tree is still read only once per search.

Equivalence verification:
- byte-identical output (cmp) vs HEAD build: default pack,
  --no-gitignore, multi-root (src website), subdir pack (website/client)
- searchFiles/listFiles/listDirectories parity on synthetic repos:
  negations, nested .gitignore (CRLF), anchored and trailing-slash
  patterns, .repomixignore/.ignore, 300-dir ignored tree (readdir
  counts equal: 301 = 301), subdir-of-git-root with parent gitignores
  and nested negation override
- differential fuzz vs globby's internal predicate: 9,600 checks
  across 60 random ignore-file trees, 0 mismatches; pruning patterns
  equal as multisets across 120 configurations

Benchmark (packing this repo, warm, quiet 4-core Linux, 30 interleaved
ABBA pairs, node bin/repomix.cjs --quiet):
- e2e median 816ms -> 796.5ms, paired delta median -29ms (-3.6%),
  mean -25.7ms (-3.1%), t = -3.40, 24/30 pairs improved
- search phase ([globby] trace) 149-189ms -> 125-145ms

intent(perf): round-7 automated pass; the previous round measured this
  approach at -5.2% but rejected it as unshippable via globby internals
decision(gitignore-filter): keep globby for traversal and include
  handling; replace only the ignore-file machinery — preserves
  expandDirectories and negative-include behavior that a raw fast-glob
  swap would silently lose
rejected(globby-public-api): isIgnoredByIgnoreFiles alone — it
  hardcodes includeParentIgnoreFiles=false (drops parent .gitignore
  files when packing a subdirectory) and does not expose patterns for
  traversal pruning
rejected(candidates): skipping the redundant sortOutputFiles call in
  generateOutput (-22ms) and a direct XML string builder replacing the
  Handlebars render (-19ms) — both clear the 2% bar but are smaller
  than this change; viable for future rounds
constraint(equivalence): ignore-file discovery order is
  nondeterministic (async fast-glob) in globby today as well; order
  only affects cross-file negation interplay, so the behavior class is
  unchanged
learned(globby): gitignore:true injects converted gitignore patterns
  into fast-glob's ignore option for directory pruning only when cwd
  is the git root and no negations exist anywhere; replicated in
  convertPatternsForFastGlob

https://claude.ai/code/session_014MsDPw1ZUnHVU4giu48JA7
…lution

The isIgnored() predicate returned by buildIgnoreFileFilter routed every
tested path through createIgnoreMatcher, which performs ~5 path operations
per call (resolve, normalize, relative, isInsidePath check, slash
conversion) before reaching the `ignore` matcher — applied to every file
and directory the scans emit (~1,400 paths per search on this repo, plus
the empty-directory scans), this dominated the post-scan filter cost.

fast-glob only ever emits clean, slash-separated paths relative to the
scan root. For those, the baseDir-relative form the `ignore` package
expects is a constant prefix (the scan root's path below the git root;
empty when they coincide) plus the input string, so the fast path is now
one string concatenation + ig.ignores(). A single regex routes every
other input shape ('', '.'/'..' segments, backslashes, doubled or
trailing slashes, absolute paths) to the unchanged legacy matcher, so
edge-case semantics stay exactly as before.

decision(gitignore-filter): keep createIgnoreMatcher as the fallback for
non-fast-glob input shapes instead of replicating its normalization
inline — equivalence by construction, and the fallback never runs on the
hot path
rejected(double-sort): skipping the second sortOutputFiles inside
generateOutput — re-measured at ~0.12ms isolated; the change-7 round's
22ms figure was the git-log subprocess cost already eliminated by
prefetchSortData
rejected(xml-direct-builder): direct string builder replacing the
Handlebars xml render — noise-level on the current tip (t=-0.14 over 20
interleaved pairs); earlier ~19ms estimates came from builds without the
landed lazy render-context getters
rejected(md5-precompute): computing contentCacheKey during processFiles
to clear it from the tail — +7.6ms (noise, t=0.86) over 20 interleaved
pairs on the current tip
learned(bench): this container runs ~1.6-1.7x slower than the previous
rounds' quiet host (e2e baseline ~1615ms vs ~800-950ms); relative deltas
from interleaved pairs are the comparable metric

Benchmark (32 interleaved ABBA pairs, warm, default pack of this repo,
4-core Linux): e2e median 1615ms -> 1540ms, paired mean delta -82.0ms
(-5.1%), median delta -74ms (-4.6%), t = -7.68, 29/32 pairs improved.
Search phase ([globby] trace, --verbose): 749-778ms -> 643-684ms with
identical results (1099 files, 255 directories).

Output byte-identical (cmp) vs the previous build for: default pack,
subdirectory pack (website/client — exercises the git-root prefix
branch), multi-root (src website), and --no-gitignore. 1416/1416 tests
pass; lint clean (3 pre-existing warnings in unrelated files).

https://claude.ai/code/session_01N3uqykUShsrDKkyvjuKi13
@yamadashy yamadashy force-pushed the perf/auto-perf-tuning branch from b053bbe to a008d75 Compare June 13, 2026 01:03
# Conflicts:
#	package-lock.json
#	package.json

Copy link
Copy Markdown
Owner Author

Perf-tuning run 2026-06-13: no qualifying candidate found

This automated pass first merged the latest main into the branch (resolving the package.json conflict: kept commander@^15.0.0 from main + the branch's fast-glob@^3.3.3; package-lock.json regenerated). Build + CLI smoke verified, pushed.

It then investigated 5 non-overlapping scopes in parallel (token/metrics pipeline, file processing, output generation, git operations + the MD5 cache-key loop, and a whole-process CPU/GC profile) against the current branch tip (9276d79, warm baseline ~700ms on a quiet 4-core Linux host; 2% ≈ 14ms). No candidate met the ≥2% end-to-end bar, so nothing was committed. Recording the findings so future runs skip them:

Strongest candidate — measured, but below the bar

  • contentCacheKey MD5 + Buffer.byteLength → SHA-1 + content.length (src/core/metrics/tokenCountCache.ts, +CACHE_VERSION bump). Three agents independently converged here: the per-file cache-key loop in calculateFileMetrics is the single biggest reducible main-thread item on the warm-run tail critical path (~17–21ms; it runs synchronously before the output token metrics). SHA-1 is hardware-accelerated and faster than MD5; content.length avoids the full UTF-8 re-encode Buffer.byteLength performs; the truncated 64-bit digest is no weaker than before and still covers the full content (token counts unchanged, output byte-identical, verified via cmp).
    • Measured e2e (swap one compiled file in-place, same bin + same warm compile cache, per-version isolated token caches, interleaved ABBA, quiet host): 40 pairs → −11.2ms (−1.61%, t=2.42); 70 pairs → −8.9ms (−1.24%, t=2.13). Real and statistically significant, but ~1.2–1.6% < 2%. The corpus is only ~4.8MB, so the hash speedup doesn't scale to 14ms here.

Investigated and rejected this run

  • Overlap the metrics warm-up with output render (move await metricsWarmupPromise / await tokenCacheLoadPromise out of the pre-produceOutput critical path into the calculateMetrics branch of the tail Promise.all): byte-identical output, but noise (30 pairs, −0.9%, t=0.76, median +1.1ms). The ~84ms "idle waiting for metricsWarmupPromise" seen in one profile was a contention artifact from running 5 profiling agents concurrently; under single-process load the worker BPE warm-up already completes during collect/security, so the await is near-instant. The MD5 loop and the Handlebars render both contend for the main thread, so reordering can't overlap them — only reducing absolute CPU (the candidate above) helps.
  • Output scope (src/core/output/): Handlebars template compile already cached per-process; render ~0.4ms warm. Best idea (trim template source + drop runtime .trim()) saves ~5ms isolated but ~2ms e2e — once produceOutput drops below calculateFileMetrics in the tail race, the metrics side gates it. Buffer pre-encode variants: ~6ms isolated, ~1.5ms e2e, same bottleneck shift.
  • File processing (fileProcess/fileManipulate/fileCollect): entire phase is <1.1% of wall time on default config; fileManipulate is inert (removeComments/removeEmptyLines false). TextDecoder/buf.toString swaps and the identity rawFiles.map are all <1ms. Nothing close.
  • Git operations: all git subprocesses complete ~320ms before the tail starts — fully off the critical path. prefetchSortData already overlaps search; no redundant work.
  • Other in-scope micro-items: buildTokenCountTree (~1ms, post-pack() in cliReport), extractOutputWrapper (~6ms, wrapper is a cache hit), two-level byteLength index and charLen-only keys (rejected — silent wrong-token-count correctness bugs). GC: all Scavenge pauses ≤5ms, no stop-the-world on the critical path.

Verdict

The MD5→SHA-1+charLen cache-key change is a correct, safe, measurable ~1.2–1.6% win but falls short of the strict 2% threshold, so it was reverted. No other behavior-preserving single change in src clears the bar on this corpus; the pipeline has no remaining serial main-thread segment whose removal reaches 2% on a warm run.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants