perf(core): Skip redundant full-output tokenization via wrapper-extraction fast path (-13.2%) by yamadashy · Pull Request #1457 · yamadashy/repomix

yamadashy · 2026-04-12T08:47:57Z

Summary

When tokenCountTree is enabled, calculateSelectiveFileMetrics already tokenizes every file individually. The original calculateOutputMetrics then re-tokenized the full output (~4 MB) a second time in 200 KB chunks to compute totalTokens. This second pass was the single longest task in the calculateMetrics Promise.all, consuming ~1 second of worker time that duplicated work already done for per-file counts.

This PR introduces a wrapper-extraction fast path: splice file contents out of the generated output via a single indexOf forward-walk, tokenize only the remaining "wrapper" (template boilerplate + directory tree + git diff/log + per-file headers), and compute:

totalTokens ≈ Σ per-file tokens + tokens(wrapper)

Key changes

src/core/metrics/calculateMetrics.ts
- extractOutputWrapper() — single-pass forward walk to splice file contents out of the output string
- canUseFastOutputTokenPath() — gate: enabled only for xml/markdown/plain non-parsable single-part output with tokenCountTree
- When fast path applies, reuse per-file token counts from selectiveFileMetrics + one cheap runTokenCount on the ~127 KB wrapper instead of re-tokenizing ~4 MB
src/core/packager.ts
- Pre-sort processedFiles via sortOutputFiles() before passing to produceOutput and calculateMetrics, ensuring file order matches output order (precondition for indexOf forward-walk)
tests/core/packager/diffsFunctionality.test.ts
- Stub isGitInstalled and getFileChangeCount in the existing vi.mock for gitRepositoryHandle.js

Accuracy

The delta vs the old 200 KB-chunk approach is bounded by BPE merges at file↔wrapper boundaries: 309 / 1,284,067 tokens ≈ 0.024% on the repomix repo itself — comparable to the chunk boundary error the existing approach already accepts.

Fallback

If any file's content isn't found in the output (template escaping, split output, JSON/parsable-XML, order mismatch), the fast path returns null and falls back to the original calculateOutputMetrics path. Zero behavior change in those cases.

Benchmark

Interleaved 30-run benchmark (repomix repo, 1018 files, ~4 MB xml output):

base median: 2735 ms  [2389 - 3528]  IQR=367 ms
opt  median: 2374 ms  [2125 - 2653]  IQR=293 ms
delta:       -362 ms  (-13.2%)

calculateMetrics phase: ~1296 ms → ~580 ms.

Test plan

All 1102 existing tests pass
Lint clean
Verify fast path activates on default xml output with tokenCountTree enabled
Verify fallback works correctly for JSON/parsable-XML/split output

🤖 Generated with Claude Code

…ction fast path (-13.2%) When `tokenCountTree` is enabled `calculateSelectiveFileMetrics` already tokenizes every file individually on the primary worker pool. The original `calculateOutputMetrics` then re-tokenized the full output a second time, split into 200 KB chunks, to compute `totalTokens`. On large repos with the tree display enabled, this second pass was the single longest task in the `calculateMetrics` `Promise.all`, consuming roughly 1 second of worker time that duplicated work already done for the per-file counts. This change introduces a fast path for the common case (xml / markdown / plain output, non-parsable, single-part): walk the generated output with `indexOf(file.content, cursor)` once per file to splice file contents out of the output, tokenize only the remaining "wrapper" (template boilerplate + directory tree + git diff/log + per-file headers), and compute `totalTokens = Σ per-file tokens + wrapper tokens`. The accuracy delta versus the old 200 KB-chunk approach is bounded by BPE merges across file↔wrapper boundaries; on the repomix repository itself the measured error was 309 / 1,284,067 tokens ≈ 0.024 %, comparable to the chunk boundary error the existing approach already accepts. ## Implementation - `src/core/metrics/calculateMetrics.ts` - Add `extractOutputWrapper(output, processedFilesInOutputOrder)` which walks the output with a single forward cursor. Returns `null` and triggers a fall back to `calculateOutputMetrics` if any file content is not found (e.g., template escaped it, output was split, order mismatch). - Add `canUseFastOutputTokenPath(config)` gate: only enabled when `tokenCountTree` is truthy, `splitOutput` is undefined, `parsableStyle` is false, and the style is `xml` / `markdown` / `plain`. JSON output and parsable XML go through `JSON.stringify` / `fast-xml-builder` which escape file contents, so `indexOf(content)` would miss them. - In `calculateMetrics`, when the fast path is available and wrapper extraction succeeds, replace `outputMetricsPromise` with a promise that awaits the already-running `selectiveFileMetricsPromise`, sums the per-file token counts, and dispatches a single `runTokenCount` on the extracted wrapper string. The rest of the `Promise.all` is unchanged. - `src/core/packager.ts` - Call `sortOutputFiles(filteredProcessedFiles, config)` once in `pack` immediately after suspicious-file filtering and use its result as `processedFiles` downstream (for `produceOutput`, `calculateMetrics`, and the final result object). `generateOutput` internally calls `sortOutputFiles` as well, which is stable and memoized via `fileChangeCountsCache`, so the two now share the single git-log subprocess result and consumers see files in the exact order they appear in the output. This is a precondition for the fast path's forward-walk extraction. - Expose `sortOutputFiles` on `defaultDeps` so existing packager unit tests can inject their own implementation. - `tests/core/packager/diffsFunctionality.test.ts` - Extend the `gitRepositoryHandle.js` `vi.mock` to also stub `isGitInstalled` and `getFileChangeCount`, since `sortOutputFiles` resolves its default dependencies from that module at module load time. All 1102 existing tests pass unchanged; lint is clean. ## Benchmark Interleaved 30-run benchmark against the repomix repo itself (1018 files, ~4 MB xml output, `tokenCountTree: 50000`, `sortByChanges: true`, `includeDiffs` and `includeLogs` enabled via the repo's own `repomix.config.json`): base median: 2735.2 ms [2389 - 3528] IQR=367 ms opt median: 2373.6 ms [2125 - 2653] IQR=293 ms delta: -361.6 ms (-13.22%) Verbose trace before/after (single run, representative): before: Selective metrics calculation completed in 639 ms Output token count completed in 1046 ms Calculate Metrics wall: 1296 ms after: Selective metrics calculation completed in 579 ms Fast-path output tokens: files=1017293, wrapper=33678 (126996 chars) Calculate Metrics wall: ~580 ms The savings are concentrated in the `calculateMetrics` phase, which was the dominant critical path in the final `Promise.all` for tokenCountTree runs on large repos.

coderabbitai · 2026-04-12T08:48:14Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 09ed0669-60f8-48b4-987c-094a494b1e1c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This change introduces a performance optimization for token counting in metrics calculation and integrates output file sorting into the packager workflow. The metrics module now conditionally applies a fast token-counting path for eligible configurations, computing output token counts by summing per-file counts with a tokenized wrapper, while falling back to the original behavior otherwise. The packager now sorts output files before passing them to downstream operations.

Changes

Cohort / File(s)	Summary
Metrics Optimization `src/core/metrics/calculateMetrics.ts`	Added logger import and `runTokenCount` from worker runner. Introduced `extractOutputWrapper()` helper to isolate wrapper content and `canUseFastOutputTokenPath()` to determine fast-path eligibility. Replaced unconditional `calculateOutputMetrics` with conditional logic: fast path sums per-file token counts plus wrapper-only output tokens (via `runTokenCount`); fallback path preserves original behavior via `calculateOutputMetrics`.
Packager Output Sorting `src/core/packager.ts`	Added `sortOutputFiles` dependency to `defaultDeps`. After filtering suspicious files, now awaits `deps.sortOutputFiles()` to establish output ordering before skill generation and metrics calculation, replacing the previous unsorted file list usage.
Test Mocks `tests/core/packager/diffsFunctionality.test.ts`	Extended `gitRepositoryHandle` mock with `isGitInstalled` (resolves `false`) and `getFileChangeCount` (resolves empty object) asynchronous functions.

Sequence Diagram

sequenceDiagram
    participant Packager as Packager
    participant SortFn as sortOutputFiles
    participant CalcMetrics as calculateMetrics
    participant FastPath as Fast Path<br/>(runTokenCount)
    participant FallbackPath as Fallback Path<br/>(calculateOutputMetrics)

    Packager->>SortFn: sort filtered processed files
    SortFn-->>Packager: ordered files
    
    Packager->>CalcMetrics: pass sorted files & config
    
    alt Config supports fast path
        CalcMetrics->>FastPath: extract wrapper output
        FastPath->>FastPath: sum file tokenCounts
        FastPath->>FastPath: tokenize wrapper only
        FastPath->>FastPath: aggregate counts
        FastPath-->>CalcMetrics: output token count
    else Fallback to standard path
        CalcMetrics->>FallbackPath: calculate for each output part
        FallbackPath-->>CalcMetrics: aggregated counts
    end
    
    CalcMetrics-->>Packager: metrics result

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

perf(metrics): Batch token counting IPC to reduce worker round-trip overhead #1411: Introduces the metrics worker runner and runTokenCount helpers that this PR directly extends for the fast token-counting path.
perf(core): Replace tiktoken WASM with gpt-tokenizer #1350: Modifies the token-counting and metrics infrastructure (TokenCounter, calculateOutputMetrics types) that this PR now selectively uses via the new conditional logic.
perf(metrics): Increase output token counting chunk size from 100KB to 200KB #1415: Modifies calculateOutputMetrics chunking behavior, directly affecting the fallback path that this PR conditionally invokes for ineligible configurations.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main performance optimization: skipping redundant full-output tokenization via a wrapper-extraction fast path with measured 13.2% improvement.
Description check	✅ Passed	The description comprehensively covers the problem, solution, key changes, accuracy analysis, fallback behavior, benchmark results, and test plan, exceeding the minimal template requirements.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/fast-output-tokenization

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-12T08:48:27Z

⚡ Performance Benchmark

Latest commit:	`701e8d4` refactor(metrics): Rename stale selective variable to allFileMetrics
Status:	✅ Benchmark complete!
Ubuntu:	1.34s (±0.02s) → 1.29s (±0.02s) · -0.05s (-3.9%)
macOS:	1.17s (±0.38s) → 1.09s (±0.41s) · -0.08s (-6.7%)
Windows:	1.79s (±0.04s) → 1.69s (±0.03s) · -0.10s (-5.8%)

Details

Packing the repomix repository with node bin/repomix.cjs
Warmup: 2 runs (discarded), interleaved execution
Measurement: 20 runs / 30 on macOS (median ± IQR)
Workflow run

History

7ed5943 refactor(metrics): Rename fastWrapper to outputWrapper

Ubuntu:	1.35s (±0.03s) → 1.29s (±0.02s) · -0.06s (-4.5%)
macOS:	0.84s (±0.03s) → 0.79s (±0.05s) · -0.05s (-6.3%)
Windows:	2.07s (±0.42s) → 1.94s (±0.40s) · -0.13s (-6.4%)

f1dc925 fix(metrics): Rename misleading fastOutputToken to singleOutput

Ubuntu:	1.41s (±0.02s) → 1.33s (±0.02s) · -0.07s (-5.3%)
macOS:	0.94s (±0.14s) → 0.87s (±0.13s) · -0.07s (-7.2%)
Windows:	1.73s (±0.04s) → 1.64s (±0.04s) · -0.09s (-5.3%)

0b6c775 fix(metrics): Address PR review feedback

Ubuntu:	1.42s (±0.02s) → 1.35s (±0.02s) · -0.07s (-4.9%)
macOS:	1.02s (±0.17s) → 0.93s (±0.17s) · -0.09s (-8.6%)
Windows:	1.77s (±0.09s) → 1.69s (±0.15s) · -0.08s (-4.7%)

ea174db test(metrics): Add tests for extractOutputWrapper and canUseFastOutputTokenPath

Ubuntu:	1.39s (±0.04s) → 1.32s (±0.02s) · -0.07s (-5.2%)
macOS:	0.82s (±0.04s) → 0.76s (±0.04s) · -0.06s (-7.2%)
Windows:	1.66s (±0.04s) → 1.57s (±0.05s) · -0.09s (-5.3%)

48c7d41 fix(metrics): Address PR review feedback

Ubuntu:	1.38s (±0.02s) → 1.29s (±0.03s) · -0.08s (-5.8%)
macOS:	0.95s (±0.12s) → 0.93s (±0.15s) · -0.02s (-2.4%)
Windows:	1.79s (±0.03s) → 1.68s (±0.02s) · -0.10s (-5.9%)

8321c7a refactor(metrics): Rename calculateSelectiveFileMetrics to calculateFileMetrics

Ubuntu:	1.40s (±0.02s) → 1.34s (±0.02s) · -0.06s (-4.6%)
macOS:	1.12s (±0.13s) → 1.04s (±0.11s) · -0.07s (-6.5%)
Windows:	1.89s (±0.05s) → 1.78s (±0.09s) · -0.11s (-5.7%)

6dea6cb refactor(metrics): Always calculate per-file token counts for all files

Ubuntu:	1.44s (±0.03s) → 1.38s (±0.03s) · -0.06s (-4.4%)
macOS:	1.26s (±0.20s) → 1.08s (±0.09s) · -0.18s (-14.2%)
Windows:	1.91s (±0.43s) → 1.81s (±0.41s) · -0.10s (-5.1%)

68a47b9 perf(core): Skip redundant full-output tokenization via wrapper-extraction fast path (-13.2%)

Ubuntu:	1.33s (±0.02s) → 1.32s (±0.03s) · -0.01s (-0.8%)
macOS:	0.91s (±0.14s) → 0.90s (±0.16s) · -0.01s (-0.7%)
Windows:	1.76s (±0.04s) → 1.76s (±0.05s) · +0.00s (+0.0%)

codecov · 2026-04-12T08:48:53Z

Codecov Report

❌ Patch coverage is 84.09091% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.18%. Comparing base (9b8a46f) to head (701e8d4).
⚠️ Report is 10 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/core/metrics/calculateMetrics.ts	84.61%	6 Missing ⚠️
src/core/metrics/calculateFileMetrics.ts	75.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1457      +/-   ##
==========================================
+ Coverage   87.13%   87.18%   +0.04%     
==========================================
  Files         117      117              
  Lines        4440     4465      +25     
  Branches     1023     1032       +9     
==========================================
+ Hits         3869     3893      +24     
- Misses        571      572       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

cloudflare-workers-and-pages · 2026-04-12T08:49:15Z

Deploying repomix with Cloudflare Pages

Latest commit:	`701e8d4`
Status:	✅ Deploy successful!
Preview URL:	https://2402a134.repomix.pages.dev
Branch Preview URL:	https://perf-fast-output-tokenizatio.repomix.pages.dev

View logs

gemini-code-assist

Code Review

This pull request introduces a performance optimization for token counting by implementing a "fast path" that avoids re-tokenizing the entire output string. Instead, it calculates the total tokens by summing individual file token counts and tokenizing only the "wrapper" (the template boilerplate). To support this, files are now pre-sorted to match the output order. Feedback was provided regarding the robustness of the extractOutputWrapper function, specifically noting that using indexOf on file content might lead to incorrect matches if the content appears earlier in the output, such as in the directory tree.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Previously, per-file tokenization was only done for all files when `tokenCountTree` was enabled; otherwise only the top N files (by char count) were tokenized for the "Top Files" display. But output tokenization always processes all file contents anyway, so the "selective" path was not saving any work — it just prevented the fast output-token path from being used. Now we always tokenize every file individually, which: - Enables the wrapper-extraction fast path regardless of tokenCountTree - Simplifies the metrics pipeline by removing the conditional branching - Provides complete per-file token data for all downstream consumers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude · 2026-04-12T09:02:58Z

Code Review — `perf(core): Skip redundant full-output tokenization via wrapper-extraction fast path`

Nice optimization concept — reusing per-file token counts and only tokenizing the lightweight wrapper is a clean approach. A few items worth discussing before merge:

1. Always-tokenize-all-files is a behavioral change beyond the fast path (Medium)

The PR removes the selective tokenization optimization that previously only tokenized the top N files when tokenCountTree is disabled (the default):

// Before: only top files tokenized when tokenCountTree=false
const metricsTargetPaths = shouldCalculateAllFiles
  ? processedFiles.map((file) => file.path)
  : [...processedFiles].sort(...).slice(0, topFilesLength * 10).map(...)

// After: always all files
const metricsTargetPaths = processedFiles.map((file) => file.path);

For users running with default settings (tokenCountTree: false), this means all 1000+ files are now individually tokenized — work that wasn't done before. The fast path offsets this by avoiding full-output re-tokenization, but the net effect on the default config isn't benchmarked (the benchmark used tokenCountTree: true). A benchmark on the default config would confirm whether this is a net win or regression for the majority of users.

2. Test coverage gap for new logic (Medium)

Codecov reports 35.29% patch coverage with 22 lines missing. The PR's own test plan has unchecked items. Key untested code:

extractOutputWrapper — no unit tests for successful extraction, null return, empty file handling, or forward-walk correctness
canUseFastOutputTokenPath — no branch coverage tests
No integration test verifying the fast path produces token counts within acceptable bounds of the slow path

Suggested test cases

extractOutputWrapper with known output string and files → returns correct wrapper
extractOutputWrapper with missing file content → returns null
extractOutputWrapper with empty file content → skips correctly
canUseFastOutputTokenPath for each branch: splitOutput set, parsableStyle true, each style value
Fast vs slow path produce equivalent results (within ~0.1%)

3. Implicit ordering contract between packager and output generator (Low)

The fast path relies on processedFiles being sorted identically to how generateOutput orders them in the output. Both call sortOutputFiles, relying on a shared module-level cache to guarantee identical results. If generateOutput ever changes its sorting logic, the fast path silently produces wrong token counts (without triggering the null fallback).

Consider passing the pre-sorted array into generateOutput as an already-sorted input, making the contract explicit and eliminating the redundant sort.

4. Sequential dependency in fast path (Low)

The fast path awaits selectiveFileMetricsPromise before running runTokenCount on the wrapper. This serializes work that previously ran in parallel (output metrics vs file metrics). Since the wrapper tokenization is cheap (~127KB), the impact is small, but worth noting.

Positives:

Fallback to calculateOutputMetrics on null is solid defensive design
Gating on parsableStyle and JSON correctly excludes formats where content is escaped
Benchmark methodology (interleaved 30 runs, IQR reported) is credible
Pre-sorting files in packager and adding sortOutputFiles to deps follows the project's DI convention

🤖 Generated with Claude Code

…ileMetrics The function now always calculates metrics for all files, so the "Selective" prefix no longer reflects its behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude · 2026-04-12T11:51:35Z

Follow-up Review — Incremental Findings

Building on my previous review, a few additional items surfaced:

1. CI Benchmark vs PR Claim (High)

The CI benchmark bot shows -0.8% (Ubuntu), -0.7% (macOS), +0.0% (Windows) — dramatically less than the -13.2% claimed in the PR description. The PR benchmark used tokenCountTree: true (non-default), while CI likely tests default config. Since the PR also removes the selective top-N tokenization optimization, the default config now tokenizes all files instead of just ~50. This means:

With tokenCountTree: true: net win (avoids redundant full-output tokenization)
With tokenCountTree: false (default): potential regression (now tokenizes all files, gains nothing from fast path since fileTokenCounts weren't previously computed for all files)

Suggestion: Run the CI benchmark explicitly with tokenCountTree: true to validate the claimed improvement, and confirm default config isn't regressing.

2. Stale Comment (Low)

Line 157 in calculateMetrics.ts:

// Reuse per-file token counts from the primary selective metrics run.

Should be updated — the function is no longer "selective" after the rename to calculateFileMetrics and the removal of top-N filtering.

3. Missing Fallback Logging (Low)

When extractOutputWrapper returns null, the code silently falls back to calculateOutputMetrics with no log. Adding a logger.trace('Fast-path unavailable, falling back to full output tokenization') would help with debugging and monitoring adoption of the fast path in practice.

Previous feedback items (test coverage, ordering contract, sequential dependency) still apply. The approach is sound — the main open question is whether the net effect is positive for the default configuration.

🤖 Generated with Claude Code

- Remove stale "selective metrics run" comment - Add trace log when fast-path falls back to full output tokenization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yamadashy · 2026-04-12T14:36:44Z

Addressed in 48c7d413:

Removed stale "selective metrics run" comment (item 2 from follow-up review)
Added logger.trace when fast-path falls back to full output tokenization (item 3 from follow-up review)

Regarding other items:

Always-tokenize-all-files: Intentional design decision — output tokenization processes all file contents anyway, so per-file tokenization is equivalent work with better pipeline overlap. CI benchmarks confirm net improvement across all platforms.
CI benchmark vs PR claim: The -13.2% was measured with tokenCountTree: true on the author's machine. CI runs default config and shows -4.6% to -6.5% improvement, which is consistent with the analysis.
Test coverage: Will discuss with maintainer.
Implicit ordering contract: Will discuss with maintainer.
Sequential dependency: Acknowledged; wrapper tokenization is ~127KB so the serialization cost is negligible.

🤖

claude · 2026-04-12T14:39:25Z

Code Review — Incremental Findings (Post-48c7d41)

Thanks for addressing the stale comment and adding the fallback trace log. A few new items surfaced on closer inspection:

1. Existing `calculateMetrics` test accidentally bypasses `calculateOutputMetrics` mock (Medium)

The test in calculateMetrics.test.ts uses default config (style: 'xml', parsableStyle: false, splitOutput: undefined) and output 'a'.repeat(300) with file content 'a'.repeat(100). Since canUseFastOutputTokenPath returns true for this config, and extractOutputWrapper will find the substring, the fast path activates — meaning the calculateOutputMetrics: async () => 30 mock is never called. The test passes by coincidence, not by exercising the intended code path. This should be fixed by either:

Setting parsableStyle: true in the test config to force the fallback path
Adding a separate test that explicitly verifies the fast path behavior

2. `targetFilePaths` parameter in `calculateFileMetrics` is now vestigial (Low)

The only call site always passes processedFiles.map(f => f.path) — i.e., all file paths. The filtering logic inside calculateFileMetrics (targetFileSet.has(file.path)) is dead code. Consider removing the parameter (or adding a comment explaining if it's retained for future use).

3. Comment overstates caching of sort (Low)

The comment at packager.ts:160-166 says "sorting once here…costs only a single git-log subprocess (cached)" — this is correct for the git subprocess, but sortFilesByChangeCounts (the array sort) still executes twice: once in packager.ts and once inside generateOutput. The practical cost is negligible (~1ms for 1000 files), but the comment implies the sort itself is memoized when it isn't.

Previous feedback still applicable

CI benchmark discrepancy (-0.8% vs -13.2%) — the PR benchmark used tokenCountTree: true while CI uses default config. Worth caveating in the description.
Test coverage for extractOutputWrapper and canUseFastOutputTokenPath — still no direct unit tests. Key untested scenarios: content not found (null return), duplicate file contents, all-empty files, each canUseFastOutputTokenPath branch.
Implicit ordering contract — calculateMetrics accepts any ProcessedFile[] but the fast path assumes output-order sorting. A misorder where content exists in the wrong segment could yield incorrect counts without triggering the null fallback.

Overall: the optimization is sound for its target use case (tokenCountTree: true, large repos, xml/markdown/plain). The main concern is ensuring the test suite exercises the fast path intentionally rather than accidentally.

🤖 Generated with Claude Code

…tTokenPath Export the two helper functions and add 13 unit tests covering: - extractOutputWrapper: normal extraction, missing content (null), empty files, identical content, wrong order, no files, no wrapper - canUseFastOutputTokenPath: each style variant, splitOutput, parsableStyle Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix calculateMetrics test to use parsableStyle: true so it exercises the fallback path (calculateOutputMetrics mock) instead of accidentally hitting the fast path - Correct packager comment to clarify that git-log is cached but the array sort itself runs twice (negligible cost) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yamadashy · 2026-04-12T14:55:17Z

Addressed in 0b6c775d:

Test bypassing mock (item 1): Fixed by setting parsableStyle: true in the test config so it forces the fallback path and exercises the calculateOutputMetrics mock as intended.
Comment overstates caching (item 3): Corrected the packager comment to clarify that git-log is cached but the array sort runs twice (negligible).

Regarding item 2 (targetFilePaths vestigial parameter): Will discuss with maintainer — removing it is a valid cleanup but a separate concern.

🤖

claude · 2026-04-12T14:59:14Z

Code Review — Claude

Overall: Clean, well-structured performance optimization. The approach is sound — reusing per-file token counts and tokenizing only the wrapper is a smart way to avoid redundant work. The fallback is correctly implemented. A few items worth addressing:

Findings

1. No integration test for the fast path through calculateMetrics ⚠️

calculateMetrics.test.ts now uses parsableStyle: true, which forces the fallback path. The fast-path IIFE (lines 159–171 of calculateMetrics.ts) is never exercised end-to-end by any test — fastOutputTokenPath.test.ts only covers the two pure helpers in isolation.

Consider adding a calculateMetrics test with a config where canUseFastOutputTokenPath returns true and a single-part output where extractOutputWrapper succeeds, to verify the fast path produces correct total token counts.

2. Minor: misleading variable name fastOutputToken 💡

calculateMetrics.ts:151 — fastOutputToken holds the full output string, not a token. Something like singleOutput or fastPathOutput would be clearer.

Other observations (non-blocking)

Sequential await in fast path: The fast path awaits fileMetricsPromise before dispatching runTokenCount(wrapper), losing some parallelism versus the slow path. In practice this is fine since fileMetricsPromise dominates the time, but if per-file tokenization gets faster in the future, kicking off runTokenCount(wrapper) in parallel would help.
Double-sort acknowledged: sortOutputFiles() runs in both packager.ts and inside generateOutput. The comment correctly notes this is negligible (~1ms), and the git-log subprocess is cached. The coupling between the two sort calls is a maintenance consideration but not a current issue.
Fallback log level: The trace-level log when the fast path falls back (logger.trace('Fast-path unavailable...')) could be promoted to debug for better observability in production without full trace logging.
canUseFastOutputTokenPath — splitOutput !== undefined guard: This is technically redundant with the outputParts.length === 1 check downstream, but serves as a clear early-exit for readability. Fine as-is.

What looks good

extractOutputWrapper is well-designed: single forward pass, handles empty files, correctly returns null on any mismatch
Comprehensive unit tests for both new helpers (7 + 6 cases)
Clean rename of calculateSelectiveFileMetrics → calculateFileMetrics
Proper dependency injection for sortOutputFiles in packager
All existing tests updated correctly
Benchmark data is thorough and reproducible

🤖 Generated with Claude Code

The variable holds the full output string, not a token. Renamed for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude · 2026-04-12T15:08:20Z

🔍 Code Review — Claude

Overall verdict: This is a well-executed performance optimization. The approach is sound, the benchmarks are solid (-13.2% overall, ~55% reduction in calculateMetrics), and the fallback behavior is safe. A few items worth considering before merge:

Key Findings

1. False-positive indexOf match risk (Recommended)

extractOutputWrapper finds file content via indexOf(file.content, cursor). If a file's content appears verbatim earlier in the output than its actual file section (e.g., inside the directory tree, a git diff block, or a summary), indexOf will match the wrong occurrence. Unlike the "not found" case (which returns null and falls back), a wrong-position match produces a silently incorrect wrapper and therefore a slightly wrong totalTokens — without triggering the fallback.

In practice this is extremely unlikely for real codebases (file contents are typically much longer than header/tree snippets), and the error would be tiny. But it's worth documenting this as a known limitation in the JSDoc for extractOutputWrapper.

2. selective variable name is stale (Recommended)

In calculateMetrics.ts:160:

const selective = await fileMetricsPromise;

This is a remnant of the old calculateSelectiveFileMetrics naming. Since it now always contains all files, renaming to fileMetrics (or allFileMetrics) would be clearer.

3. Wrapper tokenization unnecessarily serialized (Not needed — minor)

The fast path awaits fileMetricsPromise before dispatching runTokenCount for the wrapper. Since the wrapper string is already extracted synchronously (line 152), the wrapper tokenization could theoretically be dispatched in parallel with per-file metrics. However, the wrapper is only ~127KB and the tokenization is fast, so the real-world gain would be negligible.

4. calculateFileMetrics retains targetFilePaths parameter (Not needed for this PR)

After the refactor, calculateFileMetrics is always called with all file paths. The targetFilePaths parameter + internal Set filtering is now dead logic. This is a cleanup that could happen in a follow-up PR.

Test Coverage

The 13 new unit tests for extractOutputWrapper and canUseFastOutputTokenPath cover the key scenarios well. Codecov reports 84% patch coverage with 6 lines uncovered in calculateMetrics.ts — these are the fast-path integration lines (159–170) which are only exercised when the fast path fires end-to-end. The existing calculateMetrics.test.ts forces the fallback via parsableStyle: true, which is a good defensive choice. A future follow-up test exercising the fast path through calculateMetrics would close this gap.

Summary

Area	Status
Approach	✅ Sound — wrapper extraction + sum is a clean optimization
Accuracy	✅ 0.024% delta, well within BPE boundary error already accepted
Fallback	✅ Returns `null` → original path, zero behavior change
Security	✅ No concerns — `indexOf` is linear, no untrusted patterns
Conventions	✅ DI pattern, file sizes, commit messages all compliant
Tests	⚠️ Good unit coverage, fast-path integration lines uncovered

LGTM with the minor selective → fileMetrics rename suggestion. Nice work on the benchmarking and documentation! 🎉

Detailed reviewer notes

Code Quality:

extractOutputWrapper and canUseFastOutputTokenPath are clean, well-documented functions
The inline IIFE for the fast-path promise (lines 159–171) is slightly harder to follow than a named helper, but acceptable given the tight scope
calculateMetrics.ts is at 221 lines — close to the 250-line project limit but within bounds

Performance:

String.prototype.indexOf with forward cursor is O(n) amortized for the common case
wrapperSegments.join('') for ~127KB is negligible
Double sortOutputFiles call relies on fileChangeCountsCache — works correctly but is a hidden coupling worth noting

Conventions:

runTokenCount is called directly (not via deps) in the fast path — minor DI gap, acceptable for now since it's a thin wrapper
sortOutputFiles properly added to defaultDeps in packager

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude · 2026-04-12T15:19:14Z

Code Review — Claude (Incremental)

Previous Claude reviews raised items around test coverage, stale naming, fallback logging, ordering coupling, and benchmark discrepancy. The maintainer has addressed most of these across commits 48c7d41–701e8d4. This review covers only new or remaining findings on the final state of the PR.

Overall: LGTM — clean, well-iterated optimization. Ready to merge.

Remaining Item (Non-blocking)

targetFilePaths parameter in calculateFileMetrics is vestigial

The only call site (calculateMetrics.ts:124) always passes all file paths:

const metricsTargetPaths = processedFiles.map((file) => file.path);

The targetFileSet filtering inside calculateFileMetrics (lines 23–24) is dead code. This was noted in a prior review and deferred. Consider removing it in a follow-up to keep the interface honest — or adding a comment explaining it's retained for API flexibility.

What's Been Well Addressed

Previous Feedback	Resolution
Stale "selective" naming	Renamed throughout (`allFileMetrics`, `calculateFileMetrics`)
Missing fallback log	Added `logger.trace` on line 154
Test accidentally bypassing mock	Fixed with `parsableStyle: true`
Comment overstating cache	Corrected to note array sort runs twice
Variable naming (`fastOutputToken`, `fastWrapper`, `selective`)	All renamed to clearer alternatives
Unit tests for helpers	13 tests added in `fastOutputTokenPath.test.ts`

Summary Table

Area	Status
Approach	✅ Sound — wrapper extraction + sum avoids redundant tokenization
Accuracy	✅ 0.024% delta, within existing BPE chunk-boundary error
Fallback	✅ `null` → original path, logged, zero behavior change
Security	✅ No concerns — string search only, no eval/parse
Performance	✅ CI confirms -4.5% to -6.4% improvement across platforms
Conventions	✅ 221 lines (under 250 limit), DI pattern followed, commits follow conventions
Tests	✅ Good unit coverage for helpers; fast-path integration lines (159–170) uncovered but acceptable given the fallback safety

Nice work on the thorough benchmarking, iterative addressing of review feedback, and clean fallback design.

🤖 Generated with Claude Code

coderabbitai Bot approved these changes Apr 12, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread src/core/metrics/calculateMetrics.ts

devin-ai-integration Bot reviewed Apr 12, 2026

View reviewed changes

refactor(metrics): Rename calculateSelectiveFileMetrics to calculateF…

8321c7a

…ileMetrics The function now always calculates metrics for all files, so the "Selective" prefix no longer reflects its behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(metrics): Address PR review feedback

48c7d41

- Remove stale "selective metrics run" comment - Add trace log when fast-path falls back to full output tokenization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yamadashy and others added 2 commits April 12, 2026 23:41

yamadashy and others added 2 commits April 13, 2026 00:02

fix(metrics): Rename misleading fastOutputToken to singleOutput

f1dc925

The variable holds the full output string, not a token. Renamed for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(metrics): Rename fastWrapper to outputWrapper

7ed5943

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(metrics): Rename stale selective variable to allFileMetrics

701e8d4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yamadashy merged commit 4c356f7 into main Apr 12, 2026
59 checks passed

yamadashy deleted the perf/fast-output-tokenization branch April 12, 2026 15:20

coderabbitai Bot mentioned this pull request Apr 14, 2026

perf(core): Remove redundant minimatch and parallelize wrapper tokenization #1469

Merged

3 tasks

yamadashy mentioned this pull request Apr 29, 2026

docs(release): Add v1.14.0 release notes #1523

Merged

2 tasks

This was referenced May 9, 2026

perf(core): Automated performance tuning by Claude #1548

Closed

perf(core): Add content-addressed token-count disk cache #1562

Merged

Uh oh!

Conversation

yamadashy commented Apr 12, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Accuracy

Fallback

Benchmark

Test plan

Uh oh!

coderabbitai Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

Uh oh!

github-actions Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚡ Performance Benchmark

Uh oh!

codecov Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying repomix with Cloudflare Pages

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

claude Bot commented Apr 12, 2026

Code Review — perf(core): Skip redundant full-output tokenization via wrapper-extraction fast path

1. Always-tokenize-all-files is a behavioral change beyond the fast path (Medium)

2. Test coverage gap for new logic (Medium)

3. Implicit ordering contract between packager and output generator (Low)

4. Sequential dependency in fast path (Low)

Uh oh!

claude Bot commented Apr 12, 2026

Follow-up Review — Incremental Findings

1. CI Benchmark vs PR Claim (High)

2. Stale Comment (Low)

3. Missing Fallback Logging (Low)

Uh oh!

yamadashy commented Apr 12, 2026

Uh oh!

claude Bot commented Apr 12, 2026

Code Review — Incremental Findings (Post-48c7d41)

1. Existing calculateMetrics test accidentally bypasses calculateOutputMetrics mock (Medium)

2. targetFilePaths parameter in calculateFileMetrics is now vestigial (Low)

3. Comment overstates caching of sort (Low)

Uh oh!

yamadashy commented Apr 12, 2026

Uh oh!

claude Bot commented Apr 12, 2026

Code Review — Claude

Findings

What looks good

Uh oh!

claude Bot commented Apr 12, 2026

🔍 Code Review — Claude

Key Findings

Test Coverage

Summary

Uh oh!

claude Bot commented Apr 12, 2026

Code Review — Claude (Incremental)

Remaining Item (Non-blocking)

What's Been Well Addressed

Summary Table

yamadashy commented Apr 12, 2026 •

edited by devin-ai-integration Bot

Loading

coderabbitai Bot commented Apr 12, 2026 •

edited

Loading

github-actions Bot commented Apr 12, 2026 •

edited

Loading

codecov Bot commented Apr 12, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Apr 12, 2026 •

edited

Loading

Code Review — `perf(core): Skip redundant full-output tokenization via wrapper-extraction fast path`

1. Existing `calculateMetrics` test accidentally bypasses `calculateOutputMetrics` mock (Medium)

2. `targetFilePaths` parameter in `calculateFileMetrics` is now vestigial (Low)