Skip to content

Commit 5ff9b1f

Browse files
committed
perf(core): Add newline pre-filter to base64 run detection
Skip the per-character `hasLongBase64Run` scan for files whose lines are all shorter than the 256-char standalone-base64 threshold. Why: - `truncateBase64Content` runs on the main thread for every collected file (no worker pool on the default pack path), so its CPU cost is fully on the serial critical path. With `truncateBase64: true` (set in this repo's own repomix.config.json, the benchmark target) it is the dominant cost of the file-processing phase. - `hasLongBase64Run` previously charCodeAt-scanned every byte of every file (~5.5 MB across ~1.1k files) just to gate the standalone-base64 regex. What: - A 256-char base64 run cannot contain a newline (`\n` is not a base64 character and resets the run), so it must fit inside a single line. Before the byte scan, walk newline offsets with the native `String.prototype.indexOf`; if no line reaches the threshold, no run is possible and we return early. Files with a long line fall through to the unchanged full scan. Behavior-preserving: - The pre-filter can only short-circuit when a long run is provably absent; any file with a >=256-char line still runs the authoritative byte scan, so results are identical. CLI output verified byte-identical across xml/markdown/json/plain. Isolated run over all 1127 repo files: 0 mismatches vs the previous implementation. Benchmark (this container, `node bin/repomix.cjs`, warm cache): - Isolated `truncateBase64Content` over the full repo file set (interleaved, JIT-warmed median): 42.5ms -> 17.3ms, -25.2ms. - Whole-process wall clock (interleaved, noise floor): min -32.5ms (-3.73%) p25 -24.3ms (-2.53%) Comfortably above the 2%-of-total improvement bar. Tests: - Added newline-split / many-short-lines / CRLF / no-newline cases to tests/core/file/truncateBase64.test.ts. Full suite: 1345 passing.
1 parent 84b2603 commit 5ff9b1f

2 files changed

Lines changed: 56 additions & 0 deletions

File tree

src/core/file/truncateBase64.ts

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,25 @@ const standaloneBase64Pattern = new RegExp(`([A-Za-z0-9+/]{${MIN_BASE64_LENGTH_S
2424
const hasLongBase64Run = (content: string): boolean => {
2525
const len = content.length;
2626
if (len < MIN_BASE64_LENGTH_STANDALONE) return false;
27+
// Newline pre-filter: `\n` is not a base64 character, so it always resets the
28+
// run below. A run of `MIN_BASE64_LENGTH_STANDALONE` therefore has to fit
29+
// inside a single line. If every line is shorter than that threshold no run is
30+
// possible, and we can bail out before the per-character scan. `indexOf` is a
31+
// native (memchr-style) scan, far cheaper than the charCodeAt loop, and the
32+
// vast majority of source files have no such long line, so this skips the hot
33+
// loop entirely for them.
34+
let lineStart = 0;
35+
let newlineIndex = content.indexOf('\n');
36+
while (newlineIndex !== -1) {
37+
if (newlineIndex - lineStart >= MIN_BASE64_LENGTH_STANDALONE) break;
38+
lineStart = newlineIndex + 1;
39+
newlineIndex = content.indexOf('\n', lineStart);
40+
}
41+
// The final segment (after the last newline, or the whole content when there
42+
// is none) also needs the length check before we can rule out a long run.
43+
if (newlineIndex === -1 && len - lineStart < MIN_BASE64_LENGTH_STANDALONE) {
44+
return false;
45+
}
2746
let run = 0;
2847
for (let i = 0; i < len; i++) {
2948
const c = content.charCodeAt(i);

tests/core/file/truncateBase64.test.ts

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,43 @@ describe('truncateBase64Content', () => {
127127
expect(result).toBe(input);
128128
});
129129

130+
it('should not truncate a base64-like run split across a newline', () => {
131+
// A 320-char base64 body interrupted by a newline: neither line segment
132+
// reaches 256, and `\n` resets the run, so nothing should be truncated.
133+
// Guards the newline pre-filter in `hasLongBase64Run`.
134+
const half = longBase64.slice(0, 160);
135+
const input = `const data = "${half}\n${half}";`;
136+
const result = truncateBase64Content(input);
137+
expect(result).toBe(input);
138+
});
139+
140+
it('should truncate a long base64 run that follows many short lines', () => {
141+
// Many short lines (each < 256) precede the real run, so the newline
142+
// pre-filter must fall through to the full scan and still truncate.
143+
const shortLines = 'const a = 1;\n'.repeat(50);
144+
const input = `${shortLines}const data = "${longBase64}";`;
145+
const result = truncateBase64Content(input);
146+
expect(result).toContain('DTJXfKHG6xA1Wn+kye4TOF2Cp8zxFjtg...');
147+
expect(result.startsWith(shortLines)).toBe(true);
148+
});
149+
150+
it('should truncate a base64 run on a CRLF-terminated line', () => {
151+
// The `\r` before `\n` is also non-base64; the long line must still be
152+
// detected by the pre-filter and truncated by the full scan.
153+
const input = `const data = "${longBase64}";\r\nconst next = 2;\r\n`;
154+
const result = truncateBase64Content(input);
155+
expect(result).toContain('DTJXfKHG6xA1Wn+kye4TOF2Cp8zxFjtg...');
156+
expect(result).toContain('const next = 2;');
157+
});
158+
159+
it('should truncate a long base64 run with no newline at all', () => {
160+
// Single-line content (no `\n`): the pre-filter treats the whole string as
161+
// one segment and must fall through to the full scan.
162+
const input = `prefix-${longBase64}-suffix`;
163+
const result = truncateBase64Content(input);
164+
expect(result).toContain('DTJXfKHG6xA1Wn+kye4TOF2Cp8zxFjtg...');
165+
});
166+
130167
it('should leave non-base64 data URIs untouched', () => {
131168
// `data:text/plain,hello` has no `;base64,` literal, so the dataUriPattern
132169
// cannot match. Verifies the `includes(';base64,')` guard short-circuits

0 commit comments

Comments
 (0)