Skip to content

Commit 2d0a45a

Browse files
committed
perf(file): Sample-scan base64 run detection in truncateBase64
intent(file-process): automated perf tuning pass — single highest-impact, behavior-preserving change against a ~865ms default pack run; truncateBase64 is enabled in this repo's own config so its precondition scan runs on every packed file in the benchmark workload learned(base64-scan): hasLongBase64Run walked every character of every file (~5.5MB per pack, 23ms main-thread self time in CPU profiles, 35ms isolated) even though it almost always returns false — the per-character loop was itself the previous optimization over the regex it gates decision(sampled-scan): sample one character every MIN_BASE64_LENGTH_STANDALONE (256) positions — any qualifying run occupies 256 consecutive indices, so it must contain a sample point; only a sampled base64-class hit triggers a bounded outward expansion to measure the surrounding run, and the sampling phase resets cleanly after each short-run skip (next possible run from hi+1 always covers sample hi+256) constraint(equivalence): differential-tested against the per-character reference on the full repo corpus (1096 files, 0 mismatches) plus 20k randomized fuzz cases; a deterministic-LCG differential test now pins both false-positive and false-negative directions in the suite rejected(regex-precheck): /[A-Za-z0-9+/]{256}/.test() measured 4.5x SLOWER than the per-character loop (155ms vs 35ms on the corpus) — bounded-repetition re-scanning at each start position, not a viable replacement rejected(early-git-token-dispatch): pre-dispatching git diff/log token counts from the packager — with a warm token cache they resolve while calculateMetrics awaits outputPromise (Promise.all resolves in ~0ms; the 63-67ms wall time is main-thread-busy completion latency, not queue wait), e2e median +15ms under noise, unproven rejected(collect-concurrency): FILE_COLLECT_CONCURRENCY 50 -> 128/256 — identical medians over 40 quiet interleaved runs; libuv's 4-thread pool is saturated at depth 50, queue depth adds nothing rejected(startup-lazy-imports): module-level import() prefetches of tinypool/fast-glob/handlebars all measure 0 to -3ms — ESM already fetches/compiles the static graph in parallel; the budget is sequential module evaluation (~255 modules), only bundling would cut it rejected(lazy-render-context): skipping fileLineCounts + markdownCodeBlockDelimiter on the XML path re-measured at ~11ms p50 quiet (6.2 + 4.7) — still below the 2% threshold, matching the previous pass's rejection Benchmark (repomix repo itself, ~1100 files, 20 interleaved warm pairs, quiet 4-core Linux, default pack, pristine HEAD worktree build vs patched build): - end-to-end median 865ms -> 820.5ms (paired delta median -26.5ms, -3.1%), paired mean -37.5ms (t = 5.14), 18/20 pairs improved - isolated scan cost over the packed corpus: 35.6ms -> 1.6ms p50 (~22x) - output byte-identical (cmp) vs the base build on the same tree - 6 new tests: stride alignments 0-511, run ending at EOF, whole-content run, phase reset after short-run skips, near-threshold non-matches, and the seeded differential fuzz npm run test: 1385/1385 pass. npm run lint: clean (3 pre-existing warnings in unrelated files). https://claude.ai/code/session_01Ea6eConhLEQFKZsVkJz1zE
1 parent 1f2621e commit 2d0a45a

2 files changed

Lines changed: 139 additions & 15 deletions

File tree

src/core/file/truncateBase64.ts

Lines changed: 37 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -12,30 +12,52 @@ const dataUriPattern = new RegExp(
1212
);
1313
const standaloneBase64Pattern = new RegExp(`([A-Za-z0-9+/]{${MIN_BASE64_LENGTH_STANDALONE},}={0,2})`, 'g');
1414

15+
// [A-Z]:65-90, [a-z]:97-122, [0-9]:48-57, '+':43, '/':47
16+
const isBase64CharCode = (c: number): boolean =>
17+
(c >= 65 && c <= 90) || (c >= 97 && c <= 122) || (c >= 48 && c <= 57) || c === 43 || c === 47;
18+
1519
/**
16-
* Cheap precondition for `standaloneBase64Pattern`: scans for any run of
17-
* `[A-Za-z0-9+/]` reaching `MIN_BASE64_LENGTH_STANDALONE`, the smallest body
20+
* Cheap precondition for `standaloneBase64Pattern`: detects whether any run of
21+
* `[A-Za-z0-9+/]` reaches `MIN_BASE64_LENGTH_STANDALONE`, the smallest body
1822
* the regex can match. When this returns false, the regex provably has zero
1923
* matches, so we can skip the much more expensive backtracking scan over the
20-
* whole content. The hot loop avoids regex engine overhead and runs ~4x faster
21-
* than the original `replace`, which dominated `applyLightweightTransforms`
22-
* CPU on profiles of repos with `truncateBase64: true`.
24+
* whole content.
25+
*
26+
* Instead of testing every character, the scan samples one character every
27+
* `MIN_BASE64_LENGTH_STANDALONE` positions: a run of that length occupies
28+
* `MIN_BASE64_LENGTH_STANDALONE` consecutive indices, so it necessarily
29+
* contains a sample point — no qualifying run can slip between two samples.
30+
* Only when a sampled character is in the base64 class does the scan expand
31+
* outward to measure the surrounding run (bounded by the run itself, typically
32+
* a handful of characters in source code). This visits ~1/64th of the input on
33+
* typical text and replaced a per-character loop that dominated
34+
* `applyLightweightTransforms` CPU on profiles of repos with
35+
* `truncateBase64: true`.
2336
*/
2437
const hasLongBase64Run = (content: string): boolean => {
2538
const len = content.length;
2639
if (len < MIN_BASE64_LENGTH_STANDALONE) return false;
27-
let run = 0;
28-
for (let i = 0; i < len; i++) {
29-
const c = content.charCodeAt(i);
30-
// [A-Z]:65-90, [a-z]:97-122, [0-9]:48-57, '+':43, '/':47
31-
if ((c >= 65 && c <= 90) || (c >= 97 && c <= 122) || (c >= 48 && c <= 57) || c === 43 || c === 47) {
32-
run++;
33-
if (run >= MIN_BASE64_LENGTH_STANDALONE) return true;
34-
} else {
35-
run = 0;
40+
let i = MIN_BASE64_LENGTH_STANDALONE - 1;
41+
while (true) {
42+
// Clamp the last sample to the final character so the trailing partial
43+
// window (shorter than the sampling stride) is still covered.
44+
if (i >= len) i = len - 1;
45+
if (isBase64CharCode(content.charCodeAt(i))) {
46+
// Sample hit: measure the maximal run containing it.
47+
let lo = i - 1;
48+
while (lo >= 0 && isBase64CharCode(content.charCodeAt(lo))) lo--;
49+
let hi = i + 1;
50+
while (hi < len && isBase64CharCode(content.charCodeAt(hi))) hi++;
51+
if (hi - lo - 1 >= MIN_BASE64_LENGTH_STANDALONE) return true;
52+
// The run around this sample is too short. Resume sampling after it:
53+
// `hi` is a non-base64 index (or `len`), so the next possible run starts
54+
// at `hi + 1` and any qualifying run from there contains index
55+
// `hi + MIN_BASE64_LENGTH_STANDALONE`.
56+
i = hi;
3657
}
58+
if (i >= len - 1) return false;
59+
i += MIN_BASE64_LENGTH_STANDALONE;
3760
}
38-
return false;
3961
};
4062

4163
/**

tests/core/file/truncateBase64.test.ts

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,4 +135,106 @@ describe('truncateBase64Content', () => {
135135
const result = truncateBase64Content(input);
136136
expect(result).toBe(input);
137137
});
138+
139+
describe('sampled run detection (hasLongBase64Run precondition)', () => {
140+
// The precondition samples one character every 256 positions instead of
141+
// scanning every character. These cases pin the sampling-specific edges:
142+
// runs at arbitrary offsets, runs in the trailing partial window, and
143+
// sampling-phase resets after short-run expansions.
144+
145+
it('should detect a qualifying run at every alignment relative to the sample stride', () => {
146+
// A 256-char run starting at offset k occupies [k, k+255], which must
147+
// contain a sample regardless of k. Exercise alignments around the
148+
// first two sample points (indices 255 and 511).
149+
for (const offset of [0, 1, 127, 254, 255, 256, 257, 300, 511]) {
150+
const input = `${'-'.repeat(offset)}${longBase64.slice(0, 256)}#tail`;
151+
const result = truncateBase64Content(input);
152+
expect(result, `offset ${offset}`).toContain('...');
153+
}
154+
});
155+
156+
it('should detect a run that ends exactly at the end of content', () => {
157+
// The final partial window is shorter than the sampling stride; the
158+
// clamped last sample must still see this run.
159+
const input = `${'x '.repeat(150)}${longBase64.slice(0, 256)}`;
160+
const result = truncateBase64Content(input);
161+
expect(result).toContain('...');
162+
});
163+
164+
it('should detect a run when the whole content is exactly one run of threshold length', () => {
165+
const input = longBase64.slice(0, 256);
166+
const result = truncateBase64Content(input);
167+
expect(result).toBe(`${longBase64.slice(0, 32)}...`);
168+
});
169+
170+
it('should detect a qualifying run that follows many short runs', () => {
171+
// Every sample before the real run lands inside a short base64-like word,
172+
// forcing repeated expand-and-skip steps that reset the sampling phase.
173+
const shortWords = 'word1 path/to2 abc3 '.repeat(60); // 1200 chars of short runs
174+
const input = `${shortWords}${longBase64.slice(0, 256)} end`;
175+
const result = truncateBase64Content(input);
176+
expect(result).toContain('...');
177+
});
178+
179+
it('should preserve content of many near-threshold runs separated by breaks', () => {
180+
// 255-char runs (one below threshold) back to back with separators must
181+
// never match, even though almost every sample hits a base64 character.
182+
const nearRun = longBase64.slice(0, 255).replace(/[+/]/g, 'a');
183+
const input = Array.from({ length: 10 }, () => nearRun).join('\n');
184+
const result = truncateBase64Content(input);
185+
expect(result).toBe(input);
186+
});
187+
188+
it('should match the per-character reference scan on randomized content', () => {
189+
// Differential check: the sampled precondition must agree with a
190+
// straightforward per-character reference on generated inputs.
191+
const referenceHasLongRun = (content: string): boolean => {
192+
let run = 0;
193+
for (let i = 0; i < content.length; i++) {
194+
if (/[A-Za-z0-9+/]/.test(content[i])) {
195+
run++;
196+
if (run >= 256) return true;
197+
} else {
198+
run = 0;
199+
}
200+
}
201+
return false;
202+
};
203+
// Deterministic LCG so failures are reproducible.
204+
let seed = 0x2f6e2b1;
205+
const rand = () => {
206+
seed = (seed * 1103515245 + 12345) & 0x7fffffff;
207+
return seed / 0x7fffffff;
208+
};
209+
const alphabet = 'Aa0+/ .,\n=-_';
210+
for (let trial = 0; trial < 500; trial++) {
211+
let s = '';
212+
const length = Math.floor(rand() * 700);
213+
for (let j = 0; j < length; j++) {
214+
s += alphabet[Math.floor(rand() * alphabet.length)];
215+
}
216+
let injectedQualifyingRun = false;
217+
if (rand() < 0.3) {
218+
const runLength = 200 + Math.floor(rand() * 120);
219+
const pos = Math.floor(rand() * (s.length + 1));
220+
// Repetitions of this diverse base64 prefix always pass isLikelyBase64.
221+
const run = longBase64
222+
.slice(0, 32)
223+
.repeat(Math.ceil(runLength / 32))
224+
.slice(0, runLength);
225+
s = s.slice(0, pos) + run + s.slice(pos);
226+
injectedQualifyingRun = runLength >= 256;
227+
}
228+
if (injectedQualifyingRun) {
229+
// False-negative direction: a diverse run of >= 256 chars exists, so
230+
// the sampled precondition must let the truncation happen.
231+
expect(truncateBase64Content(s), `trial ${trial}`).not.toBe(s);
232+
} else if (!referenceHasLongRun(s)) {
233+
// False-positive direction: no qualifying run anywhere, so content
234+
// must come back untouched.
235+
expect(truncateBase64Content(s), `trial ${trial}`).toBe(s);
236+
}
237+
}
238+
});
239+
});
138240
});

0 commit comments

Comments
 (0)