Skip to content

Commit ade7f34

Browse files
committed
perf(metrics): Cache gpt-tokenizer BPE ranks as JSON to cut tokenizer init
The metrics worker's TokenCounter initialization is the single largest cost on a warm-cache CLI run. gpt-tokenizer ships each BPE merge-rank table as a ~2 MB CommonJS module of inline array literals; `resolveEncodingAsync` `require`s it, forcing V8 to lex/parse/execute the file and allocate a ~200k-element array on every cold worker thread (~120 ms), before `GptEncoding.getEncodingApi` then builds the rank Map (~90 ms). A verbose trace showed this landing at ~211 ms on the metrics critical path that gates output generation. The value returned by `resolveEncodingAsync` is a plain JSON-serializable array (strings, plus byte arrays of 1–19 bytes for ranks whose token bytes are not valid UTF-8). New `bpeRanksCache.ts` persists it once as JSON under `$TMPDIR/repomix/cache/bpe-ranks/` and, on later runs, reloads it via `readFileSync` + `JSON.parse` (~40 ms) — a restricted-grammar parse V8 handles in native C++, ~3x faster than re-executing the JS module. `getEncodingApi` receives a byte-identical ranks array, so token counts are unchanged. intent(metrics): cut the ~211ms tokenizer init that dominates warm-cache runs decision(bpe-cache): runtime JSON disk cache via gpt-tokenizer's public resolveEncodingAsync output — no build step, no bundled data, no internal imports rejected(bpe-cache): bundling a pre-built ranks JSON in lib/ — adds ~2MB to the published package and reaches into gpt-tokenizer's internal cjs/bpeRanks path, fragile across upgrades decision(module-split): extract the cache to bpeRanksCache.ts mirroring tokenCountCache.ts — keeps TokenCounter focused and makes the cache unit-testable decision(cache-key): key the file by gpt-tokenizer version + format version so an upgrade auto-invalidates stale tables (miss → rebuild) constraint(bpe-cache): runs inside worker threads — sync fs + crypto-random tmp name + atomic rename so a concurrent worker (which shares this pid) never reads a partial file; all read/write errors fall back to resolveEncodingAsync (pure optimization, never a correctness signal) constraint(shape-guard): readBpeRanksCache rejects any non-array/empty JSON as a clean miss, so a structurally-valid-but-wrong file is rebuilt rather than silently producing zero counts constraint(opt-out): shares the REPOMIX_TOKEN_CACHE=0 switch with the token-count cache; REPOMIX_BPE_RANKS_CACHE_PATH redirects the dir for tests Correctness: - Library-level equivalence verified: JSON round-trip yields identical encode() output and countTokens() across source code, unicode, emoji, special tokens (<|endoftext|>) and invalid UTF-8; all 1571 byte-array rank entries (lengths 1–19) preserved for o200k_base and cl100k_base. - Full CLI token totals identical with the cache enabled vs REPOMIX_TOKEN_CACHE=0, and the generated output file is byte-identical. - 1332 tests pass (12 new for the cache: hit/miss/disabled/corrupt/malformed-shape fallback); `tsgo --noEmit`, biome and oxlint clean. Benchmark — `node bin/repomix.cjs --include src -o <tmp> --quiet`, warm cache, 25 runs/round, baseline (no cache) and patched rebuilt and run interleaved (two rounds each, to control for environment drift): base: 773.1 / 742.6 ms (mean 757.9) patched: 722.0 / 688.1 ms (mean 705.1) delta: -52.8 ms (7.0%) — minimums ~703 → ~665 ms move in lockstep Exceeds the 2% target and sits outside the run-to-run noise band (sd ~25 ms on the patched runs) as shown by the consistent direction across both rounds.
1 parent 6106f30 commit ade7f34

4 files changed

Lines changed: 259 additions & 3 deletions

File tree

src/core/metrics/TokenCounter.ts

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,32 @@
11
import { GptEncoding } from 'gpt-tokenizer/GptEncoding';
22
import { resolveEncodingAsync } from 'gpt-tokenizer/resolveEncodingAsync';
33
import { logger } from '../../shared/logger.js';
4+
import { readBpeRanksCache, writeBpeRanksCache } from './bpeRanksCache.js';
45
import { TOKEN_ENCODINGS, type TokenEncoding } from './tokenEncodings.js';
56

67
// Re-export for backward compatibility with existing
78
// `import { TOKEN_ENCODINGS, TokenEncoding } from './TokenCounter.js'` call sites.
89
export { TOKEN_ENCODINGS, type TokenEncoding };
910

11+
// Resolved BPE merge-rank table, as returned by `resolveEncodingAsync`. A plain
12+
// JSON-serializable array of strings, plus byte arrays (1–19 bytes in
13+
// o200k_base) for ranks whose token bytes are not valid UTF-8.
14+
type BpeRanks = Awaited<ReturnType<typeof resolveEncodingAsync>>;
15+
16+
// Load the BPE merge-rank table from the on-disk JSON cache when present,
17+
// otherwise resolve it from gpt-tokenizer and persist it for next time. See
18+
// bpeRanksCache.ts for why this avoids the ~120 ms JS-module parse on warm runs.
19+
const resolveBpeRanks = async (encodingName: TokenEncoding): Promise<BpeRanks> => {
20+
const cached = readBpeRanksCache(encodingName);
21+
if (cached !== undefined) {
22+
return cached as BpeRanks;
23+
}
24+
25+
const bpeRanks = await resolveEncodingAsync(encodingName);
26+
writeBpeRanksCache(encodingName, bpeRanks);
27+
return bpeRanks;
28+
};
29+
1030
interface CountTokensOptions {
1131
disallowedSpecial?: Set<string>;
1232
}
@@ -30,9 +50,10 @@ const loadEncoding: LoadEncodingFn = async (encodingName) => {
3050

3151
const startTime = process.hrtime.bigint();
3252

33-
// Use resolveEncodingAsync to lazily load BPE rank data, then create a GptEncoding instance.
34-
// resolveEncodingAsync uses static import paths internally, so bundlers (rolldown) can resolve them.
35-
const bpeRanks = await resolveEncodingAsync(encodingName);
53+
// Load BPE rank data (from the on-disk JSON cache when available, else from
54+
// gpt-tokenizer), then create a GptEncoding instance. resolveEncodingAsync
55+
// uses static import paths internally, so bundlers (rolldown) can resolve them.
56+
const bpeRanks = await resolveBpeRanks(encodingName);
3657
const encoder = GptEncoding.getEncodingApi(encodingName, () => bpeRanks);
3758
const countFn = encoder.countTokens.bind(encoder) as CountTokensFn;
3859
encodingModules.set(encodingName, countFn);

src/core/metrics/bpeRanksCache.ts

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
import { randomBytes } from 'node:crypto';
2+
import fs from 'node:fs';
3+
import { createRequire } from 'node:module';
4+
import path from 'node:path';
5+
import { logger } from '../../shared/logger.js';
6+
import { getRepomixTmpDir } from '../../shared/tmpDir.js';
7+
import { isCacheDisabled } from './tokenCountCache.js';
8+
import type { TokenEncoding } from './tokenEncodings.js';
9+
10+
// On-disk JSON cache for gpt-tokenizer's BPE merge-rank tables.
11+
//
12+
// gpt-tokenizer ships each table as a ~2 MB CommonJS module of inline array
13+
// literals. `resolveEncodingAsync` `require`s it, forcing V8 to lex/parse/
14+
// execute the file and allocate a ~200k-element array on every cold worker
15+
// thread (~120 ms) — the single largest cost on a warm-cache CLI run. The
16+
// resolved value is a plain JSON-serializable array, so we persist it once and
17+
// reload it with `readFileSync` + `JSON.parse` (~40 ms) on later runs: a
18+
// restricted-grammar parse V8 handles in native code, ~3x faster than
19+
// re-executing the JS module. The reloaded array is byte-identical to the
20+
// resolved one (same encode output), so token counts are unchanged.
21+
//
22+
// This is a pure optimization: every read/write failure is swallowed and the
23+
// caller falls back to `resolveEncodingAsync`. The cache shares the
24+
// `REPOMIX_TOKEN_CACHE=0` opt-out and the `$TMPDIR/repomix/cache/` umbrella
25+
// with the token-count cache (see tokenCountCache.ts).
26+
27+
const cjsRequire = createRequire(import.meta.url);
28+
29+
// On-disk serialization format version. Bump only if the persisted JSON shape
30+
// changes in a way incompatible with files written by older repomix versions.
31+
const BPE_RANKS_CACHE_FORMAT = 1;
32+
// Shares the `cache/` umbrella with the token-count cache; the per-encoding
33+
// files live in a `bpe-ranks/` subdirectory beneath it.
34+
const CACHE_SUBDIR_NAME = 'cache';
35+
const BPE_RANKS_SUBDIR_NAME = 'bpe-ranks';
36+
37+
// gpt-tokenizer version keys the cache file name so a dependency upgrade that
38+
// changes a table automatically invalidates the stale file (different name →
39+
// cache miss → rebuild). `gpt-tokenizer/package.json` is exported by the
40+
// package, so this resolves the same way `gpt-tokenizer/GptEncoding` does.
41+
//
42+
// Files from superseded versions are not swept; they live under $TMPDIR (which
43+
// the OS may evict) and amount to a few MB per version, so the simplicity is
44+
// worth more than reclaiming the space.
45+
const getGptTokenizerVersion = (): string => {
46+
try {
47+
return (cjsRequire('gpt-tokenizer/package.json') as { version: string }).version;
48+
} catch {
49+
return 'unknown';
50+
}
51+
};
52+
53+
/**
54+
* Absolute path of the cached BPE-ranks file for an encoding.
55+
*
56+
* `REPOMIX_BPE_RANKS_CACHE_PATH` overrides the parent directory for tests and
57+
* explicit user configuration (mirrors `REPOMIX_TOKEN_CACHE_PATH`).
58+
*/
59+
export const getBpeRanksCachePath = (encodingName: TokenEncoding): string => {
60+
const fileName = `${encodingName}-${getGptTokenizerVersion()}-v${BPE_RANKS_CACHE_FORMAT}.json`;
61+
const override = process.env.REPOMIX_BPE_RANKS_CACHE_PATH;
62+
if (override) {
63+
return path.join(override, fileName);
64+
}
65+
return path.join(getRepomixTmpDir(), CACHE_SUBDIR_NAME, BPE_RANKS_SUBDIR_NAME, fileName);
66+
};
67+
68+
/**
69+
* Read and parse the cached BPE ranks for `encodingName`. Returns `undefined`
70+
* on a cache miss, an unreadable/corrupt file, or when caching is disabled
71+
* (`REPOMIX_TOKEN_CACHE=0`). Never throws.
72+
*
73+
* A shape check rejects any structurally-valid-but-wrong file (e.g. an object,
74+
* a number, or an empty array left by an incompatible writer) so it is treated
75+
* as a clean miss and rebuilt, rather than handed to the tokenizer where it
76+
* would silently produce zero token counts.
77+
*/
78+
export const readBpeRanksCache = (encodingName: TokenEncoding): unknown | undefined => {
79+
if (isCacheDisabled()) {
80+
return undefined;
81+
}
82+
try {
83+
const parsed = JSON.parse(fs.readFileSync(getBpeRanksCachePath(encodingName), 'utf8'));
84+
if (!Array.isArray(parsed) || parsed.length === 0) {
85+
logger.trace(`Ignoring malformed BPE ranks cache for ${encodingName}`);
86+
return undefined;
87+
}
88+
logger.trace(`Loaded BPE ranks for ${encodingName} from cache`);
89+
return parsed;
90+
} catch {
91+
// Cache miss or unreadable/corrupt file — caller resolves from gpt-tokenizer.
92+
return undefined;
93+
}
94+
};
95+
96+
/**
97+
* Persist `bpeRanks` for `encodingName` as JSON. Best-effort and never throws.
98+
*
99+
* A unique tmp name (pid + crypto-random suffix) written then atomically
100+
* renamed means a concurrent reader never observes a partial file, even when
101+
* several worker threads (which share this process's pid) resolve the same
102+
* encoding at once. No-ops when caching is disabled. All errors (read-only FS,
103+
* permission denied, races) are swallowed — the cache is optional.
104+
*/
105+
export const writeBpeRanksCache = (encodingName: TokenEncoding, bpeRanks: unknown): void => {
106+
if (isCacheDisabled()) {
107+
return;
108+
}
109+
const cachePath = getBpeRanksCachePath(encodingName);
110+
try {
111+
fs.mkdirSync(path.dirname(cachePath), { recursive: true });
112+
const tmpPath = `${cachePath}.${process.pid}.${randomBytes(4).toString('hex')}.tmp`;
113+
fs.writeFileSync(tmpPath, JSON.stringify(bpeRanks), { mode: 0o600 });
114+
fs.renameSync(tmpPath, cachePath);
115+
} catch (error) {
116+
logger.trace(`Failed to persist BPE ranks cache for ${encodingName}:`, error);
117+
}
118+
};
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
import fs from 'node:fs';
2+
import os from 'node:os';
3+
import path from 'node:path';
4+
import { afterEach, beforeEach, describe, expect, test } from 'vitest';
5+
import {
6+
getBpeRanksCachePath,
7+
readBpeRanksCache,
8+
writeBpeRanksCache,
9+
} from '../../../src/core/metrics/bpeRanksCache.js';
10+
11+
// A small stand-in for the real BPE ranks: a mix of strings and single-element
12+
// byte arrays, mirroring the shape gpt-tokenizer returns (and the JSON
13+
// round-trip that token correctness depends on).
14+
const SAMPLE_RANKS: unknown = ['!', '"', '#', [161], [194, 162]];
15+
16+
describe('bpeRanksCache', () => {
17+
let tmpDir: string;
18+
let prevPathEnv: string | undefined;
19+
let prevDisableEnv: string | undefined;
20+
21+
beforeEach(() => {
22+
tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'repomix-bpe-test-'));
23+
prevPathEnv = process.env.REPOMIX_BPE_RANKS_CACHE_PATH;
24+
prevDisableEnv = process.env.REPOMIX_TOKEN_CACHE;
25+
process.env.REPOMIX_BPE_RANKS_CACHE_PATH = tmpDir;
26+
// The suite disables caching globally; enable it for these tests.
27+
delete process.env.REPOMIX_TOKEN_CACHE;
28+
});
29+
30+
afterEach(() => {
31+
fs.rmSync(tmpDir, { recursive: true, force: true });
32+
if (prevPathEnv === undefined) {
33+
delete process.env.REPOMIX_BPE_RANKS_CACHE_PATH;
34+
} else {
35+
process.env.REPOMIX_BPE_RANKS_CACHE_PATH = prevPathEnv;
36+
}
37+
if (prevDisableEnv === undefined) {
38+
delete process.env.REPOMIX_TOKEN_CACHE;
39+
} else {
40+
process.env.REPOMIX_TOKEN_CACHE = prevDisableEnv;
41+
}
42+
});
43+
44+
test('getBpeRanksCachePath honors the path override and includes the encoding name', () => {
45+
const cachePath = getBpeRanksCachePath('o200k_base');
46+
expect(path.dirname(cachePath)).toBe(tmpDir);
47+
expect(path.basename(cachePath)).toMatch(/^o200k_base-.*\.json$/);
48+
});
49+
50+
test('returns undefined on a cache miss', () => {
51+
expect(readBpeRanksCache('o200k_base')).toBeUndefined();
52+
});
53+
54+
test('write then read round-trips the ranks (including byte-array entries)', () => {
55+
writeBpeRanksCache('o200k_base', SAMPLE_RANKS);
56+
expect(fs.existsSync(getBpeRanksCachePath('o200k_base'))).toBe(true);
57+
expect(readBpeRanksCache('o200k_base')).toEqual(SAMPLE_RANKS);
58+
});
59+
60+
test('a corrupt cache file falls back to a miss without throwing', () => {
61+
fs.writeFileSync(getBpeRanksCachePath('o200k_base'), 'not valid json {{{');
62+
expect(readBpeRanksCache('o200k_base')).toBeUndefined();
63+
});
64+
65+
test.each([
66+
['an object', '{}'],
67+
['a number', '42'],
68+
['null', 'null'],
69+
['an empty array', '[]'],
70+
])('rejects structurally-valid-but-wrong cache content (%s) as a miss', (_label, content) => {
71+
fs.writeFileSync(getBpeRanksCachePath('o200k_base'), content);
72+
expect(readBpeRanksCache('o200k_base')).toBeUndefined();
73+
});
74+
75+
test('leaves no stray temp files after a write', () => {
76+
writeBpeRanksCache('o200k_base', SAMPLE_RANKS);
77+
const leftovers = fs.readdirSync(tmpDir).filter((f) => f.endsWith('.tmp'));
78+
expect(leftovers).toEqual([]);
79+
});
80+
81+
test('different encodings use distinct cache files', () => {
82+
writeBpeRanksCache('o200k_base', ['a']);
83+
writeBpeRanksCache('cl100k_base', ['b']);
84+
expect(readBpeRanksCache('o200k_base')).toEqual(['a']);
85+
expect(readBpeRanksCache('cl100k_base')).toEqual(['b']);
86+
});
87+
88+
describe('when caching is disabled via REPOMIX_TOKEN_CACHE=0', () => {
89+
beforeEach(() => {
90+
process.env.REPOMIX_TOKEN_CACHE = '0';
91+
});
92+
93+
test('read returns undefined and write is a no-op', () => {
94+
writeBpeRanksCache('o200k_base', SAMPLE_RANKS);
95+
// Path is computed independent of the disable flag, so the file must be absent.
96+
expect(fs.existsSync(getBpeRanksCachePath('o200k_base'))).toBe(false);
97+
expect(readBpeRanksCache('o200k_base')).toBeUndefined();
98+
});
99+
100+
test('read returns undefined even when a cache file exists', () => {
101+
// Write a file directly, bypassing the disabled writer.
102+
fs.writeFileSync(getBpeRanksCachePath('o200k_base'), JSON.stringify(SAMPLE_RANKS));
103+
expect(readBpeRanksCache('o200k_base')).toBeUndefined();
104+
});
105+
});
106+
});

tests/testing/vitestSetup.ts

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
import os from 'node:os';
2+
import path from 'node:path';
3+
14
// Disable the token-count disk cache by default for the entire test suite so
25
// that (a) test runs do not read or write the developer's real cache file in
36
// $TMPDIR and (b) tests asserting on worker dispatch behavior are not skewed
@@ -6,3 +9,11 @@
69
if (process.env.REPOMIX_TOKEN_CACHE === undefined) {
710
process.env.REPOMIX_TOKEN_CACHE = '0';
811
}
12+
13+
// Redirect the BPE-ranks disk cache to an isolated per-process temp directory so
14+
// any test that re-enables the cache (by clearing REPOMIX_TOKEN_CACHE) never
15+
// touches the developer's real $TMPDIR/repomix cache. Tests that exercise the
16+
// cache directly override this with their own temp dir.
17+
if (process.env.REPOMIX_BPE_RANKS_CACHE_PATH === undefined) {
18+
process.env.REPOMIX_BPE_RANKS_CACHE_PATH = path.join(os.tmpdir(), `repomix-test-bpe-ranks-${process.pid}`);
19+
}

0 commit comments

Comments
 (0)