Design from justrach/turbodb — adapted from src/storage/mmap.zig zero-copy page-aligned I/O.
Problem
indexFile loads the entire file content into memory, then iterates every 3-byte window for trigram extraction. For large files this means multi-MB heap allocations just for trigram extraction.
Proposed Solution
Stream trigrams directly from file handles in 4KB chunks:
- Read 4KB chunk from disk
- Extract trigrams from chunk (carry over last 2 bytes across chunk boundaries)
- Never hold more than 4KB per file in memory for trigram extraction
- Outline parsing still needs full content, but trigrams don't
Reference: justrach/turbodb storage/mmap.zig — zero-copy page-aligned access pattern that avoids loading full files into heap.
Failing test
test "issue-209: streaming trigram extraction matches full-load results" {
// Index a file with streaming vs full-load
// Verify identical trigram candidates
}
Expected impact
| Metric |
Before |
After |
| Per-file memory (trigrams) |
file_size bytes |
4KB |
| Peak RSS during build |
~1GB (5K files) |
~100MB |
Files to modify
src/index.zig — TrigramIndex.indexFile
Problem
indexFileloads the entire file content into memory, then iterates every 3-byte window for trigram extraction. For large files this means multi-MB heap allocations just for trigram extraction.Proposed Solution
Stream trigrams directly from file handles in 4KB chunks:
Reference: justrach/turbodb
storage/mmap.zig— zero-copy page-aligned access pattern that avoids loading full files into heap.Failing test
Expected impact
Files to modify
src/index.zig—TrigramIndex.indexFile