Skip to content

perf: streaming trigram extraction — no full file load #209

@justrach

Description

@justrach

Design from justrach/turbodb — adapted from src/storage/mmap.zig zero-copy page-aligned I/O.

Problem

indexFile loads the entire file content into memory, then iterates every 3-byte window for trigram extraction. For large files this means multi-MB heap allocations just for trigram extraction.

Proposed Solution

Stream trigrams directly from file handles in 4KB chunks:

  1. Read 4KB chunk from disk
  2. Extract trigrams from chunk (carry over last 2 bytes across chunk boundaries)
  3. Never hold more than 4KB per file in memory for trigram extraction
  4. Outline parsing still needs full content, but trigrams don't

Reference: justrach/turbodb storage/mmap.zig — zero-copy page-aligned access pattern that avoids loading full files into heap.

Failing test

test "issue-209: streaming trigram extraction matches full-load results" {
    // Index a file with streaming vs full-load
    // Verify identical trigram candidates
}

Expected impact

Metric Before After
Per-file memory (trigrams) file_size bytes 4KB
Peak RSS during build ~1GB (5K files) ~100MB

Files to modify

  • src/index.zigTrigramIndex.indexFile

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions