Skip to content

Address memory consumption during "semcode-index --lore"#34

Open
chucklever wants to merge 3 commits intofacebookexperimental:mainfrom
chucklever:main
Open

Address memory consumption during "semcode-index --lore"#34
chucklever wants to merge 3 commits intofacebookexperimental:mainfrom
chucklever:main

Conversation

@chucklever
Copy link
Copy Markdown
Contributor

My recent optimization for "semcode-index --lore" allowed the database optimization step to grow the working set of the semcode-index and semcode MCP processes such that running semcode at all on small-memory systems becomes impossible. This series addresses the regression.

"cargo fmt" adjusted a bit of code that was added by commit
37f4b7a ("switch LSP server from tower-lsp 0.20 to
tower-lsp-server 0.23").
Commit 8ac9f79 ("lore: Use incremental FTS index updates
instead of full rebuilds") removed an early-return guard in
optimize_single_table() that previously skipped the lore table
entirely. The guard had been documented as protecting FTS index
references, and that protection was no longer needed once
ensure_lore_fts_indices() + optimize_lore_fts_indices() became
the canonical FTS update path. Removing the guard exposed a
different, previously-dormant issue: compaction of the 290k-row
lore table now runs on every --lore invocation.

lance/index/append.rs:merge_indices() opens every delta index
fragment for a column before merging any of them, and for the
scalar/FTS path indices_merged is hard-coded to 1, so the
num_indices_to_merge option has no effect on how many fragments
are touched per call. On a host with enough memory the cost is
acceptable. On a 6GB system the resident set grows linearly
with the per-column fragment count, bleeds into swap, and the
OOM killer eventually terminates semcode-index. The run then
leaves behind fresh delta fragments that the next run will also
have to walk, so the problem is monotonic.

Two paths now reach the expensive merge_indices walk:

 1. compact_lore_tables() -> optimize_single_table("lore"),
    which runs Compact + Prune + Index (step 3 is the index
    optimize that walks all fragments).

 2. optimize_lore_fts_indices() called directly from the
    --lore pipeline after compact_lore_tables() returns.

Restore the early-return skip in optimize_single_table() for the
lore table so path (1) is a no-op again, and guard
optimize_lore_fts_indices() with a _indices/ fragment-count
threshold so path (2) bails out cleanly when the backlog is
already pathologically large. Query correctness is preserved
in both cases: LanceDB's native FTS engine serves unindexed
rows via a brute-force fallback, so searches still return
correct results while compaction is deferred to a host with
enough memory to complete it.

Fixes: 8ac9f79 ("lore: Use incremental FTS index updates instead of full rebuilds")
After inserting emails, the --lore handler called
optimize_database() which runs compact_and_cleanup() across
every table in the database — functions, types, 16 content
shards, and several metadata tables.  When a database
already contains a code index, those tables carry thousands
of fragments with full function bodies and type definitions.
Compacting them loads hundreds of megabytes of data that the
lore run never modified, and on a 6 GB system the combined
working set triggers the OOM killer.

Add compact_lore_tables() which processes only the lore and
lore_indexed_commits tables, sequentially, and call it from
both --lore code paths instead of optimize_database().  Peak
memory during post-pipeline cleanup is now proportional to
the lore data alone.
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant