Address memory consumption during "semcode-index --lore"#34
Open
chucklever wants to merge 3 commits intofacebookexperimental:mainfrom
Open
Address memory consumption during "semcode-index --lore"#34chucklever wants to merge 3 commits intofacebookexperimental:mainfrom
chucklever wants to merge 3 commits intofacebookexperimental:mainfrom
Conversation
"cargo fmt" adjusted a bit of code that was added by commit 37f4b7a ("switch LSP server from tower-lsp 0.20 to tower-lsp-server 0.23").
Commit 8ac9f79 ("lore: Use incremental FTS index updates instead of full rebuilds") removed an early-return guard in optimize_single_table() that previously skipped the lore table entirely. The guard had been documented as protecting FTS index references, and that protection was no longer needed once ensure_lore_fts_indices() + optimize_lore_fts_indices() became the canonical FTS update path. Removing the guard exposed a different, previously-dormant issue: compaction of the 290k-row lore table now runs on every --lore invocation. lance/index/append.rs:merge_indices() opens every delta index fragment for a column before merging any of them, and for the scalar/FTS path indices_merged is hard-coded to 1, so the num_indices_to_merge option has no effect on how many fragments are touched per call. On a host with enough memory the cost is acceptable. On a 6GB system the resident set grows linearly with the per-column fragment count, bleeds into swap, and the OOM killer eventually terminates semcode-index. The run then leaves behind fresh delta fragments that the next run will also have to walk, so the problem is monotonic. Two paths now reach the expensive merge_indices walk: 1. compact_lore_tables() -> optimize_single_table("lore"), which runs Compact + Prune + Index (step 3 is the index optimize that walks all fragments). 2. optimize_lore_fts_indices() called directly from the --lore pipeline after compact_lore_tables() returns. Restore the early-return skip in optimize_single_table() for the lore table so path (1) is a no-op again, and guard optimize_lore_fts_indices() with a _indices/ fragment-count threshold so path (2) bails out cleanly when the backlog is already pathologically large. Query correctness is preserved in both cases: LanceDB's native FTS engine serves unindexed rows via a brute-force fallback, so searches still return correct results while compaction is deferred to a host with enough memory to complete it. Fixes: 8ac9f79 ("lore: Use incremental FTS index updates instead of full rebuilds")
After inserting emails, the --lore handler called optimize_database() which runs compact_and_cleanup() across every table in the database — functions, types, 16 content shards, and several metadata tables. When a database already contains a code index, those tables carry thousands of fragments with full function bodies and type definitions. Compacting them loads hundreds of megabytes of data that the lore run never modified, and on a 6 GB system the combined working set triggers the OOM killer. Add compact_lore_tables() which processes only the lore and lore_indexed_commits tables, sequentially, and call it from both --lore code paths instead of optimize_database(). Peak memory during post-pipeline cleanup is now proportional to the lore data alone.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
My recent optimization for "semcode-index --lore" allowed the database optimization step to grow the working set of the semcode-index and semcode MCP processes such that running semcode at all on small-memory systems becomes impossible. This series addresses the regression.