Skip to content

Fix rebuild efficency#586

Open
NeptuneHub wants to merge 7 commits into
mainfrom
devel
Open

Fix rebuild efficency#586
NeptuneHub wants to merge 7 commits into
mainfrom
devel

Conversation

@NeptuneHub
Copy link
Copy Markdown
Owner

This is to improve the speed and stability of Index Rebuil with:

  • Code quality: Eliminates ~400 lines of duplicated index-building logic across 4 builders (CLAP, lyrics, lyrics-axes, SemGrove, audio Voyager) into reusable helpers. Follows DRY principle.
  • Dashboard UX fix: The real problem this solves—users couldn't see which index was running during the final 95–97% window because only the CLAP builder announced itself via log_fn. Now each builder reports progress before running.
  • Memory optimization: SemGrove switched from materializing full embedding matrices (dicts + vstack) to streaming with running sum/sum-of-squares accumulators. No extra copies held in RAM.
  • Concurrency safety: Moved from holding the main DB connection idle-in-transaction during the entire embedding fetch (which broke if other workers wrote to those tables) to opening short-lived autocommit read-only side connections. Much safer.
  • Test coverage: 443 lines of comprehensive unit tests covering identifier validation, buffer streaming, NULL/wrong-dim skipping, buffer growth on concurrent writes, Voyager round-trips, and segmented storage.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a centralized helper module (tasks/index_build_helpers.py) to unify the process of streaming embeddings, building Voyager indexes, and persisting them in segmented database rows. This refactoring is applied across CLAP, lyrics, SemGrove, and audio Voyager index builders, significantly improving memory efficiency and preventing idle-in-transaction database connection timeouts. Additionally, the dashboard logging is updated to show active builders, and comprehensive unit tests are added. Feedback is provided to optimize memory usage in the streaming helper by avoiding an unnecessary buffer copy when converting database blobs to numpy arrays.

Comment thread tasks/index_build_helpers.py Outdated
@NeptuneHub
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a centralized helper module (tasks/index_build_helpers.py) to standardize and optimize the building, streaming, and segmented persistence of Voyager indexes across CLAP, lyrics, SemGrove, and audio managers. It also includes memory optimizations for map and artist projections, UI progress updates during index rebuilds, and comprehensive unit tests. The reviewer suggested a performance improvement in tasks/sem_grove_manager.py to fully vectorize the whitening statistics computation using NumPy, which avoids a Python loop over common IDs.

Comment thread tasks/sem_grove_manager.py
@NeptuneHub
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a centralized index_build_helpers module to optimize the memory footprint and transaction handling of Voyager index builds (CLAP, lyrics, lyrics-axes, SemGrove, and audio). By streaming embeddings from a dedicated side connection and building indexes incrementally, the changes prevent high peak RAM usage and avoid database timeout issues. Additionally, the PR updates progress logging, increments the app version, and adds comprehensive unit tests. The reviewer feedback correctly identifies potential ValueError exceptions when calling np.frombuffer on malformed blobs in the streaming helpers, suggesting a safe length check beforehand.

Comment thread tasks/index_build_helpers.py Outdated
Comment thread tasks/index_build_helpers.py Outdated
@NeptuneHub
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Voyager index building and persistence logic across multiple modules (CLAP, lyrics, SemGrove, and audio Voyager) by centralizing common phases into a new tasks/index_build_helpers.py module. This helper module introduces memory-efficient streaming of embeddings into pre-allocated numpy buffers and segmented BYTEA storage to prevent exceeding PostgreSQL's 1 GB MaxAllocSize limit. Additionally, the artist GMM metadata is optimized by dropping unused covariance fields and serializing the remaining parameters into a compact binary format stored in a dedicated artist_metadata_data table. The review identified a critical NameError in tasks/artist_gmm_manager.py where db_conn is referenced instead of conn, a potential ValueError in tasks/index_build_helpers.py due to unaligned offsets when calling np.frombuffer on packed binary data, and a concurrency risk in app_helper.py when iterating over a shared global dictionary during the two-pass projection build.

Comment thread tasks/artist_gmm_manager.py Outdated
Comment thread tasks/index_build_helpers.py
Comment thread app_helper.py
@NeptuneHub
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request centralizes and refactors the building and persistence of Voyager indexes across CLAP, lyrics, SemGrove, and audio modules into a new tasks/index_build_helpers.py helper module. This change optimizes memory usage through batched streaming and prevents database row-size limit issues by segmenting large binaries and GMM metadata. Feedback on these changes highlights a critical alignment issue when using np.frombuffer with an offset in the metadata decoder, a potential crash in the artist GMM manager if the new metadata table does not exist yet, and a robustness improvement to wrap the artist projection population loop in a try...except block to handle dimension mismatches gracefully.

Comment thread tasks/index_build_helpers.py
Comment thread tasks/artist_gmm_manager.py
Comment thread app_helper.py
@NeptuneHub
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request centralizes and optimizes the building and storage of Voyager indexes across CLAP, lyrics, SemGrove, and audio search tasks by introducing streaming database fetches and segmented binary storage helpers. It also refactors artist GMM metadata storage to use a custom binary codec stored in a dedicated table, dropping unused covariance fields to prevent exceeding PostgreSQL's allocation limits. Feedback is provided regarding a critical alignment issue in the metadata unpacking logic where using np.frombuffer with a variable byte offset can raise a ValueError if the offset is not a multiple of 4 bytes.

Comment thread tasks/index_build_helpers.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant