Skip to content

[ENH] Add maxscore index to metadata segment#6880

Open
Sicheng-Pan wants to merge 2 commits intohammad/maxscore_schema_gatingfrom
hammad/maxscore_segment_wiring
Open

[ENH] Add maxscore index to metadata segment#6880
Sicheng-Pan wants to merge 2 commits intohammad/maxscore_schema_gatingfrom
hammad/maxscore_segment_wiring

Conversation

@Sicheng-Pan
Copy link
Copy Markdown
Contributor

@Sicheng-Pan Sicheng-Pan commented Apr 10, 2026

Description of changes

This is PR 7 in the MaxScore stack. It wires MaxScoreWriter/MaxScoreReader into the metadata segment so that collections with algorithm: "max_score" in their schema use the new index format on both the write and read paths. The query pipeline is not yet connected — that follows in PR 8.

  • MaxScoreReader::count_postings() (maxscore.rs): New method that counts total posting entries for a dimension by summing len() across posting blocks. O(n_blocks). Used by the IDF operator in PR 8 to compute document frequency for BM25 scoring.
  • Metadata segment writer (blockfile_metadata.rs):
    • Added maxscore_index_writer: Option<MaxScoreWriter> field to MetadataSegmentWriterShard.
    • Added schema: Option<&Schema> parameter to both MetadataSegmentWriter::from_segment() and MetadataSegmentWriterShard::from_segment().
    • 3-way branch in writer construction:
      1. SPARSE_POSTING in file_path → fork MaxScore index (open reader + forked writer)
      2. SPARSE_MAX in file_path → fork existing WAND index (unchanged)
      3. Neither (fresh collection) → check schema.is_maxscore_enabled() to decide which writer to create
    • Only one of sparse_index_writer / maxscore_index_writer is Some at a time.
    • Dual dispatch in set_metadata/delete_metadata SparseVector arms — checks maxscore_index_writer first, falls back to sparse_index_writer.
  • Metadata segment flusher (blockfile_metadata.rs):
    • Changed sparse_index_flusher: SparseFlusher to Option<SparseFlusher> + Option<MaxScoreFlusher>.
    • commit() handles both writer paths; flush() conditionally inserts SPARSE_POSTING or SPARSE_MAX+SPARSE_OFFSET_VALUE into the flushed file_path map.
  • Metadata segment reader (blockfile_metadata.rs):
    • Added maxscore_index_reader: Option<MaxScoreReader> field to MetadataSegmentReaderShard.
    • SPARSE_POSTING blockfile loaded concurrently in the existing tokio::join!. If present, maxscore_index_reader is populated and the old WAND reader is skipped.
  • Call site updates (~27 sites):
    • 2 production orchestrators (log_fetch_orchestrator.rs, attached_function_orchestrator.rs) pass collection.schema.as_ref().
    • create_new_shard extracts schema from &Collection.
    • ~24 test sites pass None (backward-compatible — default WAND path).

Test plan

  • All existing metadata segment tests pass unchanged (they pass None for schema, so the WAND writer is created as before).
  • Compilation verified for chroma-segment and worker crates.
  • Integration tests for the MaxScore write-then-read path will be added in PR 8 alongside the operator and orchestrator routing.

Migration plan

No migration needed. Existing collections with SPARSE_MAX+SPARSE_OFFSET_VALUE in their file_path continue to use the WAND reader/writer. New collections only get the MaxScore index if the schema has algorithm: "max_score" (set by the frontend gating in PR 6). The segment reader auto-detects which format is present based on file_path keys.

Observability plan

No new metrics or spans. The 3-way branch in from_segment() is logged implicitly through existing tracing on blockfile open/create operations.

Documentation Changes

None.

@github-actions
Copy link
Copy Markdown

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Copy Markdown
Contributor Author

Sicheng-Pan commented Apr 10, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@Sicheng-Pan Sicheng-Pan changed the title Wire MaxScore index into metadata segment writer, reader, and flusher [ENH] Add maxscore index to metadata segment Apr 10, 2026
@Sicheng-Pan Sicheng-Pan marked this pull request as ready for review April 10, 2026 20:37
@propel-code-bot
Copy link
Copy Markdown
Contributor

propel-code-bot bot commented Apr 10, 2026

Wire MaxScore sparse index into metadata segment read/write/flush paths

This PR adds end-to-end metadata segment support for the new MaxScore sparse index format and selects it based on persisted segment files or collection schema. The metadata writer now supports both legacy WAND (SPARSE_MAX + SPARSE_OFFSET_VALUE) and MaxScore (SPARSE_POSTING) with mutually exclusive writer state, and the reader auto-detects which sparse format to load. Flushing and commit logic were updated to emit the correct file path keys for the chosen sparse index type.

The change also introduces MaxScoreReader::count_postings() for per-dimension posting counts, updates orchestrator call sites to pass collection.schema.as_ref() into metadata writer construction, and adds a substantial multi-commit consistency test that validates MaxScore behavior across add/delete/update cycles and query recall checks.

This summary was automatically generated by @propel-code-bot

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review found no issues; changes cleanly integrate MaxScore metadata indexing with backward-compatible reader/writer routing.

Status: No Issues Found | Risk: Low

Review Details

📁 7 files reviewed | 💬 0 comments

@blacksmith-sh

This comment has been minimized.

@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_segment_wiring branch from b8b2470 to 294856f Compare April 10, 2026 23:27
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_segment_wiring branch from ed6eeec to 3f7caf7 Compare April 11, 2026 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant