[ENH] Add maxscore lazy cursor#6829
[ENH] Add maxscore lazy cursor#6829HammadB wants to merge 3 commits intohammad/maxscore_writer_readerfrom
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
There was a problem hiding this comment.
Changes are promising, but one important unsafe lifetime issue and code duplication should be addressed before merge.
Status: Changes Suggested | Risk: Medium
Issues Identified & Suggestions
- Unsafe lifetime transmute may cause use-after-free; return owned bytes or Arc.
rust/blockstore/src/arrow/blockfile.rs - Duplicate View/Lazy logic; extract shared raw-block helper to reduce drift.
rust/index/src/sparse/cursor.rs
Review Details
📁 13 files reviewed | 💬 2 comments
👍 / 👎 individual comments to help improve reviews for you
| let target_block_id = self.root.sparse_index.get_target_block_id(&search_key); | ||
| let guard = self.loaded_blocks.read(); | ||
| let block: &Block = guard.get(&target_block_id)?; | ||
| // Safety: the Block is heap-allocated (Box<Block>) and is never | ||
| // removed from loaded_blocks. The returned &[u8] points into | ||
| // Arrow buffers owned by the Block. Same lifetime-extension | ||
| // argument used in get_block's transmute. |
There was a problem hiding this comment.
[Logic] The transmute here extends the lifetime of a &Block borrowed from inside the RwLockReadGuard to &'me Block, where 'me is the lifetime parameter of the reader. This is unsound: once the read guard drops at the end of this function, the underlying Block could theoretically be deallocated if a write guard is acquired elsewhere (e.g., if blocks were evicted from the cache), while the returned &[u8] still points into its Arrow buffers.
The comment says "Block is never removed from loaded_blocks", and the same invariant is relied upon by get_block's existing transmutes (lines 498, 513). However, this is a behavioral invariant, not a type-system invariant. This PR expands the unsafe surface area by adding a new function that relies on the same unverifiable guarantee. If that invariant ever changes (cache eviction, compaction, etc.), this becomes a use-after-free with no compiler warning.
A safer approach is to either:
- Return an owned
Vec<u8>(a copy of the raw bytes), or - Wrap Block in
Arc<Block>so the caller can hold a reference-counted handle that prevents deallocation without needing the transmute.
Context for Agents
The `transmute` here extends the lifetime of a `&Block` borrowed from inside the `RwLockReadGuard` to `&'me Block`, where `'me` is the lifetime parameter of the reader. This is unsound: once the read guard drops at the end of this function, the underlying `Block` could theoretically be deallocated if a write guard is acquired elsewhere (e.g., if blocks were evicted from the cache), while the returned `&[u8]` still points into its Arrow buffers.
The comment says "Block is never removed from loaded_blocks", and the same invariant is relied upon by `get_block`'s existing transmutes (lines 498, 513). However, this is a behavioral invariant, not a type-system invariant. This PR expands the unsafe surface area by adding a new function that relies on the same unverifiable guarantee. If that invariant ever changes (cache eviction, compaction, etc.), this becomes a use-after-free with no compiler warning.
A safer approach is to either:
1. Return an owned `Vec<u8>` (a copy of the raw bytes), or
2. Wrap Block in `Arc<Block>` so the caller can hold a reference-counted handle that prevents deallocation without needing the transmute.
File: rust/blockstore/src/arrow/blockfile.rs
Line: 5718925c1a to
814fd3f
Compare
9d41c2f to
2fed177
Compare
This comment has been minimized.
This comment has been minimized.
814fd3f to
d3f0665
Compare
2fed177 to
ffb0104
Compare
d3f0665 to
756d94a
Compare
e71b3f4 to
6b0e4e6
Compare
2834ec4 to
30520f9
Compare
6b0e4e6 to
d57413d
Compare
30520f9 to
609b9b0
Compare
|
Add This PR introduces a new It also rewrites This summary was automatically generated by @propel-code-bot |
b3e6c5a to
fcca860
Compare
d57413d to
9941a72
Compare
fcca860 to
da68907
Compare
9941a72 to
0dec95c
Compare
da68907 to
59904b4
Compare
59904b4 to
c76ce23
Compare
6ad7e5c to
be9d78b
Compare
c76ce23 to
4b94ecf
Compare
be9d78b to
4c6a02c
Compare

Description of changes
This is PR #3 of the BlockMaxMaxScore series, stacked on
hammad/maxscore_writer_reader. It replaces the eager-only cursor with a tri-modal cursor and rewrites the query pipeline to use a 3-batch I/O strategy for reduced latency.PostingCursor(cursor.rs): Replaces the eager-only cursor from PR Test queries in nested parquet coco files #2 with three source modes:SparsePostingBlocks (unchanged from PR Test queries in nested parquet coco files #2).&[u8]slices from the Arrow block cache. Offsets are decompressed into a reusable buffer; f16 weights are read directly from raw bytes on the fly (fused reads, no bulk conversion).None, populated on demand viapopulate_from_cache/populate_all_from_cache. Same fused f16 read path as View once populated.get_raw_from_cache(synchronous cache-only raw byte lookup),count_blocks_for_prefix,Block::get_raw/Block::get_prefix_raw, andArrowReadableValue::get_raw_bytestrait method.query():query_weight × max_weight ≤ thresholdare skipped entirely.Test plan
ms_12_cursor_view— View cursor vs Eager equivalence: advance, drain_essential, score_candidates, get_value, window_upper_bound, current.ms_13_cursor_lazy— Lazy cursor via real blockfile I/O: advance, get_value, drain_essential.ms_14_three_batch_pipeline— End-to-end 3-batch query: basic correctness, single dim, k > results, empty query, k=0, missing dim, masks, multi-block per dim.ms_15_multi_window— Documents spanning 4096-entry window boundaries: boundary transitions, accumulator reset, many-docs-across-windows recall.ms_16_lazy_partial_load— Lazy cursor with partially loaded blocks: advance skips unloaded, drain_essential skips unloaded, score_candidates skips unloaded.cargo testMigration plan
No migration needed. This changes the internal query execution strategy but does not alter blockfile formats or segment layouts.
Observability plan
No new instrumentation in this PR. Block-skip counters and batch timing will be added alongside segment integration.
Documentation Changes
No user-facing API changes. Internal design is documented in
rust/index/src/sparse/maxscore.md.