You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[ENH] Commit eagerly in ordered blockfile writer (#6109)
## Description of changes
_Summarize the changes made by this PR._
- Improvements & Bug fixes
- N/A
- New functionality
- Eagerly commit blocks in ordered blockfile writer to avoid excessive
memory usage
## Test plan
_How are these changes tested?_
- [ ] Tests pass locally with `pytest` for python, `yarn test` for js,
`cargo test` for rust
## Migration plan
_Are there any migrations, or any forwards/backwards compatibility
changes needed in order to make sure this change deploys reliably?_
## Observability plan
_What is the plan to instrument and monitor this change?_
## Documentation Changes
_Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_
for delta in inner.completed_block_deltas.drain(..){
152
-
// Don't we split on-mutation (.set() calls)?
153
-
// Yes, but that is only a performance optimization. For correctness, we must also split on commit. Why?
154
-
//
155
-
// We need to defer copying old forked data until:
156
-
// - we receive a set()/delete() for a later key
157
-
// - we are committing the delta (it will receive no further writes)
158
-
//
159
-
// Because of this constraint, we cannot always effectively split on-mutation if the writer is over a forked blockfile. Imagine this scenario:
160
-
// 1. There is 1 existing block whose size == limit.
161
-
// 2. We receive a .set() for a key before the existing block's start key.
162
-
// 3. We turn the existing block into a delta and add the new KV pair.
163
-
// 4. At this point, the total size of the delta (materialized + pending forked data) is above the limit.
164
-
// 5. We would like to split our delta into two immediately after the newly-added key. However, this means that the right half of the split is empty (there is no materialized data), which violates a fundamental assumption made by our blockstore code. And we cannot materialize only the first key in the right half from the pending forked data because that would violate the above constraint.
165
-
//
166
-
// Thus, we handle splitting in two places:
167
-
//
168
-
// 1. Split deltas in half on-mutation if the materialized size is over the limit (just a performance optimization).
169
-
// 2. During the commit phase, after all deltas have been fully materialized, split if necessary.
170
-
//
171
-
// An alternative would be to create a fresh delta that does not fork from an existing block if we receive a .set() for a key that is not contained in any existing block key range, however this complicates writing logic and potentially increases fragmentation.
172
-
if delta.get_size::<K,V>() > self.root.max_block_size_bytes{
173
-
let split_blocks = delta.split::<K,V>(self.root.max_block_size_bytes);
0 commit comments