Run containers attempt 3 #320

lucascool12 · 2025-04-10T20:58:44Z

This PR continues on #66. My main goal is to move each part of the original branch to the new project layout, e.g. the run_store.rs file or whatever it should be called.

Each commit will move such a piece of code and also add tests for this (and then fix any resulting bugs).

Example of such a commit: a57aff1

Closes: #12

Implements and tests `insert` and `insert_range` methods on runs.

This fixes some failing tests and adds some `#[allow(todo]` and `#[allow(unused]`.

lucascool12 · 2025-05-11T13:07:25Z

I think inserting with insert_range (especially as a new container) should probably try to make a range container if it's a big enough range. It would be nice if bitmap.insert_range(0, HUGE_NUMBER) was efficent.

I have implemented this based on CRoaring's implementation in eff381a.

Dr-Emann · 2025-05-11T15:02:52Z

I think the important factor is that:

Unless one calls (run_)optimize then all containers that can be represented more efficiently by run containers will be represented by run containers.

There are cases where a container can be represented equally efficiently as either a range, or an {array/bitmap}. Both implementations (correctly imo) default to leaving the existing container type when converting to/from a run container is not strictly more efficient.

Therefore, in these cases, the result of (run_)optimize/remove_run_compression depends on which of the two equally valid container types was already there.

e.g.

for the Roaring Bitmap containing [0, 1, 2], it could be represented in two ways

runs: [0..=2] # serialized size `2 + (1 * 4) = 6`
array: [0, 1, 2] # serialized size `3 * 2 = 6`

So e.g. both implementations have to match on the result type of container for all operations for all container types, e.g. run[0..=8] ^ array[3,4,5,6,7,8] needs to have the same container type in both implementations if we want to be able to guarantee they always serialize the same, even after doing (run_)optimize

lucascool12 · 2025-05-11T15:19:02Z

e.g.

for the Roaring Bitmap containing [0, 1, 2], it could be represented in two ways
runs: [0..=2] # serialized size `2 + (1 * 4) = 6`
array: [0, 1, 2] # serialized size `3 * 2 = 6`
So e.g. both implementations have to match on the result type of container for all operations for all container types, e.g. run[0..=8] ^ array[3,4,5,6,7,8] needs to have the same container type in both implementations if we want to be able to guarantee they always serialize the same, even after doing (run_)optimize

I see. I don't think it is feasible to ensure we also produces runs in the exact same situations as CRoaring.
CRoaring presumably doesn't make any promises about which operations automatically produce runs. so a minor version bump in CRoaring might make our fuzz ci fail, which is not desirable.

Relaxing the serialization comparison would be the best option we have.

lucascool12 · 2025-05-11T15:27:01Z

Couldn't we call remove_run_compression before (run_)opitimize to ensure we always have the same Roaring bitmap?
And also the other way around?

lucascool12 · 2025-05-11T17:31:28Z

Couldn't we call remove_run_compression before (run_)opitimize to ensure we always have the same Roaring bitmap? And also the other way around?

I ran the fuzzer with the following patch applied on croaring-rs and found nothing after letting it run for 45 minutes. Yeey!

diff --git i/croaring-sys/CRoaring/roaring.c w/croaring-sys/CRoaring/roaring.c
index d49cda5..ba61acb 100644
--- i/croaring-sys/CRoaring/roaring.c
+++ w/croaring-sys/CRoaring/roaring.c
@@ -1494,7 +1494,7 @@ bool array_container_validate(const array_container_t *v, const char **reason);
  * Return the serialized size in bytes of a container having cardinality "card".
  */
 static inline int32_t array_container_serialized_size_in_bytes(int32_t card) {
-    return card * 2 + 2;
+    return card * 2;
 }
 
 /**

Kerollmops · 2025-05-17T11:55:09Z

Hey @lucascool12 and @Dr-Emann 👋

I hope you're good 😊 I was wondering if the final change we want to merge this PR is to merge RoaringBitmap/CRoaring#702? And if so, what's actually missing for it to be merged?

Have a nice day 🥬

lucascool12 · 2025-05-17T15:55:04Z

Hey @lucascool12 and @Dr-Emann 👋

I hope you're good 😊 I was wondering if the final change we want to merge this PR is to merge RoaringBitmap/CRoaring#702? And if so, what's actually missing for it to be merged?

Have a nice day 🥬

I noticed that Interval assumes self.start <= self.end but this is very weakly enforced right now. I'll change this by making the new function return an option and add a new_unchecked. Lastly, I'm going to review my own code one more time, resolve anything I find that is unsatisfactory. And then this PR will be completely ready from my end.

Also I think we are all in favour of the current semantics of optimize even though it is different from croaring's run_optimize, correct? And as @Dr-Emann said since optimize didn't exist previously adding a breaking change in this PR is a bit odd. Maybe we should remove the breaking label?

Dr-Emann · 2025-05-18T00:53:23Z

Did find something in fuzzing:

Fuzz input

FuzzInput {
    ops: [
        MutateLhs(
            Extend(
                [
                    Num(
                        97619,
                    ),
                    Num(
                        97917,
                    ),
                    Num(
                        97661,
                    ),
                    Num(
                        77184,
                    ),
                    Num(
                        72989,
                    ),
                    Num(
                        70941,
                    ),
                    Num(
                        104237,
                    ),
                ],
            ),
        ),
        SwapSides,
        MutateLhs(
            InsertRange(
                Num(
                    72981,
                )..=Num(
                    72989,
                ),
            ),
        ),
        Binary(
            Xor,
        ),
        Binary(
            Or,
        ),
        MutateLhs(
            RemoveRunCompression,
        ),
    ],
    initial_input: [],
}

Base64: A319fX3Fl4eDfX19U1N9fn19fX19fX0tgB0VHR2NHRUdHcWXLYAdFR0dxZd9fX19fS2AHRUdHY0dFR0BAAAAHR1TyUEABR0VHR3Fl5eXl5dw5VNTyTA=

Looking a bit closer at https://github.com/lucascool12/roaring-rs/blob/c3ebe863e377b58a0732f0ba27da13dc8a1b987f/fuzz/fuzz_targets/arbitrary_ops/mod.rs#L280-L282

x.run_optimize();
y.optimize();
assert_eq!(x.remove_run_compression(), y.remove_run_compression());

I don't think we can do that assert: If we've got a bitmap that can be either a bitmap or {array/bitmap}, the optimize call won't do anything, e.g. croaring could have a run container, roaring could have an array container, so removing runs will return true for croaring, false for roaring.

Think we could either just not check the return values, or we could use the statistics call to check if the type of containers have changed, rather than comparing with if the croaring bitmap changed.

lucascool12 · 2025-05-18T06:34:28Z

Think we could either just not check the return values, or we could use the statistics call to check if the type of containers have changed, rather than comparing with if the croaring bitmap changed.

So using a statistics call before and after and then checking no run containers exist?

I tried adding x.remove_run_compression(); and y.remove_run_compression(); before the optimize calls, this works for this crash. And unless I'm missing something this should always result in the same result no?

Kerollmops · 2025-05-27T07:35:12Z

I tried adding x.remove_run_compression(); and y.remove_run_compression(); before the optimize calls, this works for this crash. And unless I'm missing something this should always result in the same result no?

@lucascool12 Do you think this change can be part of the final PR or should we implement the statistic-based solution?

What I don't like/understand with the remove-run-compression solution is that it doesn't check the run-container optimization. At least, doesn't compare it to the C implementation. Am I wrong?

Kerollmops · 2025-05-31T08:14:42Z

Hey @lucascool12 👋

Daniel just merged the PR on the C roaring library. I think we need the croaring Rust wrapper to update its dependency and we will be ready to merge this very PR 👏

lucascool12 · 2025-05-31T08:27:27Z

I tried adding x.remove_run_compression(); and y.remove_run_compression(); before the optimize calls, this works for this crash. And unless I'm missing something this should always result in the same result no?

@lucascool12 Do you think this change can be part of the final PR or should we implement the statistic-based solution?

What I don't like/understand with the remove-run-compression solution is that it doesn't check the run-container optimization. At least, doesn't compare it to the C implementation. Am I wrong?

Well, it only checks that both implementations agree that the bitmap changed or stayed the same.
A later operation could then check if the statistics are the same, this depends on what the fuzzer decides.
We could check the statistics right after the optimization if we really wanted, I don't think it makes much of a difference.

Daniel just merged the PR on the C roaring library. I think we need the croaring Rust wrapper to update its dependency and we will be ready to merge this very PR 👏

Great! I'll push the last remnants such as the remove_compression and interval change. Then this PR will be ready.

Fixes a fuzz failure by ensuring no run containers are present in both implementations before adding run containers and then removing them again to check if both remove operations had the same effect.

Dr-Emann · 2025-06-01T03:46:05Z

New version of croaring-sys which picks up the croaring update, should just need a cargo update in the fuzzer directory to pick it up.

lucascool12

Alright this is my final review of this PR. I'll remove the dbg! statements and push the updated fuzz code in 5 seconds.

Presumably we would also want to revert the change to the Debug impl for RoaringBitmap, but I'd like some confirmation for this.

roaring/src/bitmap/fmt.rs

roaring/tests/serialization.rs

lucascool12 · 2025-06-05T12:58:39Z

Is there anything left to be done from my end to get this merged?
In my humble opinion this PR is ready to merge.
Or does it require review from someone in particular before being merged?

Kerollmops · 2025-06-05T19:03:45Z

Hey @lucascool12 👋

Thank you again for the good work. I'll merge it right away your mission is a success. I plan to release a new version soon enough either before or after trying it on Meilisearch 🤔

Dr-Emann · 2025-06-07T01:54:15Z

I've been running the fuzzer for a few days, and no findings!

josephglanville and others added 30 commits September 11, 2020 17:34

WIP: Run container

9b67893

Fix some bugs in the run container implementation

3124aa4

Fix the to_array/bitmap impl for runs, the end bound is inclusive

2068bb6

Rework the array bitmap intersect_with using Vec::retain

e605f64

Implement the array run intersect_with operation

9321618

Implement the run array intersect_with operation

a62fc7d

Implement the run run union_with operation

d658f28

Implement the run array union_with operation

0ded028

Implement the array run union_with operation

fe8a4ab

Implement the bitmap run union_with operation

613163f

Implement the run run intersect_with operation

0a66483

Implement the bitmap run intersect_with operation

9af4366

Implement the run bitmap intersect_with operation

9612ae9

Simplify the run run intersect_with operation

4ae8986

Implement the remove_range operation for the run store type

924d4db

Implement the run array and array run is_disjoint operation

d7bcad3

Implement the run run is_disjoint operation

cb69d80

Simplify the array bitmap difference_with operation

c77c0f8

Implement the array run difference_with operation

3a9eefd

Implement the bitmap run difference_with operation

183c1bb

Clippy and fmt pass

07d0fcc

Implement the run array difference_with operation

3c99804

Mark array run symmetric_difference_with operation as unimplemented

c762f93

Implement the array run is_subset operation

9744f12

Implement the run run difference_with operation

67784ad

Merge remote-tracking branch 'origin/main' into run-containers

ec86619

feat: insert and insert_range on runs

a57aff1

Implements and tests `insert` and `insert_range` methods on runs.

fix: fixes ci failures introduced in a57aff1

47b7cbf

This fixes some failing tests and adds some `#[allow(todo]` and `#[allow(unused]`.

feat: run store push

71c6679

feat: run store remove index

7dfdd92

feat: remove_run_compression

c3ebe86

Dr-Emann mentioned this pull request May 12, 2025

fix: computation of array container serialized size was incorrect RoaringBitmap/CRoaring#702

Merged

Kerollmops removed the breaking This change will require a bump of the minor or major version. label May 18, 2025

lucascool12 added 2 commits May 31, 2025 10:28

fix: fuzzing against croaring failure by optimize

a9614bb

Fixes a fuzz failure by ensuring no run containers are present in both implementations before adding run containers and then removing them again to check if both remove operations had the same effect.

fix: enforce Interval invariants

69fe5e6

lucascool12 force-pushed the run-containers branch from 8b41f22 to 69fe5e6 Compare May 31, 2025 09:23

chore: update croaring to 2.3.1 for fuzzing

5b9372a

lucascool12 commented Jun 1, 2025

View reviewed changes

roaring/src/bitmap/fmt.rs Show resolved Hide resolved

roaring/tests/serialization.rs Outdated Show resolved Hide resolved

test: remove dbg! statement

5427897

lucascool12 force-pushed the run-containers branch from b16e44f to 5427897 Compare June 1, 2025 18:01

Kerollmops approved these changes Jun 5, 2025

View reviewed changes

Kerollmops added this pull request to the merge queue Jun 5, 2025

Merged via the queue into RoaringBitmap:main with commit 6535a82 Jun 5, 2025
15 checks passed

Run containers attempt 3 #320

Run containers attempt 3 #320

Uh oh!

Conversation

lucascool12 commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucascool12 commented May 11, 2025

Uh oh!

Dr-Emann commented May 11, 2025

Uh oh!

lucascool12 commented May 11, 2025

Uh oh!

lucascool12 commented May 11, 2025

Uh oh!

lucascool12 commented May 11, 2025

Uh oh!

Kerollmops commented May 17, 2025

Uh oh!

lucascool12 commented May 17, 2025

Uh oh!

Dr-Emann commented May 18, 2025

Uh oh!

lucascool12 commented May 18, 2025

Uh oh!

Kerollmops commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kerollmops commented May 31, 2025

Uh oh!

lucascool12 commented May 31, 2025

Uh oh!

Dr-Emann commented Jun 1, 2025

Uh oh!

lucascool12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lucascool12 commented Jun 5, 2025

Uh oh!

Kerollmops commented Jun 5, 2025

Uh oh!

Uh oh!

Dr-Emann commented Jun 7, 2025

Uh oh!

Uh oh!

lucascool12 commented Apr 10, 2025 •

edited

Loading

Kerollmops commented May 27, 2025 •

edited

Loading