Skip to content

Run containers attempt 3 #320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 69 commits into
base: main
Choose a base branch
from

Conversation

lucascool12
Copy link

This PR continues on #66. My main goal is to move each part of the original branch to the new project layout, e.g. the run_store.rs file or whatever it should be called.

Each commit will move such a piece of code and also add tests for this (and then fix any resulting bugs).

Example of such a commit: a57aff1

josephglanville and others added 30 commits September 11, 2020 17:34
Implements and tests `insert` and `insert_range` methods on runs.
This fixes some failing tests and adds some `#[allow(todo]` and
`#[allow(unused]`.
@lucascool12
Copy link
Author

I opened lucascool12#1 to extend the fuzzer to include ranges, and it seems to find something.

I fixed some bugs found by the fuzzer, but found the following:
The documentation for roaring_bitmap_run_optimize found here says:

Returns true if the result has at least one run container.

But the docs in the croaring crate state:

Compresses of the bitmap. Returns true if the bitmap was modified.

As such fuzzing fails because the rust implementation (currently) returns true if anything was optimized.
I'll fix this for now so by also checking if the croaring bitmap changed.

Some things to consider: should we follow croaring's 'return true if bitmap contains any run' or 'return true if we changed anything'. I'm more in favor of the current (second) option, since it makes more sense.

@Kerollmops
Copy link
Member

Thank you, @lucascool12,

I think I am more in favor of the second option. I am willing to change the behavior and to do so we will have to document that in the breaking release. Would you mind updating the main PR message to include breaking changes, please?

@Kerollmops Kerollmops added the breaking This change will require a bump of the minor or major version. label May 6, 2025
@Kerollmops
Copy link
Member

Kerollmops commented May 8, 2025

Hey @Dr-Emann and @lucascool12 👋

I was looking at the cargo fuzz discovery and trying to reproduce it on my laptop. However, I haven't found a way to provide the seed to cargo fuzz. Would you happen to know how to reproduce, or should we work with the provided debug output?

Have a lovely week 🦆

@lucascool12
Copy link
Author

Hey @Dr-Emann and @lucascool12 👋

I was looking at the cargo fuzz discovery and trying to reproduce it on my laptop. However, I haven't found a way to provide the seed to cargo fuzz. Would you happen to know how to reproduce, or should we work with the provided debug output?

Have a lovely week 🦆

I'm not sure how you can use the seed. But you can recreate the exact error from these lines:

artifact_prefix='/home/runner/work/roaring-rs/roaring-rs/fuzz/artifacts/against_croaring/'; Test unit written to /home/runner/work/roaring-rs/roaring-rs/fuzz/artifacts/against_croaring/crash-5496d655c8cc97f820e887fd9ab710723de50c8f
Base64: E7AI/1ZMKZMn7f/+/wMA//8ZmzvcTAEAFf+Tk5OTk531CJNh

You can use this command to recreate the crash file locally:

echo 'E7AI/1ZMKZMn7f/+/wMA//8ZmzvcTAEAFf+Tk5OTk531CJNh' | base64 -d > fuzz/artifacts/against_croaring/crash-5496d655c8cc97f820e887fd9ab710723de50c8f

Then you can run this exact crash using the suggested command in the logs

cargo fuzz run --sanitizer=none against_croaring fuzz/artifacts/against_croaring/crash-5496d655c8cc97f820e887fd9ab710723de50c8f

@Kerollmops
Copy link
Member

Thank you, @lucascool12,

It works. I investigated a bit and both serialization gives the same bitmaps. They compare equally.

I must investigate further to understand the reason why the serialization differs. I'll bet on either an empty container or a container with a different type 🤔

@Dr-Emann
Copy link
Member

Dr-Emann commented May 9, 2025

You can run with fuzz run --sanitizer=address against_croaring -- -help=1 to see the libfuzzer help, to set the seed, you can run with -- -seed=12345, but like @lucascool12 said, using the base64 to recreate the crashing file is the better way.

@Dr-Emann
Copy link
Member

Dr-Emann commented May 9, 2025

The difference appears to be in the type of container used.

roaring-rs

image

CRoaring

image

It seems croaring chooses to make the first container a run container (with 1 run) vs roaring-rs makes the first container an array container (with 2 items). Both containers take up 2 u16s, so either is valid to pick.

@Kerollmops
Copy link
Member

Kerollmops commented May 9, 2025

Nice investigation @Dr-Emann and thanks for the help. What tool do you use to analyze serialized bitmaps?

Both containers take up 2 u16s, so either is valid to pick.

Maybe we can change the optimization condition so that roaring-rs decide to do a run as well?

@Dr-Emann
Copy link
Member

I wrote a Kaitai Struct definition for the format (RoaringBitmap/RoaringFormatSpec#17) and used it on https://ide.kaitai.io/devel/#, which gives that nice object-tree view of the parsed file format.

@Dr-Emann
Copy link
Member

@Kerollmops I think there's some confusion (or I'm confused), I think all the discussions about changing behavior is for the optimize function, which is newly introduced by this PR, so there shouldn't be any breaking changes from that.

I just opened an issue and a PR against croaring-rs, the docs were just wrong, its behavior exactly matches the CRoaring behavior (because it's a thin wrapper).

FWIW, I don't have any strong opinions on the better meaning of the return value of the optimize function, either "true if modified" or "true of any run containers", I struggle to really think of a usecase that could benefit from knowing either of those bits of info after calling the optimize function. I think the "true if modified" is the more standard type of API in general, so I lean slightly toward that.

@Dr-Emann
Copy link
Member

Wait, what I said was false:

Both containers take up 2 u16s, so either is valid to pick.

That's not true, a run container is a u16 number of runs, then 2 u16s per run, so an array container should be chosen. It looks like CRoaring has a bug:

https://github.com/RoaringBitmap/CRoaring/blob/37c0388f475dd42275661127917795c75f73bc50/include/roaring/containers/array.h#L165-L167

This isn't correct, an array container only takes card * sizeof(u16) bytes.

However, even if that's fixed, we still have a problem, because of this:
https://github.com/lucascool12/roaring-rs/blob/b88e25824639ff37e20f8e0465a0552f03042be1/roaring/src/bitmap/container.rs#L214

When there is a point where two container types are equally valid, both roaring-rs and CRoaring both default to leaving the container as is. In order to match serialization exactly in all cases, we would need to always create the same container types with the same operations.

@lucascool12 do you think that's feasible? If not, we can relax the fuzzing to not require serializations to exactly match.

@lucascool12
Copy link
Author

@Dr-Emann

When there is a point where two container types are equally valid, both roaring-rs and CRoaring both default to leaving the container as is. In order to match serialization exactly in all cases, we would need to always create the same container types with the same operations.

@lucascool12 do you think that's feasible? If not, we can relax the fuzzing to not require serializations to exactly match.

Unless I misunderstand Roaring bitmaps, for any set of u16 a container will either be a bitmap or an array, where the choice between the two is based on the amount of values right? But run containers are allowed to replace both, depending on implementation details. Unless one calls (run_)optimize then all containers that can be represented more efficiently by run containers will be represented by run containers.

As such shouldn't the optimize and remove runs method always produce the same results (barring bugs)?

@lucascool12
Copy link
Author

I think inserting with insert_range (especially as a new container) should probably try to make a range container if it's a big enough range. It would be nice if bitmap.insert_range(0, HUGE_NUMBER) was efficent.

I have implemented this based on CRoaring's implementation in eff381a.

@Dr-Emann
Copy link
Member

I think the important factor is that:

Unless one calls (run_)optimize then all containers that can be represented more efficiently by run containers will be represented by run containers.

There are cases where a container can be represented equally efficiently as either a range, or an {array/bitmap}. Both implementations (correctly imo) default to leaving the existing container type when converting to/from a run container is not strictly more efficient.

Therefore, in these cases, the result of (run_)optimize/remove_run_compression depends on which of the two equally valid container types was already there.

e.g.

for the Roaring Bitmap containing [0, 1, 2], it could be represented in two ways

runs: [0..=2] # serialized size `2 + (1 * 4) = 6`
array: [0, 1, 2] # serialized size `3 * 2 = 6`

So e.g. both implementations have to match on the result type of container for all operations for all container types, e.g. run[0..=8] ^ array[3,4,5,6,7,8] needs to have the same container type in both implementations if we want to be able to guarantee they always serialize the same, even after doing (run_)optimize

@lucascool12
Copy link
Author

e.g.

for the Roaring Bitmap containing [0, 1, 2], it could be represented in two ways

runs: [0..=2] # serialized size `2 + (1 * 4) = 6`
array: [0, 1, 2] # serialized size `3 * 2 = 6`

So e.g. both implementations have to match on the result type of container for all operations for all container types, e.g. run[0..=8] ^ array[3,4,5,6,7,8] needs to have the same container type in both implementations if we want to be able to guarantee they always serialize the same, even after doing (run_)optimize

I see. I don't think it is feasible to ensure we also produces runs in the exact same situations as CRoaring.
CRoaring presumably doesn't make any promises about which operations automatically produce runs. so a minor version bump in CRoaring might make our fuzz ci fail, which is not desirable.

Relaxing the serialization comparison would be the best option we have.

@lucascool12
Copy link
Author

Couldn't we call remove_run_compression before (run_)opitimize to ensure we always have the same Roaring bitmap?
And also the other way around?

@lucascool12
Copy link
Author

Couldn't we call remove_run_compression before (run_)opitimize to ensure we always have the same Roaring bitmap? And also the other way around?

I ran the fuzzer with the following patch applied on croaring-rs and found nothing after letting it run for 45 minutes. Yeey!

diff --git i/croaring-sys/CRoaring/roaring.c w/croaring-sys/CRoaring/roaring.c
index d49cda5..ba61acb 100644
--- i/croaring-sys/CRoaring/roaring.c
+++ w/croaring-sys/CRoaring/roaring.c
@@ -1494,7 +1494,7 @@ bool array_container_validate(const array_container_t *v, const char **reason);
  * Return the serialized size in bytes of a container having cardinality "card".
  */
 static inline int32_t array_container_serialized_size_in_bytes(int32_t card) {
-    return card * 2 + 2;
+    return card * 2;
 }
 
 /**

@Kerollmops
Copy link
Member

Hey @lucascool12 and @Dr-Emann 👋

I hope you're good 😊 I was wondering if the final change we want to merge this PR is to merge RoaringBitmap/CRoaring#702? And if so, what's actually missing for it to be merged?

Have a nice day 🥬

@lucascool12
Copy link
Author

Hey @lucascool12 and @Dr-Emann 👋

I hope you're good 😊 I was wondering if the final change we want to merge this PR is to merge RoaringBitmap/CRoaring#702? And if so, what's actually missing for it to be merged?

Have a nice day 🥬

I noticed that Interval assumes self.start <= self.end but this is very weakly enforced right now. I'll change this by making the new function return an option and add a new_unchecked. Lastly, I'm going to review my own code one more time, resolve anything I find that is unsatisfactory. And then this PR will be completely ready from my end.

Also I think we are all in favour of the current semantics of optimize even though it is different from croaring's run_optimize, correct? And as @Dr-Emann said since optimize didn't exist previously adding a breaking change in this PR is a bit odd. Maybe we should remove the breaking label?

@Dr-Emann
Copy link
Member

Did find something in fuzzing:

Fuzz input
FuzzInput {
    ops: [
        MutateLhs(
            Extend(
                [
                    Num(
                        97619,
                    ),
                    Num(
                        97917,
                    ),
                    Num(
                        97661,
                    ),
                    Num(
                        77184,
                    ),
                    Num(
                        72989,
                    ),
                    Num(
                        70941,
                    ),
                    Num(
                        104237,
                    ),
                ],
            ),
        ),
        SwapSides,
        MutateLhs(
            InsertRange(
                Num(
                    72981,
                )..=Num(
                    72989,
                ),
            ),
        ),
        Binary(
            Xor,
        ),
        Binary(
            Or,
        ),
        MutateLhs(
            RemoveRunCompression,
        ),
    ],
    initial_input: [],
}

Base64: A319fX3Fl4eDfX19U1N9fn19fX19fX0tgB0VHR2NHRUdHcWXLYAdFR0dxZd9fX19fS2AHRUdHY0dFR0BAAAAHR1TyUEABR0VHR3Fl5eXl5dw5VNTyTA=

Looking a bit closer at https://github.com/lucascool12/roaring-rs/blob/c3ebe863e377b58a0732f0ba27da13dc8a1b987f/fuzz/fuzz_targets/arbitrary_ops/mod.rs#L280-L282

x.run_optimize();
y.optimize();
assert_eq!(x.remove_run_compression(), y.remove_run_compression());

I don't think we can do that assert: If we've got a bitmap that can be either a bitmap or {array/bitmap}, the optimize call won't do anything, e.g. croaring could have a run container, roaring could have an array container, so removing runs will return true for croaring, false for roaring.

Think we could either just not check the return values, or we could use the statistics call to check if the type of containers have changed, rather than comparing with if the croaring bitmap changed.

@lucascool12
Copy link
Author

Think we could either just not check the return values, or we could use the statistics call to check if the type of containers have changed, rather than comparing with if the croaring bitmap changed.

So using a statistics call before and after and then checking no run containers exist?

I tried adding x.remove_run_compression(); and y.remove_run_compression(); before the optimize calls, this works for this crash. And unless I'm missing something this should always result in the same result no?

@Kerollmops Kerollmops removed the breaking This change will require a bump of the minor or major version. label May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants