Parallelize arroy again*#130
Merged
Merged
Conversation
Contributor
|
@irevoire just curious, what tool are you using to profile this ? |
Contributor
Author
|
It's On Linux, I was using valgrind+cachegrind and was visualizing the output with kcachegrind. |
Contributor
Author
Contributor
Author
|
Once merged I'll need to re-implement the progress properly. An idea I just had is to precompute the total number of items we must insert (nb_trees * to_insert) and then decrement this number in parallel every time we write a descendant. |
3 tasks
…changed the rng and some snapshots
…oment. We don't want one thread to stay stuck forever
ManyTheFish
requested changes
Jun 17, 2025
ManyTheFish
requested changes
Jun 18, 2025
Co-authored-by: Many the fish <many@meilisearch.com>
ManyTheFish
approved these changes
Jun 18, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



What changed
Since this is almost a complete rewrite of the indexing algorithm, once again, I'll try to detail everything I did here to help the reviewer, but more information is available below on the "why".
The main idea of this rework is to remove all the writes to the database while we're indexing. This means we don't have to refresh the leaves and tree nodes during the whole indexing process
Since we no longer lose access to the leaves and tree nodes, it also means we can now parallelize arroy.
An almost exhaustive list of the new component I had to develop to make the algorithm work:
How does an indexing process work now:
memory/nb_threadsRAM to each threads and then let them do their loops on their side without any synchronization? It means we would load way less elements than what we could in reality and multiply the number of time we have to traverse our tree but maybe it's not an issue?Some notes:
Context
After reducing the number of writes we do to LMDB, here's a view of where the time is lost during an indexing process.
In this PR, I'll get rid of all the writes we do in LMDB and instead read my own
TmpFile.That means we don't have to refresh the leaves anymore, saving 27% of the time. Probably also the 6% creating and removing temp files, as we will have only one file per thread instead.
That should also open the way to better parallelization later on as we won't have any common state between the threads. Only the queue of large descendants to explode will be shared behind a mutex where each thread can pop and push I guess 🤔
After first implementation
I ran some benchmarks on my MacBook Pro with unlimited RAM and here are the results:
It needs more profiling
First investigation
The more chunks we index and the more time is spent in rayon... waiting
Side quest
Before each chunk starts processing, we see a "tail" where we're barely doing anything.
That's the time it takes to check if any of the updated items have been removed. The bigger the database is and the most time it takes. On the last batch it takes 1.6s to insert 10k items in a 90k items database where there is actually nothing to remove.
Fixed in 9c70222

The bigger the database, the bigger the gain basically
In the end why are we still worse than v0.6.1
After a lot of profiling where nothing really stands out I finally noticed that the main difference between 0.6.1 and this PR.
It's just the number of trees generated.
In #105 (which was merged after the 0.6.1 but was never released officially), I tried to guess the number of trees we would generate ahead of time.
Looks like I was bad at it, and now I end up generating between 2-3 times more trees than the 0.6, which also explains why the relevancy was better, I guess, even if I don't understand why it's also better than main.
By fixing the number of trees to, let's say, 300. Here are the results of this PR against 0.6 with 10 cores:
Both are really close to each other, with no clear winner.
It's hard to see on the chart, but it seems that 0.6.1 is still twice as fast as this PR when it comes to inserting very few elements in the database with a lot of threads.
With 2 threads available it's similar and this PR ends up being quicker to process very few elements:

In conclusion, I'll rebase on main, implement the error handling, and we'll be able to merge as-is.
Then we'll absolutely need to work on #134