Parallelize arroy again* by irevoire · Pull Request #130 · meilisearch/arroy

irevoire · 2025-05-28T09:58:15Z

What changed

Since this is almost a complete rewrite of the indexing algorithm, once again, I'll try to detail everything I did here to help the reviewer, but more information is available below on the "why".

The main idea of this rework is to remove all the writes to the database while we're indexing. This means we don't have to refresh the leaves and tree nodes during the whole indexing process
Since we no longer lose access to the leaves and tree nodes, it also means we can now parallelize arroy.

An almost exhaustive list of the new component I had to develop to make the algorithm work:

Now we can read and write in a temporary file
- The writes always happen at the end of the file and thus are efficient with a bufwriter
- The reads are always looking at the last tree built, so overall, by looking at the file from the end, we're guaranteed that the worst case scenario is reaching an element that's at the end of the tree we just built
- When reading we're seeking + reading, there are no tricks involved to make that faster, but in my benchmarks, I never saw anything related to that, so I believe it's efficient enough
Instead of creating one temporary file by tree like we were doing before, we're now creating one file per thread (thanks to the thread_local crate
Instead of parallelizing at the tree level like before, all the threads are now enqueuing a task for each of the large descendants. Rayon is used a lot.

How does an indexing process work now:

Retrieve the items to remove and remove them immediately with the rwtxn
Insert the new items into the current trees without creating new nodes. The descendants to update are stored in RAM. The parallelization is made at the tree level:
We start by taking as many elements as possible that fits in RAM
Every thread inserts these elements in their tree
We wait for all threads to finish and then select a new batches of items to insert
Should we try to change that to instead give memory/nb_threads RAM to each threads and then let them do their loops on their side without any synchronization? It means we would load way less elements than what we could in reality and multiply the number of time we have to traverse our tree but maybe it's not an issue?
From this insertion, we've extracted a list of descendants to update
We loop over these descendants, and when a descendant is too large to be stored as-is, we spawn a new tokio task that'll be in charge of creating a tree from this descendant
All the threads start processing the descendants too large asap, the steps to process a descendant are the following:
Create a new tree from this descendant in a tmp file using as much memory as possible (total memory divided by the number of threads)
Insert all the remaining items that are too large in batches in the tree while reading the tree from the tmp file. The descendants are stored in RAM
Iterate over the list of descendants. When a descendant is large, enqueue a new task in rayon; otherwise, store it in the file

Some notes:

Everything we see below was made before merging Add bias info to SplitPlaneNormal #132 to keep the benchmarks fair
I want to merge this PR as-is, but there will be a few follow-ups before the release that should improve the performance even more: https://github.com/meilisearch/arroy/milestone/5
To handle errors in rayon, when an error occurs, the thread encountering the error writes it to a channel. All the other threads must stop all work as soon as the channel contains something. I had to use a crossbeam channel for that but it doesn't add dependencies as rayon was already using crossbeam

Context

After reducing the number of writes we do to LMDB, here's a view of where the time is lost during an indexing process.

47% of the time is spent in building the tree. In this function, almost all the time is spent on computing the Cosine distance. We can't really optimize that more, but it's the part we could multi-thread in the future.
27% of the time is spent refreshing our leaves after a write to LMDB. That's also synchronization points where we cannot properly multi-thread
An additional 4.5%+2.1% are lost creating and removing TmpFile during the loop

In this PR, I'll get rid of all the writes we do in LMDB and instead read my own TmpFile.
That means we don't have to refresh the leaves anymore, saving 27% of the time. Probably also the 6% creating and removing temp files, as we will have only one file per thread instead.

That should also open the way to better parallelization later on as we won't have any common state between the threads. Only the queue of large descendants to explode will be shared behind a mutex where each thread can pop and push I guess 🤔

After first implementation

I ran some benchmarks on my MacBook Pro with unlimited RAM and here are the results:

version	100k vectors	10k * 10 vectors
v0.6.1
This PR

I didn't save the previous benchmarks I've done, sadly but this PR is about 5 times faster than right before the PR, which means my 10 cores are being used but not at their full potential
I would say we're around twice slower than the v0.6.1 (which is the fastest version we ever made yet)
The relevancy increased a bit, but I don't know why yet

It needs more profiling

First investigation

The more chunks we index and the more time is spent in rayon... waiting

Side quest

Before each chunk starts processing, we see a "tail" where we're barely doing anything.
That's the time it takes to check if any of the updated items have been removed. The bigger the database is and the most time it takes. On the last batch it takes 1.6s to insert 10k items in a 90k items database where there is actually nothing to remove.

Fixed in 9c70222

  => Vectors:         77108,   154216,   231324,   308432,   385540,   462648,   539756,   616864,   693972,   771080,   771089
  => Before:          93.10s,   62.29s,   62.89s,   67.26s,   65.28s,   74.95s,   78.85s,   84.32s,   85.17s,   89.61s,   34.91s
  => After:           87.72s,   57.98s,   57.70s,   65.85s,   58.20s,   66.44s,   70.30s,   68.51s,   66.90s,   75.74s,   14.60s

The bigger the database, the bigger the gain basically

In the end why are we still worse than v0.6.1

After a lot of profiling where nothing really stands out I finally noticed that the main difference between 0.6.1 and this PR.
It's just the number of trees generated.
In #105 (which was merged after the 0.6.1 but was never released officially), I tried to guess the number of trees we would generate ahead of time.
Looks like I was bad at it, and now I end up generating between 2-3 times more trees than the 0.6, which also explains why the relevancy was better, I guess, even if I don't understand why it's also better than main.
By fixing the number of trees to, let's say, 300. Here are the results of this PR against 0.6 with 10 cores:

Both are really close to each other, with no clear winner.
It's hard to see on the chart, but it seems that 0.6.1 is still twice as fast as this PR when it comes to inserting very few elements in the database with a lot of threads.

With 2 threads available it's similar and this PR ends up being quicker to process very few elements:

In conclusion, I'll rebase on main, implement the error handling, and we'll be able to merge as-is.
Then we'll absolutely need to work on #134

nnethercott · 2025-05-28T12:31:32Z

@irevoire just curious, what tool are you using to profile this ?

irevoire · 2025-05-28T13:09:27Z

It's instruments but it's only available on macOS.
Here's the view it gives:

On Linux, I was using valgrind+cachegrind and was visualizing the output with kcachegrind.

irevoire · 2025-06-12T13:45:05Z

Something else I noticed is that 12% of the time is spent on this function:

We can probably optimize that a bit by sharing the allocation between the successive calls or at least specifying the capacity if that helps (but rust should be able to guess it in this case 🤔)

irevoire · 2025-06-12T14:39:33Z

Once merged I'll need to re-implement the progress properly.

An idea I just had is to precompute the total number of items we must insert (nb_trees * to_insert) and then decrement this number in parallel every time we write a descendant.
It's easy, it won't really help us understand precisely what's going on, but it'll at least give a rough idea to the user that something is going on

…changed the rng and some snapshots

…oment. We don't want one thread to stay stuck forever

… run

Co-authored-by: Many the fish <many@meilisearch.com>

Remove a first batch of useless writes to tmp files and LMDB

ad16bd8

irevoire changed the title ~~Remove a first batch of useless writes to tmp files and LMDB~~ Remove writes to LMDB until the very end of the indexing process May 28, 2025

irevoire added 7 commits May 28, 2025 16:18

update the tmp_nodes to be readable

e9d2c64

Parallelize the split of large descendants

20b891e

randomize the item selection

fe56192

get rids of the remap method on the tmp files

ae9c4c2

makes the clippy gods happy

c118447

fmt

dc14f4b

Make the deletion of items faster

9c70222

irevoire mentioned this pull request Jun 12, 2025

Something else I noticed is that 12% of the time is spent on this function: #136

Open

irevoire mentioned this pull request Jun 12, 2025

Once merged I'll need to re-implement the progress properly. #137

Closed

Merge with main, it was too complex to rebase

5a6dcb3

irevoire changed the title ~~Remove writes to LMDB until the very end of the indexing process~~ Parallelize arroy Jun 12, 2025

irevoire changed the title ~~Parallelize arroy~~ Parallelize arroy again* Jun 12, 2025

irevoire added this to the v1.7.0 milestone Jun 12, 2025

irevoire added 2 commits June 12, 2025 19:38

fmt

ae21f60

make clippy happy

9c29c04

irevoire mentioned this pull request Jun 12, 2025

Implement Hamming distance for binary strings #124

Open

3 tasks

irevoire added 4 commits June 16, 2025 16:12

share the code between the writer and the incremental indexing. That …

a16fbdd

…changed the rng and some snapshots

handle error properly in the rayon pool

d4d646d

use try_send instead on send in case two errors happens at the same m…

c1874ff

…oment. We don't want one thread to stay stuck forever

fmt+lints

2face0e

irevoire marked this pull request as ready for review June 16, 2025 14:41

irevoire added indexing Everything related to indexing performance relevancy labels Jun 16, 2025

irevoire added 2 commits June 17, 2025 10:11

remove debug prints

5b8efc0

remove new_fit_in_memory

e9209c2

ManyTheFish requested changes Jun 17, 2025

View reviewed changes

Comment thread src/tests/writer.rs Outdated

Comment thread src/writer.rs

Comment thread src/writer.rs Outdated

Comment thread src/writer.rs Outdated

Comment thread src/writer.rs Outdated

Comment thread src/writer.rs Outdated

Comment thread src/writer.rs

irevoire added 2 commits June 17, 2025 14:37

fix panic when we have low memory

af2cdca

check for error when inserting elements in a tree

ecf0df4

irevoire mentioned this pull request Jun 17, 2025

insert_descendants_in_file_and_spawn_tasks is multi-threaded only when there is not enough RAM #139

Open

irevoire added 4 commits June 17, 2025 16:02

fix multiple bugs

5262b08

fix a new bug and add unit tests on fit_in_memory

fa62650

Add a large test with no memory to force the full indexing process to…

3a94056

… run

fmt

ccdafc2

irevoire requested a review from ManyTheFish June 17, 2025 15:38

ManyTheFish requested changes Jun 18, 2025

View reviewed changes

Comment thread src/writer.rs Outdated

Update src/writer.rs

29bc623

Co-authored-by: Many the fish <many@meilisearch.com>

irevoire requested a review from ManyTheFish June 18, 2025 09:17

fmt

e0d18a0

ManyTheFish approved these changes Jun 18, 2025

View reviewed changes

irevoire added this pull request to the merge queue Jun 18, 2025

Merged via the queue into main with commit d8caf55 Jun 18, 2025
15 checks passed

irevoire deleted the remove-sync-point branch June 18, 2025 09:50

irevoire mentioned this pull request Jun 20, 2025

Still too slow when there is not enough RAM #145

Open

irevoire mentioned this pull request Jul 7, 2025

Parallelize the incremental indexing of large descendants #118

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize arroy again*#130

Parallelize arroy again*#130
irevoire merged 25 commits into
mainfrom
remove-sync-point

irevoire commented May 28, 2025 •

edited

Loading

Uh oh!

nnethercott commented May 28, 2025

Uh oh!

irevoire commented May 28, 2025

Uh oh!

irevoire commented Jun 12, 2025

Uh oh!

irevoire commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

irevoire commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Context

After first implementation

First investigation

Side quest

In the end why are we still worse than v0.6.1

Uh oh!

nnethercott commented May 28, 2025

Uh oh!

irevoire commented May 28, 2025

Uh oh!

irevoire commented Jun 12, 2025

Uh oh!

irevoire commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

irevoire commented May 28, 2025 •

edited

Loading