Computing KMeans for datasets that don't fit in memory #3586

nicolas-dufour · 2024-02-01T15:26:11Z

nicolas-dufour
Feb 1, 2024

Hi,
Looking at the doc, i see that when training faiss.Kmeans, we need to give the full dataset as an array.
However can it be possible to give it in batches for when we cannot load the full dataset in memory ? Or provide a pytorch Dataloader?

Otherwise what's the recommended way to handle very big datasets?

Thanks!

mdouze · 2024-02-01T16:03:40Z

mdouze
Feb 1, 2024
Collaborator

If you have a very large dataset it probably means that you also have many centroids. In that case you probably want to use a distributed k-means implementation, see
https://github.com/facebookresearch/faiss/blob/main/benchs/distributed_ondisk/distributed_kmeans.py

0 replies

nicolas-dufour · 2024-02-01T21:34:21Z

nicolas-dufour
Feb 1, 2024
Author

Thanks for the suggestion @mdouze ! Do you have some idea of the order of data/centroids it's better to use the distributed method? Myself i was thinking of clustering around 10M-20M embeddings with 50K to 100K centroids.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Computing KMeans for datasets that don't fit in memory #3586

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Computing KMeans for datasets that don't fit in memory #3586

Uh oh!

nicolas-dufour Feb 1, 2024

Replies: 2 comments

Uh oh!

mdouze Feb 1, 2024 Collaborator

Uh oh!

nicolas-dufour Feb 1, 2024 Author

nicolas-dufour
Feb 1, 2024

mdouze
Feb 1, 2024
Collaborator

nicolas-dufour
Feb 1, 2024
Author