Computing KMeans for datasets that don't fit in memory #3586
nicolas-dufour
started this conversation in
General
Replies: 2 comments
-
|
If you have a very large dataset it probably means that you also have many centroids. In that case you probably want to use a distributed k-means implementation, see |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Thanks for the suggestion @mdouze ! Do you have some idea of the order of data/centroids it's better to use the distributed method? Myself i was thinking of clustering around 10M-20M embeddings with 50K to 100K centroids. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Looking at the doc, i see that when training faiss.Kmeans, we need to give the full dataset as an array.
However can it be possible to give it in batches for when we cannot load the full dataset in memory ? Or provide a pytorch Dataloader?
Otherwise what's the recommended way to handle very big datasets?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions