[help, HOW?] 100+ Million vectors of dimension 1024 into Faiss Index #3579

rafayaar · 2023-10-16T04:56:21Z

rafayaar
Oct 16, 2023

A Bit Background/Goal

I have 100+ Million vectors of Dimension 1024 which I want to fit in Faiss index.
Obviously, I cant use Flat Index cause it will have immensely huge memory footprint and will take forever for search and even creating the index.
According to the documentation, I suppose I have to use IVF, PCA index in order to train the index on my Dataset.

Concern

When creating index using IVFPQ, we initially have to TRAIN the index.
MY CONCERN IS, that for training obviously I cannot fit all 100M in RAM to train the index. It has to be a small chunk to train, and then iteratively read rest of the chunks and ADD to the index.
I don't think training only over a small chunk (let's say 5M) and generalize it over rest of the 95+ million would give me good results.

Need help to find approach to train/add 100M (1024d) vectors into Faiss index?

Platform

OS:

Faiss version:

Installed from:

Faiss compilation options:

Running on:

CPU
GPU

Interface:

C++
Python

mlomeli1 · 2023-10-17T15:02:52Z

mlomeli1
Oct 17, 2023
Collaborator

hi @rafayaar, I suggest you to try the approach of training the index in a subset and check whether you don't get good performance as you think. The train index method will select a subsample of the data anyway if you pass too much data anyway. See
the questions about training in: https://github.com/facebookresearch/faiss/wiki/FAQ

0 replies

rafayaar · 2023-10-19T04:54:41Z

rafayaar
Oct 19, 2023
Author

hi @rafayaar, I suggest you to try the approach of training the index in a subset and check whether you don't get good performance as you think. The train index method will select a subsample of the data anyway if you pass too much data anyway. See the questions about training in: https://github.com/facebookresearch/faiss/wiki/FAQ

Hi @mlomeli1
I have already experimented this.
Small subset cannot represent whole dataset.
If compared with Flat index's, training index's come no near in accuracy/efficiency.
But still I would want to see some results with training index's

0 replies

mlomeli1 · 2023-10-19T14:26:47Z

mlomeli1
Oct 19, 2023
Collaborator

How do you sample the subset? It could represent the whole dataset if the sampling scheme leads to a representative sample @rafayaar.
Another consideration is, you are using PCA for dimensionality reduction, you shouldn't expect to get the same accuracy than the original flat index.

0 replies

shiwanghua · 2023-10-23T03:24:28Z

shiwanghua
Oct 23, 2023

split 0.1B x 1024 to 8 x 0.1B x 128,

eight small datasets

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[help, HOW?] 100+ Million vectors of dimension 1024 into Faiss Index #3579

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[help, HOW?] 100+ Million vectors of dimension 1024 into Faiss Index #3579

Uh oh!

rafayaar Oct 16, 2023

A Bit Background/Goal

Concern

Platform

Replies: 4 comments

Uh oh!

mlomeli1 Oct 17, 2023 Collaborator

Uh oh!

Uh oh!

rafayaar Oct 19, 2023 Author

Uh oh!

Uh oh!

mlomeli1 Oct 19, 2023 Collaborator

Uh oh!

shiwanghua Oct 23, 2023

rafayaar
Oct 16, 2023

mlomeli1
Oct 17, 2023
Collaborator

rafayaar
Oct 19, 2023
Author

mlomeli1
Oct 19, 2023
Collaborator

shiwanghua
Oct 23, 2023