Replies: 5 comments 1 reply
-
|
This is surprising. What is the initial data dimensionality? Would it be possible to share the vectors? |
Beta Was this translation helpful? Give feedback.
-
|
Hey there! Thanks for taking a look! The vectors have shape: Here's a file containing a small subset (100) of the vectors: 2020_12_12_subset.fvecs.zip. Here is a link to a file containing the entire set of vectors added to a single index (one index per month of data). The training set is generated by randomly sampling across all monthly files, until we have between Here is a single vector (before normalizing): |
Beta Was this translation helpful? Give feedback.
-
|
Hi there @mdouze! Just wondering if you might have had a chance to take another look at this? Cheers! |
Beta Was this translation helpful? Give feedback.
-
|
have you got success with the problem ? |
Beta Was this translation helpful? Give feedback.
-
|
how to deal with the datasets now ?@leothomas |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Hey there!
I'm looking into switching from L2 distance metric to Inner Product distance metric. Our current index is built using PCA + IVFFlat (
PCA128,IVF{K},Flat) with L2 distance, where the input vectors have dimension 512 and the number of IVF centroids is defined as:4*sqrt(N) < K < 16*sqrt(N)(N = number of vectors indexed).Compared to a
Flatindex, this index reaches a kNN intersection measure of 0.96 @ rank 100.However when I build the same index with an inner product distance metric (vectors are normalized prior to training, adding to the index, and searching for both the L2 and IP distance metrics) I get a kNN intersection measure of 0.019 @ rank 100. Setting
nprobeto the number of centroids (to mimic a Flat search) actually reduces the kNN intersection measure to 0.003 @ rank 100.Without the PCA pre-processing, the IVF index with inner product distance metric has a kNN intersection measure of 0.97 @ rank 100 - which is ideal, but the index is simply much too big to hold in memory.
Is there some sort of fundamental incompatibility between PCA pre-processing and Inner Product distance metric?
I was able to achieve excellent compression and a kNN intersection measure of ~ 0.70 @ rank 100 with the
OPQ{M}_{D},IVF{K},PQ{M}andOPQ{M}_{D},IVF{K}_HNSW32,PQ{M}indexes with the inner product distance metric. Are there any other indexing recommendations for pre-processing, coarse or fine quantization, or even search time parameters (efSearch, etc) that might work better with the inner product distance metric?Thanks again for taking a look at this !
Platform
OS: macOS 13.01
Faiss version: 1.7.3
Installed from:
pip install 'faiss-cpu==1.7.3'Faiss compilation options:
Running on:
Interface:
Reproduction instructions
Beta Was this translation helpful? Give feedback.
All reactions