-
Notifications
You must be signed in to change notification settings - Fork 327
[py-tx] Implementation of IVF faiss indices in PDQ #1756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @b8zhong! Thanks again for contributing, I am very excited about this work, and so glad it's got you on it.
I suspect this change is much simpler than you might think! You have added a new wrapping class to the heirarchy, but I don't think we need it.
Is it possible to make the following changes:
- Migrate the index selection logic from
PDQSignalTypeIndex2.build
toPDQIndex2.build
. - Take a look at the existing tests for pdqindex2 and see if you can re-use them by just changing the index you pass in
- See if there's anything left in
PDQSignalTypeIndex2
.
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,184 @@ | |||
import typing as t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean python-threatexchange/threatexchange/signal_type/tests/test_pdq_index2.py
?
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
Didn't forget ab this.. I'll check it out tn; thanks for the review! |
Think it's ready to get a second look now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I had unsubmitted comments!
This looks like it won't break any of the existing code, what I'd like to see is some results from benchmarking to prove that we've gotten any kind of speedup.
and
https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/benchmarks/README.md
Let me know if you'd rather take a look at these in a followup instead.
entries_list = list(entries) | ||
|
||
if len(entries_list) >= cls.IVF_THRESHOLD: | ||
nlist = int(len(entries_list) ** 0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why not
nlist = int(len(entries_list) ** 0.5) | |
nlist = len(entries_list) // 2 |
Also, why is 2 the magic number here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because setting nlist = number of entries divided by 2 creates clusters averaging 2 items each... ? then the index will have half as many clusters as data points. What do you think? Open to feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last time I messed around with it, it seemed like there was a warning if there wasn't at least ~40 points per cluster.
There are a bunch of parameters to FAISS that can be tuned, we should pick based on maintaining a very high recall (as we would prefer to find everything indexed), while providing as much speedup as we can.
python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py
Outdated
Show resolved
Hide resolved
I think I would prefer that :) |
Ignore the earlier comment: I think I was doing something dumb.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try and find out why we are getting a performance regression! Ideally this approach should be faster, but maybe we picked the wrong tech and we should be looking more into IndexBinaryIVF
and IndexBinaryMultiHash
. I think we should skip IndexBinaryHash
because the first few bits of a PDQ hash correspond to the upper left hand content of an image, which should have no predictive power towards the rest of the image, and has some bad cases with padding.
We can look at the FAISS Documents: https://github.com/facebookresearch/faiss/wiki/Binary-indexes
Check out also https://github.com/facebookresearch/faiss/wiki/Binary-hashing-index-benchmark, which uses the equivalent of PDQ hashes (256 bit), in roughly our target scale (50M hashes), with radius 32 (our target PDQ dist).
@Dcallies I agree. Thanks for review -looking into it now. Or anything in general really |
There certainly is, which is to start building out the storage interface, backed by DBM. I can either give you a couple of PR's to serve as the starting point, and you can build the rest out, or you can do a couple of smaller PRs in that direction and hash it out over the review comments. Which is your preference? |
@Dcallies Either works for me! If you want it to get done faster... I don't mind if you do the first few. |
Can get you started then, give me a week to make a few PRs! |
lint: black Update PDQ index and signal implementation remove unecessary test file remove unused import Update python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py Co-authored-by: David Callies <[email protected]> Some more PR comments Co-authored-by: David Callies <[email protected]>
Still working on this: swapping to Apparently -
Gonna try to figure it out something this week |
I made a few small PRs for how you might do DBM and interface migrations, respectively, if you get stuck here and why to try that instead. |
Summary
PDQ now using IVF Faiss for better scalability (as we move away from the old imp)
#1613
Test Plan
Run the tests....
python3 -m pytest threatexchange/signal_type/tests/test_pdq_signal_type_index2.py -v -W ignore::DeprecationWarning
Passes.. ?