Skip to content

[py-tx] Implementation of IVF faiss indices in PDQ #1756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

b8zhong
Copy link
Contributor

@b8zhong b8zhong commented Feb 6, 2025

Summary

PDQ now using IVF Faiss for better scalability (as we move away from the old imp)

  • Adds PDQSignalTypeIndex2 that automatically switches between flat and IVF indices based on dataset size
  • Uses IVF-Faiss for datasets >= 1000 entries (as you said), flat index for less than
  • Maintains backward compatibility with existing PDQIndex
  • Updates signal.py to use the new index implementation as default (if we're not swapping yet, I'll change it back)

#1613

Test Plan

Run the tests....

python3 -m pytest threatexchange/signal_type/tests/test_pdq_signal_type_index2.py -v -W ignore::DeprecationWarning

Passes.. ?

@b8zhong b8zhong requested a review from Dcallies as a code owner February 6, 2025 20:31
@b8zhong b8zhong changed the title Implementation of IVF faiss indices in PD [py-tx] Implementation of IVF faiss indices in PDQ Feb 6, 2025
@b8zhong b8zhong marked this pull request as draft February 10, 2025 19:05
@b8zhong b8zhong marked this pull request as ready for review February 10, 2025 19:05
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @b8zhong! Thanks again for contributing, I am very excited about this work, and so glad it's got you on it.

I suspect this change is much simpler than you might think! You have added a new wrapping class to the heirarchy, but I don't think we need it.

Is it possible to make the following changes:

  1. Migrate the index selection logic from PDQSignalTypeIndex2.build to PDQIndex2.build.
  2. Take a look at the existing tests for pdqindex2 and see if you can re-use them by just changing the index you pass in
  3. See if there's anything left in PDQSignalTypeIndex2.

@@ -0,0 +1,184 @@
import typing as t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean python-threatexchange/threatexchange/signal_type/tests/test_pdq_index2.py?

@b8zhong
Copy link
Contributor Author

b8zhong commented Feb 11, 2025

Didn't forget ab this.. I'll check it out tn; thanks for the review!

@b8zhong
Copy link
Contributor Author

b8zhong commented Feb 22, 2025

Think it's ready to get a second look now?

@Dcallies Dcallies self-requested a review February 24, 2025 12:08
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I had unsubmitted comments!

This looks like it won't break any of the existing code, what I'd like to see is some results from benchmarking to prove that we've gotten any kind of speedup.

Check out https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/benchmarks/benchmark_pdq_faiss_matchers.py

and

https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/benchmarks/README.md

Let me know if you'd rather take a look at these in a followup instead.

entries_list = list(entries)

if len(entries_list) >= cls.IVF_THRESHOLD:
nlist = int(len(entries_list) ** 0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why not

Suggested change
nlist = int(len(entries_list) ** 0.5)
nlist = len(entries_list) // 2

Also, why is 2 the magic number here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because setting nlist = number of entries divided by 2 creates clusters averaging 2 items each... ? then the index will have half as many clusters as data points. What do you think? Open to feedback

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last time I messed around with it, it seemed like there was a warning if there wasn't at least ~40 points per cluster.

There are a bunch of parameters to FAISS that can be tuned, we should pick based on maintaining a very high recall (as we would prefer to find everything indexed), while providing as much speedup as we can.

@b8zhong
Copy link
Contributor Author

b8zhong commented Feb 27, 2025

Let me know if you'd rather take a look at these in a followup instead.

I think I would prefer that :)

@b8zhong
Copy link
Contributor Author

b8zhong commented Feb 27, 2025

Ignore the earlier comment: I think I was doing something dumb.

python3 benchmarks/benchmark_pdq_indices.py --dataset-size 10000 --num-queries 1000 --implementations index2_flat flat_hash multi_hash --thresholds 31 --query-mode both
Benchmark: PDQ Index Implementation Comparison

Options:
	 faiss_threads :  1
	 dataset_size :  10000
	 num_queries :  1000
	 thresholds :  [31]
	 seed :  None
	 implementations :  ['index2_flat', 'flat_hash', 'multi_hash']
	 serialize_test :  False
	 query_mode :  both

using random seed of  1740628723234617000
use --seed  1740628723234617000  to rerun with same random values

Generating dataset...
Generated 10000 hashes in 0.0388s


=== Benchmarking PDQFlatHashIndex ===
Running benchmark with threshold 31...
  Running batch query benchmark...
  Running individual query benchmark...

=== Benchmarking PDQMultiHashIndex ===
Running benchmark with threshold 31...
  Running batch query benchmark...
  Running individual query benchmark...

=== Benchmarking PDQIndex2 (Flat) ===
Running benchmark with threshold 31...
  Running batch query benchmark...
  Running individual query benchmark...

========== MEMORY USAGE ESTIMATE ==========
Note: These are serialized sizes, not runtime memory usage
PDQFlatHashIndex: 390KB
PDQMultiHashIndex: 1,208KB
PDQIndex2 (Flat): 10,791KB


========== BENCHMARK SUMMARY ==========

===== PDQFlatHashIndex Results =====
Build time: 0.0156s
Size: 390KB

Threshold 31:
	Batch mode:
		Search time: 0.0120s
		Per query: 0.0120ms
		Targets found: 100.00%
	Individual mode:
		Search time: 0.0200s
		Per query: 0.0200ms
		Targets found: 100.00%
Single item test: SUCCESS

===== PDQMultiHashIndex Results =====
Build time: 0.0281s
Size: 1,208KB

Threshold 31:
	Batch mode:
		Search time: 0.0166s
		Per query: 0.0166ms
		Targets found: 100.00%
	Individual mode:
		Search time: 0.0286s
		Per query: 0.0286ms
		Targets found: 100.00%
Single item test: SUCCESS

===== PDQIndex2 (Flat) Results =====
Build time: 0.0275s
Size: 10,791KB

Threshold 31:
	Batch mode:
		Search time: 0.2088s
		Per query: 0.2088ms
		Targets found: 100.00%
	Individual mode:
		Search time: 0.2089s
		Per query: 0.2089ms
		Targets found: 100.00%
Single item test: SUCCESS

========== PERFORMANCE COMPARISON ==========

--- BATCH QUERY MODE ---
Implementation       | Build Time (s)  | Size (KB)       | Search Time (ms) | Targets Found (%)
----------------------------------------------------------------------------------------
PDQFlatHashIndex     | 0.0156          | 390             | 0.0120          | 100.00         
PDQMultiHashIndex    | 0.0281          | 1,208           | 0.0166          | 100.00         
PDQIndex2 (Flat)     | 0.0275          | 10,791          | 0.2088          | 100.00         

--- INDIVIDUAL QUERY MODE ---
Implementation       | Build Time (s)  | Size (KB)       | Search Time (ms) | Targets Found (%)
----------------------------------------------------------------------------------------
PDQFlatHashIndex     | 0.0156          | 390             | 0.0200          | 100.00         
PDQMultiHashIndex    | 0.0281          | 1,208           | 0.0286          | 100.00         
PDQIndex2 (Flat)     | 0.0275          | 10,791          | 0.2089          | 100.00

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try and find out why we are getting a performance regression! Ideally this approach should be faster, but maybe we picked the wrong tech and we should be looking more into IndexBinaryIVF and IndexBinaryMultiHash. I think we should skip IndexBinaryHash because the first few bits of a PDQ hash correspond to the upper left hand content of an image, which should have no predictive power towards the rest of the image, and has some bad cases with padding.

We can look at the FAISS Documents: https://github.com/facebookresearch/faiss/wiki/Binary-indexes

Check out also https://github.com/facebookresearch/faiss/wiki/Binary-hashing-index-benchmark, which uses the equivalent of PDQ hashes (256 bit), in roughly our target scale (50M hashes), with radius 32 (our target PDQ dist).

@b8zhong
Copy link
Contributor Author

b8zhong commented Mar 1, 2025

@Dcallies I agree. Thanks for review -looking into it now.
By the way -- if there's anything else related to dbm I can do concurrently -- let me know?

Or anything in general really

@Dcallies
Copy link
Contributor

Dcallies commented Mar 2, 2025

There certainly is, which is to start building out the storage interface, backed by DBM.

I can either give you a couple of PR's to serve as the starting point, and you can build the rest out, or you can do a couple of smaller PRs in that direction and hash it out over the review comments. Which is your preference?

@b8zhong
Copy link
Contributor Author

b8zhong commented Mar 2, 2025

@Dcallies Either works for me! If you want it to get done faster... I don't mind if you do the first few.

@Dcallies
Copy link
Contributor

Dcallies commented Mar 2, 2025

Can get you started then, give me a week to make a few PRs!

@github-actions github-actions bot added the python-threatexchange Items related to the threatexchange python tool / library label Mar 11, 2025
lint: black

Update PDQ index and signal implementation

remove unecessary test file

remove unused import

Update python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py

Co-authored-by: David Callies <[email protected]>

Some more PR comments

Co-authored-by: David Callies <[email protected]>
@b8zhong
Copy link
Contributor Author

b8zhong commented Mar 12, 2025

Still working on this: swapping to index = faiss.IndexBinaryFlat(BITS_IN_PDQ), etc. seems to cause a bunch of downstream issues.

Apparently -

convert_pdq_strings_to_ndarray function unpacks the binary data to individual bits, creating a 256-element array for a 32-byte PDQ hash

Gonna try to figure it out something this week

@Dcallies
Copy link
Contributor

I made a few small PRs for how you might do DBM and interface migrations, respectively, if you get stuck here and why to try that instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed python-threatexchange Items related to the threatexchange python tool / library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants