[py-tx] Implementation of IVF faiss indices in PDQ #1756

b8zhong · 2025-02-06T20:31:13Z

Summary

PDQ now using IVF Faiss for better scalability (as we move away from the old imp)

Adds PDQSignalTypeIndex2 that automatically switches between flat and IVF indices based on dataset size
Uses IVF-Faiss for datasets >= 1000 entries (as you said), flat index for less than
Maintains backward compatibility with existing PDQIndex
Updates signal.py to use the new index implementation as default (if we're not swapping yet, I'll change it back)

#1613

Test Plan

Run the tests....

python3 -m pytest threatexchange/signal_type/tests/test_pdq_signal_type_index2.py -v -W ignore::DeprecationWarning

Passes.. ?

Dcallies

Hey @b8zhong! Thanks again for contributing, I am very excited about this work, and so glad it's got you on it.

I suspect this change is much simpler than you might think! You have added a new wrapping class to the heirarchy, but I don't think we need it.

Is it possible to make the following changes:

Migrate the index selection logic from PDQSignalTypeIndex2.build to PDQIndex2.build.
Take a look at the existing tests for pdqindex2 and see if you can re-use them by just changing the index you pass in
See if there's anything left in PDQSignalTypeIndex2.

python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py

python-threatexchange/threatexchange/signal_type/pdq/signal.py

Dcallies · 2025-02-10T19:32:17Z

python-threatexchange/threatexchange/signal_type/tests/test_pdq_signal_type_index2.py

@@ -0,0 +1,184 @@
+import typing as t


What do you think about these tests?

https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py

Did you mean python-threatexchange/threatexchange/signal_type/tests/test_pdq_index2.py?

python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py

b8zhong · 2025-02-11T19:41:45Z

Didn't forget ab this.. I'll check it out tn; thanks for the review!

b8zhong · 2025-02-22T13:45:53Z

Think it's ready to get a second look now?

Dcallies

Apologies, I had unsubmitted comments!

This looks like it won't break any of the existing code, what I'd like to see is some results from benchmarking to prove that we've gotten any kind of speedup.

Check out https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/benchmarks/benchmark_pdq_faiss_matchers.py

and

https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/benchmarks/README.md

Let me know if you'd rather take a look at these in a followup instead.

Dcallies · 2025-02-13T16:20:45Z

python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py

+        entries_list = list(entries)
+
+        if len(entries_list) >= cls.IVF_THRESHOLD:
+            nlist = int(len(entries_list) ** 0.5)


nit: Why not

Suggested change

nlist = int(len(entries_list) ** 0.5)

nlist = len(entries_list) // 2

Also, why is 2 the magic number here?

Because setting nlist = number of entries divided by 2 creates clusters averaging 2 items each... ? then the index will have half as many clusters as data points. What do you think? Open to feedback

The last time I messed around with it, it seemed like there was a warning if there wasn't at least ~40 points per cluster.

There are a bunch of parameters to FAISS that can be tuned, we should pick based on maintaining a very high recall (as we would prefer to find everything indexed), while providing as much speedup as we can.

python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py

b8zhong · 2025-02-27T00:19:15Z

Let me know if you'd rather take a look at these in a followup instead.

I think I would prefer that :)

b8zhong · 2025-02-27T04:00:34Z

Ignore the earlier comment: I think I was doing something dumb.

python3 benchmarks/benchmark_pdq_indices.py --dataset-size 10000 --num-queries 1000 --implementations index2_flat flat_hash multi_hash --thresholds 31 --query-mode both
Benchmark: PDQ Index Implementation Comparison

Options:
	 faiss_threads :  1
	 dataset_size :  10000
	 num_queries :  1000
	 thresholds :  [31]
	 seed :  None
	 implementations :  ['index2_flat', 'flat_hash', 'multi_hash']
	 serialize_test :  False
	 query_mode :  both

using random seed of  1740628723234617000
use --seed  1740628723234617000  to rerun with same random values

Generating dataset...
Generated 10000 hashes in 0.0388s


=== Benchmarking PDQFlatHashIndex ===
Running benchmark with threshold 31...
  Running batch query benchmark...
  Running individual query benchmark...

=== Benchmarking PDQMultiHashIndex ===
Running benchmark with threshold 31...
  Running batch query benchmark...
  Running individual query benchmark...

=== Benchmarking PDQIndex2 (Flat) ===
Running benchmark with threshold 31...
  Running batch query benchmark...
  Running individual query benchmark...

========== MEMORY USAGE ESTIMATE ==========
Note: These are serialized sizes, not runtime memory usage
PDQFlatHashIndex: 390KB
PDQMultiHashIndex: 1,208KB
PDQIndex2 (Flat): 10,791KB


========== BENCHMARK SUMMARY ==========

===== PDQFlatHashIndex Results =====
Build time: 0.0156s
Size: 390KB

Threshold 31:
	Batch mode:
		Search time: 0.0120s
		Per query: 0.0120ms
		Targets found: 100.00%
	Individual mode:
		Search time: 0.0200s
		Per query: 0.0200ms
		Targets found: 100.00%
Single item test: SUCCESS

===== PDQMultiHashIndex Results =====
Build time: 0.0281s
Size: 1,208KB

Threshold 31:
	Batch mode:
		Search time: 0.0166s
		Per query: 0.0166ms
		Targets found: 100.00%
	Individual mode:
		Search time: 0.0286s
		Per query: 0.0286ms
		Targets found: 100.00%
Single item test: SUCCESS

===== PDQIndex2 (Flat) Results =====
Build time: 0.0275s
Size: 10,791KB

Threshold 31:
	Batch mode:
		Search time: 0.2088s
		Per query: 0.2088ms
		Targets found: 100.00%
	Individual mode:
		Search time: 0.2089s
		Per query: 0.2089ms
		Targets found: 100.00%
Single item test: SUCCESS

========== PERFORMANCE COMPARISON ==========

--- BATCH QUERY MODE ---
Implementation       | Build Time (s)  | Size (KB)       | Search Time (ms) | Targets Found (%)
----------------------------------------------------------------------------------------
PDQFlatHashIndex     | 0.0156          | 390             | 0.0120          | 100.00         
PDQMultiHashIndex    | 0.0281          | 1,208           | 0.0166          | 100.00         
PDQIndex2 (Flat)     | 0.0275          | 10,791          | 0.2088          | 100.00         

--- INDIVIDUAL QUERY MODE ---
Implementation       | Build Time (s)  | Size (KB)       | Search Time (ms) | Targets Found (%)
----------------------------------------------------------------------------------------
PDQFlatHashIndex     | 0.0156          | 390             | 0.0200          | 100.00         
PDQMultiHashIndex    | 0.0281          | 1,208           | 0.0286          | 100.00         
PDQIndex2 (Flat)     | 0.0275          | 10,791          | 0.2089          | 100.00

Dcallies

Let's try and find out why we are getting a performance regression! Ideally this approach should be faster, but maybe we picked the wrong tech and we should be looking more into IndexBinaryIVF and IndexBinaryMultiHash. I think we should skip IndexBinaryHash because the first few bits of a PDQ hash correspond to the upper left hand content of an image, which should have no predictive power towards the rest of the image, and has some bad cases with padding.

We can look at the FAISS Documents: https://github.com/facebookresearch/faiss/wiki/Binary-indexes

Check out also https://github.com/facebookresearch/faiss/wiki/Binary-hashing-index-benchmark, which uses the equivalent of PDQ hashes (256 bit), in roughly our target scale (50M hashes), with radius 32 (our target PDQ dist).

b8zhong · 2025-03-01T23:58:49Z

@Dcallies I agree. Thanks for review -looking into it now.
By the way -- if there's anything else related to dbm I can do concurrently -- let me know?

Or anything in general really

Dcallies · 2025-03-02T17:26:56Z

There certainly is, which is to start building out the storage interface, backed by DBM.

I can either give you a couple of PR's to serve as the starting point, and you can build the rest out, or you can do a couple of smaller PRs in that direction and hash it out over the review comments. Which is your preference?

b8zhong · 2025-03-02T21:02:40Z

@Dcallies Either works for me! If you want it to get done faster... I don't mind if you do the first few.

Dcallies · 2025-03-02T21:29:14Z

Can get you started then, give me a week to make a few PRs!

lint: black Update PDQ index and signal implementation remove unecessary test file remove unused import Update python-threatexchange/threatexchange/signal_type/pdq/pdq_index2.py Co-authored-by: David Callies <[email protected]> Some more PR comments Co-authored-by: David Callies <[email protected]>

b8zhong · 2025-03-12T12:47:49Z

Still working on this: swapping to index = faiss.IndexBinaryFlat(BITS_IN_PDQ), etc. seems to cause a bunch of downstream issues.

Apparently -

convert_pdq_strings_to_ndarray function unpacks the binary data to individual bits, creating a 256-element array for a 32-byte PDQ hash

Gonna try to figure it out something this week

Dcallies · 2025-03-12T20:09:36Z

I made a few small PRs for how you might do DBM and interface migrations, respectively, if you get stuck here and why to try that instead.

b8zhong · 2025-05-17T21:32:34Z

Thanks!

Apologies for the massive delay @Dcallies... to be honest not sure when I'll have time to get around to this. If you would like, feel free to take over the branch if you want 👍

b8zhong requested a review from Dcallies as a code owner February 6, 2025 20:31

facebook-github-bot added the CLA Signed label Feb 6, 2025

b8zhong changed the title ~~Implementation of IVF faiss indices in PD~~ [py-tx] Implementation of IVF faiss indices in PDQ Feb 6, 2025

b8zhong force-pushed the IVF-Faiss branch from c230ff6 to 4f9d1dd Compare February 6, 2025 21:30

b8zhong marked this pull request as draft February 10, 2025 19:05

b8zhong marked this pull request as ready for review February 10, 2025 19:05

Dcallies requested changes Feb 10, 2025

View reviewed changes

Dcallies self-requested a review February 24, 2025 12:08

Dcallies approved these changes Feb 24, 2025

View reviewed changes

Dcallies requested changes Mar 1, 2025

View reviewed changes

github-actions bot added the python-threatexchange Items related to the threatexchange python tool / library label Mar 11, 2025

b8zhong force-pushed the IVF-Faiss branch from 554702d to 5b304cb Compare March 12, 2025 01:12

	nlist = int(len(entries_list) ** 0.5)
	nlist = len(entries_list) // 2

[py-tx] Implementation of IVF faiss indices in PDQ #1756

Are you sure you want to change the base?

[py-tx] Implementation of IVF faiss indices in PDQ #1756

Uh oh!

Conversation

b8zhong commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

Dcallies left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dcallies Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

b8zhong Feb 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

b8zhong commented Feb 11, 2025

Uh oh!

b8zhong commented Feb 22, 2025

Uh oh!

Dcallies left a comment

Choose a reason for hiding this comment

Uh oh!

Dcallies Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

b8zhong Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

Dcallies Mar 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

b8zhong commented Feb 27, 2025

Uh oh!

b8zhong commented Feb 27, 2025

Uh oh!

Dcallies left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

b8zhong commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dcallies commented Mar 2, 2025

Uh oh!

b8zhong commented Mar 2, 2025

Uh oh!

Dcallies commented Mar 2, 2025

Uh oh!

b8zhong commented Mar 12, 2025

Uh oh!

Dcallies commented Mar 12, 2025

Uh oh!

b8zhong commented May 17, 2025

Uh oh!

Uh oh!

b8zhong commented Feb 6, 2025 •

edited

Loading

Dcallies left a comment •

edited

Loading

Dcallies left a comment •

edited

Loading

b8zhong commented Mar 1, 2025 •

edited

Loading