-
Notifications
You must be signed in to change notification settings - Fork 571
Parallelize blocking (Fingerprinter) #831
Description
AFAIK, Fingerprinter.__call__ is embarrassingly parallel: you just need to partition your records by the number of CPUs you have, call Fingerprinter.__call__ for each partition, then reduce write results to single blocking_map table.
Currently that's left for the implementer to do. Isn't that something the library could do, considering it already has a num_cores parameter? I could help with this.
I've found #305 but it's too old. That issue mentions message passing costs. But for DB-based "big dedupe" applications, that's not an issue since data isn't at main memory. Each worker process can read its own partition of data from the DB.
Even if we decide the library won't do that by default, maybe we should update "big dedupe" DB-based examples like pgsql_big_dedupe_example.py to parallelize blocking?