Parallelize blocking (Fingerprinter)

AFAIK, [`Fingerprinter.__call__`](https://github.com/dedupeio/dedupe/blob/52b3ffc3d6f5cc80cf811afc6e44f67263cc8914/dedupe/blocking.py#L48) is embarrassingly parallel: you just need to partition your records by the number of CPUs you have, call `Fingerprinter.__call__` for each partition, then reduce write results to single `blocking_map` table.

Currently that's left for the implementer to do. Isn't that something the library could do, considering it already has a `num_cores` parameter? I could help with this.

I've found https://github.com/dedupeio/dedupe/issues/305 but it's too old. That issue mentions message passing costs. But for DB-based "big dedupe" applications, that's not an issue since data isn't at main memory. Each worker process can read its own partition of data from the DB.

Even if we decide the library won't do that by default, maybe we should update "big dedupe" DB-based examples like [pgsql_big_dedupe_example.py](https://dedupeio.github.io/dedupe-examples/docs/pgsql_big_dedupe_example.html) to parallelize blocking?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize blocking (Fingerprinter) #831

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallelize blocking (Fingerprinter) #831

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions