Open
Description
This project is awesome but when I am trying to use it from my purpose to detect near-duplicate document e.g json, I'm not getting enough information on how to try to do that? It shows only to compute
import simhash
a = simhash.compute(...)
b = simhash.compute(...)
simhash.num_differing_bits(a, b)
OR how to find matches using
import simhash
hashes = []
blocks = 4
distance = 3
matches = simhash.find_all(hashes, blocks, distance)`
but before that how can I make hashes of my documents? Can anyone update the README.md or post a full step by step example/tutorial to implement this simhash using python?
Metadata
Assignees
Labels
No labels
Activity