Skip to content
Discussion options

You must be logged in to vote

This is the absolute number of shared fingerprints in the left and right file. These are not necessarily the same because a shared fingerprint might have a different number of occurrences in the left and right file.

A fingerprint is a series of $$k$$ subsequent tokens (k-grams) in the syntax tree selected out of a window of $$w$$ k-grams.

The similarity between two source files a and b is computed as
$$sim(a,b) = \frac{S_a + S_b}{T_a + T_b}$$
with $$T_x$$ the total number of fingerprints in file $$x$$ and $$S_x$$ the number of fingerprints in file $$x$$ that also occur in the other file.

The naming is indeed a bit awkward. It was chosen a few years ago and we try to keep our API somewhat …

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by rien
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants