Skip to content

Factors to consider in deciding which algorithm variant to use #1

@rpygithub

Description

@rpygithub

Before I begin, I would like to say that this is a welcome and overdue implementation of the TLSH algorithm that respects Java's conventions. Thank you for putting work into writing an implementation that is more efficient and well-documented.

I have but one suggestion regarding the documentation: I think it would be worth describing in general terms what the benefits and drawbacks are of the different window sizes and digest lengths in the context of TLSH. Does a sliding window value larger than 5 offer greater accuracy when comparing hashes for similarity? Should the choice be influenced by the size of files in a dataset?

These questions sprung to my mind as I reviewed the table. I am not fully familiar with all of the theory behind TLSH, so a paragraph about it would offer valuable insight.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions