Skip to content

Algorithmically dense data structure for Corpus #2

@Eh2406

Description

@Eh2406

From a quick perusal of this code it consists of a series of checks to be performed on a corpus. The checks are fundamentally looking for similar names in the corpus, and the corpus is implemented as a Map (either hash or btree). Fundamentally this all looks like fuzzy queries over a data set, which is a well studied problem.

The fst as described in the excellent blog post Index 1,600,000,000 Keys with Automata and Rust allows storing the database in a very dense fashion while still supporting fuzzy queries. Levenstein is implemented in the crate, but it also supports defining your own similarity metrics. Fst really shines with extremely large data sets. I recently put all crate names in Fst and it was <2MB. I should still have that script around if you would like me to retrieve a more accurate number.

In situations where Fst is heavyweight for the number of items being searched there are other data structures that are efficient for doing similarity matching. I have heard of the fuzzy-search crate, but don't know how production ready it is.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions