Algorithmically dense data structure for Corpus

From a quick perusal of this code it consists of a series of checks to be performed on a corpus. The checks are fundamentally looking for similar names in the corpus, and the corpus is implemented as a Map (either hash or btree). Fundamentally this all looks like fuzzy queries over a data set, which is a well studied problem.

The [fst](https://docs.rs/fst/latest/fst/) as described in the excellent blog post [Index 1,600,000,000 Keys with Automata and Rust](https://blog.burntsushi.net/transducers/) allows storing the database in a very dense fashion while still supporting fuzzy queries. Levenstein is implemented in the crate, but it also supports defining your own similarity metrics. Fst really shines with extremely large data sets. I recently put all crate names in Fst and it was <2MB. I should still have that script around if you would like me to retrieve a more accurate number.

In situations where Fst is heavyweight for the number of items being searched there are other data structures that are efficient for doing similarity matching. I have heard of the [fuzzy-search](https://github.com/ynqa/fuzzy-search) crate, but don't know how production ready it is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Algorithmically dense data structure for Corpus #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Algorithmically dense data structure for Corpus #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions