-
Notifications
You must be signed in to change notification settings - Fork 13
history
Home > History
Based largely on the works of Stoyan Mihov, Klaus Schulz, and Petar Nikolaev Mitankin, this library generates Levenshtein transducers using nothing more than an input list of dictionary terms. The referenced literature includes: "Fast String Correction with Levenshtein-Automata", which defines the algorithm used to generate the Levenshtein automata, "Universal Levenshtein Automata. Building and Properties", which provided many mathematical proofs that helped me understand the algorithm and supplied the recursive definitions upon which I based my distance functions, and "Incremental Construction of Minimal Acyclic Finite-State Automata", that defined an algorithm for constructing Minimal Acyclic Finite-State Automata in linear time (i.e. MA-FSA, also known as DAWGs: Directed Acyclic Word Graphs) which I use to store the dictionary of terms.
Upon construction, the list of dictionary terms is indexed as an MA-FSA and a transducer is initialized according to the maximum edit distance and algorithm provided. When queried against, the states of the Levenshtein automaton corresponding to the query term, maximum edit distance, and algorithm specified are constructed on-demand (lazily) as the automaton is evaluated, as described by the paper, "Fast String Correction with Levenshtein-Automata". The Levenshtein automaton is intersected with the dictionary MA-FSA, and every string accepted by both automata is emitted in a list of correction candidates for the query term.
In contrast to many other Levenshtein automata-based algorithms, the entire Levenshtein automaton needn't be constructed prior to evaluation, and only those states of the automaton which are actually required are derived, as needed, thereby greatly improving the efficiency of the transducer in terms of both space and time complexity.
The infamous blog post, "Damn Cool Algorithms: Levenshtein Automata", provided me with a good starting point for this transducer, but the algorithm proposed therein was too inefficient for my needs. Yet, it did reference the paper "Fast String Correction with Levenshtein-Automata", which I ultimately used as the basis of the Levenshtein automata. The same paper also serves as the basis of the Levenshtein automata used by the Apache projects, Lucene and Solr (Lucene's FuzzyQuery is 100 times faster in 4.0), which itself is based on the project by Jean-Philippe Barrette-LaPierre, Moman.
Steve Hanov pointed me to the paper, "Incremental Construction of Minimal Acyclic Finite-State Automata", in his blog post entitled, "Compressing dictionaries with a DAWG". Another of Steve's blogs also made an impact on me, namely "Fast and Easy Levenshtein distance using a Trie".
I've become heavily involved with the online movement regarding MOOCs (Massive Open Online Classrooms), and the following courses taught me much regarding the material within this library:
Finally, I must credit the course which first introduced me to the field of Information Retrieval, "Mathematical Applications in Search Engine Design", taught by Dr. Rao Li at the University of South Carolina Aiken. I highly recommend that course if you are in the area.
liblevenshtein is maintained by @dylon