You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+50-2Lines changed: 50 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1953,8 +1953,56 @@ Several optimizations are known:
1953
1953
The last approach is quite powerful and performant, and is used by the great [RapidFuzz][rapidfuzz] library.
1954
1954
It's less known, than the others, derived from the Baeza-Yates-Gonnet algorithm, extended to bounded edit-distance search by Manber and Wu in 1990s, and further extended by Gene Myers in 1999 and Heikki Hyyro between 2002 and 2004.
1955
1955
1956
-
StringZilla introduces a different approach, extensively used in Unum's internal combinatorial optimization libraries.
1957
-
The approach doesn't change the number of trivial operations, but performs them in a different order, removing the data dependency, that occurs when computing the insertion costs.
1956
+
StringZilla focuses on a different approach, extensively used in Unum's internal combinatorial optimization libraries.
1957
+
It doesn't change the number of trivial operations, but performs them in a different order, removing the data dependency, that occurs when computing the insertion costs.
1958
+
StringZilla __evaluates diagonals instead of rows__, exploiting the fact that all cells within a diagonal are independent, and can be computed in parallel.
1959
+
We'll store 3 diagonals instead of the 2 rows, and each consecutive diagonal will be computed from the previous two.
1960
+
Substitution costs will come from the sooner diagonal, while insertion and deletion costs will come from the later diagonal.
<code>░</code> = cells processed and forgotten
1998
+
<code>■</code> = stored cells
1999
+
<code>□</code> = computing in parallel
2000
+
<code>→ ↘</code> = movement direction
2001
+
<code>.</code> = cells to compute later
2002
+
</td>
2003
+
</tr>
2004
+
</table>
2005
+
1958
2006
This results in much better vectorization for intra-core parallelism and potentially multi-core evaluation of a single request.
1959
2007
Moreover, it's easy to generalize to weighted edit-distances, where the cost of a substitution between two characters may not be the same for all pairs, often used in bioinformatics.
0 commit comments