Skip to content

Commit 16d00a7

Browse files
authored
Faster Hashing
2 parents f65090e + 30ad812 commit 16d00a7

File tree

6 files changed

+530
-294
lines changed

6 files changed

+530
-294
lines changed

README.md

Lines changed: 50 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1953,8 +1953,56 @@ Several optimizations are known:
19531953
The last approach is quite powerful and performant, and is used by the great [RapidFuzz][rapidfuzz] library.
19541954
It's less known, than the others, derived from the Baeza-Yates-Gonnet algorithm, extended to bounded edit-distance search by Manber and Wu in 1990s, and further extended by Gene Myers in 1999 and Heikki Hyyro between 2002 and 2004.
19551955

1956-
StringZilla introduces a different approach, extensively used in Unum's internal combinatorial optimization libraries.
1957-
The approach doesn't change the number of trivial operations, but performs them in a different order, removing the data dependency, that occurs when computing the insertion costs.
1956+
StringZilla focuses on a different approach, extensively used in Unum's internal combinatorial optimization libraries.
1957+
It doesn't change the number of trivial operations, but performs them in a different order, removing the data dependency, that occurs when computing the insertion costs.
1958+
StringZilla __evaluates diagonals instead of rows__, exploiting the fact that all cells within a diagonal are independent, and can be computed in parallel.
1959+
We'll store 3 diagonals instead of the 2 rows, and each consecutive diagonal will be computed from the previous two.
1960+
Substitution costs will come from the sooner diagonal, while insertion and deletion costs will come from the later diagonal.
1961+
1962+
<table>
1963+
<tr>
1964+
<td>
1965+
<strong>Row-by-Row Algorithm</strong><br>
1966+
Computing row 4:
1967+
1968+
<pre>
1969+
∅ A B C D E
1970+
∅ 0 1 2 3 4 5
1971+
P 1 ░ ░ ░ ░ ░
1972+
Q 2 ■ ■ ■ ■ ■
1973+
R 3 ■ ■ □ → .
1974+
S 4 . . . . .
1975+
T 5 . . . . .
1976+
</pre>
1977+
</td>
1978+
<td>
1979+
<strong>Anti-Diagonal Algorithm</strong><br>
1980+
Computing diagonal 5:
1981+
1982+
<pre>
1983+
∅ A B C D E
1984+
∅ 0 1 2 3 4 5
1985+
P 1 ░ ░ ■ ■ □
1986+
Q 2 ░ ■ ■ □ ↘
1987+
R 3 ■ ■ □ ↘ .
1988+
S 4 ■ □ ↘ . .
1989+
T 5 □ ↘ . . .
1990+
</pre>
1991+
</td>
1992+
</tr>
1993+
<tr>
1994+
<td colspan="2">
1995+
<strong>Legend:</strong><br>
1996+
<code>0,1,2,3...</code> = initialization constants &nbsp;&nbsp;
1997+
<code>░</code> = cells processed and forgotten &nbsp;&nbsp;
1998+
<code>■</code> = stored cells &nbsp;&nbsp;
1999+
<code>□</code> = computing in parallel &nbsp;&nbsp;
2000+
<code>→ ↘</code> = movement direction &nbsp;&nbsp;
2001+
<code>.</code> = cells to compute later
2002+
</td>
2003+
</tr>
2004+
</table>
2005+
19582006
This results in much better vectorization for intra-core parallelism and potentially multi-core evaluation of a single request.
19592007
Moreover, it's easy to generalize to weighted edit-distances, where the cost of a substitution between two characters may not be the same for all pairs, often used in bioinformatics.
19602008

0 commit comments

Comments
 (0)