Given a string S = GATCAC and another string T = GTCACC, what is the minimum number of additions/deletions/substitutions required to convert S to T?
Build up a bank of solutions based on substrings that eventually increase to both strings. (Dynamic Programming— and a video here.)
| G | A | T | C | A | C | ||
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
| G | 1 | 0 | 1 | 2 | 3 | 4 | 5 |
| T | 2 | 1 | 1 | 1 | 2 | 3 | 4 |
| C | 3 | 2 | 2 | 2 | 1 | 2 | 3 |
| A | 4 | 3 | 2 | 3 | 2 | 1 | 2 |
| C | 5 | 4 | 3 | 3 | 3 | 2 | 1 |
| C | 6 | 5 | 4 | 4 | 3 | 3 | 2 |
It's very mechanical:
- Build a table
Eof size |S+1| × |T+1| - Fill the first row with the numbers 1 to |S|
- For every row
i(skipping the first row):- For every column
j:- If
T[i] = S[j]— that is, if the characters are the same:- Set
E[i][j]=E[i-1][j-1]
- Set
- Otherwise, set it to the minimum of either:
E[i-1][j-1](one cell up and left)E[i][j-1](one cell left)E[i-1][j](one cell up)
- If
- For every column
- Return the bottom-right-most value as the edit distance
The key technique is:
| A | B |
|---|---|
| C | = min(A,B,C) |
Otherwise, just A if the character is the same.
def edit_distance(S,T):
E = [ [None for char in range(len(S)+1)] ] * (len(T) + 1)
for i in range(len(S)):
E[0][i] = i
for i in range(1,len(S)):
for j in range(len(T)):
if S[i] = T[j]:
E[i][j] = E[i-1][j-1]
else:
a = E[i-1][j-1]
b = E[i][j-1]
c = E[i-1][j]
E[i][j] = min(a,b,c)
return E[len(S)][len(T)]