A Julia implementation of suffix automata for generic sequences. Works with characters, integers, strings, or any other iterable type.
Some use cases are demonstrated in the files examples/numbers.jl and examples/shakespeare.jl in this repo.
- Build a suffix automaton from any sequence (
String,Vector{Char},Vector{Int}, etc.). - Check substring existence in O(m) time (
occursin). - Find all occurrences of a pattern (
findall). - Count the number of distinct substrings (
substring_count). - Compute the longest common substring between two sequences (
lcs).
You will normally construct a SuffixAutomaton{T} from a Vector{T} where T is any iterable type:
using SuffixAutomata
a = SuffixAutomaton([1, 2, 3, 1, 2, 3])
occursin([1, 2, 3], a) # true
findall([1, 2, 3], a) # [1:3, 4:6]
occursin(4, a) # false
lcs([1, 2, 3, 4], a) # ([1, 2, 3], 1)
push!(a, 4)
occursin(4, a) # true
lcs([1, 2, 3, 4], a) # ([1, 2, 3, 4], 1)
length(a) # 7
a[1:5] # [1, 2, 3, 1, 2]
append!(a, a[1:end]) # double the length by appending
length(a) # 14Alternatively, for the more canonical case of dealing with character data, you may pass in a single string and it will be converted to a Vector{Char} for you internally:
a = SuffixAutomaton("ababc")
typeof(a) # SuffixAutomaton{Char}
eltype(a) # Char
occursin("ab", a) # true
findall("ab", a) # [1:2, 3:4]
lcs("abcdef", a) # ("abc", 1)
append!(a, "def")
lcs( "abcdef", a) # ("abcdef", 1)
lcs( "zabcdef", a) # ("abcdef", 2)
lcs("zzabcdef", a) # ("abcdef", 3)Note that we adhere to the f(needle, haystack) argument order. For lcs (longest common substring), however, there is a gotcha in the fact that the position returned is the position of the match in the query term (the "needle"), perhaps contrary to expectations; but this is the way the algorithm intrinsically works.