1
- # 🐎 daachorse
1
+ # 🐎 daachorse: Double-Array Aho-Corasick
2
2
3
- Daac Horse: Double-Array Aho-Corasick
3
+ A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.
4
4
5
5
[ ![ Crates.io] ( https://img.shields.io/crates/v/daachorse )] ( https://crates.io/crates/daachorse )
6
6
[ ![ Documentation] ( https://docs.rs/daachorse/badge.svg )] ( https://docs.rs/daachorse )
7
7
![ Build Status] ( https://github.com/legalforce-research/daachorse/actions/workflows/rust.yml/badge.svg )
8
8
9
9
## Overview
10
10
11
- A fast implementation of the Aho-Corasick algorithm using Double-Array Trie.
11
+ Daachorse is a crate for fast multiple pattern matching using
12
+ the [ Aho-Corasick algorithm] ( https://dl.acm.org/doi/10.1145/360825.360855 ) ,
13
+ running in linear time over the length of the input text.
14
+ For time- and memory-efficiency, the pattern match automaton is implemented using
15
+ the [ compact double-array data structure] ( https://doi.org/10.1016/j.ipm.2006.04.004 ) .
16
+ The data structure not only supports constant-time state-to-state traversal,
17
+ but also represents each state in a compact space of only 12 bytes.
12
18
13
- ### Examples
19
+ For example, compared to the NFA of the [ aho-corasick] ( https://github.com/BurntSushi/aho-corasick ) crate
20
+ that is the most poplar Aho-Corasick implementation in Rust,
21
+ Daachorse can perform pattern matching ** 3.1 times faster**
22
+ while consuming ** 45% smaller** memory, when using a word dictionary of 675K patterns.
23
+ Other experimental results can be found in
24
+ [ Wiki] ( https://github.com/legalforce-research/daachorse/wiki ) .
25
+
26
+ ![ ] ( ./figures/comparison.svg )
27
+
28
+ ## Installation
29
+
30
+ To use ` daachorse ` , depend on it in your Cargo manifest:
31
+
32
+ ``` toml
33
+ # Cargo.toml
34
+
35
+ [dependencies ]
36
+ daachorse = " 0.3"
37
+ ```
38
+
39
+ ## Example usage
40
+
41
+ Daachorse contains some search options,
42
+ ranging from basic matching with the Aho-Corasick algorithm to trickier matching.
43
+ All of them will run very fast based on the double-array data structure and
44
+ can be easily plugged into your application as shown below.
45
+
46
+ ### Finding overlapped occurrences
47
+
48
+ To search for all occurrences of registered patterns
49
+ that allow for positional overlap in the input text,
50
+ use ` find_overlapping_iter() ` . When you use ` new() ` for constraction,
51
+ unique identifiers are assigned to each pattern in the input order.
52
+ The match result has the byte positions of the occurrence and its identifier.
53
+
54
+ ``` rust
55
+ use daachorse :: DoubleArrayAhoCorasick ;
56
+
57
+ let patterns = vec! [" bcd" , " ab" , " a" ];
58
+ let pma = DoubleArrayAhoCorasick :: new (patterns ). unwrap ();
59
+
60
+ let mut it = pma . find_overlapping_iter (" abcd" );
61
+
62
+ let m = it . next (). unwrap ();
63
+ assert_eq! ((0 , 1 , 2 ), (m . start (), m . end (), m . value ()));
64
+
65
+ let m = it . next (). unwrap ();
66
+ assert_eq! ((0 , 2 , 1 ), (m . start (), m . end (), m . value ()));
67
+
68
+ let m = it . next (). unwrap ();
69
+ assert_eq! ((1 , 4 , 0 ), (m . start (), m . end (), m . value ()));
70
+
71
+ assert_eq! (None , it . next ());
72
+ ```
73
+
74
+ ### Finding non-overlapped occurrences with shortest matching
75
+
76
+ If you do not want to allow positional overlap, use ` find_iter() ` instead.
77
+ It reports the first pattern found in each iteration,
78
+ which is the shortest pattern starting from each search position.
14
79
15
80
``` rust
16
81
use daachorse :: DoubleArrayAhoCorasick ;
17
82
18
83
let patterns = vec! [" bcd" , " ab" , " a" ];
19
84
let pma = DoubleArrayAhoCorasick :: new (patterns ). unwrap ();
20
85
86
+ let mut it = pma . find_iter (" abcd" );
87
+
88
+ let m = it . next (). unwrap ();
89
+ assert_eq! ((0 , 1 , 2 ), (m . start (), m . end (), m . value ()));
90
+
91
+ let m = it . next (). unwrap ();
92
+ assert_eq! ((1 , 4 , 0 ), (m . start (), m . end (), m . value ()));
93
+
94
+ assert_eq! (None , it . next ());
95
+ ```
96
+
97
+ ### Finding non-overlapped occurrences with longest matching
98
+
99
+ If you want to search for the longest pattern without positional overlap in each iteration,
100
+ use ` leftmost_find_iter() ` with specifying ` MatchKind::LeftmostLongest ` in the construction.
101
+
102
+ ``` rust
103
+ use daachorse :: {DoubleArrayAhoCorasickBuilder , MatchKind };
104
+
105
+ let patterns = vec! [" ab" , " a" , " abcd" ];
106
+ let pma = DoubleArrayAhoCorasickBuilder :: new ()
107
+ . match_kind (MatchKind :: LeftmostLongest )
108
+ . build (& patterns )
109
+ . unwrap ();
110
+
111
+ let mut it = pma . leftmost_find_iter (" abcd" );
112
+
113
+ let m = it . next (). unwrap ();
114
+ assert_eq! ((0 , 4 , 2 ), (m . start (), m . end (), m . value ()));
115
+
116
+ assert_eq! (None , it . next ());
117
+ ```
118
+
119
+ ### Finding non-overlapped occurrences with leftmost-first matching
120
+
121
+ If you want to find the the earliest registered pattern
122
+ among ones starting from the search position,
123
+ use ` leftmost_find_iter() ` with specifying ` MatchKind::LeftmostFirst ` .
124
+
125
+ This is so-called * the leftmost first match* , a bit tricky search option that is also
126
+ supported in the [ aho-corasick] ( https://github.com/BurntSushi/aho-corasick ) crate.
127
+ For example, in the following code,
128
+ ` ab ` is reported because it is the earliest registered one.
129
+
130
+ ``` rust
131
+ use daachorse :: {DoubleArrayAhoCorasickBuilder , MatchKind };
132
+
133
+ let patterns = vec! [" ab" , " a" , " abcd" ];
134
+ let pma = DoubleArrayAhoCorasickBuilder :: new ()
135
+ . match_kind (MatchKind :: LeftmostFirst )
136
+ . build (& patterns )
137
+ . unwrap ();
138
+
139
+ let mut it = pma . leftmost_find_iter (" abcd" );
140
+
141
+ let m = it . next (). unwrap ();
142
+ assert_eq! ((0 , 2 , 0 ), (m . start (), m . end (), m . value ()));
143
+
144
+ assert_eq! (None , it . next ());
145
+ ```
146
+
147
+ ### Associating arbitrary values with patterns
148
+
149
+ To build the automaton from pairs of a pattern and integer value instead of assigning
150
+ identifiers automatically, use ` with_values() ` .
151
+
152
+ ``` rust
153
+ use daachorse :: DoubleArrayAhoCorasick ;
154
+
155
+ let patvals = vec! [(" bcd" , 0 ), (" ab" , 10 ), (" a" , 20 )];
156
+ let pma = DoubleArrayAhoCorasick :: with_values (patvals ). unwrap ();
157
+
21
158
let mut it = pma . find_overlapping_iter (" abcd" );
22
159
23
160
let m = it . next (). unwrap ();
24
- assert_eq! ((0 , 1 , 2 ), (m . start (), m . end (), m . pattern ()));
161
+ assert_eq! ((0 , 1 , 20 ), (m . start (), m . end (), m . value ()));
25
162
26
163
let m = it . next (). unwrap ();
27
- assert_eq! ((0 , 2 , 1 ), (m . start (), m . end (), m . pattern ()));
164
+ assert_eq! ((0 , 2 , 10 ), (m . start (), m . end (), m . value ()));
28
165
29
166
let m = it . next (). unwrap ();
30
- assert_eq! ((1 , 4 , 0 ), (m . start (), m . end (), m . pattern ()));
167
+ assert_eq! ((1 , 4 , 0 ), (m . start (), m . end (), m . value ()));
31
168
32
169
assert_eq! (None , it . next ());
33
170
```
34
171
172
+ ## CLI
173
+
174
+ This repository contains a command line interface named ` daacfind ` for searching patterns in text files.
175
+
176
+ ```
177
+ % cat ./pat.txt
178
+ fn
179
+ const fn
180
+ pub fn
181
+ unsafe fn
182
+ % find . -name "*.rs" | xargs cargo run --release -p daacfind -- --color=auto -nf ./pat.txt
183
+ ...
184
+ ...
185
+ ./src/errors.rs:67: fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
186
+ ./src/errors.rs:81: fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
187
+ ./src/lib.rs:115: fn default() -> Self {
188
+ ./src/lib.rs:126: pub fn base(&self) -> Option<u32> {
189
+ ./src/lib.rs:131: pub const fn check(&self) -> u8 {
190
+ ./src/lib.rs:136: pub const fn fail(&self) -> u32 {
191
+ ...
192
+ ...
193
+ ```
194
+
35
195
## Disclaimer
36
196
37
197
This software is developed by LegalForce, Inc.,
@@ -48,6 +208,8 @@ Licensed under either of
48
208
49
209
at your option.
50
210
211
+ For softwares under ` bench/data ` , follow the license terms of each software.
212
+
51
213
## Contribution
52
214
53
215
Unless you explicitly state otherwise, any contribution intentionally submitted
0 commit comments