Skip to content

Commit e47fec1

Browse files
Release v0.4.0 (#63)
* Add descriptions of variants of DaachorseError * Fix the builder to work with fixed-size helpers (#28) * access extra via member func * use only FREE_STATES elements * add #[must_use] * fix by clippy * Update src/builder.rs Co-authored-by: Koichi Akabe <[email protected]> Co-authored-by: Koichi Akabe <[email protected]> * Address empty patterns (#29) * handle empty patterns * move some test * fix following clippy * Add basic parts of charwise daachorse (#31) * add api * fix * Update src/charwise/mapper.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise/mapper.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * add lifetime param * rename * add no_suffix * Update src/charwise/iter.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise/iter.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise/iter.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise/iter.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise/iter.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise/iter.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise/iter.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise/iter.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise/iter.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise/iter.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise/iter.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise/iter.rs Co-authored-by: Koichi Akabe <[email protected]> * rm new * Update src/charwise/mapper.rs Co-authored-by: Koichi Akabe <[email protected]> Co-authored-by: Koichi Akabe <[email protected]> * Add trait of original NFA builder (#32) * generalize sparse_nfa * add comments (and minor) * move for github diff * add comment * unify * add iter * add error handling * add comment * Update src/builder.rs Co-authored-by: Koichi Akabe <[email protected]> * dyn dispatch -> enum dispatch * fix * fix the generalization * add wrapper for EdgeMapIter * add default to SparseNfaBuilderState * rm clone_to_vec * add EdgeMap for chars * use BTreeMap for EdgeMap * minor * use type alias for EdgeMap Co-authored-by: Koichi Akabe <[email protected]> * Use stack to traverse (#34) * Use RefCell to avoid cloning edges (#33) * Avoid to store unnecessary pointers in construction (#35) * use u8vec for labels * Update src/builder.rs Co-authored-by: Koichi Akabe <[email protected]> Co-authored-by: Koichi Akabe <[email protected]> * Add test for input order (#36) * add test * rm clone * enhance * Separate NFA builder into another file (#37) * separate nfa_builder * rm dependency * handle error msg (#39) * Move tests to src/tests (#38) * Add charwise builder (#40) * add builder and freq * add builder * substract * add mapper argument * add comment * Add Result type alias (#41) * Add Result type alias * update * Remove Default implementation of MatchKind (#42) * Implement mappers and examples (#43) * add mappers * add tests * add examples * fix * add example * substract * Update src/charwise/mapper.rs Co-authored-by: Koichi Akabe <[email protected]> * implement a naive dat * fix * implement a naive dat * fix * fix length bug * rm FreqMapper * modify examples with multibyte chars * Update src/nfa_builder.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/nfa_builder.rs Co-authored-by: Koichi Akabe <[email protected]> Co-authored-by: Koichi Akabe <[email protected]> * Move integration tests to tests directory (#45) * Move integration tests to tests directory * Add missing file * Remove unnecessary file * Simplify test module on random strings (#44) * add naive find funcs * fix args * fix bug for findIter * Add leftmost-first random test (#46) * add leftmost-first * fix * rm minmax * simplify * move (#47) * Add duplicate pattern tests (#49) * Remove mapper (#50) * rm mapper * fix * Add leftmost iterators of charwise version and examples (#51) * add leftmost * minor * Add integration tests for charwise daachorse (#52) * add charwise * add tests * add charwise bench (#53) * make unsafe (#54) * Refactor DaachorseError (#55) * Refactor DaachorseError * fix * fix * fmt * Add `_with_iter()` functions (#56) * Add U8SliceIterator and use it in each FindIterator * Add _with_iter() functions * with -> from * Refactoring * Add `_from_iter()` functions for charwise automata (#57) * Add U8SliceIterator and use it in each FindIterator * Add _with_iter() functions * with -> from * Add CharWithEndOffsetIterator * Add from_iter functions * Add inline * Refactoring * fix * clippy * Add a test for CharWithEndOffsetIterator (#58) * Enhance documents for charwise version (#59) * add example * minor * add * add doc * fix * fix linkage * minor * add Requirements * add * minor * minor * Update README.md Co-authored-by: Koichi Akabe <[email protected]> * Update README.md Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/lib.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/lib.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/lib.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update src/charwise.rs Co-authored-by: Koichi Akabe <[email protected]> * Update README.md Co-authored-by: Koichi Akabe <[email protected]> Co-authored-by: Koichi Akabe <[email protected]> * Add codes to measure memory usages (#60) * add memory stats * fix * Bump up to 0.4.0 (#61) * Bump up to 0.4.0 * Fix README * Update figures (#62) * Update figures * Update README.md Co-authored-by: Shunsuke Kanda <[email protected]>
1 parent e05a245 commit e47fec1

28 files changed

+3768
-1222
lines changed

Cargo.toml

+2-3
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,18 @@
11
[package]
22
name = "daachorse"
3-
version = "0.3.0"
3+
version = "0.4.0"
44
edition = "2021"
55
authors = [
66
"Koichi Akabe <[email protected]>",
77
"Shunsuke Kanda <[email protected]>",
88
]
9-
description = "Daac Horse: Double-Array Aho-Corasick"
9+
description = "Daachorse: Double-Array Aho-Corasick"
1010
license = "MIT OR Apache-2.0"
1111
homepage = "https://github.com/legalforce-research/daachorse"
1212
repository = "https://github.com/legalforce-research/daachorse"
1313
readme = "README.md"
1414
keywords = ["string", "search", "text", "aho", "multi"]
1515
categories = ["text-processing"]
16-
autotests = false
1716
exclude = [".*"]
1817

1918
[dependencies]

README.md

+33-3
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,8 @@ but also represents each state in a compact space of only 12 bytes.
1818

1919
For example, compared to the NFA of the [aho-corasick](https://github.com/BurntSushi/aho-corasick) crate
2020
that is the most poplar Aho-Corasick implementation in Rust,
21-
Daachorse can perform pattern matching **3.1 times faster**
22-
while consuming **45% smaller** memory, when using a word dictionary of 675K patterns.
21+
Daachorse can perform pattern matching **3.0~5.1 times faster**
22+
while consuming **45~55% smaller** memory, when using a word dictionary of 675K patterns.
2323
Other experimental results can be found in
2424
[Wiki](https://github.com/legalforce-research/daachorse/wiki).
2525

@@ -33,9 +33,13 @@ To use `daachorse`, depend on it in your Cargo manifest:
3333
# Cargo.toml
3434

3535
[dependencies]
36-
daachorse = "0.3"
36+
daachorse = "0.4"
3737
```
3838

39+
### Requirements
40+
41+
To compile this crate, Rust 1.58 or higher is required.
42+
3943
## Example usage
4044

4145
Daachorse contains some search options,
@@ -169,6 +173,32 @@ assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));
169173
assert_eq!(None, it.next());
170174
```
171175

176+
### Building faster automaton on multibyte characters
177+
178+
To build a faster automaton on multibyte characters, use `CharwiseDoubleArrayAhoCorasick` instead.
179+
180+
The standard version `DoubleArrayAhoCorasick` handles strings as UTF-8 sequences
181+
and defines transition labels using byte values.
182+
On the other hand, `CharwiseDoubleArrayAhoCorasick` uses code point values of Unicode,
183+
resulting in reducing the number of transitions and faster matching.
184+
185+
```rust
186+
use daachorse::charwise::CharwiseDoubleArrayAhoCorasick;
187+
188+
let patterns = vec!["全世界", "世界", ""];
189+
let pma = CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();
190+
191+
let mut it = pma.find_iter("全世界中に");
192+
193+
let m = it.next().unwrap();
194+
assert_eq!((0, 9, 0), (m.start(), m.end(), m.value()));
195+
196+
let m = it.next().unwrap();
197+
assert_eq!((12, 15, 2), (m.start(), m.end(), m.value()));
198+
199+
assert_eq!(None, it.next());
200+
```
201+
172202
## CLI
173203

174204
This repository contains a command line interface named `daacfind` for searching patterns in text files.

bench/Cargo.toml

+7-3
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,18 @@ version = "0.1.0"
44
edition = "2021"
55

66
[dependencies]
7-
8-
[dev-dependencies]
97
aho-corasick = "0.7.18" # Unlicense or MIT
10-
criterion = { version = "0.3", features = ["html_reports"] } # Apache-2.0 or MIT
118
daachorse = { path = ".." } # Apache-2.0 or MIT
129
fst = "0.4.7" # Unlicense or MIT
1310
yada = "0.5.0" # Apache-2.0 or MIT
1411

12+
[dev-dependencies]
13+
criterion = { version = "0.3", features = ["html_reports"] } # Apache-2.0 or MIT
14+
1515
[[bench]]
1616
name = "benchmark"
1717
harness = false
18+
19+
[[bin]]
20+
name = "memory"
21+
path = "src/memory.rs"

bench/benches/benchmark.rs

+85
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,10 @@ fn add_build_benches(group: &mut BenchmarkGroup<WallTime>, patterns: &[String])
137137
b.iter(|| daachorse::DoubleArrayAhoCorasick::new(patterns).unwrap());
138138
});
139139

140+
group.bench_function("daachorse/charwise", |b| {
141+
b.iter(|| daachorse::charwise::CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap());
142+
});
143+
140144
group.bench_function("aho_corasick/nfa", |b| {
141145
b.iter(|| aho_corasick::AhoCorasick::new(patterns));
142146
});
@@ -197,6 +201,21 @@ fn add_find_benches(
197201
});
198202
});
199203

204+
group.bench_function("daachorse/charwise", |b| {
205+
let pma = daachorse::charwise::CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();
206+
b.iter(|| {
207+
let mut sum = 0;
208+
for haystack in haystacks {
209+
for m in pma.find_iter(haystack) {
210+
sum += m.start() + m.end() + m.value();
211+
}
212+
}
213+
if sum == 0 {
214+
panic!();
215+
}
216+
});
217+
});
218+
200219
group.bench_function("aho_corasick/nfa", |b| {
201220
let pma = aho_corasick::AhoCorasick::new(patterns);
202221
b.iter(|| {
@@ -265,6 +284,36 @@ fn add_find_overlapping_benches(
265284
});
266285
});
267286

287+
group.bench_function("daachorse/charwise", |b| {
288+
let pma = daachorse::charwise::CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();
289+
b.iter(|| {
290+
let mut sum = 0;
291+
for haystack in haystacks {
292+
for m in pma.find_overlapping_iter(haystack) {
293+
sum += m.start() + m.end() + m.value();
294+
}
295+
}
296+
if sum == 0 {
297+
panic!();
298+
}
299+
});
300+
});
301+
302+
group.bench_function("daachorse/charwise/no_suffix", |b| {
303+
let pma = daachorse::charwise::CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();
304+
b.iter(|| {
305+
let mut sum = 0;
306+
for haystack in haystacks {
307+
for m in pma.find_overlapping_no_suffix_iter(haystack) {
308+
sum += m.start() + m.end() + m.value();
309+
}
310+
}
311+
if sum == 0 {
312+
panic!();
313+
}
314+
});
315+
});
316+
268317
group.bench_function("aho_corasick/nfa", |b| {
269318
let pma = aho_corasick::AhoCorasick::new(patterns);
270319
b.iter(|| {
@@ -373,6 +422,24 @@ fn add_leftmost_longest_find_benches(
373422
});
374423
});
375424

425+
group.bench_function("daachorse/charwise", |b| {
426+
let pma = daachorse::charwise::CharwiseDoubleArrayAhoCorasickBuilder::new()
427+
.match_kind(daachorse::MatchKind::LeftmostLongest)
428+
.build(patterns)
429+
.unwrap();
430+
b.iter(|| {
431+
let mut sum = 0;
432+
for haystack in haystacks {
433+
for m in pma.leftmost_find_iter(haystack) {
434+
sum += m.start() + m.end() + m.value();
435+
}
436+
}
437+
if sum == 0 {
438+
panic!();
439+
}
440+
});
441+
});
442+
376443
group.bench_function("aho_corasick/nfa", |b| {
377444
let pma = aho_corasick::AhoCorasickBuilder::new()
378445
.match_kind(aho_corasick::MatchKind::LeftmostLongest)
@@ -432,6 +499,24 @@ fn add_leftmost_first_find_benches(
432499
});
433500
});
434501

502+
group.bench_function("daachorse/charwise", |b| {
503+
let pma = daachorse::charwise::CharwiseDoubleArrayAhoCorasickBuilder::new()
504+
.match_kind(daachorse::MatchKind::LeftmostFirst)
505+
.build(patterns)
506+
.unwrap();
507+
b.iter(|| {
508+
let mut sum = 0;
509+
for haystack in haystacks {
510+
for m in pma.leftmost_find_iter(haystack) {
511+
sum += m.start() + m.end() + m.value();
512+
}
513+
}
514+
if sum == 0 {
515+
panic!();
516+
}
517+
});
518+
});
519+
435520
group.bench_function("aho_corasick/nfa", |b| {
436521
let pma = aho_corasick::AhoCorasickBuilder::new()
437522
.match_kind(aho_corasick::MatchKind::LeftmostFirst)

bench/src/memory.rs

+82
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
use std::convert::TryFrom;
2+
use std::fs::File;
3+
use std::io::BufRead;
4+
use std::io::BufReader;
5+
use std::path::Path;
6+
7+
fn main() {
8+
{
9+
println!("== data/words_100000 ==");
10+
let mut patterns = load_file("data/words_100000");
11+
patterns.sort_unstable();
12+
show_memory_stats(&patterns);
13+
}
14+
{
15+
println!("== data/unidic/unidic ==");
16+
let mut patterns = load_file("data/unidic/unidic");
17+
patterns.sort_unstable();
18+
show_memory_stats(&patterns);
19+
}
20+
}
21+
22+
fn show_memory_stats(patterns: &[String]) {
23+
{
24+
let pma = daachorse::DoubleArrayAhoCorasick::new(patterns).unwrap();
25+
format_memory("daachorse (bytewise)", pma.heap_bytes());
26+
}
27+
{
28+
let pma = daachorse::charwise::CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();
29+
format_memory("daachorse (charwise)", pma.heap_bytes());
30+
}
31+
{
32+
let pma = aho_corasick::AhoCorasick::new(patterns);
33+
format_memory("aho_corasick (nfa)", pma.heap_bytes());
34+
}
35+
{
36+
let pma = aho_corasick::AhoCorasickBuilder::new()
37+
.dfa(true)
38+
.build(patterns);
39+
format_memory("aho_corasick (dfa)", pma.heap_bytes());
40+
}
41+
{
42+
let fst = fst::raw::Fst::from_iter_map(
43+
patterns
44+
.iter()
45+
.cloned()
46+
.enumerate()
47+
.map(|(i, pattern)| (pattern, i as u64)),
48+
)
49+
.unwrap();
50+
format_memory("fst", fst.as_bytes().len());
51+
}
52+
{
53+
let data = yada::builder::DoubleArrayBuilder::build(
54+
&patterns
55+
.iter()
56+
.cloned()
57+
.enumerate()
58+
.map(|(i, pattern)| (pattern, u32::try_from(i).unwrap()))
59+
.collect::<Vec<_>>(),
60+
)
61+
.unwrap();
62+
format_memory("yada", data.len());
63+
}
64+
}
65+
66+
fn format_memory(title: &str, bytes: usize) {
67+
println!(
68+
"{}: {} bytes, {:.3} MiB",
69+
title,
70+
bytes,
71+
bytes as f64 / (1024.0 * 1024.0)
72+
);
73+
}
74+
75+
fn load_file<P>(path: P) -> Vec<String>
76+
where
77+
P: AsRef<Path>,
78+
{
79+
let file = File::open(path).unwrap();
80+
let buf = BufReader::new(file);
81+
buf.lines().map(|line| line.unwrap()).collect()
82+
}

daacfind/src/main.rs

+16-10
Original file line numberDiff line numberDiff line change
@@ -169,23 +169,29 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
169169
let line = match line {
170170
Ok(line) => line,
171171
Err(err) => {
172-
if let Some(filename) = filename {
173-
eprintln!("{}: {:?}", filename, err);
174-
} else {
175-
eprintln!("{:?}", err);
176-
}
172+
filename.map_or_else(
173+
|| {
174+
eprintln!("{:?}", err);
175+
},
176+
|filename| {
177+
eprintln!("{}: {:?}", filename, err);
178+
},
179+
);
177180
break;
178181
}
179182
};
180183
find_and_output(&pma, &line, filename, line_number, opt.color, &mut stdout)?;
181184
}
182185
}
183186
Err(err) => {
184-
if let Some(filename) = filename.to_str() {
185-
eprintln!("{}: {:?}", filename, err);
186-
} else {
187-
eprintln!("{:?}", err);
188-
}
187+
filename.to_str().map_or_else(
188+
|| {
189+
eprintln!("{:?}", err);
190+
},
191+
|filename| {
192+
eprintln!("{}: {:?}", filename, err);
193+
},
194+
);
189195
}
190196
}
191197
}

0 commit comments

Comments
 (0)