Skip to content

Commit e05a245

Browse files
authored
Merge pull request #26 from legalforce-research/develop
Release 0.3.0
2 parents 16a23d7 + 1a1bd13 commit e05a245

18 files changed

+3770
-814
lines changed

.github/workflows/rust.yml

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
on:
22
push:
3-
branches: [ main ]
3+
branches: [ main, develop ]
44
pull_request:
5-
branches: [ main ]
5+
branches: [ main, develop ]
66

77
name: build
88

@@ -34,7 +34,7 @@ jobs:
3434
uses: actions-rs/cargo@v1
3535
with:
3636
command: clippy
37-
args: -- -D warnings -W clippy::nursery
37+
args: --all -- -D warnings -W clippy::nursery -W clippy::cast_lossless -W clippy::cast_possible_truncation
3838

3939
- name: Run cargo test
4040
uses: actions-rs/cargo@v1
@@ -69,7 +69,7 @@ jobs:
6969
uses: actions-rs/cargo@v1
7070
with:
7171
command: clippy
72-
args: -- -D warnings -W clippy::nursery
72+
args: --all -- -D warnings -W clippy::nursery -W clippy::cast_lossless -W clippy::cast_possible_truncation
7373

7474
- name: Run cargo test
7575
uses: actions-rs/cargo@v1

.gitignore

+14-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,14 @@
1-
/target
1+
# Generated by Cargo
2+
# will have compiled files and executables
3+
debug/
4+
target/
5+
6+
# Remove Cargo.lock from gitignore if creating an executable, leave it for libraries
7+
# More information here https://doc.rust-lang.org/cargo/guide/cargo-toml-vs-cargo-lock.html
8+
Cargo.lock
9+
10+
# These are backup files generated by rustfmt
11+
**/*.rs.bk
12+
13+
# MSVC Windows builds of rustc generate these, which store debugging information
14+
*.pdb

Cargo.toml

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[package]
22
name = "daachorse"
3-
version = "0.2.1"
4-
edition = "2018"
3+
version = "0.3.0"
4+
edition = "2021"
55
authors = [
66
"Koichi Akabe <[email protected]>",
77
"Shunsuke Kanda <[email protected]>",
@@ -17,12 +17,12 @@ autotests = false
1717
exclude = [".*"]
1818

1919
[dependencies]
20-
byteorder = "1.4.3" # Unlicense or MIT
2120

2221
[dev-dependencies]
2322
rand = "0.8.4" # MIT or Apache-2.0
2423

2524
[workspace]
2625
members = [
2726
"bench",
27+
"daacfind",
2828
]

README.md

+169-7
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,197 @@
1-
# 🐎 daachorse
1+
# 🐎 daachorse: Double-Array Aho-Corasick
22

3-
Daac Horse: Double-Array Aho-Corasick
3+
A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.
44

55
[![Crates.io](https://img.shields.io/crates/v/daachorse)](https://crates.io/crates/daachorse)
66
[![Documentation](https://docs.rs/daachorse/badge.svg)](https://docs.rs/daachorse)
77
![Build Status](https://github.com/legalforce-research/daachorse/actions/workflows/rust.yml/badge.svg)
88

99
## Overview
1010

11-
A fast implementation of the Aho-Corasick algorithm using Double-Array Trie.
11+
Daachorse is a crate for fast multiple pattern matching using
12+
the [Aho-Corasick algorithm](https://dl.acm.org/doi/10.1145/360825.360855),
13+
running in linear time over the length of the input text.
14+
For time- and memory-efficiency, the pattern match automaton is implemented using
15+
the [compact double-array data structure](https://doi.org/10.1016/j.ipm.2006.04.004).
16+
The data structure not only supports constant-time state-to-state traversal,
17+
but also represents each state in a compact space of only 12 bytes.
1218

13-
### Examples
19+
For example, compared to the NFA of the [aho-corasick](https://github.com/BurntSushi/aho-corasick) crate
20+
that is the most poplar Aho-Corasick implementation in Rust,
21+
Daachorse can perform pattern matching **3.1 times faster**
22+
while consuming **45% smaller** memory, when using a word dictionary of 675K patterns.
23+
Other experimental results can be found in
24+
[Wiki](https://github.com/legalforce-research/daachorse/wiki).
25+
26+
![](./figures/comparison.svg)
27+
28+
## Installation
29+
30+
To use `daachorse`, depend on it in your Cargo manifest:
31+
32+
```toml
33+
# Cargo.toml
34+
35+
[dependencies]
36+
daachorse = "0.3"
37+
```
38+
39+
## Example usage
40+
41+
Daachorse contains some search options,
42+
ranging from basic matching with the Aho-Corasick algorithm to trickier matching.
43+
All of them will run very fast based on the double-array data structure and
44+
can be easily plugged into your application as shown below.
45+
46+
### Finding overlapped occurrences
47+
48+
To search for all occurrences of registered patterns
49+
that allow for positional overlap in the input text,
50+
use `find_overlapping_iter()`. When you use `new()` for constraction,
51+
unique identifiers are assigned to each pattern in the input order.
52+
The match result has the byte positions of the occurrence and its identifier.
53+
54+
```rust
55+
use daachorse::DoubleArrayAhoCorasick;
56+
57+
let patterns = vec!["bcd", "ab", "a"];
58+
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();
59+
60+
let mut it = pma.find_overlapping_iter("abcd");
61+
62+
let m = it.next().unwrap();
63+
assert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));
64+
65+
let m = it.next().unwrap();
66+
assert_eq!((0, 2, 1), (m.start(), m.end(), m.value()));
67+
68+
let m = it.next().unwrap();
69+
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));
70+
71+
assert_eq!(None, it.next());
72+
```
73+
74+
### Finding non-overlapped occurrences with shortest matching
75+
76+
If you do not want to allow positional overlap, use `find_iter()` instead.
77+
It reports the first pattern found in each iteration,
78+
which is the shortest pattern starting from each search position.
1479

1580
```rust
1681
use daachorse::DoubleArrayAhoCorasick;
1782

1883
let patterns = vec!["bcd", "ab", "a"];
1984
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();
2085

86+
let mut it = pma.find_iter("abcd");
87+
88+
let m = it.next().unwrap();
89+
assert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));
90+
91+
let m = it.next().unwrap();
92+
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));
93+
94+
assert_eq!(None, it.next());
95+
```
96+
97+
### Finding non-overlapped occurrences with longest matching
98+
99+
If you want to search for the longest pattern without positional overlap in each iteration,
100+
use `leftmost_find_iter()` with specifying `MatchKind::LeftmostLongest` in the construction.
101+
102+
```rust
103+
use daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};
104+
105+
let patterns = vec!["ab", "a", "abcd"];
106+
let pma = DoubleArrayAhoCorasickBuilder::new()
107+
.match_kind(MatchKind::LeftmostLongest)
108+
.build(&patterns)
109+
.unwrap();
110+
111+
let mut it = pma.leftmost_find_iter("abcd");
112+
113+
let m = it.next().unwrap();
114+
assert_eq!((0, 4, 2), (m.start(), m.end(), m.value()));
115+
116+
assert_eq!(None, it.next());
117+
```
118+
119+
### Finding non-overlapped occurrences with leftmost-first matching
120+
121+
If you want to find the the earliest registered pattern
122+
among ones starting from the search position,
123+
use `leftmost_find_iter()` with specifying `MatchKind::LeftmostFirst`.
124+
125+
This is so-called *the leftmost first match*, a bit tricky search option that is also
126+
supported in the [aho-corasick](https://github.com/BurntSushi/aho-corasick) crate.
127+
For example, in the following code,
128+
`ab` is reported because it is the earliest registered one.
129+
130+
```rust
131+
use daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};
132+
133+
let patterns = vec!["ab", "a", "abcd"];
134+
let pma = DoubleArrayAhoCorasickBuilder::new()
135+
.match_kind(MatchKind::LeftmostFirst)
136+
.build(&patterns)
137+
.unwrap();
138+
139+
let mut it = pma.leftmost_find_iter("abcd");
140+
141+
let m = it.next().unwrap();
142+
assert_eq!((0, 2, 0), (m.start(), m.end(), m.value()));
143+
144+
assert_eq!(None, it.next());
145+
```
146+
147+
### Associating arbitrary values with patterns
148+
149+
To build the automaton from pairs of a pattern and integer value instead of assigning
150+
identifiers automatically, use `with_values()`.
151+
152+
```rust
153+
use daachorse::DoubleArrayAhoCorasick;
154+
155+
let patvals = vec![("bcd", 0), ("ab", 10), ("a", 20)];
156+
let pma = DoubleArrayAhoCorasick::with_values(patvals).unwrap();
157+
21158
let mut it = pma.find_overlapping_iter("abcd");
22159

23160
let m = it.next().unwrap();
24-
assert_eq!((0, 1, 2), (m.start(), m.end(), m.pattern()));
161+
assert_eq!((0, 1, 20), (m.start(), m.end(), m.value()));
25162

26163
let m = it.next().unwrap();
27-
assert_eq!((0, 2, 1), (m.start(), m.end(), m.pattern()));
164+
assert_eq!((0, 2, 10), (m.start(), m.end(), m.value()));
28165

29166
let m = it.next().unwrap();
30-
assert_eq!((1, 4, 0), (m.start(), m.end(), m.pattern()));
167+
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));
31168

32169
assert_eq!(None, it.next());
33170
```
34171

172+
## CLI
173+
174+
This repository contains a command line interface named `daacfind` for searching patterns in text files.
175+
176+
```
177+
% cat ./pat.txt
178+
fn
179+
const fn
180+
pub fn
181+
unsafe fn
182+
% find . -name "*.rs" | xargs cargo run --release -p daacfind -- --color=auto -nf ./pat.txt
183+
...
184+
...
185+
./src/errors.rs:67: fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
186+
./src/errors.rs:81: fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
187+
./src/lib.rs:115: fn default() -> Self {
188+
./src/lib.rs:126: pub fn base(&self) -> Option<u32> {
189+
./src/lib.rs:131: pub const fn check(&self) -> u8 {
190+
./src/lib.rs:136: pub const fn fail(&self) -> u32 {
191+
...
192+
...
193+
```
194+
35195
## Disclaimer
36196

37197
This software is developed by LegalForce, Inc.,
@@ -48,6 +208,8 @@ Licensed under either of
48208

49209
at your option.
50210

211+
For softwares under `bench/data`, follow the license terms of each software.
212+
51213
## Contribution
52214

53215
Unless you explicitly state otherwise, any contribution intentionally submitted

bench/Cargo.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[package]
22
name = "daachorse-bench"
33
version = "0.1.0"
4-
edition = "2018"
4+
edition = "2021"
55

66
[dependencies]
77

0 commit comments

Comments
 (0)