Skip to content

Commit ad4498b

Browse files
authored
v0.10.0: Fat Teddy 16-bucket SIMD for 33-64 patterns (#68)
Fat Teddy AVX2 implementation with 9+ GB/s throughput for 33-64 patterns. Key features: - 16 buckets (vs Slim Teddy's 8) = 2x pattern capacity - AVX2 assembly with VPALIGNR half-shift algorithm - Aho-Corasick fallback for small haystacks (<64 bytes) - Pure Go scalar fallback for non-AVX2 platforms Performance: - 40 patterns: 9.1 GB/s (73x faster than Aho-Corasick) - Small haystack fallback: 2.4x faster (110ns vs 267ns)
1 parent 6b02713 commit ad4498b

15 files changed

Lines changed: 1634 additions & 54 deletions

CHANGELOG.md

Lines changed: 53 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,49 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1515

1616
---
1717

18+
## [0.10.0] - 2026-01-07
19+
20+
### Added
21+
- **Fat Teddy 16-bucket SIMD prefilter for 33-64 patterns**
22+
- New strategy tier: Slim Teddy (2-32 patterns) → Fat Teddy (33-64) → Aho-Corasick (>64)
23+
- AVX2 assembly implementation with 9+ GB/s throughput
24+
- 16 buckets (vs Slim Teddy's 8) = 2x pattern capacity
25+
- Pure Go scalar fallback for non-AVX2 platforms
26+
- Algorithm from Rust aho-corasick `generic.rs` Fat<V, 2> implementation
27+
28+
- **Aho-Corasick fallback for small haystacks with Fat Teddy**
29+
- Fat Teddy's AVX2 SIMD has setup overhead slower than Aho-Corasick on small inputs
30+
- Automatic fallback for haystacks < 64 bytes (threshold based on benchmarks)
31+
- 2.4x faster on 37-byte haystacks with 50 patterns (267ns → 110ns)
32+
- Follows Rust regex's `minimum_len()` approach (`builder.rs:585`)
33+
34+
### Technical Details
35+
- **fatTeddyMasks struct**: 32-byte SIMD masks (256-bit AVX2 vectors)
36+
- Low lane (bytes 0-15): buckets 0-7
37+
- High lane (bytes 16-31): buckets 8-15
38+
- **AVX2 algorithm**:
39+
- VBROADCASTI128: Load 16 bytes, duplicate to both lanes
40+
- VPSHUFB: Parallel nibble lookup in bucket masks
41+
- VPALIGNR $15: Half-shift for 2-byte fingerprint alignment
42+
- VPMOVMSKB: Extract 32-bit candidate mask
43+
44+
### Performance
45+
| Patterns | Engine | Throughput | vs Aho-Corasick |
46+
|----------|--------|------------|-----------------|
47+
| 40 | Fat Teddy AVX2 | 9.1 GB/s | **73x faster** |
48+
| 40 | Fat Teddy scalar | 228 MB/s | **1.5x faster** |
49+
| 70 | Aho-Corasick | 152 MB/s | baseline |
50+
51+
### Files
52+
- `prefilter/teddy_fat.go` - Fat Teddy core implementation + MinimumLen()
53+
- `prefilter/teddy_fat_amd64.go` - AVX2 dispatch
54+
- `prefilter/teddy_avx2_amd64.s` - AVX2 assembly (~300 lines)
55+
- `meta/meta.go` - Aho-Corasick fallback for small haystacks
56+
- `meta/strategy.go` - strategy selection update (32→Fat Teddy, >64→Aho-Corasick)
57+
- `meta/fat_teddy_fallback_test.go` - tests for fallback logic
58+
59+
---
60+
1861
## [0.9.5] - 2026-01-06
1962

2063
### Changed
@@ -1111,8 +1154,9 @@ v0.7.0 → OnePass DFA (DONE ✅)
11111154
v0.8.0 → ReverseInner strategy (DONE ✅)
11121155
v0.8.14-18 → GoAWK integration, Teddy, BoundedBacktracker (DONE ✅)
11131156
v0.8.19 → FindAll ReverseSuffix optimization (DONE ✅)
1114-
v0.8.20 → ReverseSuffixSet for multi-suffix patterns (DONE ✅) ← CURRENT
1115-
v0.9.0 → Beta testing period
1157+
v0.8.20 → ReverseSuffixSet for multi-suffix patterns (DONE ✅)
1158+
v0.9.x → Performance tuning, Teddy 2-byte fingerprint (DONE ✅)
1159+
v0.10.0 → Fat Teddy 33-64 patterns, AVX2 SIMD (DONE ✅) ← CURRENT
11161160
v1.0.0 → Stable release (API frozen)
11171161
```
11181162

@@ -1142,7 +1186,13 @@ v1.0.0 → Stable release (API frozen)
11421186

11431187
---
11441188

1145-
[Unreleased]: https://github.com/coregx/coregex/compare/v0.9.0...HEAD
1189+
[Unreleased]: https://github.com/coregx/coregex/compare/v0.10.0...HEAD
1190+
[0.10.0]: https://github.com/coregx/coregex/releases/tag/v0.10.0
1191+
[0.9.5]: https://github.com/coregx/coregex/releases/tag/v0.9.5
1192+
[0.9.4]: https://github.com/coregx/coregex/releases/tag/v0.9.4
1193+
[0.9.3]: https://github.com/coregx/coregex/releases/tag/v0.9.3
1194+
[0.9.2]: https://github.com/coregx/coregex/releases/tag/v0.9.2
1195+
[0.9.1]: https://github.com/coregx/coregex/releases/tag/v0.9.1
11461196
[0.9.0]: https://github.com/coregx/coregex/releases/tag/v0.9.0
11471197
[0.8.24]: https://github.com/coregx/coregex/releases/tag/v0.8.24
11481198
[0.8.23]: https://github.com/coregx/coregex/releases/tag/v0.8.23

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ Cross-language benchmarks on 6MB input ([source](https://github.com/kolkov/regex
7373
- IP/phone patterns (`\d+\.\d+\.\d+\.\d+`) — optimized DFA strategy
7474
- Suffix patterns (`.*\.log`, `.*\.txt`) — reverse search optimization
7575
- Inner literals (`.*error.*`, `.*@example\.com`) — bidirectional DFA
76-
- Multi-pattern (`foo|bar|baz|...`) — Teddy (≤8) or Aho-Corasick (>8 patterns)
76+
- Multi-pattern (`foo|bar|baz|...`) — Slim Teddy (≤32), Fat Teddy (33-64), or Aho-Corasick (>64)
7777

7878
## Features
7979

@@ -86,9 +86,10 @@ coregex automatically selects the optimal engine:
8686
| ReverseInner | `.*keyword.*` | 100-200x |
8787
| ReverseSuffix | `.*\.txt` | 100-220x |
8888
| LazyDFA | IP, complex patterns | 10-150x |
89-
| AhoCorasick | `a\|b\|c\|...\|z` (>8 patterns) | 75-113x |
89+
| AhoCorasick | `a\|b\|c\|...\|z` (>64 patterns) | 75-113x |
9090
| CharClassSearcher | `[\w]+`, `\d+` | 4-25x |
91-
| Teddy | `foo\|bar\|baz` (2-8 patterns) | 15-240x |
91+
| Slim Teddy | `foo\|bar\|baz` (2-32 patterns) | 15-240x |
92+
| Fat Teddy | 33-64 patterns | 60-73x |
9293
| OnePass | Anchored captures | 10x |
9394
| BoundedBacktracker | Small patterns | 2-5x |
9495

@@ -168,7 +169,8 @@ Input → Prefilter (SIMD) → Engine → Match Result
168169
**SIMD Primitives** (AMD64):
169170
- `memchr` — single byte search (AVX2)
170171
- `memmem` — substring search (SSSE3)
171-
- `teddy` — multi-pattern search (SSSE3)
172+
- `Slim Teddy` — multi-pattern search, 2-32 patterns (SSSE3)
173+
- `Fat Teddy` — multi-pattern search, 33-64 patterns (AVX2, 9+ GB/s)
172174

173175
Pure Go fallback on other architectures.
174176

ROADMAP.md

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
> **Strategic Focus**: Production-grade regex engine with RE2/rust-regex level optimizations
44
5-
**Last Updated**: 2025-12-13 | **Current Version**: v0.8.22 | **Target**: v1.0.0 stable
5+
**Last Updated**: 2026-01-07 | **Current Version**: v0.10.0 | **Target**: v1.0.0 stable
66

77
---
88

@@ -12,7 +12,7 @@ Build a **production-ready, high-performance regex engine** for Go that matches
1212

1313
### Current State vs Target
1414

15-
| Metric | Current (v0.8.22) | Target (v1.0.0) |
15+
| Metric | Current (v0.10.0) | Target (v1.0.0) |
1616
|--------|-------------------|-----------------|
1717
| Inner literal speedup | **87-3154x** | ✅ Achieved |
1818
| Case-insensitive speedup | **263x** | ✅ Achieved |
@@ -21,7 +21,9 @@ Build a **production-ready, high-performance regex engine** for Go that matches
2121
| Small string perf | **1.4-20x faster** | ✅ Achieved |
2222
| Reverse search | **Yes (4 strategies)** | ✅ Achieved |
2323
| OnePass DFA | **Yes** | ✅ Achieved |
24-
| Teddy SIMD prefilter | **Yes** | ✅ Achieved |
24+
| Slim Teddy (2-32 patterns) | **Yes (SSSE3)** | ✅ Achieved |
25+
| Fat Teddy (33-64 patterns) | **Yes (AVX2, 9GB/s)** | ✅ Achieved |
26+
| Aho-Corasick (>64 patterns) | **Yes** | ✅ Achieved |
2527
| BoundedBacktracker | **Yes** | ✅ Achieved |
2628
| CharClassSearcher | **Yes (23x, 2x vs Rust)** | ✅ Achieved |
2729
| ARM NEON SIMD | No | Planned |
@@ -32,9 +34,9 @@ Build a **production-ready, high-performance regex engine** for Go that matches
3234
## Release Strategy
3335

3436
```
35-
v0.8.22 (Current) ✅ → Small string optimization (1.4-20x faster)
37+
v0.10.0 (Current) ✅ → Fat Teddy 33-64 patterns (AVX2, 9GB/s)
3638
37-
v0.9.x → Beta testing, API stabilization
39+
v0.11.x → API stabilization, performance tuning
3840
3941
v1.0.0-rc → Feature freeze, API locked
4042
@@ -56,6 +58,8 @@ v1.0.0 STABLE → Production release with API stability guarantee
5658
-**v0.8.20**: ReverseSuffixSet for multi-suffix patterns (34-385x faster)
5759
-**v0.8.21**: CharClassSearcher (23x faster, 2x faster than Rust!)
5860
-**v0.8.22**: Small string optimization (1.4-20x faster on ~44B inputs)
61+
-**v0.9.x**: DigitPrefilter, Aho-Corasick integration, Teddy 2-byte fingerprint
62+
-**v0.10.0**: Fat Teddy 16-bucket SIMD (33-64 patterns, 9+ GB/s)
5963

6064
---
6165

@@ -147,20 +151,21 @@ v1.0.0 STABLE → Production release with API stability guarantee
147151

148152
## Feature Comparison Matrix
149153

150-
| Feature | RE2 | rust-regex | coregex v0.8.20 | coregex v1.0 |
154+
| Feature | RE2 | rust-regex | coregex v0.10.0 | coregex v1.0 |
151155
|---------|-----|------------|-----------------|--------------|
152156
| Lazy DFA |||||
153157
| Thompson NFA |||||
154158
| PikeVM |||||
155-
| Teddy SIMD |||||
159+
| Slim Teddy (≤32) |||||
160+
| Fat Teddy (33-64) |||||
156161
| Start State Cache | 8 | 6 | 6 ||
157162
| Reverse Search || ✅ (3) | ✅ (4) ||
158163
| ReverseSuffixSet |||||
159164
| OnePass DFA |||||
160165
| BoundedBacktracker |||||
161166
| Named Captures |||||
162167
| Prefilter Tracking |||||
163-
| Aho-Corasick ||| | Planned |
168+
| Aho-Corasick ||| | |
164169
| ARM NEON |||| Planned |
165170
| Look-around |||| Planned |
166171

@@ -188,7 +193,6 @@ v1.0.0 STABLE → Production release with API stability guarantee
188193
|---------|--------|----------|
189194
| ARM NEON SIMD | Planned | Medium |
190195
| Look-around assertions | Planned | Medium |
191-
| Aho-Corasick for large sets | Planned | Low |
192196
| API stability guarantee | Required | High |
193197

194198
---
@@ -223,7 +227,10 @@ Reference implementations available locally:
223227

224228
| Version | Date | Type | Key Changes |
225229
|---------|------|------|-------------|
226-
| **v0.8.20** | 2025-12-12 | Performance | **ReverseSuffixSet (34-385x faster) - NOT in rust-regex!** |
230+
| **v0.10.0** | 2026-01-07 | Feature | **Fat Teddy 33-64 patterns (AVX2, 9+ GB/s)** |
231+
| v0.9.5 | 2026-01-06 | Fix | Teddy limit 8→32, literal extraction fix |
232+
| v0.9.0-v0.9.4 | 2026-01-05 | Performance | DigitPrefilter, Aho-Corasick, 2-byte fingerprint |
233+
| v0.8.20 | 2025-12-12 | Performance | ReverseSuffixSet (34-385x faster) |
227234
| v0.8.19 | 2025-12-12 | Performance | FindAll ReverseSuffix (87x faster) |
228235
| v0.8.18 | 2025-12-12 | Performance | Teddy prefilter for alternations (242x faster) |
229236
| v0.8.17 | 2025-12-12 | Feature | BoundedBacktracker engine |
@@ -239,4 +246,4 @@ Reference implementations available locally:
239246

240247
---
241248

242-
*Current: v0.8.20 | Next: v0.9.x (Beta) | Target: v1.0.0*
249+
*Current: v0.10.0 | Next: v0.11.x (API stabilization) | Target: v1.0.0*

meta/ahocorasick_test.go

Lines changed: 23 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,27 +5,31 @@ import (
55
"testing"
66
)
77

8-
// TestAhoCorasickStrategySelection verifies that patterns with >32 literals
8+
// TestAhoCorasickStrategySelection verifies that patterns with >64 literals
99
// select UseAhoCorasick strategy.
1010
func TestAhoCorasickStrategySelection(t *testing.T) {
11-
// Pattern with 33 literals (above Teddy's limit of 32)
11+
// Pattern with 65 literals (above Teddy's limit of 64)
12+
// Teddy supports up to 64 patterns via Slim (2-32) and Fat (33-64) variants.
13+
// For >64 patterns, Aho-Corasick is selected.
1214
// Each literal >= 3 bytes, all complete (no regex meta-characters)
1315
// IMPORTANT: No shared prefixes! Go's regex parser factors common prefixes,
1416
// e.g., "two|three" becomes "t(wo|hree)", which extracts only "t" as incomplete.
15-
// Using fruits/vegetables/colors with unique first letters.
16-
pattern := `apple|banana|cherry|date|elderberry|fig|grape|honeydew|` +
17-
`imbe|jackfruit|kiwi|lemon|mango|nectarine|orange|papaya|` +
18-
`quince|raspberry|strawberry|tomato|ugli|vanilla|watermelon|` +
19-
`ximenia|yuzu|zucchini|apricot|blueberry|coconut|dragonfruit|` +
20-
`eggplant|feijoa|guava`
17+
// Using unique words with different first characters.
18+
pattern := `alpha|bravo|charlie|delta|echo|foxtrot|golf|hotel|india|juliet|` + // 10
19+
`kilo|lima|mike|november|oscar|papa|quebec|romeo|sierra|tango|` + // 20
20+
`uniform|victor|whiskey|xray|yankee|zulu|anise|basil|cilantro|dill|` + // 30
21+
`endive|fennel|ginger|hops|ivory|jasmine|kelp|lavender|mint|nutmeg|` + // 40
22+
`oregano|parsley|quassia|rosemary|sage|thyme|urtica|verbena|wasabi|xylose|` + // 50
23+
`yarrow|zinnia|acacia|bamboo|cactus|dahlia|ebony|fern|grass|holly|` + // 60
24+
`iris|juniper|kudzu|lotus|moss|nettle|oak` // 67
2125

2226
re, err := Compile(pattern)
2327
if err != nil {
2428
t.Fatalf("Compile(%q) failed: %v", pattern, err)
2529
}
2630

2731
if re.Strategy() != UseAhoCorasick {
28-
t.Errorf("Strategy() = %s, want UseAhoCorasick", re.Strategy())
32+
t.Errorf("Strategy() = %s, want UseAhoCorasick for 67 patterns", re.Strategy())
2933
}
3034
}
3135

@@ -223,22 +227,25 @@ func TestAhoCorasickCount(t *testing.T) {
223227
}
224228
}
225229

226-
// TestAhoCorasickLargePatternSet tests with many patterns.
230+
// TestAhoCorasickLargePatternSet tests with many patterns (>64 triggers Aho-Corasick).
227231
func TestAhoCorasickLargePatternSet(t *testing.T) {
228-
// 35 patterns - above Teddy's limit of 32
232+
// 70 patterns - above Teddy's limit of 64 (Fat Teddy handles up to 64)
229233
// No shared prefixes to avoid Go's regex parser factoring (e.g., "two|three" → "t(wo|hree)")
230-
pattern := `alpha|bravo|charlie|delta|echo|foxtrot|golf|hotel|india|juliet|` +
231-
`kilo|lima|mike|november|oscar|papa|quebec|romeo|sierra|tango|` +
232-
`uniform|victor|whiskey|xray|yankee|zulu|anise|basil|cilantro|dill|` +
233-
`endive|fennel|ginger|hops|ivy`
234+
pattern := `alpha|bravo|charlie|delta|echo|foxtrot|golf|hotel|india|juliet|` + // 10
235+
`kilo|lima|mike|november|oscar|papa|quebec|romeo|sierra|tango|` + // 20
236+
`uniform|victor|whiskey|xray|yankee|zulu|anise|basil|cilantro|dill|` + // 30
237+
`endive|fennel|ginger|hops|ivory|jasmine|kelp|lavender|mint|nutmeg|` + // 40
238+
`oregano|parsley|quassia|rosemary|sage|thyme|urtica|verbena|wasabi|xylose|` + // 50
239+
`yarrow|zinnia|acacia|bamboo|cactus|dahlia|ebony|fern|grass|holly|` + // 60
240+
`iris|juniper|kudzu|lotus|moss|nettle|oak|plum|reed|sorrel` // 70
234241

235242
re, err := Compile(pattern)
236243
if err != nil {
237244
t.Fatalf("Compile(%q) failed: %v", pattern, err)
238245
}
239246

240247
if re.Strategy() != UseAhoCorasick {
241-
t.Errorf("Strategy() = %s, want UseAhoCorasick for 35 patterns", re.Strategy())
248+
t.Errorf("Strategy() = %s, want UseAhoCorasick for 70 patterns", re.Strategy())
242249
}
243250

244251
haystack := []byte("this is alpha and omega, with bravo and tango at the end")

meta/config.go

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,8 @@ type Config struct {
5555
MinLiteralLen int
5656

5757
// MaxLiterals limits the number of literals to extract for prefiltering.
58-
// Default: 64
58+
// Must be > 64 to properly detect patterns that exceed Teddy's capacity.
59+
// Default: 256 (allows detecting patterns with >64 literals for Aho-Corasick)
5960
MaxLiterals int
6061

6162
// MaxRecursionDepth limits recursion during NFA compilation.
@@ -83,8 +84,8 @@ func DefaultConfig() Config {
8384
EnablePrefilter: true,
8485
MaxDFAStates: 10000,
8586
DeterminizationLimit: 1000,
86-
MinLiteralLen: 1, // Allow single-byte prefilters (memchr) like Rust
87-
MaxLiterals: 64,
87+
MinLiteralLen: 1, // Allow single-byte prefilters (memchr) like Rust
88+
MaxLiterals: 256, // Allow detecting >64 literals for Aho-Corasick
8889
MaxRecursionDepth: 100,
8990
}
9091
}

0 commit comments

Comments
 (0)