This document describes the 10 key optimizations in coregex that outperform the Rust regex crate. These algorithms are critical to coregex's competitive advantage and MUST NOT REGRESS.
| Optimization | File | Pattern Type | vs stdlib | Benchmark |
|---|---|---|---|---|
| AnchoredLiteral | meta/anchored_literal.go |
^prefix.*suffix$ |
32-133x faster | anchored_literal |
| Flat SlotTable | nfa/pikevm.go, nfa/slot_table.go |
FindSubmatch | -95% memory | submatch |
| CharClassSearcher | nfa/charclass_searcher.go |
[\w]+, [a-z]+ |
23x faster | char_class |
| CompositeSearcher | nfa/composite.go |
[a-zA-Z]+[0-9]+ |
5x faster | composite |
| BranchDispatch | nfa/branch_dispatch.go |
^(\d+|UUID|hex32) |
5-20x faster | anchored_alt |
| DigitPrefilter | prefilter/digit.go |
IP addresses, \d+ |
3324x faster* | ip |
| ReverseSuffixSet | meta/reverse_suffix_set.go |
.*\.(txt|log|md) |
260x faster | suffix |
| ReverseInner | meta/reverse_inner.go |
.*email@.* |
909x faster | |
| ReverseSuffix | meta/reverse_suffix.go |
.*\.txt |
1124x faster | suffix |
| AVX2 Slim Teddy | prefilter/teddy_slim_avx2_amd64.s |
foo|bar|baz |
242x faster | literal_alt |
* No-match case on large input where SIMD prefilter skips entire input
Benchmark source: regex-bench v0.10.2 on Intel Core i7-1255U
File: meta/anchored_literal.go
Pattern types: Anchored patterns with prefix literals, wildcards, and suffix literals like ^/.*[\w-]+\.php$
AnchoredLiteral uses O(1) specialized matching for patterns that match the form ^prefix.*charclass*suffix$.
type AnchoredLiteralInfo struct {
Prefix []byte // Required prefix literal (e.g., "/")
Suffix []byte // Required suffix literal (e.g., ".php")
CharClassTable *[256]bool // Optional char class bridge (e.g., [\w-])
CharClassMin int // Minimum chars for bridge
WildcardMin int // Minimum wildcard length
MinLength int // Total minimum length
}
func MatchAnchoredLiteral(info *AnchoredLiteralInfo, haystack []byte) bool {
// O(1) length check
if len(haystack) < info.MinLength {
return false
}
// O(k) prefix check
if !bytes.HasPrefix(haystack, info.Prefix) {
return false
}
// O(k) suffix check
if !bytes.HasSuffix(haystack, info.Suffix) {
return false
}
// O(m) charclass bridge verification
if info.CharClassTable != nil {
// Verify bridge characters satisfy char class
}
return true
}- No NFA/DFA execution: Direct byte comparisons vs state machine
- Early rejection: Length check rejects 99%+ of non-matches in O(1)
- No backtracking: Deterministic O(n) worst case
- Cache-friendly: Sequential memory access for prefix/suffix
Strategy selection (meta/strategy.go) detects anchored literal patterns:
func DetectAnchoredLiteral(re *syntax.Regexp) *AnchoredLiteralInfo {
// Returns non-nil if pattern matches ^prefix.*charclass*suffix$ form
// Supports: literal prefix, .*/+ wildcard, charclass+ bridge, literal suffix
}Pattern: ^/.*[\w-]+\.php$
Input: Various URL paths
Case stdlib coregex Speedup
Short (24B) 241 ns 7.6 ns 32x faster
Medium (45B) 347 ns 7.8 ns 44x faster
Long (78B) 516 ns 7.9 ns 65x faster
No match 590 ns 4.4 ns 133x faster
File: nfa/charclass_searcher.go
Pattern types: Simple character class patterns like [\w]+, [a-z]+, \d+
CharClassSearcher uses a 256-byte lookup table for O(1) byte membership testing. This replaces the NFA state machine overhead with a simple array lookup.
type CharClassSearcher struct {
membership [256]bool // membership[b] = true if byte b matches
minMatch int // 1 for +, 0 for *
}
func (s *CharClassSearcher) SearchAt(haystack []byte, at int) (int, int, bool) {
// Find first matching byte
for i := at; i < len(haystack); i++ {
if s.membership[haystack[i]] {
start := i
// Greedy scan while bytes match
for i < len(haystack) && s.membership[haystack[i]] {
i++
}
return start, i, true
}
}
return -1, -1, false
}- No NFA state tracking: Rust uses a bit-vector for visited states; coregex uses a single array lookup
- Single-pass state machine:
FindAllIndicesuses SEARCHING/MATCHING states, no per-match function calls - CPU branch prediction: Consistent state transitions optimize for modern CPUs
SIMD optimization was evaluated but found slower for char_class patterns because:
- Matches are frequent (30-50% of positions)
- Matches are short (average 4-8 bytes)
- SIMD setup overhead exceeds scalar benefits
For large-scale character class search, Lazy DFA is used instead.
Pattern: [\w]+
Input: 1MB Wikipedia text
coregex: 33.5 ms
Rust regex: 53.0 ms
Speedup: 35% faster
File: nfa/composite.go
Pattern types: Concatenated character class patterns like [a-zA-Z]+[0-9]+, \d+\s+\w+
CompositeSearcher uses sequential lookup tables for patterns that concatenate multiple character classes. Each part has its own [256]bool membership table for O(1) byte classification.
type CompositeSearcher struct {
parts []*charClassPart
}
type charClassPart struct {
membership [256]bool // O(1) character classification
minMatch int // 1 for +, 0 for *
maxMatch int // 0 = unlimited
}
func (c *CompositeSearcher) matchAt(haystack []byte, pos int) (int, bool) {
i := pos
for _, part := range c.parts {
// Greedy match: consume as many characters as possible
matchLen := 0
for i+matchLen < len(haystack) && part.membership[haystack[i+matchLen]] {
matchLen++
}
if matchLen < part.minMatch {
return -1, false
}
i += matchLen
}
return i, true
}- No NFA state machine: Direct table lookup instead of state transitions
- Greedy single-pass: No backtracking between parts
- Cache-friendly: Sequential memory access through lookup tables
Pattern: [a-zA-Z]+[0-9]+
Input: "The quick brown fox123 jumps over456 lazy dog789"
stdlib: 808 ns/op
coregex: 170 ns/op
Speedup: 4.75x faster
File: nfa/branch_dispatch.go
Pattern types: Anchored alternations with distinct first bytes like ^(\d+|UUID|hex32)
BranchDispatch uses O(1) first-byte dispatch for anchored alternation patterns. Instead of trying all branches sequentially, it builds a dispatch table that maps each possible first byte to the correct branch.
type BranchDispatcher struct {
dispatch [256]int8 // first_byte → branch_index (-1 = no match)
branches []*syntax.Regexp
}
// For pattern ^(\d+|UUID|hex32):
// dispatch['0'-'9'] = 0 (digit branch)
// dispatch['U'] = 1 (UUID branch)
// dispatch['h'] = 2 (hex32 branch)
// dispatch[others] = -1 (no match)
func (d *BranchDispatcher) IsMatch(haystack []byte) bool {
if len(haystack) == 0 {
return d.canMatchEmpty
}
branchIdx := d.dispatch[haystack[0]] // O(1) lookup
if branchIdx < 0 {
return false // Early rejection
}
// Only try the selected branch
return d.matchBranch(branchIdx, haystack)
}- O(1) branch selection: Single array lookup vs O(branches) iteration
- Early rejection: Non-matching first bytes rejected immediately
- Specialized matchers: Each branch has optimized literal/charclass matcher
Pattern: ^(\d+|UUID|hex32)
Input: Various test cases
Case stdlib coregex Speedup
Digits 225 ns 42 ns 5.4x faster
UUID 173 ns 39 ns 4.4x faster
No match 101 ns 4.9 ns 20.7x faster
File: prefilter/digit.go
Pattern types: Digit-lead patterns like IP addresses, phone numbers, numeric validators
DigitPrefilter uses SIMD-accelerated digit scanning to skip non-digit regions. This converts O(nm) matching to O(km) where k = number of digit positions.
func (p *DigitPrefilter) Find(haystack []byte, start int) int {
return simd.MemchrDigitAt(haystack, start) // AVX2 optimized
}The meta-engine orchestrates:
- DigitPrefilter finds next digit position
- Lazy DFA verifies the match at that position
- Skip to next digit position on mismatch
- Specialized prefilter: Rust has no digit-specific prefilter; it falls back to Core strategy
- SIMD acceleration: AVX2 processes 32 bytes/iteration vs byte-by-byte
- Skip-ahead: Non-digit regions (often >80% of text) are skipped entirely
Strategy selection (meta/strategy.go) detects digit-lead patterns:
func isDigitLeadPattern(re *syntax.Regexp) bool {
// Returns true if ALL branches must start with digit [0-9]
// Examples: \d+, [0-9]+, 25[0-5]|2[0-4][0-9]|...
}Pattern: (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.(?:25[0-5]|...) (IP address)
Input: 1MB access logs with ~0.5% IP addresses
coregex: 3.8 ms
Rust regex: 12.3 ms
Speedup: 3.3x faster
File: meta/reverse_suffix_set.go
Pattern types: Multi-suffix alternations like .*\.(txt|log|md)
ReverseSuffixSet combines Teddy multi-pattern prefilter with reverse DFA verification. This is an optimization NOT present in Rust regex (they fall back to Core strategy).
Algorithm:
1. Build Teddy prefilter from all suffix literals [".txt", ".log", ".md"]
2. Teddy finds any suffix literal in haystack
3. Reverse DFA verifies prefix pattern from suffix position
4. For .* prefix patterns, match starts at position 0 (skip reverse scan)
- Rust has no ReverseSuffixSet: When LCS (Longest Common Suffix) is empty, Rust falls back to UseBoth
- Teddy multi-pattern: SIMD searches for multiple suffixes simultaneously
- Reverse DFA optimization: For
.*prefix, no reverse scan needed
Strategy selection detects multi-suffix patterns:
func shouldUseReverseSuffixSet(prefixLiterals, suffixLiterals *literal.Seq) bool {
// Returns true if:
// - 2-32 suffix literals available
// - Each suffix >= 2 bytes
// - Not an exact alternation (those use UseTeddy)
}Pattern: .*\.(txt|log|md|json|yaml|xml|csv|html)
Input: 1MB file listing
coregex: 1.0 ms
Rust regex: 1.3 ms
Speedup: 27% faster
File: meta/reverse_inner.go
Pattern types: Inner literal patterns like .*email@.*, ERROR.*connection.*timeout
ReverseInner uses bidirectional DFA search from the inner literal position:
Algorithm for pattern `prefix.*inner.*suffix`:
1. Prefilter finds "inner" literal candidates
2. For each candidate at position P:
a. Reverse DFA scans backward from P to find match START
b. Forward DFA scans forward from P+len(inner) to find match END
3. Early return on first confirmed match (leftmost-longest)
The critical optimization (from Rust regex):
- Build reverse NFA from PREFIX AST only (not full pattern)
- Build forward NFA from SUFFIX AST only (not full pattern)
This enables true bidirectional search with minimal DFA states.
- Universal match optimization: For
.*inner.*patterns, skip DFA scans entirely - Early return: First confirmed match is leftmost by construction
- Quadratic detection: Falls back to PikeVM when O(n^2) behavior detected
Pattern: .*user@example\.com.*
Input: 1MB email logs
coregex: 1.2 ms
Rust regex: 1.4 ms
Speedup: 16% faster
File: meta/strategy.go (strategy selection)
Pattern types: Small anchored patterns like ^pattern$, tiny NFAs (<20 states)
For tiny patterns, coregex uses PikeVM directly without DFA overhead:
func SelectStrategy(n *nfa.NFA, ...) Strategy {
if nfaSize < 20 && hasGoodLiterals {
return UseNFA // Prefilter + PikeVM
}
if nfaSize < 20 {
return UseNFA // Pure PikeVM
}
// ... larger patterns use DFA
}- No lazy DFA overhead: DFA cache lookup and state construction have fixed costs
- Prefilter integration: PikeVM uses prefilter for skip-ahead
- Optimal threshold: 20 states is the crossover point where DFA benefits exceed overhead
Rust's strategy selection can be too aggressive with DFA for small patterns. coregex's threshold-based selection avoids this overhead.
Pattern: ^Hello, World!$
Input: 1MB text with pattern at various positions
coregex: 0.02 ms
Rust regex: 0.04 ms
Speedup: 2x faster
File: prefilter/teddy_slim_avx2_amd64.s
Pattern types: Multi-pattern literal alternations like foo|bar|baz (2-32 patterns)
Status: Available for direct use. NOT enabled in integrated prefilter due to regression in high false-positive workloads (see issue #74).
AVX2 Slim Teddy processes 32 bytes per iteration using the shift algorithm from Rust aho-corasick.
Key insight: For 2-byte fingerprint matching, we need mask0(byte_P) & mask1(byte_{P+1}) at each position P.
Naive approach (caused 6x regression on AMD EPYC):
VMOVDQU (SI), Y3 // Load 32 bytes at position 0
VMOVDQU 1(SI), Y10 // Load 32 bytes at position 1 (OVERLAPPING!)
// Two loads cross 32-byte cache line boundary = AMD penaltyShift algorithm (from Rust):
// Single load per iteration
VMOVDQU (SI), Y3 // Load 32 bytes ONCE
// Compute res0, res1 via nibble lookups
VPSHUFB Y4, Y0, Y6 // res0 = mask0 lookup
VPSHUFB Y4, Y8, Y11 // res1 = mask1 lookup
// Shift res0 right by 1, bringing in prev0[31]
VPERM2I128 $0x21, Y6, Y10, Y13 // tmp = [prev0.hi | res0.lo]
VPALIGNR $15, Y13, Y6, Y15 // res0_shifted = byte-align
// Combine and save prev0
VPAND Y11, Y15, Y7 // result = res0_shifted & res1
VMOVDQA Y6, Y10 // prev0 = res0 for next iteration- Single load vs two loads: Halves memory bandwidth
- No cache line crossing: AMD Zen 3 penalizes 32-byte boundary crossings
- Register-based prev0: No additional memory access between iterations
AVX2's VPALIGNR operates on 128-bit lanes independently. Cross-lane shift requires:
VPERM2I128 $0x21, prev0, self, tmp
→ tmp.lo = prev0.hi (bytes 16-31 of previous result)
→ tmp.hi = self.lo (bytes 0-15 of current result)
VPALIGNR $15, tmp, self, result
→ For each 128-bit lane: shift right 15 bytes
→ result[0] = prev0[31] (the cross-lane byte!)
→ result[1..31] = self[0..30]
Pattern: error|warning|critical|fatal|debug|info|trace|... (15 patterns)
Input: 64KB log file
SSSE3 (16 bytes/iter): 12,252 ns/op = 5,348 MB/s
AVX2 naive (2 loads): ~36,000 ns/op = 1,800 MB/s (6x slower!)
AVX2 shift (1 load): 4,174 ns/op = 15,699 MB/s (2.93x faster than SSSE3)
AMD EPYC 7763 (Zen 3) characteristics:
- 256-bit AVX2 operations split into two 128-bit µops
- 32-byte aligned loads: 1 cycle latency
- Unaligned 32-byte loads crossing cache line: 2+ cycles penalty
- Two overlapping loads at offset 0 and 1 = worst case
Intel processors (tested on i7-1255U) also benefit from shift algorithm but the penalty was less severe.
Each optimization file has a header comment:
// DO NOT REGRESS: This optimization beats Rust regex by X%.
// See docs/OPTIMIZATIONS.md for details.Before any changes to these files:
# Save baseline
bash scripts/bench.sh baseline
# Make changes
# Compare
bash scripts/bench.sh --compare baseline currentRule: Regression >5% = BLOCKER
| Optimization | Benchmark | Target |
|---|---|---|
| AnchoredLiteral | BenchmarkAnchoredLiteral |
<10 ns/op |
| CharClassSearcher | BenchmarkCharClass |
<35 ms/MB |
| DigitPrefilter | BenchmarkIP |
<4 ms/MB |
| ReverseSuffixSet | BenchmarkSuffix |
<1.1 ms/MB |
| ReverseInner | BenchmarkEmail |
<1.3 ms/MB |
| NFA small | BenchmarkAnchored |
<0.025 ms/op |
| AVX2 Slim Teddy | BenchmarkSlimTeddyDirect/AVX2 |
>15 GB/s |
Files: nfa/pikevm.go, nfa/slot_table.go
Pattern types: All FindSubmatch/FindAllSubmatch patterns with capture groups
Replaces per-thread COW (copy-on-write) capture allocation with Rust-style flat
SlotTable indexed by NFA state ID. Two SlotTables (curr/next) swap between byte
generations — matching Rust's ActiveStates pattern.
SlotTable layout: table[stateID * slotsPerState + slotIndex]
Each NFA state owns a row of slots: [g0_start, g0_end, g1_start, g1_end, ...]
Epsilon closure: stack-based with RestoreCapture frames
Explore(sid) → process state, push children
RestoreCapture(slot) → undo capture write after subtree processed
Key invariant: Visited sparse set guarantees each NFA state is visited at most once per generation → one thread per state → per-state storage is correct.
| COW (old) | SlotTable (new) | |
|---|---|---|
| Thread fork | make([]int, numSlots) — heap alloc |
copy(row, currSlots) — no alloc |
| Capture update | COW copy if shared — heap alloc | currSlots[i] = pos — in-place |
| Match save | copyData() — heap alloc |
copy(bestSlots, row) — one copy |
| Memory per search | O(threads × slots) | O(states × slots) — fixed |
FindAllSubmatch: 5 patterns, 50K matches, 800KB input
Metric COW (old) SlotTable (new) Improvement
Alloc 554 MB 26 MB -95%
Mallocs 12,500,000 440,000 -96%
Time 1.48s 0.45s 3.3x faster
- Rust:
regex-automata/src/nfa/thompson/pikevm.rs:2065(SlotTable struct) - Rust:
regex-automata/src/nfa/thompson/pikevm.rs:1611(FollowEpsilon::RestoreCapture) - Rust:
regex-automata/src/nfa/thompson/pikevm.rs:1878(Cache with curr/next ActiveStates)
- Rust regex crate: Architecture inspiration for multi-engine design
- Rust aho-corasick: Teddy shift algorithm (
packed/teddy/generic.rs,packed/vector.rs) - RE2: O(n) performance guarantees
- Hyperscan: Teddy algorithm for SIMD multi-pattern matching
- regex-bench: Cross-language regex benchmark suite
Document version: 1.3.0 Last updated: 2026-03-24 Benchmark data: regex-bench v0.12.19