Skip to content

Commit 921d193

Browse files
authored
perf: flat DFA + integrated prefilter — 35% faster than baseline (#151)
* fix: NFA candidate loop guard — use partialCoverage instead of IsComplete IsComplete() guard blocked prefilter candidate loop for ALL incomplete prefilters, including prefix-only ones where all alternation branches are represented. This caused 22x regression on Kostya's errors pattern (1984ms vs 90ms on v0.12.14). Root cause: Rust integrates prefilter as skip-ahead INSIDE PikeVM (pikevm.rs:1293-1299), not as external correctness gate. When NFA states are empty, prefilter skips ahead. Partial coverage is safe because NFA continues scanning if prefilter misses. Fix: Added partialCoverage flag on literal.Seq (set only on overflow truncation). NFA candidate loop uses !partialCoverage guard instead of IsComplete(). DFA paths retain IsComplete() where needed. errors: 1984ms -> 109ms. Stdlib compat: 38/38 PASS. * perf: PikeVM integrated prefilter skip-ahead (Rust approach) Integrate prefilter inside PikeVM search loop as skip-ahead (pikevm.rs:1293). When NFA has no active threads, PikeVM jumps to next candidate via prefilter.Find() instead of byte-by-byte scan. Safe for partial-coverage prefilters — NFA processes all branches from each candidate position. This is architecturally cleaner than external candidate loop guards (partialCoverage flag still used for external BT candidate loop as BoundedBacktracker has no integrated skip-ahead). Also includes PR #150 changes: partialCoverage flag on literal.Seq, NFA candidate loop guard uses partialCoverage instead of IsComplete(). errors pattern: 1984ms -> 120ms. la_suspicious: 38/38 stdlib PASS. * perf: flat DFA transition table — eliminate pointer chase in hot loop Replace double indirection (stateList[id].transitions[class]) with flat transition table (flatTrans[sid*stride + class]) in searchFirstAt hot loop. Also replace State.IsMatch() with compact matchFlags[sid] bool slice. Fast path now works with state ID only — no *State pointer needed. State struct accessed only in slow path (determinize, word boundary). Inspired by Rust regex-automata hybrid/dfa.rs Cache.trans flat layout. Kostya benchmark: 3.60s -> 2.56s (1.4x faster). bots pattern restored to v0.12.14 baseline (278ms vs 287ms). Stdlib compat: 38/38 PASS. * perf: 4x loop unrolling in searchFirstAt (Rust approach) Unroll DFA hot loop 4x — process 4 bytes per iteration when all transitions are in flat table (no unknown/dead states). Falls to single-byte slow path on any special state. Marginal improvement on x86 with SIMD prefilters (branch predictor handles single-byte well). May help more on ARM64 where branch prediction is less aggressive. Reference: Rust hybrid/search.rs:195-221. Stdlib compat: 38/38 PASS. * perf: apply flat DFA transition table to ALL search functions Extend flat table optimization from searchFirstAt to all 6 DFA search functions: searchAt, searchEarliestMatch, searchEarliestMatchAnchored, SearchReverse, SearchReverseLimited, IsMatchReverse. Hot loop pattern: ft[int(sid)*stride + classIdx] replaces stateList[id].transitions[class] — eliminates pointer chase. State struct accessed only in slow path (determinize, word boundary). Kostya benchmark: 2.56s -> 2.28s (+12%). errors pattern: 109ms -> 81ms (better than v0.12.14 baseline 90ms). Stdlib compat: 38/38 PASS. * fix: restore DFA prefilter skip-ahead for incomplete prefilters IsComplete() guard in findIndicesDFA/findIndicesDFAAt blocked prefilter skip-ahead for incomplete prefilters (memmem, Teddy with prefix-only literals). But DFA verifies full pattern at candidate — skip is always safe. This was the root cause of sessions (229ms -> 36ms), api_calls (245ms -> 95ms), post_requests (259ms -> 114ms) regressions. Kostya benchmark total: 2.28s -> 1.62s (FASTER than v0.12.14 baseline 1.80s!). Stdlib compat: 38/38 PASS. * perf: DFA prefilter skip-ahead at start state (Rust approach) When DFA returns to start state with no match in progress, use prefilter to skip ahead to next candidate instead of byte-by-byte scanning. Applied to searchFirstAt and searchAt (bidirectional DFA path). This is the Rust approach (hybrid/search.rs:232-258): prefilter is called inside the DFA loop when a start state is detected, not externally. peak_hours: 197ms -> 90ms (2.2x faster, gap vs Rust: 9x -> 4x). Kostya total: 1.62s -> 1.38s (15% faster). Stdlib compat: 38/38 PASS. * docs: update CHANGELOG for v0.12.18 * perf: flat DFA transition table in SearchAtAnchored Apply flat table to SearchAtAnchored — called for every prefilter candidate verification in bidirectional DFA path. Eliminates pointer chase in the most frequent DFA hot path. Kostya benchmark: 1.38s -> 1.17s (15% faster). Total improvement vs v0.12.14: 1.80s -> 1.17s (35% faster). Stdlib compat: 38/38 PASS. * perf: flat DFA transition table in isMatchWithPrefilter and findWithPrefilterAt Apply flat table to last 2 remaining functions with old Transition() calls. No more State pointer chase in ANY DFA hot loop. Kostya benchmark: 1.17s -> 1.19s (stable, tokens 116ms->51ms). All DFA search functions now use flatTrans[sid*stride+class]. Stdlib compat: 38/38 PASS. * docs: update ROADMAP and CHANGELOG for v0.12.18 * fix: guard getState/IsMatchState against 386 int overflow On 386, int(StateID(0xFFFFFFFF)) = -1 (int is 32-bit). getState and IsMatchState used int(id) for slice indexing, causing panic: index out of range [-1]. Fix: check sid >= DeadState before int cast. DeadState (0xFFFFFFFE) and InvalidState (0xFFFFFFFF) are sentinel values not present in stateList/matchFlags. * fix: use safeOffset for all flat table indexing — 386 int overflow On 386, int is 32-bit. int(StateID(0xFFFFFFFE)) = -2, causing negative slice index panic in flat table lookups. Added safeOffset() helper using uint arithmetic (always positive). Replaced all 23 occurrences of int(sid)*stride in hot loops. safeOffset inlines — zero overhead on 64-bit. * fix: safeOffset guard for DeadState/InvalidState on 386 uint multiply overflows on 386: uint(0xFFFFFFFE)*uint(20) wraps around. Guard with sid >= DeadState check — returns MaxInt so bounds check fails safely. Normal state IDs (small values) take fast path without branch. * docs: update README benchmark table and ROADMAP for v0.12.18
1 parent bc78fa7 commit 921d193

11 files changed

Lines changed: 921 additions & 436 deletions

File tree

CHANGELOG.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1212
- ARM NEON SIMD support (Go 1.26 `simd/archsimd` intrinsics — [#120](https://github.com/coregx/coregex/issues/120))
1313
- SIMD prefilter for CompositeSequenceDFA (#83)
1414

15+
## [0.12.18] - 2026-03-24
16+
17+
### Performance
18+
- **Flat DFA transition table** (Rust approach) — replaced double pointer chase
19+
(`stateList[id].transitions[class]`) with flat array (`flatTrans[sid*stride+class]`).
20+
Hot loop works with state ID only — no `*State` pointer in fast path. Applied to
21+
all 6 DFA search functions. Inspired by Rust `Cache.trans` flat layout.
22+
23+
- **4x loop unrolling** in `searchFirstAt` — process 4 bytes per iteration when
24+
all transitions are in flat table. Falls to single-byte slow path on special states.
25+
26+
- **DFA integrated prefilter skip-ahead** (Rust approach) — when DFA returns to
27+
start state with no match in progress, uses `prefilter.Find()` to skip ahead
28+
instead of byte-by-byte scanning. Applied to `searchFirstAt` and `searchAt`.
29+
Reference: Rust `hybrid/search.rs:232-258`.
30+
`peak_hours`: 197ms → **90ms** (gap vs Rust: 9x → 4x).
31+
32+
- **PikeVM integrated prefilter skip-ahead** — prefilter integrated inside PikeVM
33+
search loop (`pikevm.rs:1293`). When NFA has no active threads, PikeVM jumps to
34+
next candidate. Safe for partial-coverage prefilters.
35+
36+
### Fixed
37+
- **NFA candidate loop guard** — replaced `IsComplete()` with `partialCoverage`
38+
flag. `IsComplete()` blocked ALL incomplete prefilters including prefix-only ones.
39+
`errors` pattern: 1984ms → **80ms**.
40+
41+
- **DFA prefilter skip for incomplete prefilters**`IsComplete()` guard blocked
42+
DFA prefilter skip-ahead for memmem/Teddy prefix-only prefilters. But DFA verifies
43+
full pattern — skip is always safe. `sessions`: 229ms → **30ms**.
44+
1545
## [0.12.17] - 2026-03-23
1646

1747
### Fixed
@@ -39,6 +69,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3969
Now allows UseTeddy when anchors are only `(?m)^` (no \b, $, etc).
4070
`http_methods` on macOS ARM64: 89ms → **<1ms** (restored to v0.12.14 level).
4171

72+
- **Fix NFA candidate loop guard**`IsComplete()` guard blocked prefilter
73+
candidate loop for ALL incomplete prefilters, including prefix-only ones
74+
where all alternation branches are represented. Now uses `partialCoverage`
75+
flag (set only on overflow truncation) instead of `IsComplete()`. Pattern
76+
` [5][0-9]{2} | [4][0-9]{2} ` (Kostya's `errors`): 1984ms → **109ms**.
77+
Rust handles this by integrating prefilter as skip-ahead inside PikeVM
78+
(not as an external correctness gate) — see `pikevm.rs:1293-1299`.
79+
4280
## [0.12.16] - 2026-03-21
4381

4482
### Performance

README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -64,16 +64,16 @@ Cross-language benchmarks on 6MB input, AMD EPYC ([source](https://github.com/ko
6464

6565
| Pattern | Go stdlib | coregex | Rust regex | vs stdlib | vs Rust |
6666
|---------|-----------|---------|------------|-----------|---------|
67-
| Literal alternation | 475 ms | 4.4 ms | 0.6 ms | **108x** | 7.1x slower |
68-
| Multi-literal | 1412 ms | 12.8 ms | 4.7 ms | **110x** | 2.7x slower |
69-
| Inner `.*keyword.*` | 232 ms | 0.30 ms | 0.27 ms | **774x** | 1.1x slower |
70-
| Suffix `.*\.txt` | 236 ms | 1.82 ms | 1.13 ms | **129x** | 1.6x slower |
71-
| Multiline `(?m)^/.*\.php` | 103 ms | 0.50 ms | 0.67 ms | **206x** | **1.3x faster** |
72-
| Email validation | 265 ms | 0.62 ms | 0.27 ms | **428x** | 2.2x slower |
73-
| URL extraction | 353 ms | 0.65 ms | 0.35 ms | **543x** | 1.8x slower |
74-
| IP address | 496 ms | 2.1 ms | 12.1 ms | **231x** | **5.6x faster** |
75-
| Char class `[\w]+` | 581 ms | 51.2 ms | 50.2 ms | **11x** | ~parity |
76-
| Word repeat `(\w{2,8})+` | 712 ms | 186 ms | 48.7 ms | **3x** | 3.8x slower |
67+
| Literal alternation | 475 ms | 4.4 ms | 0.7 ms | **109x** | 6.3x slower |
68+
| Multi-literal | 1391 ms | 12.6 ms | 4.7 ms | **110x** | 2.6x slower |
69+
| Inner `.*keyword.*` | 231 ms | 0.29 ms | 0.29 ms | **797x** | **~parity** |
70+
| Suffix `.*\.txt` | 234 ms | 1.83 ms | 1.07 ms | **128x** | 1.7x slower |
71+
| Multiline `(?m)^/.*\.php` | 103 ms | 0.66 ms | 0.66 ms | **156x** | **~parity** |
72+
| Email validation | 261 ms | 0.54 ms | 0.31 ms | **482x** | 1.7x slower |
73+
| URL extraction | 262 ms | 0.84 ms | 0.35 ms | **311x** | 2.4x slower |
74+
| IP address | 498 ms | 2.1 ms | 12.0 ms | **237x** | **5.6x faster** |
75+
| Char class `[\w]+` | 554 ms | 48.0 ms | 50.1 ms | **11x** | **1.0x faster** |
76+
| Word repeat `(\w{2,8})+` | 641 ms | 185 ms | 48.7 ms | **3x** | 3.7x slower |
7777

7878
**Where coregex excels:**
7979
- Multiline patterns (`(?m)^/.*\.php`) — near Rust parity, 100x+ vs stdlib

ROADMAP.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
> **Strategic Focus**: Production-grade regex engine with RE2/rust-regex level optimizations
44
5-
**Last Updated**: 2026-03-20 | **Current Version**: v0.12.15 | **Target**: v1.0.0 stable
5+
**Last Updated**: 2026-03-24 | **Current Version**: v0.12.18 | **Target**: v1.0.0 stable
66

77
---
88

@@ -87,7 +87,13 @@ v0.12.13 ✅ → FatTeddy fix, prefilter acceleration, AC v0.2.1
8787
8888
v0.12.14 ✅ → Concurrent safety fix for isMatchDFA prefilter (#137)
8989
90-
v0.12.15 (Current) ✅ → Per-goroutine DFA cache, word boundary 30%→0.3% CPU, AC prefilter
90+
v0.12.15 ✅ → Per-goroutine DFA cache, word boundary 30%→0.3% CPU, AC prefilter
91+
92+
v0.12.16 ✅ → WrapLineAnchor for (?m)^ patterns
93+
94+
v0.12.17 ✅ → Fix LogParser ARM64 regression, restore DFA/Teddy for (?m)^
95+
96+
v0.12.18 (Current) ✅ → Flat DFA transition table, integrated prefilter, PikeVM skip-ahead
9197
9298
v1.0.0-rc → Feature freeze, API locked
9399
@@ -130,7 +136,10 @@ v1.0.0 STABLE → Production release with API stability guarantee
130136
-**v0.12.12**: Prefix trimming for case-fold literals
131137
-**v0.12.13**: FatTeddy fix (ANDL→ORL, VPTEST), prefilter acceleration, AC v0.2.1
132138
-**v0.12.14**: Concurrent safety fix for isMatchDFA prefilter (#137)
133-
-**v0.12.15**: Per-goroutine DFA cache (Rust approach), word boundary 30%→0.3% CPU, AC DFA prefilter for >32 literals (7-13x faster)
139+
-**v0.12.15**: Per-goroutine DFA cache (Rust approach), word boundary 30%→0.3% CPU, 7 correctness fixes
140+
-**v0.12.16**: WrapLineAnchor for (?m)^ patterns
141+
-**v0.12.17**: Fix LogParser ARM64 regression — restore DFA/Teddy for (?m)^, partial prefilter
142+
-**v0.12.18**: Flat DFA transition table (Rust approach), integrated prefilter skip-ahead in DFA+PikeVM, 4x unrolling — **35% faster than v0.12.14, 3x from Rust**
134143

135144
---
136145

dfa/lazy/cache.go

Lines changed: 92 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -27,36 +27,50 @@ import (
2727
// - After too many clears, falls back to NFA
2828
// - Clearing keeps allocated memory to avoid re-allocation
2929
type DFACache struct {
30-
// states maps StateKey -> DFA State
30+
// states maps StateKey -> DFA State (used only in determinize slow path)
3131
states map[StateKey]*State
3232

33-
// stateList provides O(1) lookup of states by ID via direct indexing.
34-
// StateIDs are sequential (0, 1, 2...), so slice indexing is faster than map.
35-
// This was previously DFA.states — moved here because it grows during search.
33+
// stateList provides O(1) lookup of State structs by ID.
34+
// Used only in slow path (determinize, word boundary, acceleration).
35+
// Hot loop uses flatTrans + matchFlags instead.
3636
stateList []*State
3737

38+
// --- Flat transition table (Rust approach) ---
39+
// Hot loop uses ONLY these fields — no *State pointer chase.
40+
//
41+
// Rust: cache.trans[sid + class] — single flat array, premultiplied ID.
42+
// We use: flatTrans[int(sid)*stride + class] — same layout.
43+
//
44+
// This replaces per-state State.transitions[] in the hot loop:
45+
// ONE slice access instead of TWO pointer chases (stateList → State → transitions).
46+
47+
// flatTrans is the flat transition table.
48+
// Layout: [state0_c0, state0_c1, ..., state0_cN, state1_c0, ...]
49+
// InvalidState (0xFFFFFFFF) = unknown transition (needs determinize).
50+
flatTrans []StateID
51+
52+
// matchFlags[stateID] = true if state is a match/accepting state.
53+
// Replaces State.IsMatch() in hot loop — no pointer chase needed.
54+
matchFlags []bool
55+
56+
// stride is the number of byte equivalence classes (alphabet size).
57+
stride int
58+
3859
// startTable caches start states for different look-behind contexts.
39-
// This enables correct handling of assertions (^, \b, etc.) and
40-
// avoids recomputing epsilon closures on every search.
41-
// Previously lived on DFA — moved here because it is populated lazily.
4260
startTable StartTable
4361

4462
// maxStates is the capacity limit
4563
maxStates uint32
4664

4765
// nextID is the next available state ID.
48-
// Start at 1 (0 is reserved for StartState).
4966
nextID StateID
5067

51-
// clearCount tracks how many times the cache has been cleared during
52-
// the current search. This is used to detect pathological cache thrashing
53-
// and trigger NFA fallback when clears exceed the configured limit.
54-
// Inspired by Rust regex-automata's hybrid DFA cache clearing strategy.
68+
// clearCount tracks cache clear count for NFA fallback threshold.
5569
clearCount int
5670

57-
// Statistics for cache performance tuning
58-
hits uint64 // Number of cache hits
59-
misses uint64 // Number of cache misses
71+
// Statistics
72+
hits uint64
73+
misses uint64
6074
}
6175

6276
// Get retrieves a state by its key.
@@ -95,9 +109,67 @@ func (c *DFACache) Insert(key StateKey, state *State) (StateID, error) {
95109
c.states[key] = state
96110
c.misses++
97111

112+
// Grow flat transition table for this state's row (all InvalidState initially).
113+
if c.stride > 0 {
114+
sid := int(state.id)
115+
needed := (sid + 1) * c.stride
116+
if needed > len(c.flatTrans) {
117+
growth := needed - len(c.flatTrans)
118+
for i := 0; i < growth; i++ {
119+
c.flatTrans = append(c.flatTrans, InvalidState)
120+
}
121+
}
122+
// Grow matchFlags
123+
for len(c.matchFlags) <= sid {
124+
c.matchFlags = append(c.matchFlags, false)
125+
}
126+
c.matchFlags[sid] = state.isMatch
127+
}
128+
98129
return state.ID(), nil
99130
}
100131

132+
// safeOffset computes flat table offset, safe on 386 where int is 32-bit.
133+
// StateID is uint32; on 386 int(0xFFFFFFFF) = -1 and uint multiply overflows.
134+
// Returns MaxInt for special state IDs (DeadState, InvalidState) so bounds
135+
// check (offset < ftLen) always fails safely.
136+
func safeOffset(sid StateID, stride int, classIdx int) int {
137+
if sid >= DeadState {
138+
return int(^uint(0) >> 1) // MaxInt — always >= ftLen
139+
}
140+
return int(sid)*stride + classIdx
141+
}
142+
143+
// SetFlatTransition records a transition in the flat table.
144+
// Called from determinize when a transition is computed.
145+
func (c *DFACache) SetFlatTransition(fromID StateID, classIdx int, toID StateID) {
146+
offset := safeOffset(fromID, c.stride, classIdx)
147+
if offset < len(c.flatTrans) {
148+
c.flatTrans[offset] = toID
149+
}
150+
}
151+
152+
// FlatNext returns the next state ID from the flat table.
153+
// Returns InvalidState if the transition hasn't been computed yet.
154+
// This is the hot-path function — should be inlined by the compiler.
155+
func (c *DFACache) FlatNext(sid StateID, classIdx int) StateID {
156+
offset := int(sid)*c.stride + classIdx
157+
return c.flatTrans[offset]
158+
}
159+
160+
// IsMatchState returns whether the given state ID is a match state.
161+
// Uses compact matchFlags slice — no pointer chase.
162+
func (c *DFACache) IsMatchState(sid StateID) bool {
163+
if sid >= DeadState {
164+
return false
165+
}
166+
id := int(sid)
167+
if id >= len(c.matchFlags) {
168+
return false
169+
}
170+
return c.matchFlags[id]
171+
}
172+
101173
// GetOrInsert retrieves a state from cache or inserts it if not present.
102174
// This is the primary method used during DFA construction.
103175
//
@@ -220,6 +292,11 @@ func (c *DFACache) getState(id StateID) *State {
220292
return nil
221293
}
222294

295+
// Guard against special state IDs (DeadState=0xFFFFFFFE, InvalidState=0xFFFFFFFF).
296+
// On 386, int(uint32(0xFFFFFFFF)) = -1, causing negative index panic.
297+
if id >= DeadState {
298+
return nil
299+
}
223300
idx := int(id)
224301
if idx >= len(c.stateList) {
225302
return nil

0 commit comments

Comments
 (0)