coregx
diff --git a/‎CHANGELOG.md‎
Lines changed: 56 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 56 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 20 additions & 2 deletions b/‎README.md‎
Lines changed: 20 additions & 2 deletions
diff --git a/‎ROADMAP.md‎
Lines changed: 5 additions & 2 deletions b/‎ROADMAP.md‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎dfa/lazy/builder.go‎
Lines changed: 24 additions & 0 deletions b/‎dfa/lazy/builder.go‎
Lines changed: 24 additions & 0 deletions
@@ -12,6 +12,62 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - ARM NEON SIMD support (Go 1.26 `simd/archsimd` intrinsics — [#120](https://github.com/coregx/coregex/issues/120))
 - SIMD prefilter for CompositeSequenceDFA (#83)
 
+## [0.12.21] - 2026-03-27
+
+### Performance
+- **Tagged start states** (Rust `LazyStateID` approach) — start states get tag bit,
+  always route to slow path. Enables prefilter skip-ahead only at start state,
+  eliminating O(n²) from start state self-loop. Unlocks UseDFA for tiny NFA patterns.
+
+- **DFA multiline $ fix** — EndLine look-ahead re-computation in determinize
+  (Rust mod.rs:131-212). `(?m)hello$` now works correctly in DFA.
+
+- **Dead-state prefilter restart** in searchEarliestMatch — IsMatch path uses
+  prefilter to skip past dead states, matching Rust find_fwd_imp approach.
+
+- **1100x fewer mallocs** — FindAllIndex/FindAllSubmatchIndex use flat buffer
+  (`compactToSliceOfSlice`): N matches → 2 allocations instead of N+1.
+
+- **Local SearchState cache** on Engine — atomic.Pointer single-slot cache
+  survives GC, avoids sync.Pool re-allocation overhead.
+
+- **Tiny NFA → UseDFA routing** — patterns with < 20 NFA states now use
+  bidirectional DFA (was PikeVM). 7x faster DFA vs PikeVM on large inputs.
+
+### Added
+- **`AllIndex(b []byte) iter.Seq[[2]int]`** — zero-alloc match index iterator (Go 1.23+)
+- **`AllStringIndex(s string) iter.Seq[[2]int]`** — string version
+- **`All(b []byte) iter.Seq[[]byte]`** — zero-alloc match content iterator
+- **`AllString(s string) iter.Seq[string]`** — string version
+- **`AppendAllIndex(dst [][2]int, b []byte, n int) [][2]int`** — buffer-reuse API
+- **`AppendAllStringIndex(dst [][2]int, s string, n int) [][2]int`** — string version
+
+Naming follows Go proposal #61902 (regexp iterator methods) and `strconv.Append*` convention.
+
+### Fixed
+- DFA `isMatchWithPrefilter` pfSkip off-by-one — `zx+` on "zzx" now correct
+- DFA multiline `$` EndLine look-ahead — `(?m)hello$` now matches before `\n`
+
+### Benchmarks (LangArena LogParser, 7.2 MB, 13 patterns)
+
+| Metric | v0.12.20 | v0.12.21 | Improvement |
+|--------|----------|----------|-------------|
+| Total time (FindAll) | 163ms | **107ms** | **-34%** |
+| errors pattern | 23ms | **8ms** (FindAll) / **5.5ms** (AllIndex) | **-65% / -76%** |
+| vs Rust gap | 3.9x | **2.9x** (FindAll) / **1.7x** (AllIndex) | **-56%** |
+| Mallocs/iter | 203K | **182** | **-99.9%** |
+
+### Zero-Alloc API Benchmarks (new methods vs stdlib-compat)
+
+| Method | errors (33K matches) | Alloc | vs Rust |
+|--------|---------------------|-------|---------|
+| FindAllStringIndex (stdlib) | 8.2ms / 3890 KB | 19 mallocs | 2.6x slower |
+| **AllIndex (iter.Seq)** | **5.9ms / 0 KB** | **0 mallocs** | **1.7x** |
+| **AppendAllIndex (reuse)** | **5.5ms / 0 KB** | **0 mallocs** | **1.7x** |
+| Rust find_iter | 3.2ms / 0 | 0 | — |
+
+emails pattern: `AppendAllIndex` **2.0ms vs Rust 2.6ms** — **faster than Rust!**
+
 ## [0.12.20] - 2026-03-25
 
 ### Performance
 
@@ -83,6 +83,7 @@ Cross-language benchmarks on 6MB input, AMD EPYC ([source](https://github.com/ko
 - Multi-pattern (`foo|bar|baz|...`) — Slim Teddy (≤32), Fat Teddy (33-64), or Aho-Corasick (>64)
 - Anchored alternations (`^(\d+|UUID|hex32)`) — O(1) branch dispatch (5-20x)
 - Concatenated char classes (`[a-zA-Z]+[0-9]+`) — DFA with byte classes (5-7x)
+- **Zero-alloc iterators** (`AllIndex`, `AppendAllIndex`) — 0 heap allocs, up to **30% faster** than FindAll. Email pattern **faster than Rust** with `AppendAllIndex`.
 
 ## Features
 
@@ -130,11 +131,28 @@ Supported methods:
 ### Zero-Allocation APIs
 
 ```go
-// Zero allocations — returns bool
+// Zero allocations — boolean match
 matched := re.IsMatch(text)
 
-// Zero allocations — returns (start, end, found)
+// Zero allocations — single match indices
 start, end, found := re.FindIndices(text)
+
+// Zero allocations — iterator over all matches (Go 1.23+)
+for m := range re.AllIndex(data) {
+    fmt.Printf("match at [%d, %d]\n", m[0], m[1])
+}
+
+// Zero allocations — match content iterator
+for s := range re.AllString(text) {
+    fmt.Println(s)
+}
+
+// Buffer-reuse — append to caller's slice (strconv.Append* pattern)
+var buf [][2]int
+for _, chunk := range chunks {
+    buf = re.AppendAllIndex(buf[:0], chunk, -1)
+    process(buf)
+}
 ```
 
 ### Configuration
 
@@ -97,8 +97,11 @@ v0.12.18 ✅ → Flat DFA transition table, integrated prefilter, PikeVM skip-ah
          ↓
 v0.12.19 ✅ → Zero-alloc FindSubmatch, byte-based DFA cache, Rust-aligned visited limits
          ↓
-v0.12.20 (Current) → Premultiplied/tagged StateIDs, break-at-match DFA determinize,
-                      Phase 3 elimination (2-pass bidirectional DFA)
+v0.12.20 ✅ → Premultiplied/tagged StateIDs, break-at-match DFA determinize,
+               Phase 3 elimination (2-pass bidirectional DFA)
+         ↓
+v0.12.21 (Current) → Tagged start states, zero-alloc API (AllIndex iter.Seq),
+                      1100x fewer mallocs, UseDFA for tiny NFA, -32% LangArena
          ↓
 v1.0.0-rc → Feature freeze, API locked
          ↓
 
@@ -64,6 +64,9 @@ func (b *Builder) Build() (*DFA, error) {
 	// Check if the NFA contains word boundary assertions
 	hasWordBoundary := b.checkHasWordBoundary()
 
+	// Check if the NFA contains EndLine ($) assertions
+	hasEndLine := b.checkHasEndLine()
+
 	// Check if the pattern is always anchored (has ^ prefix)
 	isAlwaysAnchored := b.nfa.IsAlwaysAnchored()
 
@@ -80,6 +83,7 @@ func (b *Builder) Build() (*DFA, error) {
 		byteClasses:      b.nfa.ByteClasses(),
 		unanchoredStart:  b.nfa.StartUnanchored(),
 		hasWordBoundary:  hasWordBoundary,
+		hasEndLine:       hasEndLine,
 		isAlwaysAnchored: isAlwaysAnchored,
 		startByteMap:     startByteMap,
 	}
@@ -706,3 +710,23 @@ func (b *Builder) checkHasWordBoundary() bool {
 	}
 	return false
 }
+
+// checkHasEndLine checks if the NFA contains EndLine ($) look assertions.
+// When true, determinize performs look-ahead re-computation on '\n' bytes.
+// Computed once at DFA build time for O(1) check in hot loop.
+func (b *Builder) checkHasEndLine() bool {
+	numStates := b.nfa.States()
+	for i := nfa.StateID(0); int(i) < numStates; i++ {
+		state := b.nfa.State(i)
+		if state == nil {
+			continue
+		}
+		if state.Kind() == nfa.StateLook {
+			look, _ := state.Look()
+			if look == nfa.LookEndLine {
+				return true
+			}
+		}
+	}
+	return false
+}