Skip to content

Releases: coregx/coregex

v0.10.3: Critical capture group fix

Choose a tag to compare

@kolkov kolkov released this 08 Jan 17:29
02d3e30

Critical Bug Fix

FindStringSubmatch returned incorrect capture groups for .+ patterns

The Bug

re := coregex.MustCompile(`^(.+)-(\d+)$`)
matches := re.FindStringSubmatch("hello-123")
// Expected: matches[1] = "hello"
// Actual:   matches[1] = "hello-123" ← BUG

Root Cause

StateSplit in PikeVM passed captures to both branches without cloning. The COW (Copy-on-Write) mechanism failed because refs=1, causing in-place modifications that corrupted the other branch's capture data.

Fix

Clone captures for the right branch in StateSplit to ensure refs>1, enabling proper copy-on-write semantics.

Affected Patterns

  • .+, .+?, .*, .*? in capture groups
  • Not affected: Explicit character classes ([a-z]+, \w+)

Added

  • 16 regression tests for .+ capture group scenarios
  • Stdlib compatibility verification tests

Fixes #77

v0.10.2: Version pattern hotfix

Choose a tag to compare

@kolkov kolkov released this 07 Jan 11:29

Fixed

  • Version pattern regression (#75)
    • Restored DigitPrefilter for digit-lead patterns like \d+\.\d+\.\d+
    • v0.10.1 incorrectly chose ReverseInner with "." as inner literal
    • Performance restored: 8.2ms → 2.15ms (3.8x speedup)

Changed

  • Lint config: Added exclusion for unused AVX2 functions in _amd64.go files

Full Changelog: v0.10.1...v0.10.2

v0.10.1: AVX2 Slim Teddy, Version Pattern Fix

Choose a tag to compare

@kolkov kolkov released this 07 Jan 04:18

Changes

  • AVX2 Slim Teddy implementation (#69) - Available for direct benchmarks, uses shift algorithm from Rust aho-corasick
  • Version pattern ReverseInner (#70) - Improved strategy selection for digit-lead patterns
  • Optimization documentation (#71) - docs/OPTIMIZATIONS.md with 6 optimizations that beat Rust regex

Known Issues

  • AVX2 Slim Teddy not enabled in integrated prefilter due to regression in high false-positive workloads (#74)

v0.10.0: Fat Teddy 16-bucket SIMD for 33-64 patterns

Choose a tag to compare

@kolkov kolkov released this 06 Jan 22:35
ad4498b

Summary

This release introduces Fat Teddy, a 16-bucket AVX2 SIMD prefilter for patterns with 33-64 literals, completing the multi-pattern search tier:

Patterns Engine Throughput
2-32 Slim Teddy (SSSE3) ~2 GB/s
33-64 Fat Teddy (AVX2) 9+ GB/s
>64 Aho-Corasick ~150 MB/s

Key Features

  • Fat Teddy AVX2 implementation - 16 buckets (vs Slim Teddy's 8) = 2x pattern capacity
  • 40x faster than scalar - AVX2 assembly with VPALIGNR half-shift algorithm
  • Smart fallback for small haystacks - Aho-Corasick used for <64 bytes (2.4x faster)
  • Pure Go scalar fallback - Works on non-AVX2 platforms

Performance

Patterns Engine Throughput vs Aho-Corasick
40 Fat Teddy AVX2 9.1 GB/s 73x faster
40 Fat Teddy scalar 228 MB/s 1.5x faster
70 Aho-Corasick 152 MB/s baseline

Small Haystack Optimization

  • Before: ~267 ns/op (Fat Teddy on 37-byte input)
  • After: ~110 ns/op (Aho-Corasick fallback)
  • Improvement: 2.4x faster

Technical Details

AVX2 Algorithm (from Rust aho-corasick):

  • VBROADCASTI128: Load 16 bytes, duplicate to both 128-bit lanes
  • VPSHUFB: Parallel nibble lookup in bucket masks
  • VPALIGNR $15: Half-shift for 2-byte fingerprint alignment
  • VPMOVMSKB: Extract 32-bit candidate mask

Files Changed

  • prefilter/teddy_fat.go - Fat Teddy core implementation
  • prefilter/teddy_avx2_amd64.s - AVX2 assembly (~300 lines)
  • meta/meta.go - Aho-Corasick fallback for small haystacks
  • Documentation updates (README, CHANGELOG, ROADMAP)

Breaking Changes

None. This is a purely additive feature.


Full Changelog: v0.9.5...v0.10.0

v0.9.5: Teddy 8→32 patterns, literal extraction fix

Choose a tag to compare

@kolkov kolkov released this 06 Jan 19:45

Changes

  • Teddy pattern limit expanded from 8 to 32 (#67)
    • Slim Teddy now handles up to 32 patterns (was 8)
    • Strategy threshold updated: Aho-Corasick triggers at >32 patterns (was >8)
    • Follows Rust aho-corasick architecture

Fixed

  • Literal extraction for factored prefixes (#67)
    • Problem: syntax.Parse factors (Wanderlust|Weltanschauung)W(anderlust|eltanschauung)
    • Caused wrong strategy selection: UseReverseSuffixSet instead of UseTeddy
    • Benchmark fix: 376µs → 1.7µs (220x faster)

Install

go get github.com/coregx/coregex@v0.9.5

v0.9.4: Streaming State Machine for CharClassSearcher

Choose a tag to compare

@kolkov kolkov released this 06 Jan 14:32

What's Changed

Changed

  • Streaming state machine for CharClassSearcher - single-pass FindAll/Count
    • New methods: FindAllIndices(), Count() with streaming state machine
    • Eliminates per-match function call overhead
    • Based on Rust regex approach: SEARCHING/MATCHING states
    • Integrated into public API: FindAll(), FindAllIndex() use streaming path

Performance

  • CharClassFindAll: 15-30% faster (1500ns → 1100-1400ns on 1KB)
  • char_class gap vs Rust: reduced from 2.6x to ~1.9x
  • No regressions on other patterns (+0.05% geomean)

Full Changelog: v0.9.3...v0.9.4

v0.9.3: Teddy 2-byte fingerprint + strategy optimization

Choose a tag to compare

@kolkov kolkov released this 06 Jan 12:42

Summary

Optimize strategy selection and implement Teddy 2-byte fingerprint for reduced false positives.

Changes

Teddy 2-byte Fingerprint

  • Changed default from 1-byte to 2-byte fingerprint
  • New SSSE3 assembly: teddySlimSSSE3_2
  • Reduces false positives from ~25% to <0.5%

Strategy Selection Reorder

  • DigitPrefilter now checked before tiny NFA fallback
  • Added isDigitLeadPattern() helper for digit-lead pattern detection
  • Prevents high-frequency literals (like .) from being used as inner search targets

Performance

Pattern v0.9.2 v0.9.3 Change
literal_alt 31ms 8ms +4x faster
version 8.2ms 2ms +4x faster
IP 3.9ms 5.5ms -43% (trade-off)

Note: IP pattern is 43% slower but remains 2.2x faster than Rust regex. See #62 for future optimization research.

Full Changelog

https://github.com/coregx/coregex/blob/main/CHANGELOG.md#093---2026-01-06

v0.9.2: Simplified DigitPrefilter (146x IP speedup)

Choose a tag to compare

@kolkov kolkov released this 06 Jan 10:59
30dbd01

What's Changed

Replaced adaptive switching approach from v0.9.1 with a simpler and faster solution.

Background

v0.9.1 added runtime adaptive switching to handle dense digit data. Testing revealed that:

  1. Adaptive tracking itself added overhead (~50ms on 6MB)
  2. Complex patterns (like IP with 74 NFA states) are better served by pure DFA

New Approach

Instead of runtime adaptation, we now use compile-time strategy selection:

  • Simple digit patterns (≤100 NFA states) → DigitPrefilter
  • Complex digit patterns (>100 NFA states) → LazyDFA

This eliminates runtime overhead while achieving better performance.

Performance Improvements

Pattern v0.9.1 v0.9.2 Speedup
IP 731ms 5ms 146x
char_class 183ms 113ms 1.6x
literal_alt 61ms 29ms 2.1x

Changes

  • Remove digitPrefilterAdaptiveThreshold (runtime tracking)
  • Add digitPrefilterMaxNFAStates=100 (compile-time limit)
  • Add PikeVM.SearchBetween for bounded search optimization
  • Update benchmarks in README

Full Changelog: v0.9.1...v0.9.2

v0.9.1: DigitPrefilter Adaptive Switching

Choose a tag to compare

@kolkov kolkov released this 05 Jan 01:10
d5c6862

Fixed

DigitPrefilter adaptive switching for high false-positive scenarios

  • Problem: DigitPrefilter was slow on dense digit data (many consecutive FPs)
  • Solution: Runtime adaptive switching - after 64 consecutive false positives, switch to DFA
  • Based on Rust regex insight: "prefilter with high FP rate makes search slower"

Performance (IP regex benchmarks)

Scenario stdlib coregex Speedup
Sparse 64KB 833 µs 2.8 µs 300x
Dense 64KB 8.5 µs 2.4 µs 3.5x
No IPs 1MB 60.7 ms 19.8 µs 3000x

Details

  • Sparse data: prefilter remains fast (100-3000x speedup via SIMD skip)
  • Dense data: adaptively switches to lazy DFA (3-5x speedup vs stdlib)
  • New stat: Stats.PrefilterAbandoned tracks adaptive switching events
  • New constant: digitPrefilterAdaptiveThreshold = 64

Full Changelog: v0.9.0...v0.9.1

v0.9.0: UseAhoCorasick, DigitPrefilter, Paired-byte SIMD

Choose a tag to compare

@kolkov kolkov released this 04 Jan 22:58
e09f196

Highlights

UseAhoCorasick Strategy

  • Large literal alternations (>8 patterns) via github.com/coregx/ahocorasick
  • 75-113x faster than stdlib on 15-20 pattern alternations
  • O(n) multi-pattern matching with ~1.6 GB/s throughput

DigitPrefilter Strategy (#56)

  • AVX2 SIMD digit scanner for IP regex patterns
  • 2500x faster on no-match scenarios
  • 39-152x faster on sparse IP data

Paired-byte SIMD Search (#55)

  • Byte frequency analysis for optimal rare byte selection
  • AVX2 MemchrPair() searches two bytes simultaneously
  • Dramatically reduces false positives

Installation

go get github.com/coregx/coregex@v0.9.0

See CHANGELOG.md for full details.