Skip to content

Latest commit

 

History

History
681 lines (504 loc) · 33.6 KB

File metadata and controls

681 lines (504 loc) · 33.6 KB

Changelog

0.15.5 - 2026-04-17

Changed

  • Process Maps: Significantly expanded NORM.txt with Unicode confusables data — non-ASCII characters that are visually similar to ASCII equivalents are now normalized, improving coverage against homoglyph-based evasion.
  • Process Maps: Extended NUM-NORM.txt with additional numeric character mappings and corrected existing values.
  • Process Maps: Extended TEXT-DELETE.txt with new codepoints for more comprehensive deletion coverage.
  • Process Maps: Removed comments from ROMANIZE.txt and VARIANT_NORM.txt, retaining only essential mappings.
  • Process Maps: Updated manifest.json to reflect new counts, sources, and Python/Unicode data versions.
  • Tooling: Enhanced generate_process_map.py to download and apply Unicode confusables, mapping non-ASCII characters to ASCII equivalents where applicable, with improved handling of combining marks.

Documentation

  • Clarify CJK variant normalization in README and DESIGN docs.

0.15.4 - 2026-04-15

Performance

  • Core: Custom DFA next_state walk via Automaton API, replacing materialized-match iteration. Enables fused prefilter-aware dispatch: Teddy active → materialize + try_find_overlapping; no prefilter → stream via next_state loop.
  • Core: Replace NEON scratch-buffer scan with bitmask extraction for SIMD delete filtering.
  • Core: Switch engine selection from byte-ratio heuristic to character density (bytecount::num_chars / len), improving CJK dispatch accuracy.

Added

  • Core: BytewiseDFAEngine — extracted DFA engine encapsulating dfa::DFA, dfa_to_value, and has_prefilter flag with 4-way fused-path dispatch.
  • Core: Parallel batch API (batch_is_match, batch_process, batch_find_match) via rayon work-stealing. Behind rayon feature flag.
  • Core: Profiling boundaries (#[profiling::function]) on key DFA and matcher functions.
  • Tooling: Benchmarking orchestration (scripts/run_benchmarks.py) and interactive Plotly visualization (scripts/bench_viz.py).
  • Tooling: Profiling workflow via macOS Instruments (just profile record/analyze).

Fixed

  • Core: Delete dual-scan correctness — propagate mask bit to root when Delete is a direct child, ensuring patterns with deletable characters are scanned against both original and transformed text.
  • Core: Merge colliding ProcessType buckets before parse_rules to prevent pattern index conflicts.

Changed

  • Core: Replace #[inline(always)] with #[inline] across codebase, letting LLVM decide inlining under full LTO.
  • Core: Streamline automata compilation by removing unnecessary closures.
  • Core: Simplify Delete dual-scan by reusing apply() result instead of redundant transform.
  • Core: Extract seek helper and encapsulate dual-scan check in scan module.
  • Core: Replace inline super:: paths with top-level imports for readability.
  • Core: Introduce MatcherError enum for structured construction failure reporting.
  • Config: Clean up configuration files, remove unnecessary flags, consolidate environment variables.

Dependencies

  • Update bitflags, daachorse, rayon.

Documentation

  • Comprehensive call-graph documentation for public API and internal logic.
  • Update CLAUDE.md for BytewiseDFAEngine and prefilter-aware dispatch.

0.15.3 - 2026-04-12

Changed

  • Core: Fuse WordState + satisfied_masks into a single RuleState struct, consolidating per-rule hot state into one cache line.

Fixed

  • Core: Resolve NORM/NUM-NORM overlap in process maps where shared codepoints caused ambiguous transform behavior.
  • Docs: Improve clarity in density-based engine dispatch problem description in DESIGN.md.

Tests

  • Add exhaustive process_map coverage validating all transform tables (VARIANT_NORM, ROMANIZE, DELETE, NORM, NUM-NORM, EMOJI_NORM).

0.15.2 - 2026-04-11

Changed

  • Python: Batch methods (batch_is_match, batch_process, batch_find_match) use PyBackedStr for zero-copy string handling, avoiding redundant UTF-8 copies across the FFI boundary.
  • Core: Replace unsafe get_unchecked with safe indexing guarded by assert_unchecked hints across scan and pattern modules.
  • Core: Add safety assertions for bitset indexing in DeleteMatcher.
  • Core: Simplify text processing by inlining replace_cow functionality.
  • Core: Move has_match into ScanState and remove unused methods.
  • Tooling: Trim cargo-all-features allowlist to only the perf feature.

Documentation

  • Update all README Quick Start examples to use builder APIs (Python, Java, C) for consistency with Rust.
  • Remove phantom process_iter reference from Rust README (method was removed in 0.15.0).
  • Fix formatting in root README architecture diagram.

0.15.1 - 2026-04-11

Added

  • Core: batch_is_match, batch_process, batch_find_match — parallel batch API powered by rayon. Distributes texts across CPU cores via work-stealing. Behind rayon feature flag (off by default in matcher_rs, enabled by all binding crates).
  • Core: Batch benchmarks (bench_search::batch) comparing sequential vs parallel throughput. 2.6–7.2× speedup on M3 Max.
  • C: simple_matcher_batch_is_match, simple_matcher_batch_process, simple_matcher_batch_find_match FFI functions with corresponding drop_* deallocators.

Changed

  • Python/Java: Existing batch_is_match, batch_process, batch_find_match now use rayon parallelism internally (previously sequential).

0.15.0 - 2026-04-10

Bindings

  • Python: Rewrite from_dict to iterate PyDict directly (no json.dumps round-trip). Replace simple_table_bytes storage with (pt, id, word) triples, halving input memory. Remove __getstate__/__setstate__ (pickle via __getnewargs__ only). Add SimpleMatcherBuilder, heap_bytes().
  • Java: Make SimpleMatcher Serializable for Spark distribution (stores config bytes, reconstructs native matcher on deserialization). Convert SimpleResult to a Java record. Add SimpleMatcherBuilder. Remove FastJSON runtime dependency. Make MatcherJava package-private; expose textProcess/reduceTextProcess as static methods on SimpleMatcher.
  • C: Extract decode_c_str helper and ffi_fn! macro to reduce FFI boilerplate. Add SimpleMatcherBuilder (init/add_word/build/drop). Add simple_matcher_heap_bytes.
  • Python: Add batch_find_match for batch first-match queries with single GIL release.

Refactor

  • Unify rule evaluation into single eval_hit with simplified direct-rule encoding.
  • Remove string pool, process_iter, and PatternKind::Simple; merge into And+SingleAnd.
  • Simplify WordState from 3 generation stamps to 1 + vetoed flag.
  • Flatten transform/replace/ into transform/; merge process/api.rs into process/mod.rs; move graph.rssimple_matcher/tree.rs; rename engine.rsscan.rs; merge encoding.rs into pattern.rs.
  • Consolidate RuleHot/RuleCold into single Rule type.
  • Introduce streaming byte iterators for filter transform steps.
  • Replace TinyVec with Vec across modules.

Documentation

  • Rewrite DESIGN.md around concepts and rationale.
  • Fix stale doc links across workspace.

Tooling

  • Align pre-commit hooks with just lint-check.
  • Consolidate bench files and deduplicate test data.

0.14.3 - 2026-04-08

Refactor

  • Merge AllSimple fast path into General — all query methods (process, for_each_match, process_iter) now use the unified walk_and_scan path. is_match retains a minimal AC-direct bypass for simple literal matchers.
  • Reject empty pattern sets at construction with MatcherError::EmptyPatterns.
  • Bundle bytewise and charwise engines in a non-optional Engines struct behind a unified ScanEngine trait with dispatch! macro.
  • Remove Option from DFA field — always present under cfg(feature = "dfa"). has_dfa() is now cfg!(feature = "dfa").
  • Unify 4 streaming filter iterators (DeleteFilterIterator, NormalizeFilterIterator, RomanizeFilterIterator, VariantNormFilterIterator) into a generic FilterIterator<F> backed by a CodepointFilter trait.

Breaking Changes

  • SimpleMatcher::new and SimpleMatcherBuilder::build now return Err(MatcherError::EmptyPatterns) when no scannable patterns remain after parsing. Previously, empty matchers silently returned no matches.
  • SimpleMatcher Debug output no longer includes search_mode.
  • Python SimpleMatcher.stats() no longer includes search_mode key.

Documentation

  • Update CLAUDE.md, DESIGN.md, README.md, and matcher_rs/README.md for engine architecture changes.

0.14.2 - 2026-04-07

Documentation

  • Add Architecture section and Common Pitfalls FAQ to root README.
  • Add performance tuning guidance ("When to Use Which") to matcher_rs README.
  • Add format header comments to all process_map/*.txt files (with # comment support in build.rs).
  • Add data/README.md documenting benchmark haystack and word list files.

Testing

  • Add 4 ProcessType composition edge case tests (Delete|EmojiNorm, None|Delete, Romanize vs RomanizeChar, VariantNormDeleteNormalize).
  • Add cargo-fuzz targets for SimpleMatcher::new and text_process (adversarial input fuzzing).
  • Fix proptest ASCII generator: exclude backslash to prevent \b word-boundary false negatives.

Bindings

  • Python: add __repr__ to SimpleMatcher (shows search mode and rule count).
  • Python: add missing EMOJI_NORM to .pyi type stub.
  • C: add matcher_version() function for runtime version queries.

Tooling

  • Add cargo-semver-checks CI job to catch accidental API breaks.
  • Add just fuzz / just fuzz-list recipes.

Refactor

  • Replace .expect() with let-else unreachable!() in api.rs and search.rs.

0.14.1 - 2026-04-07

Performance

  • LUT + unchecked indexing for word boundary checks.
  • Fused romanize-scan path via RomanizeFilterIterator.
  • Store and_count in PatternEntry to eliminate RuleHot cache misses.

Documentation

  • Add # Panics sections for compile_automata, walk_and_scan, process_entry, and get_transform_step.
  • Fix 2 broken RuleHot::and_count intra-doc links (field moved to PatternEntry).
  • Update CLAUDE.md and DESIGN.md for RuleHot/PatternEntry restructure.

Refactor

  • Unify NormalizeFilterIterator state into single remainder struct.
  • Remove OPTIMIZATION_IDEAS.md (no longer needed).

Bug Fixes

  • Remove unused import of text_process in bench.rs.
  • Update profiling category rules for current architecture.
  • Expand DFA scan category in time profile parser for improved accuracy.

Tooling

  • Add profile_build example and --target build support to profiler.
  • Add overlap comparison benchmarks for 3 AC engines.
  • Rewrite text_transform benchmarks to measure full matcher pipeline.

0.14.0 - 2026-04-06

Features

  • Add word boundary matching (\b) for whole-word precision in pattern rules.
  • Add OR operator (|) for alternative patterns within rules.
  • Add EmojiNorm ProcessType for emoji-to-English-word normalization via CLDR short names.
  • Generalize CJK transforms — rename Fanjian→VariantNorm, PinYin→Romanize with expanded JP/KR data.

Performance

  • Replace is_ascii + Harry SIMD dispatch with density-based engine selection (count_non_ascii_simd NEON/AVX2/portable). Harry matcher removed entirely.
  • 3-way fused scan dispatch — DFA materialize at low density, streaming charwise at high density, with 0.67 non-ASCII threshold.
  • Always build DFA + DAAC bytewise together; raise DFA pattern threshold to 25K.
  • Replace Normalize AC DFA with page-table + fused streaming scan.
  • Implement fused delete-scan path to stream non-deleted bytes directly into AC.
  • Eliminate Vec pointer re-resolution in scan hot path via ScanState split-borrow.
  • Optimize AC scan closure by pre-resolving &[RuleHot] slice and removing per-hit indirection.
  • Enhance bytewise matcher with prefilter acceleration.
  • Replace PrefixMap binary search with AHashMap for O(1) verification.
  • Specialize AllSimple process loop for single-transform-type matchers.
  • Skip is_ascii() dispatch when all patterns are ASCII.

Refactor

  • Split simple_matcher/rule.rs into encoding.rs, pattern.rs, and rule.rs modules.
  • Split replace.rs into variant_norm.rs, romanize.rs, normalize.rs sub-modules.
  • Add Fanjian streaming byte iterator and integrate into transform pipeline.
  • Replace #[inline(always)] with #[inline] for improved inlining heuristics.
  • Remove runtime_build feature.
  • Merge duplicate leaf-node scan paths in walk_and_scan.
  • Remove dead abstractions and fix stale doc links.

Bug Fixes

  • Resolve broken rustdoc links after module split.
  • Propagate transform output density for correct engine dispatch.

Tooling

  • Add interactive benchmark visualization with Plotly (just bench-viz).
  • Add engine dispatch characterization example and visualization.
  • Add Instruments profiling with atos inline resolution and source attribution.
  • Add pre-commit configuration with hooks for all languages.
  • Simplify bench/profiling tooling and add missing operator coverage.

Documentation

  • Enhance documentation with examples and performance notes across modules.
  • Document ScanState split-borrow optimization and RuleHot compaction in DESIGN.md.
  • Streamline CLAUDE.md with updated architecture and commands.

0.13.0 - 2026-04-03

Features

  • Add heap_bytes() to HarryMatcher and SimpleMatcher for heap memory introspection across all matcher components (AC automata, Harry tables, rule metadata, process-type trie).

Performance

  • Unify HarryMatcher into a single matcher with wildcarded columns, eliminating per-prefix-length scans (6x on CJK, 3-4x on mixed haystacks).
  • Column-0 early exit in NEON/AVX512 kernels skips columns 1-7 for ~95% of non-ASCII chunks.
  • Replace AHashMap with sorted split-array PrefixMap in Harry verification for L1-friendly binary search.
  • Gate Harry dispatch on ASCII-only patterns and DFA absence; improve non-ASCII haystack routing.
  • Const-generic SIMD kernels with PREFIX_LEN-scoped column loading.

Testing

  • Add 15 targeted coverage tests (process type display, streaming scan paths, NEON edge cases, threaded compilation).
  • Coverage: 86% of testable lines (excluding platform-gated AVX512, binding crates, benchmarks).

CI

  • Fix SIGILL on x86_64 CI runners by overriding target-cpu=native from .cargo/config.toml.
  • Add separate coverage workflow with tarpaulin and Codecov integration.
  • Replace Makefile with Justfile for all build/test/bench/lint commands.

Build

  • Add scripts/bump-version.sh and scripts/dev-setup.sh for release and onboarding automation.

0.12.3 - 2026-04-02

Performance

  • Add per-plan charwise_density_threshold to ScanPlan; AcDfa never routes to charwise at any density, DaacBytewise uses 0.1.
  • Raise AC_DFA_PATTERN_THRESHOLD 5000 → 7000 based on M3 Max benchmarks (+14% at 7k, -15% cliff at 8k due to L2 cache boundary).
  • Align ASCII transform fast paths — consolidate is_ascii / output_density tracking across TransformStep, simplify per-transform ASCII detection.

Bug Fixes

  • Fix leaf-transform noop handling: leaf nodes in the process trie that are ASCII no-ops were incorrectly re-scanning instead of reusing the parent variant.

Data

  • Regenerate all process maps (VARIANT_NORM, NORM, NUM-NORM, ROMANIZE, TEXT-DELETE) from updated Python sources.
  • Move map generator script into matcher_rs/scripts/generate_process_map.py; add manifest.json for reproducibility.
  • Remove large raw Unicode source files from data/str_conv/ (now generated on demand).

Benchmarks

  • Add density_dispatch bench module to calibrate the charwise threshold.
  • Add pattern_mix_en / pattern_mix_cn modules with CJK-% sweep to validate the all_ascii guard.
  • Extend search_ascii_en with 6000/7000/8000 pattern counts around the DFA threshold.

Testing

  • Add proptest-based property tests for transform correctness.
  • Extend transform unit tests; remove redundant matcher_rs coverage.

Documentation

  • Major rustdoc pass: ReplacementFinder, string pool, decode_utf8_raw, AsciiInputBehavior, get_transform_step, build_process_type_tree, multibyte_density, SIMD skip functions.
  • Update DESIGN.md: density-based engine selection, StepOutput shape, ScanPlan accessor list, threshold and constructor docs.
  • Add #![warn(missing_docs)] to crate root.

0.12.2 - 2026-03-31

Performance

  • Fix Romanize regression by eliminating Replacement enum indirection in replacement engines.
  • Unify streaming tree walk into single walk_and_scan method — 25% faster process, 33% faster is_match.
  • Lazy transform pipeline for is_match — skips materializing text variants when early exit is possible.

Refactor

  • Merge charwise + normalize into unified replace.rs with shared ReplacementFinder trait.
  • Deduplicate SIMD dispatch and AVX2 entry points with macros.
  • Extract shared UTF-8 decoder to transform/utf8.rs.
  • Merge step.rs and registry.rs into single step module.
  • Remove dead public API after walk_and_scan unification.
  • Remove unused optimizations (masks pool, VariantNorm in-place, SingleProcessType const generic).
  • Remove unused daachorse dependency and related non-overlapping code.

Bug Fixes

  • Remove broken single-step match processing methods from SimpleMatcher.

Testing

  • Add unit tests for critical internals and improve coverage infrastructure.
  • Simplify runtime build test configuration.

Documentation

  • Add doc tests and expand rustdoc for public API gaps.
  • Update CLAUDE.md and DESIGN.md for post-refactor accuracy.

0.12.1 - 2026-03-30

Performance

  • Optimize search throughput 10-17% via six hot-path improvements.
  • Encode rule_idx directly in automaton values for simple single-PT patterns (DIRECT_RULE_BIT), eliminating one indirection per hit.
  • Skip text.is_ascii() scan when only ASCII patterns exist.
  • Optimize is_match hot path with two targeted improvements.
  • Raise AC_DFA_PATTERN_THRESHOLD to 5000 and optimize bench_engine.
  • Improve SimpleMatcher build performance up to 42%.
  • Replace std HashMap with ahash in runtime_build transform init.

API

  • SimpleMatcher::new and builder::build now return Result instead of panicking.
  • SimpleMatcherBuilder::add_word accepts owned String in addition to &str.
  • Add #[must_use] to public types and query methods.
  • Derive PartialEq/Eq on SimpleResult; add Send + Sync static assertions.
  • Add manual Debug impl for SimpleMatcher.

Features

  • Release GIL and add batch methods (is_match_batch, process_batch) in Python bindings.

Bug Fixes

  • Harden construction against invalid ProcessType and edge-case rules.
  • Fix Romanize handling to correctly track is_ascii for unmapped characters.
  • Resolve broken intra-doc link to cfg-gated private function.

Safety

  • Deny unsafe_op_in_unsafe_fn lint, document all unsafe blocks with SAFETY comments.
  • Add SAFETY comments to all unsafe blocks in AVX2 SIMD functions.
  • Add crate-level Safety section documenting unsafe usage.

Refactor

  • Reorganize simple_matcher internals into focused modules (build.rs, engine.rs, rule.rs, search.rs, state.rs).
  • Reorganize transform pipeline into dedicated modules under process/.
  • Replace FLAG_* bit flags with RuleShape enum in PatternEntry.

Testing

  • Add 6 property tests for correctness invariants.
  • Reorganize test suite by system-under-test.

Documentation

  • Rewrite DESIGN.md and update CLAUDE.md to match refactored codebase.
  • Add API tutorial and profiling targets in examples/.
  • Update DESIGN.md to reflect search throughput optimizations.

CI

  • Adopt cargo-nextest across all test workflows.
  • Enable rust-lld linker for test and bench builds.
  • Streamline cargo installation in release workflow.
  • Improve CI workflow reliability and efficiency.

0.12.0 - 2026-03-28

Performance

  • all_simple fast path for is_match — bypasses TLS state, generation counters, and overlapping iteration for pure-literal matchers.
  • Dedup length pre-filter to skip redundant pattern entries during construction.
  • Thread-local TRANSFORM_STATE bundles scratch buffers into a single TLS lookup per call; literal fast path avoids TLS entirely for simple cases.
  • In-place VariantNorm optimization — exploits same-byte-length property of 99%+ Traditional-to-Simplified mappings to avoid scan-and-rebuild allocations.
  • Shrink PatternEntry from 16 to 8 bytes via sequential process-type indexing.
  • Embed dedup indices directly in DAAC automaton values, eliminating one indirection per hit.
  • Track is_ascii flag through the transform pipeline to skip redundant charwise scans on ASCII-only text.
  • Auto-select DAAC bytewise engine over AC DFA when ASCII pattern count exceeds 2000.

Refactor

  • Replace PatternEntry boolean flags with PatternKind enum for clearer dispatch in process_match.
  • Reorganize matcher_rs into focused single-responsibility modules: simple_matcher/ split into types.rs, construction.rs, scan.rs; process/ split into process_type.rs, string_pool.rs, process_tree.rs, transform/.
  • Improve code clarity via named structs (ScanContext, RuleHot, RuleCold) and bundled TLS parameters.

Dependencies

  • Bump sonic-rs to 0.5.8, tinyvec to 1.11.0, proptest to 1.11.0.
  • Migrate matcher_java JNI bindings to jni 0.22.4.

Documentation

  • Rewrite DESIGN.md to reflect current implementation with detailed sections on state management, SIMD dispatch, and const-generic optimizations.
  • Update all READMEs to match current package APIs: document text_process/reduce_text_process in C and Java bindings, add ProcessType reference tables, fix paths, improve build instructions.

0.11.0 - 2026-03-12

Breaking Changes

  • Removed the vectorscan backend to simplify the build process and eliminate the external Boost dependency requirement.

Performance

  • Simplified SIMD utility dispatching by removing OnceLock/SimdDispatch for AArch64 (NEON is now always baseline) and gating it for x86_64 only.
  • Removed dead API surface and unused parameters in SIMD hot paths.
  • Optimized search hot paths and benchmark tooling in matcher_rs.
  • Added comprehensive benchmark results for MacBook Air M4 (Apple Silicon).

Documentation

  • Filled documentation gaps, added # Panics / # Errors / # Arguments sections, and explained internal implementations in matcher_rs.
  • Aligned public documentation and improved comments on private items for better maintainability.

0.10.3 - 2026-03-11

Performance

  • Hot/cold struct split, pre-computed masks, TLS consolidation for reduced per-call overhead in SimpleMatcher.
  • Skip unused text variants during process-tree traversal, avoiding redundant transformations.
  • Cache Romanize trim metadata to eliminate repeated recomputation.
  • Lazy tree walking for unique text variants — process-tree nodes are now visited on demand rather than eagerly.

Refactor

  • Extract is_rule_satisfied as a dedicated method for clarity and measurable performance improvement.
  • Optimize tree node index handling in walk_process_tree (formerly reduce_text_process_with_tree).
  • Rename traversal function to walk_process_tree and update terminology throughout.
  • Improve encapsulation: SingleCharMatcher/SingleCharMatch visibility narrowed; SimpleMatcher internals use more descriptive struct names (RuleHot, etc.).
  • Add safety assertions in page_table_lookup.
  • Update type-ignore comments in test cases for clarity.

Documentation

  • Update terminology and traversal descriptions in DESIGN.md.
  • Update benchmark records in README.md with new results.

0.10.2 - 2026-03-10

Bug Fixes

  • Fix DeleteFindIter SIMD fast-skip incorrectly advancing past deletable ASCII bytes (e.g. spaces) that appear before a non-ASCII character in the same 16-byte chunk. The non_ascii_mask was checked before del_mask, causing the skip to jump to the first non-ASCII byte and silently drop intervening deletable characters. Fixed by ORing both masks and stopping at the first set bit in either.

0.10.1 - 2026-03-10

Performance

  • Monomorphize SingleChar iterators and add SIMD ASCII chunk-skip for faster inner loops.
  • Byte-level Romanize/Delete iterators and ascii_lut fast-path, eliminating UTF-8 decoding overhead on ASCII-heavy input.
  • portable_simd SIMD helpers (skip_ascii_simd, simd_ascii_delete_mask, skip_non_digit_ascii_simd) for 16-byte parallel probing in SingleChar skip loops.

Features

  • Exhaustive property-based and unit tests for VariantNorm, Delete, Normalize, and Romanize process types.
  • Macro-based benchmark generation with BytesCount metric for normalized throughput measurement.

Refactor

  • Improve clarity and consistency across the process module.

Documentation

  • Improve CLAUDE.md with benchmark scoping, test-file syntax, and architecture details.
  • Move benchmark output to bench_records/ and link from README.
  • Clarify get_or_init_matcher return type in docs.

0.10.0 - 2026-03-07

Breaking Changes

  • Removed Matcher, RegexMatcher, and SimMatcher components to focus on the high-performance SimpleMatcher.
  • Updated C and Java FFI interfaces to only support SimpleMatcher.

Documentation

  • Updated README.md, DESIGN.md, and GEMINI.md to reflect the focus on SimpleMatcher.
  • Cleaned up documentation and examples across all language bindings.

0.9.0 - 2026-03-05

Refactor & Performance

  • Replace standard HashMap and HashSet with FxHashMap and FxHashSet for improved execution speed.
  • Replace Vec<i32> with TinyVec in simple_matcher for improved performance.
  • Optimize inner loop with Vec indexing and flat matrix in simple_matcher.
  • Use FxHashMap + u64 bitmask for the inner loop of simple_matcher.
  • Rename ProcessedTextSet to ProcessedTextMasks and update its representation to use a u64 bitmask for process types.
  • Simplify TextMatcherTrait by deriving is_match and process_iter from process, and remove the TextMatcherInternal trait.
  • Simplify word splitting logic in SimpleMatcher::new using a helper closure and adjust lifetime bounds for borrowed types.
  • Simplify C FFI panic handling and wrap all panic::catch_unwind calls in FFI functions with AssertUnwindSafe.
  • Remove word_id from match result structs, refine regex pattern handling and matching.
  • Unconditionally configure mimalloc as the global allocator and remove conditional allocator dependencies.

Maintenance & Documentation

  • Standardize Rust documentation and include detailed algorithm explanations across all matching engines.
  • Update benchmark results in README.md after modifications to the simple matcher.
  • Configure rustflags to use 8 compilation threads.
  • Streamline CI Rust testing by adopting cargo-all-features and enabling RUST_BACKTRACE.
  • Expand and update CI workflows (upgrade action runners to ubuntu-24.04-arm and macos-latest).
  • Remove AGENTS.md and legacy tracker files.

0.8.1 - 2026-03-01

Refactor & Performance

  • Replace nohash-hasher, id-set, FxHashMap (rustc-hash), and micromap with std collections (HashMap/HashSet), removing these external dependencies.
  • Replace tinyvec::ArrayVec with std::vec::Vec for dynamic collections in the process matcher.

Maintenance & Documentation

  • Standardize rustdoc comments and add intra-doc links to type names across the project for improved readability.
  • Improve build/linting commands and remove outdated feature mentions.

0.8.0 - 2026-02-28

Breaking Changes

  • Implement sealed trait pattern for TextMatcherTrait.

Refactor & Performance

  • Use Box<[T]> for frozen Vec fields to optimize memory.
  • Introduce gen blocks for process_iter implementations to improve iteration.
  • Remove unsafe code, update aho-corasick dependency, optimize matcher with tinyvec.
  • Introduce ProcessTypeError for text_process handling.
  • Use eprintln for warnings instead of println.
  • Consolidate conditional matching logic and update FFI function attributes to unsafe(no_mangle).
  • Improve struct initializations and Option block handling.

Features

  • Derive Debug on MatchResult for consistency.
  • Add diagnostic::on_unimplemented to public traits for better compiler errors.

Maintenance & Documentation

  • Update Rust edition to 2024.
  • Add rust-toolchain.toml to use nightly toolchains for reproducible builds.
  • Remove direct deserialization for core types.
  • Improve SimpleMatcher and Matcher instantiation examples to recommend builder patterns.
  • Ensure correct and modern rust idiom implementations across repo.

0.7.2 - 2026-02-25

Refactor & Performance

  • Removed explicit ASCII case-insensitivity from AhoCorasickBuilder to simplify builder configuration.
  • Deferred String allocation in ProcessMatcher's replace_all and delete_all for performance optimization.
  • Simplified TextMatcherTrait and various internal matcher method implementations.
  • Expanded testing suite by separating tests into individual files, adding edge case checks and fixing slice coercion in proptests.

Maintenance & Documentation

  • Switched aho-corasick-unsafe dependency from git source to crates.io.
  • Updated benchmarks with deterministic scenarios for process types.
  • Enhanced Java example to use the high-level API and adjusted the environment for macOS.
  • Heavily improved documentation across README.md, README_CN.md, AGENTS.md and specific language READMEs.

0.7.1 - 2026-02-21

Security & Safety (Audit Fixes)

  • FFI Panic Safety: All entry points in matcher_c are now wrapped in catch_unwind to prevent native crashes when Rust code panics.
  • Memory Robustness: Fixed brittle raw pointer usage in reduce_text_process_with_tree (process matcher) by switching to indexing.
  • ReDoS Protection: Added pattern length limits (1024 chars) to RegexMatcher to mitigate exponential backtracking risks.
  • Invariants: Added debug_assert! checks across SimpleMatcher to verify internal consistency in development.

Java

  • Ergonomics: Introduced high-level Matcher and SimpleMatcher classes that implement AutoCloseable for automatic native memory management (RAII).

API (from 0.7.0)

  • Breaking: MatchResultTrait::similarity now returns Option<f64>None for exact matchers (Simple, Regex) and Some(score) for similarity matchers.
  • Breaking: MatchTableTrait::word_list and exemption_word_list now return &[S] instead of &Vec<S>.
  • Internal TextMatcherTrait methods are now marked #[doc(hidden)].

Performance / Correctness

  • Fixed double-checked locking in get_process_matcher.
  • Re-enabled overflow-checks globally; hot-path arithmetic uses wrapping_add / wrapping_mul.

Maintenance

  • Replaced lazy_static with std::sync::LazyLock.
  • Updated documentation regarding !Send iterators and git-dependency limitations.

0.6.0 - 2026-02-21

Added

  • Builder API: SimpleMatcherBuilder, MatchTableBuilder, MatcherBuilder.
  • process_iter — lazy iterator over match results for all four matcher types. RegexMatcher and SimMatcher have truly lazy implementations; SimpleMatcher wraps process() (two-pass AC constraint documented); Matcher avoids the final collect() via into_values().flatten().

0.5.9 - 2025-08-23

Changed

  • Update dependencies.

0.5.8 - 2025-08-23

  • staticmethod for extension_types.py
  • Update dependencies.

0.5.7 - 2025-03-17

Flexibility

  • Update dependencies.

0.5.6 - 2024-11-18

Performance

  • Fix build_process_type_tree function, use set instead of list.
  • Update several dependencies.

0.5.5 - 2024-10-14

Bug fixes

  • Change XXX(Enum) to XXX(str, Enum) in extension_types.py to fix json dumps issue.

Flexibility

  • Add Python 3.13 support.
  • Remove msgspec, only use json in README.md.

0.5.4 - 2024-08-23

Readability

  • Fix typo and cargo clippy warnings.
  • Add single line benchmark.

0.5.3 - 2024-07-26

Bug fixes

  • Fix simple matcher is_match function.

0.5.2 - 2024-07-22

Flexibility

  • Remove msgpack, now non-rust users should use json to serialize input of Matcher and SimpleMatcher.
  • Refactor Java code.

0.5.1 - 2024-07-19

Performance

  • Use FxHash to speed up simple matcher process.

Flexibility

  • Remove unnecessary dependencies.

0.5.0 - 2024-07-18

Changed

  • Major internal refactor of SimpleMatcher internals. See git history for details.

0.4.6 - 2024-07-15

Performance

  • Optimize SimpleMatcher hot-path performance.

0.4.5 - 2024-07-12

Changed

  • Optimize Simple Matcher process function when multiple simple_match_type are used.
  • Add dfa feature to matcher_rs.
  • Shrink VARIANT_NORM conversion map.

0.4.4 - 2024-07-09

Changed

  • Merge ROMANIZE and ROMANIZECHAR process matcher build.
  • Add process function to matcher_py/c/java.
  • Fix simple matcher process function issue.
  • Refactor matcher_py file structure, use rye to manage matcher_py.
  • Delete println! in matcher_c.

0.4.3 - 2024-07-08

Changed

  • Fix exemption word list wrongly reject entire match, not a single table.
  • Add match_id to MatchResult.
  • Reverse DFA structure to AhoCorasick structure.
  • matcher_c use from_utf8_unchecked instead of from_utf8.
  • Build multiple wheels for different python version.
  • Update VARIANT_NORM.txt and NORM.txt.
  • Fix issues with runtime_build feature.

0.4.2 - 2024-07-07

Performance

  • Optimize SimpleMatcher construction and search throughput.

0.4.1 - 2024-07-06

Changed

  • Rebuild Transformation Rules based on Unicode Standard.

0.4.0 - 2024-07-03

Changed

  • Implement NOT logic word-wise inside SimpleMatcher, now you can use &(and) and ~(not) separator to config simple word, eg: hello&world~helo.