- Process Maps: Significantly expanded
NORM.txtwith Unicode confusables data — non-ASCII characters that are visually similar to ASCII equivalents are now normalized, improving coverage against homoglyph-based evasion. - Process Maps: Extended
NUM-NORM.txtwith additional numeric character mappings and corrected existing values. - Process Maps: Extended
TEXT-DELETE.txtwith new codepoints for more comprehensive deletion coverage. - Process Maps: Removed comments from
ROMANIZE.txtandVARIANT_NORM.txt, retaining only essential mappings. - Process Maps: Updated
manifest.jsonto reflect new counts, sources, and Python/Unicode data versions. - Tooling: Enhanced
generate_process_map.pyto download and apply Unicode confusables, mapping non-ASCII characters to ASCII equivalents where applicable, with improved handling of combining marks.
- Clarify CJK variant normalization in README and DESIGN docs.
- Core: Custom DFA
next_statewalk viaAutomatonAPI, replacing materialized-match iteration. Enables fused prefilter-aware dispatch: Teddy active → materialize +try_find_overlapping; no prefilter → stream vianext_stateloop. - Core: Replace NEON scratch-buffer scan with bitmask extraction for SIMD delete filtering.
- Core: Switch engine selection from byte-ratio heuristic to character density (
bytecount::num_chars / len), improving CJK dispatch accuracy.
- Core:
BytewiseDFAEngine— extracted DFA engine encapsulatingdfa::DFA,dfa_to_value, andhas_prefilterflag with 4-way fused-path dispatch. - Core: Parallel batch API (
batch_is_match,batch_process,batch_find_match) via rayon work-stealing. Behindrayonfeature flag. - Core: Profiling boundaries (
#[profiling::function]) on key DFA and matcher functions. - Tooling: Benchmarking orchestration (
scripts/run_benchmarks.py) and interactive Plotly visualization (scripts/bench_viz.py). - Tooling: Profiling workflow via macOS Instruments (
just profile record/analyze).
- Core: Delete dual-scan correctness — propagate mask bit to root when Delete is a direct child, ensuring patterns with deletable characters are scanned against both original and transformed text.
- Core: Merge colliding
ProcessTypebuckets beforeparse_rulesto prevent pattern index conflicts.
- Core: Replace
#[inline(always)]with#[inline]across codebase, letting LLVM decide inlining under full LTO. - Core: Streamline automata compilation by removing unnecessary closures.
- Core: Simplify Delete dual-scan by reusing
apply()result instead of redundant transform. - Core: Extract seek helper and encapsulate dual-scan check in scan module.
- Core: Replace inline
super::paths with top-level imports for readability. - Core: Introduce
MatcherErrorenum for structured construction failure reporting. - Config: Clean up configuration files, remove unnecessary flags, consolidate environment variables.
- Update
bitflags,daachorse,rayon.
- Comprehensive call-graph documentation for public API and internal logic.
- Update CLAUDE.md for
BytewiseDFAEngineand prefilter-aware dispatch.
- Core: Fuse
WordState+satisfied_masksinto a singleRuleStatestruct, consolidating per-rule hot state into one cache line.
- Core: Resolve NORM/NUM-NORM overlap in process maps where shared codepoints caused ambiguous transform behavior.
- Docs: Improve clarity in density-based engine dispatch problem description in DESIGN.md.
- Add exhaustive
process_mapcoverage validating all transform tables (VARIANT_NORM, ROMANIZE, DELETE, NORM, NUM-NORM, EMOJI_NORM).
- Python: Batch methods (
batch_is_match,batch_process,batch_find_match) usePyBackedStrfor zero-copy string handling, avoiding redundant UTF-8 copies across the FFI boundary. - Core: Replace unsafe
get_uncheckedwith safe indexing guarded byassert_uncheckedhints across scan and pattern modules. - Core: Add safety assertions for bitset indexing in
DeleteMatcher. - Core: Simplify text processing by inlining
replace_cowfunctionality. - Core: Move
has_matchintoScanStateand remove unused methods. - Tooling: Trim
cargo-all-featuresallowlist to only theperffeature.
- Update all README Quick Start examples to use builder APIs (Python, Java, C) for consistency with Rust.
- Remove phantom
process_iterreference from Rust README (method was removed in 0.15.0). - Fix formatting in root README architecture diagram.
- Core:
batch_is_match,batch_process,batch_find_match— parallel batch API powered by rayon. Distributes texts across CPU cores via work-stealing. Behindrayonfeature flag (off by default inmatcher_rs, enabled by all binding crates). - Core: Batch benchmarks (
bench_search::batch) comparing sequential vs parallel throughput. 2.6–7.2× speedup on M3 Max. - C:
simple_matcher_batch_is_match,simple_matcher_batch_process,simple_matcher_batch_find_matchFFI functions with correspondingdrop_*deallocators.
- Python/Java: Existing
batch_is_match,batch_process,batch_find_matchnow use rayon parallelism internally (previously sequential).
- Python: Rewrite
from_dictto iterate PyDict directly (nojson.dumpsround-trip). Replacesimple_table_bytesstorage with(pt, id, word)triples, halving input memory. Remove__getstate__/__setstate__(pickle via__getnewargs__only). AddSimpleMatcherBuilder,heap_bytes(). - Java: Make
SimpleMatcherSerializablefor Spark distribution (stores config bytes, reconstructs native matcher on deserialization). ConvertSimpleResultto a Java record. AddSimpleMatcherBuilder. Remove FastJSON runtime dependency. MakeMatcherJavapackage-private; exposetextProcess/reduceTextProcessas static methods onSimpleMatcher. - C: Extract
decode_c_strhelper andffi_fn!macro to reduce FFI boilerplate. AddSimpleMatcherBuilder(init/add_word/build/drop). Addsimple_matcher_heap_bytes. - Python: Add
batch_find_matchfor batch first-match queries with single GIL release.
- Unify rule evaluation into single
eval_hitwith simplified direct-rule encoding. - Remove string pool,
process_iter, andPatternKind::Simple; merge intoAnd+SingleAnd. - Simplify
WordStatefrom 3 generation stamps to 1 + vetoed flag. - Flatten
transform/replace/intotransform/; mergeprocess/api.rsintoprocess/mod.rs; movegraph.rs→simple_matcher/tree.rs; renameengine.rs→scan.rs; mergeencoding.rsintopattern.rs. - Consolidate
RuleHot/RuleColdinto singleRuletype. - Introduce streaming byte iterators for filter transform steps.
- Replace
TinyVecwithVecacross modules.
- Rewrite DESIGN.md around concepts and rationale.
- Fix stale doc links across workspace.
- Align pre-commit hooks with
just lint-check. - Consolidate bench files and deduplicate test data.
- Merge AllSimple fast path into General — all query methods (
process,for_each_match,process_iter) now use the unifiedwalk_and_scanpath.is_matchretains a minimal AC-direct bypass for simple literal matchers. - Reject empty pattern sets at construction with
MatcherError::EmptyPatterns. - Bundle bytewise and charwise engines in a non-optional
Enginesstruct behind a unifiedScanEnginetrait withdispatch!macro. - Remove
Optionfrom DFA field — always present undercfg(feature = "dfa").has_dfa()is nowcfg!(feature = "dfa"). - Unify 4 streaming filter iterators (
DeleteFilterIterator,NormalizeFilterIterator,RomanizeFilterIterator,VariantNormFilterIterator) into a genericFilterIterator<F>backed by aCodepointFiltertrait.
SimpleMatcher::newandSimpleMatcherBuilder::buildnow returnErr(MatcherError::EmptyPatterns)when no scannable patterns remain after parsing. Previously, empty matchers silently returned no matches.SimpleMatcherDebug output no longer includessearch_mode.- Python
SimpleMatcher.stats()no longer includessearch_modekey.
- Update CLAUDE.md, DESIGN.md, README.md, and matcher_rs/README.md for engine architecture changes.
- Add Architecture section and Common Pitfalls FAQ to root README.
- Add performance tuning guidance ("When to Use Which") to matcher_rs README.
- Add format header comments to all
process_map/*.txtfiles (with#comment support inbuild.rs). - Add
data/README.mddocumenting benchmark haystack and word list files.
- Add 4 ProcessType composition edge case tests (
Delete|EmojiNorm,None|Delete,RomanizevsRomanizeChar,VariantNormDeleteNormalize). - Add
cargo-fuzztargets forSimpleMatcher::newandtext_process(adversarial input fuzzing). - Fix proptest ASCII generator: exclude backslash to prevent
\bword-boundary false negatives.
- Python: add
__repr__toSimpleMatcher(shows search mode and rule count). - Python: add missing
EMOJI_NORMto.pyitype stub. - C: add
matcher_version()function for runtime version queries.
- Add
cargo-semver-checksCI job to catch accidental API breaks. - Add
just fuzz/just fuzz-listrecipes.
- Replace
.expect()withlet-else unreachable!()inapi.rsandsearch.rs.
- LUT + unchecked indexing for word boundary checks.
- Fused romanize-scan path via
RomanizeFilterIterator. - Store
and_countinPatternEntryto eliminateRuleHotcache misses.
- Add
# Panicssections forcompile_automata,walk_and_scan,process_entry, andget_transform_step. - Fix 2 broken
RuleHot::and_countintra-doc links (field moved toPatternEntry). - Update CLAUDE.md and DESIGN.md for RuleHot/PatternEntry restructure.
- Unify
NormalizeFilterIteratorstate into single remainder struct. - Remove OPTIMIZATION_IDEAS.md (no longer needed).
- Remove unused import of
text_processin bench.rs. - Update profiling category rules for current architecture.
- Expand DFA scan category in time profile parser for improved accuracy.
- Add
profile_buildexample and--target buildsupport to profiler. - Add overlap comparison benchmarks for 3 AC engines.
- Rewrite
text_transformbenchmarks to measure full matcher pipeline.
- Add word boundary matching (
\b) for whole-word precision in pattern rules. - Add OR operator (
|) for alternative patterns within rules. - Add
EmojiNormProcessType for emoji-to-English-word normalization via CLDR short names. - Generalize CJK transforms — rename Fanjian→VariantNorm, PinYin→Romanize with expanded JP/KR data.
- Replace
is_ascii+ Harry SIMD dispatch with density-based engine selection (count_non_ascii_simdNEON/AVX2/portable). Harry matcher removed entirely. - 3-way fused scan dispatch — DFA materialize at low density, streaming charwise at high density, with 0.67 non-ASCII threshold.
- Always build DFA + DAAC bytewise together; raise DFA pattern threshold to 25K.
- Replace Normalize AC DFA with page-table + fused streaming scan.
- Implement fused delete-scan path to stream non-deleted bytes directly into AC.
- Eliminate
Vecpointer re-resolution in scan hot path viaScanStatesplit-borrow. - Optimize AC scan closure by pre-resolving
&[RuleHot]slice and removing per-hit indirection. - Enhance bytewise matcher with prefilter acceleration.
- Replace PrefixMap binary search with AHashMap for O(1) verification.
- Specialize
AllSimpleprocess loop for single-transform-type matchers. - Skip
is_ascii()dispatch when all patterns are ASCII.
- Split
simple_matcher/rule.rsintoencoding.rs,pattern.rs, andrule.rsmodules. - Split
replace.rsintovariant_norm.rs,romanize.rs,normalize.rssub-modules. - Add
Fanjianstreaming byte iterator and integrate into transform pipeline. - Replace
#[inline(always)]with#[inline]for improved inlining heuristics. - Remove
runtime_buildfeature. - Merge duplicate leaf-node scan paths in
walk_and_scan. - Remove dead abstractions and fix stale doc links.
- Resolve broken rustdoc links after module split.
- Propagate transform output density for correct engine dispatch.
- Add interactive benchmark visualization with Plotly (
just bench-viz). - Add engine dispatch characterization example and visualization.
- Add Instruments profiling with
atosinline resolution and source attribution. - Add pre-commit configuration with hooks for all languages.
- Simplify bench/profiling tooling and add missing operator coverage.
- Enhance documentation with examples and performance notes across modules.
- Document
ScanStatesplit-borrow optimization andRuleHotcompaction in DESIGN.md. - Streamline CLAUDE.md with updated architecture and commands.
- Add
heap_bytes()toHarryMatcherandSimpleMatcherfor heap memory introspection across all matcher components (AC automata, Harry tables, rule metadata, process-type trie).
- Unify HarryMatcher into a single matcher with wildcarded columns, eliminating per-prefix-length scans (6x on CJK, 3-4x on mixed haystacks).
- Column-0 early exit in NEON/AVX512 kernels skips columns 1-7 for ~95% of non-ASCII chunks.
- Replace AHashMap with sorted split-array PrefixMap in Harry verification for L1-friendly binary search.
- Gate Harry dispatch on ASCII-only patterns and DFA absence; improve non-ASCII haystack routing.
- Const-generic SIMD kernels with PREFIX_LEN-scoped column loading.
- Add 15 targeted coverage tests (process type display, streaming scan paths, NEON edge cases, threaded compilation).
- Coverage: 86% of testable lines (excluding platform-gated AVX512, binding crates, benchmarks).
- Fix SIGILL on x86_64 CI runners by overriding
target-cpu=nativefrom.cargo/config.toml. - Add separate coverage workflow with tarpaulin and Codecov integration.
- Replace Makefile with Justfile for all build/test/bench/lint commands.
- Add
scripts/bump-version.shandscripts/dev-setup.shfor release and onboarding automation.
- Add per-plan
charwise_density_thresholdtoScanPlan;AcDfanever routes to charwise at any density,DaacBytewiseuses 0.1. - Raise
AC_DFA_PATTERN_THRESHOLD5000 → 7000 based on M3 Max benchmarks (+14% at 7k, -15% cliff at 8k due to L2 cache boundary). - Align ASCII transform fast paths — consolidate
is_ascii/output_densitytracking acrossTransformStep, simplify per-transform ASCII detection.
- Fix leaf-transform noop handling: leaf nodes in the process trie that are ASCII no-ops were incorrectly re-scanning instead of reusing the parent variant.
- Regenerate all process maps (VARIANT_NORM, NORM, NUM-NORM, ROMANIZE, TEXT-DELETE) from updated Python sources.
- Move map generator script into
matcher_rs/scripts/generate_process_map.py; addmanifest.jsonfor reproducibility. - Remove large raw Unicode source files from
data/str_conv/(now generated on demand).
- Add
density_dispatchbench module to calibrate the charwise threshold. - Add
pattern_mix_en/pattern_mix_cnmodules with CJK-% sweep to validate theall_asciiguard. - Extend
search_ascii_enwith 6000/7000/8000 pattern counts around the DFA threshold.
- Add
proptest-based property tests for transform correctness. - Extend transform unit tests; remove redundant
matcher_rscoverage.
- Major rustdoc pass:
ReplacementFinder, string pool,decode_utf8_raw,AsciiInputBehavior,get_transform_step,build_process_type_tree,multibyte_density, SIMD skip functions. - Update DESIGN.md: density-based engine selection,
StepOutputshape,ScanPlanaccessor list, threshold and constructor docs. - Add
#![warn(missing_docs)]to crate root.
- Fix Romanize regression by eliminating
Replacementenum indirection in replacement engines. - Unify streaming tree walk into single
walk_and_scanmethod — 25% fasterprocess, 33% fasteris_match. - Lazy transform pipeline for
is_match— skips materializing text variants when early exit is possible.
- Merge charwise + normalize into unified
replace.rswith sharedReplacementFindertrait. - Deduplicate SIMD dispatch and AVX2 entry points with macros.
- Extract shared UTF-8 decoder to
transform/utf8.rs. - Merge
step.rsandregistry.rsinto singlestepmodule. - Remove dead public API after
walk_and_scanunification. - Remove unused optimizations (masks pool, VariantNorm in-place,
SingleProcessTypeconst generic). - Remove unused
daachorsedependency and related non-overlapping code.
- Remove broken single-step match processing methods from
SimpleMatcher.
- Add unit tests for critical internals and improve coverage infrastructure.
- Simplify runtime build test configuration.
- Add doc tests and expand rustdoc for public API gaps.
- Update
CLAUDE.mdandDESIGN.mdfor post-refactor accuracy.
- Optimize search throughput 10-17% via six hot-path improvements.
- Encode
rule_idxdirectly in automaton values for simple single-PT patterns (DIRECT_RULE_BIT), eliminating one indirection per hit. - Skip
text.is_ascii()scan when only ASCII patterns exist. - Optimize
is_matchhot path with two targeted improvements. - Raise
AC_DFA_PATTERN_THRESHOLDto 5000 and optimizebench_engine. - Improve
SimpleMatcherbuild performance up to 42%. - Replace std
HashMapwithahashinruntime_buildtransform init.
SimpleMatcher::newandbuilder::buildnow returnResultinstead of panicking.SimpleMatcherBuilder::add_wordaccepts ownedStringin addition to&str.- Add
#[must_use]to public types and query methods. - Derive
PartialEq/EqonSimpleResult; addSend + Syncstatic assertions. - Add manual
Debugimpl forSimpleMatcher.
- Release GIL and add batch methods (
is_match_batch,process_batch) in Python bindings.
- Harden construction against invalid
ProcessTypeand edge-case rules. - Fix Romanize handling to correctly track
is_asciifor unmapped characters. - Resolve broken intra-doc link to cfg-gated private function.
- Deny
unsafe_op_in_unsafe_fnlint, document all unsafe blocks withSAFETYcomments. - Add
SAFETYcomments to all unsafe blocks in AVX2 SIMD functions. - Add crate-level Safety section documenting unsafe usage.
- Reorganize
simple_matcherinternals into focused modules (build.rs,engine.rs,rule.rs,search.rs,state.rs). - Reorganize transform pipeline into dedicated modules under
process/. - Replace
FLAG_*bit flags withRuleShapeenum inPatternEntry.
- Add 6 property tests for correctness invariants.
- Reorganize test suite by system-under-test.
- Rewrite
DESIGN.mdand updateCLAUDE.mdto match refactored codebase. - Add API tutorial and profiling targets in
examples/. - Update
DESIGN.mdto reflect search throughput optimizations.
- Adopt
cargo-nextestacross all test workflows. - Enable
rust-lldlinker for test and bench builds. - Streamline cargo installation in release workflow.
- Improve CI workflow reliability and efficiency.
all_simplefast path foris_match— bypasses TLS state, generation counters, and overlapping iteration for pure-literal matchers.- Dedup length pre-filter to skip redundant pattern entries during construction.
- Thread-local
TRANSFORM_STATEbundles scratch buffers into a single TLS lookup per call; literal fast path avoids TLS entirely for simple cases. - In-place VariantNorm optimization — exploits same-byte-length property of 99%+ Traditional-to-Simplified mappings to avoid scan-and-rebuild allocations.
- Shrink
PatternEntryfrom 16 to 8 bytes via sequential process-type indexing. - Embed dedup indices directly in DAAC automaton values, eliminating one indirection per hit.
- Track
is_asciiflag through the transform pipeline to skip redundant charwise scans on ASCII-only text. - Auto-select DAAC bytewise engine over AC DFA when ASCII pattern count exceeds 2000.
- Replace
PatternEntryboolean flags withPatternKindenum for clearer dispatch inprocess_match. - Reorganize
matcher_rsinto focused single-responsibility modules:simple_matcher/split intotypes.rs,construction.rs,scan.rs;process/split intoprocess_type.rs,string_pool.rs,process_tree.rs,transform/. - Improve code clarity via named structs (
ScanContext,RuleHot,RuleCold) and bundled TLS parameters.
- Bump
sonic-rsto 0.5.8,tinyvecto 1.11.0,proptestto 1.11.0. - Migrate
matcher_javaJNI bindings tojni0.22.4.
- Rewrite
DESIGN.mdto reflect current implementation with detailed sections on state management, SIMD dispatch, and const-generic optimizations. - Update all READMEs to match current package APIs: document
text_process/reduce_text_processin C and Java bindings, add ProcessType reference tables, fix paths, improve build instructions.
- Removed the
vectorscanbackend to simplify the build process and eliminate the external Boost dependency requirement.
- Simplified SIMD utility dispatching by removing
OnceLock/SimdDispatchfor AArch64 (NEON is now always baseline) and gating it for x86_64 only. - Removed dead API surface and unused parameters in SIMD hot paths.
- Optimized search hot paths and benchmark tooling in
matcher_rs. - Added comprehensive benchmark results for MacBook Air M4 (Apple Silicon).
- Filled documentation gaps, added
# Panics/# Errors/# Argumentssections, and explained internal implementations inmatcher_rs. - Aligned public documentation and improved comments on private items for better maintainability.
- Hot/cold struct split, pre-computed masks, TLS consolidation for reduced per-call overhead in
SimpleMatcher. - Skip unused text variants during process-tree traversal, avoiding redundant transformations.
- Cache Romanize trim metadata to eliminate repeated recomputation.
- Lazy tree walking for unique text variants — process-tree nodes are now visited on demand rather than eagerly.
- Extract
is_rule_satisfiedas a dedicated method for clarity and measurable performance improvement. - Optimize tree node index handling in
walk_process_tree(formerlyreduce_text_process_with_tree). - Rename traversal function to
walk_process_treeand update terminology throughout. - Improve encapsulation:
SingleCharMatcher/SingleCharMatchvisibility narrowed;SimpleMatcherinternals use more descriptive struct names (RuleHot, etc.). - Add safety assertions in
page_table_lookup. - Update type-ignore comments in test cases for clarity.
- Update terminology and traversal descriptions in
DESIGN.md. - Update benchmark records in
README.mdwith new results.
- Fix
DeleteFindIterSIMD fast-skip incorrectly advancing past deletable ASCII bytes (e.g. spaces) that appear before a non-ASCII character in the same 16-byte chunk. Thenon_ascii_maskwas checked beforedel_mask, causing the skip to jump to the first non-ASCII byte and silently drop intervening deletable characters. Fixed by ORing both masks and stopping at the first set bit in either.
- Monomorphize
SingleChariterators and add SIMD ASCII chunk-skip for faster inner loops. - Byte-level
Romanize/Deleteiterators andascii_lutfast-path, eliminating UTF-8 decoding overhead on ASCII-heavy input. portable_simdSIMD helpers (skip_ascii_simd,simd_ascii_delete_mask,skip_non_digit_ascii_simd) for 16-byte parallel probing inSingleCharskip loops.
- Exhaustive property-based and unit tests for
VariantNorm,Delete,Normalize, andRomanizeprocess types. - Macro-based benchmark generation with
BytesCountmetric for normalized throughput measurement.
- Improve clarity and consistency across the
processmodule.
- Improve
CLAUDE.mdwith benchmark scoping, test-file syntax, and architecture details. - Move benchmark output to
bench_records/and link from README. - Clarify
get_or_init_matcherreturn type in docs.
- Removed
Matcher,RegexMatcher, andSimMatchercomponents to focus on the high-performanceSimpleMatcher. - Updated C and Java FFI interfaces to only support
SimpleMatcher.
- Updated
README.md,DESIGN.md, andGEMINI.mdto reflect the focus onSimpleMatcher. - Cleaned up documentation and examples across all language bindings.
- Replace standard
HashMapandHashSetwithFxHashMapandFxHashSetfor improved execution speed. - Replace
Vec<i32>withTinyVecinsimple_matcherfor improved performance. - Optimize inner loop with
Vecindexing and flat matrix insimple_matcher. - Use
FxHashMap+u64bitmask for the inner loop ofsimple_matcher. - Rename
ProcessedTextSettoProcessedTextMasksand update its representation to use au64bitmask for process types. - Simplify
TextMatcherTraitby derivingis_matchandprocess_iterfromprocess, and remove theTextMatcherInternaltrait. - Simplify word splitting logic in
SimpleMatcher::newusing a helper closure and adjust lifetime bounds for borrowed types. - Simplify C FFI panic handling and wrap all
panic::catch_unwindcalls in FFI functions withAssertUnwindSafe. - Remove
word_idfrom match result structs, refine regex pattern handling and matching. - Unconditionally configure mimalloc as the global allocator and remove conditional allocator dependencies.
- Standardize Rust documentation and include detailed algorithm explanations across all matching engines.
- Update benchmark results in README.md after modifications to the simple matcher.
- Configure
rustflagsto use 8 compilation threads. - Streamline CI Rust testing by adopting
cargo-all-featuresand enablingRUST_BACKTRACE. - Expand and update CI workflows (upgrade action runners to
ubuntu-24.04-armandmacos-latest). - Remove
AGENTS.mdand legacy tracker files.
- Replace
nohash-hasher,id-set,FxHashMap(rustc-hash), andmicromapwith std collections (HashMap/HashSet), removing these external dependencies. - Replace
tinyvec::ArrayVecwithstd::vec::Vecfor dynamic collections in the process matcher.
- Standardize rustdoc comments and add intra-doc links to type names across the project for improved readability.
- Improve build/linting commands and remove outdated feature mentions.
- Implement sealed trait pattern for
TextMatcherTrait.
- Use
Box<[T]>for frozenVecfields to optimize memory. - Introduce
genblocks forprocess_iterimplementations to improve iteration. - Remove unsafe code, update
aho-corasickdependency, optimize matcher withtinyvec. - Introduce
ProcessTypeErrorfortext_processhandling. - Use
eprintlnfor warnings instead ofprintln. - Consolidate conditional matching logic and update FFI function attributes to
unsafe(no_mangle). - Improve struct initializations and Option block handling.
- Derive
DebugonMatchResultfor consistency. - Add
diagnostic::on_unimplementedto public traits for better compiler errors.
- Update Rust edition to 2024.
- Add
rust-toolchain.tomlto use nightly toolchains for reproducible builds. - Remove direct deserialization for core types.
- Improve
SimpleMatcherandMatcherinstantiation examples to recommend builder patterns. - Ensure correct and modern rust idiom implementations across repo.
- Removed explicit ASCII case-insensitivity from
AhoCorasickBuilderto simplify builder configuration. - Deferred
Stringallocation inProcessMatcher'sreplace_allanddelete_allfor performance optimization. - Simplified
TextMatcherTraitand various internal matcher method implementations. - Expanded testing suite by separating tests into individual files, adding edge case checks and fixing slice coercion in proptests.
- Switched
aho-corasick-unsafedependency from git source tocrates.io. - Updated benchmarks with deterministic scenarios for process types.
- Enhanced Java example to use the high-level API and adjusted the environment for macOS.
- Heavily improved documentation across
README.md,README_CN.md,AGENTS.mdand specific language READMEs.
- FFI Panic Safety: All entry points in
matcher_care now wrapped incatch_unwindto prevent native crashes when Rust code panics. - Memory Robustness: Fixed brittle raw pointer usage in
reduce_text_process_with_tree(process matcher) by switching to indexing. - ReDoS Protection: Added pattern length limits (1024 chars) to
RegexMatcherto mitigate exponential backtracking risks. - Invariants: Added
debug_assert!checks acrossSimpleMatcherto verify internal consistency in development.
- Ergonomics: Introduced high-level
MatcherandSimpleMatcherclasses that implementAutoCloseablefor automatic native memory management (RAII).
- Breaking:
MatchResultTrait::similaritynow returnsOption<f64>—Nonefor exact matchers (Simple, Regex) andSome(score)for similarity matchers. - Breaking:
MatchTableTrait::word_listandexemption_word_listnow return&[S]instead of&Vec<S>. - Internal
TextMatcherTraitmethods are now marked#[doc(hidden)].
- Fixed double-checked locking in
get_process_matcher. - Re-enabled
overflow-checksglobally; hot-path arithmetic useswrapping_add/wrapping_mul.
- Replaced
lazy_staticwithstd::sync::LazyLock. - Updated documentation regarding
!Senditerators and git-dependency limitations.
- Builder API:
SimpleMatcherBuilder,MatchTableBuilder,MatcherBuilder. process_iter— lazy iterator over match results for all four matcher types.RegexMatcherandSimMatcherhave truly lazy implementations;SimpleMatcherwrapsprocess()(two-pass AC constraint documented);Matcheravoids the finalcollect()viainto_values().flatten().
- Update dependencies.
- staticmethod for extension_types.py
- Update dependencies.
- Update dependencies.
- Fix
build_process_type_treefunction, use set instead of list. - Update several dependencies.
- Change
XXX(Enum)toXXX(str, Enum)in extension_types.py to fix json dumps issue.
- Add Python 3.13 support.
- Remove msgspec, only use json in README.md.
- Fix typo and cargo clippy warnings.
- Add single line benchmark.
- Fix simple matcher is_match function.
- Remove msgpack, now non-rust users should use json to serialize input of Matcher and SimpleMatcher.
- Refactor Java code.
- Use FxHash to speed up simple matcher process.
- Remove unnecessary dependencies.
- Major internal refactor of SimpleMatcher internals. See git history for details.
- Optimize SimpleMatcher hot-path performance.
- Optimize Simple Matcher
processfunction when multiple simple_match_type are used. - Add
dfafeature to matcher_rs. - Shrink
VARIANT_NORMconversion map.
- Merge ROMANIZE and ROMANIZECHAR process matcher build.
- Add
processfunction to matcher_py/c/java. - Fix simple matcher process function issue.
- Refactor matcher_py file structure, use
ryeto manage matcher_py. - Delete
println!in matcher_c.
- Fix exemption word list wrongly reject entire match, not a single table.
- Add match_id to MatchResult.
- Reverse DFA structure to AhoCorasick structure.
- matcher_c use from_utf8_unchecked instead of from_utf8.
- Build multiple wheels for different python version.
- Update VARIANT_NORM.txt and NORM.txt.
- Fix issues with
runtime_buildfeature.
- Optimize SimpleMatcher construction and search throughput.
- Rebuild Transformation Rules based on Unicode Standard.
- Implement NOT logic word-wise inside SimpleMatcher, now you can use
&(and) and~(not) separator to config simple word, eg:hello&world~helo.