Commit f85f3b8
authored
fix(memory): extend tokenizer + slug regex to Thai/Arabic/Hebrew/Cyrillic (HKUDS#104)
The previous CJK tokenizer ranges (HKUDS#87, HKUDS#95) only matched ``一-鿿``
and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or
Cyrillic titles:
- Tokenized to the empty set, making recall always miss (e.g.
``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body
contained the word).
- Had their slug characters stripped to ``_``, so two distinct Thai
titles of equal length silently overwrote each other on disk.
The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai +
Arabic + Hebrew + Cyrillic and is reused by:
- ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per
``_tokenize`` call (one text scan instead of two; precompiled at
module level so it doesn't go through ``re.compile`` cache lookup on
each invocation).
- ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``.
Arabic and Hebrew are deliberately narrowed to the basic letter blocks
(U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like
U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs,
where they would render as invisible-but-distinct filenames.
Tests cover tokenization for each script, slug preservation
(parametrized across the four new scripts), and a Thai collision-
distinction regression.
Out of scope: ``agent/src/session/search.py`` has the same CJK-only
range in its FTS sanitizer; worth a follow-up PR to consume the same
constant.1 parent 9bfaa4c commit f85f3b8
2 files changed
Lines changed: 65 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
30 | 45 | | |
31 | 46 | | |
32 | 47 | | |
| |||
52 | 67 | | |
53 | 68 | | |
54 | 69 | | |
55 | | - | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
56 | 73 | | |
57 | 74 | | |
58 | 75 | | |
| |||
63 | 80 | | |
64 | 81 | | |
65 | 82 | | |
66 | | - | |
67 | | - | |
68 | | - | |
| 83 | + | |
69 | 84 | | |
70 | 85 | | |
71 | 86 | | |
| |||
223 | 238 | | |
224 | 239 | | |
225 | 240 | | |
226 | | - | |
227 | | - | |
228 | | - | |
229 | | - | |
230 | | - | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
231 | 246 | | |
232 | 247 | | |
233 | 248 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
84 | 84 | | |
85 | 85 | | |
86 | 86 | | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
87 | 110 | | |
88 | 111 | | |
89 | 112 | | |
| |||
131 | 154 | | |
132 | 155 | | |
133 | 156 | | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
134 | 175 | | |
135 | 176 | | |
136 | 177 | | |
| |||
0 commit comments