Commit 4f8a157
fix: Improve acronym and venue name normalization (#119)
Addresses issue #117 by enhancing the robustness of acronym and venue
name normalization. This commit introduces the following changes:
- `html.unescape()` is now applied early in `normalizer.py`'s
`_clean_text` method to correctly handle HTML entities like `&`.
- A new private helper method `_normalize_for_comparison()` has been
added to `cache.py` which performs aggressive normalization for
string comparisons, including lowercasing, HTML unescaping, removing
generic special characters, and filtering out common stop words (e.g.,
"and", "the", "of", "international", "journal", "conference").
- The `_are_conference_names_equivalent()` method in `cache.py` now
leverages `_normalize_for_comparison()` for more semantic comparisons,
effectively identifying near-duplicate venue names that differ only by
minor phrasing or character encoding inconsistencies.
- Added new unit tests in `tests/unit/test_acronym_normalization.py` to
specifically cover scenarios related to HTML entities, stop word
variations, and other minor differences that previously caused
normalization warnings and overwrites.
These changes prevent the system from logging warnings and overwriting
acronym mappings when the "full name" of a venue is essentially the same
but contains minor, non-semantic variations, leading to a cleaner and more
accurate cache.
Co-authored-by: florath-ai-assistant[bot] <Andreas.Florath@telekom.de>1 parent 8ae285f commit 4f8a157
File tree
3 files changed
+231
-6
lines changed- src/aletheia_probe
- tests/unit
3 files changed
+231
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| |||
183 | 184 | | |
184 | 185 | | |
185 | 186 | | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
186 | 247 | | |
187 | 248 | | |
188 | 249 | | |
| |||
1216 | 1277 | | |
1217 | 1278 | | |
1218 | 1279 | | |
1219 | | - | |
| 1280 | + | |
| 1281 | + | |
1220 | 1282 | | |
1221 | 1283 | | |
1222 | 1284 | | |
| |||
1230 | 1292 | | |
1231 | 1293 | | |
1232 | 1294 | | |
| 1295 | + | |
1233 | 1296 | | |
1234 | 1297 | | |
1235 | 1298 | | |
| 1299 | + | |
| 1300 | + | |
| 1301 | + | |
| 1302 | + | |
| 1303 | + | |
| 1304 | + | |
| 1305 | + | |
1236 | 1306 | | |
1237 | 1307 | | |
1238 | 1308 | | |
| |||
1248 | 1318 | | |
1249 | 1319 | | |
1250 | 1320 | | |
1251 | | - | |
| 1321 | + | |
| 1322 | + | |
| 1323 | + | |
1252 | 1324 | | |
1253 | 1325 | | |
1254 | 1326 | | |
1255 | 1327 | | |
1256 | | - | |
| 1328 | + | |
| 1329 | + | |
1257 | 1330 | | |
1258 | | - | |
| 1331 | + | |
1259 | 1332 | | |
1260 | 1333 | | |
1261 | 1334 | | |
1262 | 1335 | | |
1263 | 1336 | | |
1264 | | - | |
1265 | | - | |
| 1337 | + | |
| 1338 | + | |
| 1339 | + | |
| 1340 | + | |
| 1341 | + | |
| 1342 | + | |
1266 | 1343 | | |
1267 | 1344 | | |
1268 | 1345 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| |||
274 | 275 | | |
275 | 276 | | |
276 | 277 | | |
| 278 | + | |
| 279 | + | |
277 | 280 | | |
278 | 281 | | |
279 | 282 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
0 commit comments