Commit 1674891
committed
Guard URL+title merge against contradicting strong signals
Per review (#16), checking which (URL, title) groups
the rule would merge across the whole corpus reveals ~40% likely-wrong
merges before guards: 14% with distinct DOIs (definite false-positives,
e.g. each journal issue's "Front Matter" sharing a publisher landing
URL), 20% with distinct first-author last names (e.g. college catalog
PDFs serving multiple department entries), 6% with year span >5y.
Adds a _strong_signals_disagree() guard that blocks the URL+title hard
merge whenever DOIs disagree, PMIDs disagree, or first-author last
names disagree (with both sides populated in each case). Reduces force-
merges from 224,802 -> 147,345 corpus IDs across the corpus, eliminating
the predictable false-positive classes Sergey called out while still
covering all 24,931 IDs in the canonical-plus-bare-stubs target bucket.
Adds three regression tests (DOI disagreement, first-author disagreement,
matching first-author still triggers).1 parent 00ccd5d commit 1674891
2 files changed
Lines changed: 91 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
76 | 76 | | |
77 | 77 | | |
78 | 78 | | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
79 | 94 | | |
80 | 95 | | |
81 | 96 | | |
| |||
415 | 430 | | |
416 | 431 | | |
417 | 432 | | |
| 433 | + | |
418 | 434 | | |
419 | 435 | | |
420 | 436 | | |
421 | 437 | | |
422 | | - | |
423 | | - | |
424 | | - | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
425 | 443 | | |
426 | 444 | | |
427 | 445 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
140 | 140 | | |
141 | 141 | | |
142 | 142 | | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
143 | 213 | | |
144 | 214 | | |
145 | 215 | | |
| |||
0 commit comments