Commit 97486fd
committed
fix(config): scrape Berlin Presse from /pressemitteilungen listing, not news sub-sitemap
The news sub-sitemap (PR #673/#674/#677) aggregates ALL Berlin LV posts:
press releases, AG-Sitzung announcements, LAG meetings, and online events.
Articles indexed from there polluted berlin-lv-presse with non-press content
(8 AG-Sitzung, 8 LAG, 1 Termin, 2 Veranstaltung in the audit) plus 3 entries
contaminated with listing-page boilerplate where TYPO3's /nachrichten/<slug>
alias silently 404s to the listing.
/pressemitteilungen is TYPO3's category-filtered route — only real press
releases. Verified pagination works (unlike /nachrichten where tx_xblog_pi1
is silently ignored): pages 1, 2, 3, 57 each return distinct article IDs and
next-page links carry per-page cHash signatures.
Use paginationLinkSelector mode so the extractor follows next-links from
HTML rather than constructing pagination URLs (which can't include the
required cHash). paginationPattern stays as fallback. listSelector and
contentSelectors unchanged. The off-path filter naturally drops legacy
/nachrichten/<slug> teaser links from the listing page since URLs don't
share the /pressemitteilungen prefix — exactly the desired behavior.
Going forward, future scrapes index articles at /pressemitteilungen/<slug>_<id>
canonical URLs. Existing legacy /nachrichten/ entries in Qdrant should be
purged separately so dedup works against a single URL form.1 parent b2d72ef commit 97486fd
1 file changed
Lines changed: 19 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
294 | 294 | | |
295 | 295 | | |
296 | 296 | | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
297 | 311 | | |
298 | | - | |
| 312 | + | |
299 | 313 | | |
300 | | - | |
301 | | - | |
302 | | - | |
303 | | - | |
304 | | - | |
305 | | - | |
306 | | - | |
307 | | - | |
308 | | - | |
309 | | - | |
310 | | - | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
311 | 318 | | |
312 | 319 | | |
313 | 320 | | |
| |||
0 commit comments