Releases: flairNLP/fundus
v0.5.5
🌍 Extend Publisher Support & Maintenance 🌍
This release expands our publisher coverage with 9 new publishers from 7 countries, increasing Fundus’ total to 171 supported news outlets.
Alongside the expansion, we maintained existing publishers and enhanced the robustness of the forward crawler to better handle unexpected exceptions when fetching HTML files.
✨ Quality of Life Improvements
- Improve robustness of
fetchmethod forWebSourceby @MaxDall in #875 - Rework break transformation by @MaxDall in #885
🚀 New Publishers
🇩🇪
- Add LTO (Legal Tribune Online) publisher by @elias-polyapp in #799
🇻🇳
- Add VN publisher (VnExpress) by @bachthyaglx in #802
🇸🇪
🇮🇩
🇺🇦
🇱🇧
- LBC publisher integrated by @nancyboukamel-ds in #814
🇿🇦
- Add
TheCitizenby @addie9800 in #847 - Add
EyethuNewsby @addie9800 in #835 - Add
Ilangaby @addie9800 in #848
🔧 Updated Publishers
- Update
Tageblattby @addie9800 in #868 - Fix
SeznamZpravyby @addie9800 in #873 - Fix paragraph selector for
LeMondeparser by @MaxDall in #878 - Fix paragraph selector for
sternparser by @MaxDall in #879 - Update
Landesspiegelparser by @MaxDall in #877 - Update
HankookIlboparser by @MaxDall in #882
🚫 Deprecated
- Deprecate
NikkanGeadaiby @addie9800 in #872 - Deprecate
LesothoTimesby @MaxDall in #880
🐛 Bug fixes
- Update error message by @addie9800 in #869
- Skip functioning publishers in publisher coverage by @addie9800 in #871
- Fix a bug with
VALID_UNTILdate in long crawls by @MaxDall in #876 - Fix error message in
BaseParserby @MaxDall in #881 - Remove unfinished bar in
check_coverageby @MaxDall in #883 - Ignore capitalization in supported_publishers.md ordering by @addie9800 in #886
New Contributors
- @elias-polyapp made their first contribution in #799
- @bachthyaglx made their first contribution in #802
- @rekordii made their first contribution in #803
- @vrdhn91 made their first contribution in #804
- @bucheben made their first contribution in #807
- @nancyboukamel-ds made their first contribution in #814
Full Changelog: v0.5.4...v0.5.5
v0.5.4
🛠️ Maintenance Update 🛠️
This PR introduces new quality-of-life improvements that streamline the update process for existing parsers. So we got hands on and improved 20 existing publishers as well as added 2 new ones. In addition, with this release we fixed several bugs related to xpath_search, encoding detection, and sitemap parsing.
✨ Quality of Life Improvements
- Add
check_coveragescript by @MaxDall in #839 - Apply general quality improvements by @MaxDall in #859
🚀 Publishers
🆕 New
- Add German publisher T-Online by @freylily in #805
- add klassegegenklasse (DE) publisher + parser + tests + tables by @baurlaur in #809
🔧 Updates
- Adjust
paragraph_selectorforRheinische Postby @MaxDall in #838 - FIX
CBC Newsby @MaxDall in #842 - Deprecate
FreiePresseby @MaxDall in #857 - Update
Dagbladetparser to versionV1_1by @MaxDall in #856 - Update
SeznamZpravyparser by @MaxDall in #843 - Fix
Tageszeitungby @MaxDall in #855 - Update
TheMirrorparser by @MaxDall in #850 - Deprecate
authorsforThePortugalNewsby @MaxDall in #845 - Update selectors by @addie9800 in #828
- Update parser for
SalzburgerNachrichtenby @MaxDall in #854 - Deprecate
Morgunbladidby @MaxDall in #853 - Update
NTVparser by @MaxDall in #846 - Update
Euronewsparser to versionV1_1by @MaxDall in #852 - Update
DailyMaverickparser by @MaxDall in #851 - Fix
SRFsummary selector by @MaxDall in #861 - Fix summary selector for
20Minutesby @MaxDall in #860 - Fix sitemaps for BR by @MaxDall in #862
🐛 Bug fixes
- Fix a bug in the encoding detection by @MaxDall in #841
- Fix escaping in
xpath_searchby @MaxDall in #840 - Skip lazy loading images by @MaxDall in #849
- Catch unexpected HTML by @MaxDall in #863
New Contributors
Full Changelog: v0.5.3...v0.5.4
v0.5.3
🌍 Expanded Publisher Support & Key Bug Fixes 🌍
This release introduces 10 new publishers to fundus, bringing the total to 160 publishers across 37 countries. We've also added a feature that respects publishers' preferences to not be scraped for AI purposes (see our documentation for details).
Additionally, we resolved several bugs related to deadlocks that appeared in specific edge cases within our threading logic.
New Publishers
🇸🇪 Sweden
- Add SE Expressen by @ghostsshadow in #800
🇩🇪 Germany
🇬🇧 UK
- Add Nature (UK Scientific Journal) by @Kucki2018 in #797
🇺🇸 USA
- Add Rest Of World Publisher by @marten-ti in #801
🇧🇪 Belgium
🇿🇦 South Africa
- Add
Dizindababy @addie9800 in #832 - Add Independent Online newspapers by @addie9800 in #827
Maintained Existing Publishers
- Add
V1_1forSeznamZpravyby @addie9800 in #821 - Fix
ZwanzigMinutenby @addie9800 in #820 - Update
upper_boundary_selectorforNZZby @addie9800 in #819 - Update topics for
Funkeby @addie9800 in #823 - Fix
BoersenZeitungby @addie9800 in #824 - Add
V1_1forZDFby @addie9800 in #825 - Fix
summary_selectorforTheNationby @addie9800 in #831 - Add
V1_1toNTVTRby @addie9800 in #830
New Features
- Add
skip_publishers_disallowing_trainingby @addie9800 in #772 - Update
generic_parsingby @addie9800 in #822
Bug Fixes
- Fix spacing error in
LaVanguardiaby @MaxDall in #795 - Handle malformed XML by @addie9800 in #794
- Fix race conditions and improve exception handling by @MaxDall in #796
- Ignore type check for
MONTHSby @addie9800 in #810 - Update User Agents by @addie9800 in #818
- Fix deadlock in
queue_wrapperby @MaxDall in #833
New Contributors
- @ghostsshadow made their first contribution in #800
- @Kucki2018 made their first contribution in #797
- @marten-ti made their first contribution in #801
- @bresslem made their first contribution in #798
- @rascaria made their first contribution in #811
Full Changelog: v0.5.2...v0.5.3
v0.5.2
🔧 Maintenance Update 🔧
This release includes several fixes to existing publishers, bug patches across the codebase, and some quality-of-life improvements for updating parsers.
✨ Quality of Life Improvements
Implemented a mechanism to deprecate specific parser attributes based on timestamps
- Deprecated
Attributesby @addie9800 in #745
Added a filter option for text extraction to omit certain tags via XPath selectors
Publisher Updates
- Updated sources for ThePortugalNews by @addie9800 in #790
- Updated funke by @addie9800 in #786
- Fixed Golem parser by @addie9800 in #787
Bug Fixes
- Fixed language code testing by @addie9800 in #784
- Updated README.md by @sanj4git in #785
- Updated user-agent by @addie9800 in #788
- Generalized sitemap selectors by @addie9800 in #789
New Contributors
Full Changelog: v0.5.1...v0.5.2
v0.5.1
🌍 Support for 150 Publishers & New Language-Based Search and Corpus Controls 🚀
With this release, Fundus now supports 150 publishers across 30 countries, thanks to the addition of 14 new regions and 24 new publishers!
✨ New Features
As our coverage grows, so does the need for better language and data management—so we’ve introduced two powerful new features:
🔎 Language-Based Publisher Search
You can now filter publishers based on the languages they support. This makes it easier to target specific linguistic corpora or build multilingual datasets.
from fundus import Crawler, PublisherCollection
# Find publishers that support Japanese
filtered_publishers = PublisherCollection.search(languages=["ja"])
# US-based publishers that also offer Spanish content
filtered_publishers = PublisherCollection.us.search(languages=["es"])
crawler = Crawler(*filtered_publishers)
for article in crawler.crawl():
print(article)
- Add search by language functionality by @addie9800 in #667
🧮 Balanced Article Crawling
You can now cap the number of articles per publisher during crawling using the new max_articles_per_publisher parameter—ideal for creating balanced datasets.
from fundus import Crawler, PublisherCollection
crawler = Crawler(PublisherCollection.us)
for article in crawler.crawl(max_articles_per_publisher=10, save_to_file="my_corpus.json"):
print(article)
Check out our documentation for more details!
Publishers
This update brings 14 new regions and 24 additional publishers, pushing our total to 150 supported publishers!
Added Regions
- Add
PLby @addie9800 in #698 - Add
PTby @addie9800 in #699 - Add
CZby @horychtom in #725 - Add
MX+ minor bug fixes by @addie9800 in #734 - Add
GLby @addie9800 in #735 - Add
ISLby @addie9800 in #736 - Add
ILby @addie9800 in #737 - Add
PYby @addie9800 in #741 - Add
RUby @addie9800 in #757 - Add
KRby @addie9800 in #758 - Add KR with MBN by @zxxxv in #765
- Add
ZAby @addie9800 in #760 - Add
LSby @addie9800 in #762 - Add
LUby @addie9800 in #775 - Add
LIby @addie9800 in #777
Added Publishers
- Added turkish publisher Anadolu Ajansı by @MSDuran in #722
- Add
Tageszeitungby @addie9800 in #738 - Add
MallorcaMagazinby @addie9800 in #739 - Add
MallorcaZeitungby @addie9800 in #740 - Add
DailyMaverickby @addie9800 in #761 - Add
LuxemburgerWortby @addie9800 in #776 - Add Spanish publishers by @Finiluh in #768
- Add
SalzburgerNachrichtenby @addie9800 in #770 - Add
DiePresseby @addie9800 in #771 - Add
KleineZeitungby @addie9800 in #778
Updated Publishers
- add url_filter to voa by @addie9800 in #715
- add url_filter and RSSFeeds by @addie9800 in #716
- Update
BusinessInsiderby @addie9800 in #717 - Update author extraction for
JyllandsPostenby @addie9800 in #718 - Fix
Focussources by @MaxDall in #732 - Update
FAZparser to versionV3by @MaxDall in #733 - Adjust
ZDFParserto be more suitable for live tickers by @MaxDall in #747 - Modify BBC selectors by @addie9800 in #749
- Update RSSFeed for
Bildby @addie9800 in #752 - Add
V1_1forNationalPostby @addie9800 in #754 - Update Sitemaps for Tanzania by @addie9800 in #755
- Update
_paragraph_selectorforJyllandsPostenby @addie9800 in #756 - Update Tanzanian publishers by @addie9800 in #766
- Change name of
MBNby @addie9800 in #769 - Add
V1_1forNDRby @addie9800 in #773 - Update
Nieuwsbladby @addie9800 in #780
Deprecated Publishers
- Deprecate
TheTelegraphby @MaxDall in #711 - Deprecate
Nikkeiby @addie9800 in #767 - Deprecate
TheNamibianby @addie9800 in #779
Bug Fixes & Stability
- Set
allow_all=Truewhen robots cannot be loaded by @MaxDall in #709 - Add
max_articles_per_publisherparameter tocrawlby @MaxDall in #710 - Extend timeout in publisher coverage by @addie9800 in #712
- Properly release resources by @MaxDall in #713
- Docs: Fix date_filter example by @dallasbrittany in #714
- Bug Fixes - Events by @addie9800 in #719
- Register default
stopevent for WebSource by @MaxDall in #721 - Make network connections interruptible by @MaxDall in #723
- Rework language attribution by @MaxDall in #726
- Make
langattribute deterministic by @MaxDall in #742 - Bug Fix in Source Restriction by @addie9800 in #746
- Bug Fixes from Publisher Coverage by @addie9800 in #753
- Add logging for source restriction by @addie9800 in #774
- Remove duplicate entries in
PublisherCollectionafter merge of #757 by @MaxDall in #781 - Remove Whitespace Normalization in image source parsing by @addie9800 in #692
Cleanup & Maintenance
- Remove leftover ANADOLUAJANSI.json by @MaxDall in #727
- Remove unused imports by @MaxDall in #729
- Update publisher_coverage.yaml by @addie9800 in #750
- Update publisher_coverage.yaml by @addie9800 in #751
Testing
New Contributors
- @dallasbrittany made their first contribution in #714
- @MSDuran made their first contribution in #722
- @horychtom made their first contribution in #725
- @zxxxv made their first contribution in #765
- @Finiluh made their first contribution in #768
Full Changelog: v0.5.0...v0.5.1
v0.5.0
🚀 Get millions of labeled images in just a few hours* 🚀
This release adds image extraction and new publishers, updates existing ones, and fixes several bugs.
*Testing involved crawling 1 million images including at least a caption or description, which took 1 hour and 20 minutes. This was done on a machine using 10Gbit/s bandwidth and the CC-NEWS crawl running with 50 processes. Results may vary based on the use case and bandwidth.
Image Extraction
Thanks to @addie9800, Fundus now provides image extraction for most of our publishers. Each crawled article automatically parses image links and metadata, allowing users to retrieve millions of labeled images in just a few hours. Parsed images include the caption, description, author, and various image versions (sorted by size).

Language distribution of one million crawled images, excluding languages with fewer than 1000 entries images
Check out our supported publishers to find out which publishers are supported.
New Publishers for it, ch, jp, es, dk, tz, be
With this major release, Fundus now offers support for 124 publishers from 22 different countries
IT
- Initial support for Italian publishers, starting with La Repubblica by @ruggsea in #670
- add
CorriereDellaSeraby @addie9800 in #677 - Support for 2 new italian newspapers - Corriere della Sera & Il Giornale by @ruggsea in #700
CH
JP
- Add
Taipei Timesby @MaxDall in #674 - Add
AsahiShimbunby @MaxDall in #682 - Add
ChunichiShimbunandTokyoShimbunby @MaxDall in #683 - Add
MainichiShimbunby @MaxDall in #685 - add
Nikkeiby @MaxDall in #686 - Add
SankeiShimbunby @MaxDall in #688 - Add
NikkanGeadaiby @MaxDall in #689
ES
- Add
El Mundoby @MaxDall in #675 - Add
ABCby @addie9800 in #681 - Add
LaVanguardiaby @addie9800 in #684
DK
- Add
DKby @addie9800 in #696
TZ
- Add Tanzanian Publishers by @addie9800 in #691
BE
- Add
BEby @addie9800 in #697
Update Publishers
- Update
FreiePresseby @addie9800 in #663 - Fix
Metroby @addie9800 in #665 - Update
BoersenZeitungparser by @MaxDall in #666 - Update BBC by @addie9800 in #668
- Layout Change
SRFby @addie9800 in #680 - Add parser
v1_1-iNewsby @addie9800 in #693 - Update
Dagbladetby @addie9800 in #695
Bug fixes
- Reraise exceptions in main thread when error handling is set to
raiseby @MaxDall in #662 - Fix a bug returning
Nonefor empty values inxpath_searchby @MaxDall in #671 - Add
ISTto tzinfo by @MaxDall in #690 - Fix article serialization for
imagesby @MaxDall in #703
Improvements
New Contributors
Full Changelog: v0.4.6...v0.5.0
v0.4.6
🚨 Hotfix release for CCNewsCrawler 🚨
With the newly added xpath_search in version 0.4.5 some parsers generated unpickable extractions, crashing the CCNewsCrawler when piping back to the main thread and thus rendering the crawler unusable. This issue is now fixed with #655
Updated parsers
- Fix
paragraphandsubheadlineselectors forMDRby @MaxDall in #648 - Fix
BoersenZeitungby @addie9800 in #647 - Fix
Merkurby @addie9800 in #654 - Fix
Frankfurter Rundschauby @addie9800 in #652 - Update
Sternparser by @MaxDall in #658 - Add RSSFeed to
LeFigaroby @addie9800 in #657
Bug fixes
- Fix a bug with attribute defaults and add
default_factoryparameter by @MaxDall in #649 - Fix pickling problem in
LinkedDataMappingby @addie9800 in #655
QoL
- Add additional space characters to
normalize_whitespaceby @MaxDall in #646 - Improve encoding detection by @MaxDall in #650
Full Changelog: v0.4.5...v0.4.6
v0.4.5
Important
This is a re-release of version 0.4.5 from 10/21/2024, as the package couldn't be published on pypi.
New publishers for Japan and Spain and some maintenance 🔧
Publishers
New
We added two new publishers located in Japan (The Japan News/Yomiuri Shimbun) and one from Spain (El Pais)
- Add
The Japan Newsby @addie9800 in #627 - Add
Yomiuri Shimbunby @addie9800 in #628 - Add
El Paisby @addie9800 in #632
Fixes
- Fix bug in author parsing in
TheNamibianby @addie9800 in #619 - Fix
Hessenschauby @addie9800 in #624 - Fix
Focusby @addie9800 in #623 - Update
Tazparser by @MaxDall in #642 - Handle author dict Bug by @addie9800 in #641
for DEVs
JSON+LD
We refactored our JSON and JSON-LD parser to be more robust and support multi-type LDs
- Cleaner code for LD and JSON parsing by @MaxDall in #625
- Handle multiple ld types by @addie9800 in #631
- Fix trailing whitespace issue by @addie9800 in #635
Deprecation
Bug fixes
Fixed a bug with using suppress as error handling would result in skipping articles
Full Changelog: v0.4.4...v0.4.5
v0.4.4
New publishers for India, Switzerland, and Australia
With this release, we added 3 new publishers, updated several existing ones, and added some QoL functionality for DEVs
Publishers
New
- IND:
Bhaskar(@MaxDall in #605) - CH:
TagesAnzeiger(@MaxDall in #608) - AU:
TheWestAustralian(@MaxDall in #615)
Updates
- DE:
SportSchau(@addie9800 in #611) - FR:
LesEchosis now deprecated (@MaxDall in #617) - UK:
TheTelegraph(@MaxDall in #616)
What's new?
We implemented XPath queries for LinkedDataMaping to search through the data more fine-grained (@MaxDall in #614). Further, we now parse crawl-delays from publisher-given robots.txt files, which can be omitted through the crawler (@MaxDall in #609). Additionally, we ...
- Ignore
robots.txtin coverage script by @MaxDall in #610 - Adjust
generic_topic_parsingto return only unique topics by @MaxDall in #620
Bug fixes
Full Changelog: v0.4.3...v0.4.4
v0.4.3
Introducing New Publishers from Canada, Germany, and India 🚀
This release includes:
- Support for five new publishers (three from Canada, one from India, and one from Germany)
- Article filtering based on
robots.txt
New Features
With this update, we've implemented article filtering using robots.txt. Each URL fetched is now evaluated against the path and user-agent restrictions specified by publishers in their robots.txt files. This feature is enabled by default, but users can disable it by setting ignore_robots=True in the Crawler constructor.
New Publishers
Canada (CA)
- Introduced CBC as the first Canadian publisher by @addie9800 in #583
- Added
NationalPostby @addie9800 in #584 - Included The Globe and Mail by @addie9800 in #587
India (IND)
- Added
Times Of Indiaby @addie9800 in #569
Germany (DE)
Updates
We've updated our APNews parser to accurately parse authors once more and applied additional fixes.
Bug Fixes
- Protected key access for RSSFeed entries by @MaxDall in #599
- Fixed an issue in test file generation by @addie9800 in #597
Full Changelog: v0.4.2...v0.4.3






