02 Feb 13:35

MaxDall

042040a

v0.5.5 Latest

Latest

🌍 Extend Publisher Support & Maintenance 🌍

This release expands our publisher coverage with 9 new publishers from 7 countries, increasing Fundus’ total to 171 supported news outlets.

Alongside the expansion, we maintained existing publishers and enhanced the robustness of the forward crawler to better handle unexpected exceptions when fetching HTML files.

✨ Quality of Life Improvements

Improve robustness of fetch method for WebSource by @MaxDall in #875
Rework break transformation by @MaxDall in #885

🚀 New Publishers

🇩🇪

Add LTO (Legal Tribune Online) publisher by @elias-polyapp in #799

🇻🇳

Add VN publisher (VnExpress) by @bachthyaglx in #802

🇸🇪

Add SE publisher (Aftonbladet) by @rekordii in #803

🇮🇩

Add Media Indonesia Publisher by @vrdhn91 in #804

🇺🇦

[WIP] Add publisher pravda (Ukrainska Pravda) by @bucheben in #807

🇱🇧

LBC publisher integrated by @nancyboukamel-ds in #814

🇿🇦

Add TheCitizen by @addie9800 in #847
Add EyethuNews by @addie9800 in #835
Add Ilanga by @addie9800 in #848

🔧 Updated Publishers

Update Tageblatt by @addie9800 in #868
Fix SeznamZpravy by @addie9800 in #873
Fix paragraph selector for LeMonde parser by @MaxDall in #878
Fix paragraph selector for stern parser by @MaxDall in #879
Update Landesspiegel parser by @MaxDall in #877
Update HankookIlbo parser by @MaxDall in #882

🚫 Deprecated

Deprecate NikkanGeadai by @addie9800 in #872
Deprecate LesothoTimes by @MaxDall in #880

🐛 Bug fixes

Update error message by @addie9800 in #869
Skip functioning publishers in publisher coverage by @addie9800 in #871
Fix a bug with VALID_UNTIL date in long crawls by @MaxDall in #876
Fix error message in BaseParser by @MaxDall in #881
Remove unfinished bar in check_coverage by @MaxDall in #883
Ignore capitalization in supported_publishers.md ordering by @addie9800 in #886

New Contributors

@elias-polyapp made their first contribution in #799
@bachthyaglx made their first contribution in #802
@rekordii made their first contribution in #803
@vrdhn91 made their first contribution in #804
@bucheben made their first contribution in #807
@nancyboukamel-ds made their first contribution in #814

Full Changelog: v0.5.4...v0.5.5

Contributors

nancyboukamel-ds, vrdhn91, and 6 other contributors

Assets 2

08 Jan 12:04

MaxDall

v0.5.4

bc42e21

v0.5.4

🛠️ Maintenance Update 🛠️

This PR introduces new quality-of-life improvements that streamline the update process for existing parsers. So we got hands on and improved 20 existing publishers as well as added 2 new ones. In addition, with this release we fixed several bugs related to xpath_search, encoding detection, and sitemap parsing.

✨ Quality of Life Improvements

Add check_coverage script by @MaxDall in #839
Apply general quality improvements by @MaxDall in #859

🚀 Publishers

🆕 New

Add German publisher T-Online by @freylily in #805
add klassegegenklasse (DE) publisher + parser + tests + tables by @baurlaur in #809

🔧 Updates

Adjust paragraph_selector for Rheinische Post by @MaxDall in #838
FIX CBC News by @MaxDall in #842
Deprecate FreiePresse by @MaxDall in #857
Update Dagbladet parser to version V1_1 by @MaxDall in #856
Update SeznamZpravy parser by @MaxDall in #843
Fix Tageszeitung by @MaxDall in #855
Update TheMirror parser by @MaxDall in #850
Deprecate authors for ThePortugalNews by @MaxDall in #845
Update selectors by @addie9800 in #828
Update parser for SalzburgerNachrichten by @MaxDall in #854
Deprecate Morgunbladid by @MaxDall in #853
Update NTV parser by @MaxDall in #846
Update Euronews parser to version V1_1 by @MaxDall in #852
Update DailyMaverick parser by @MaxDall in #851
Fix SRF summary selector by @MaxDall in #861
Fix summary selector for 20Minutes by @MaxDall in #860
Fix sitemaps for BR by @MaxDall in #862

🐛 Bug fixes

Fix a bug in the encoding detection by @MaxDall in #841
Fix escaping in xpath_search by @MaxDall in #840
Skip lazy loading images by @MaxDall in #849
Catch unexpected HTML by @MaxDall in #863

New Contributors

@freylily made their first contribution in #805
@baurlaur made their first contribution in #809

Full Changelog: v0.5.3...v0.5.4

Contributors

addie9800, MaxDall, and 2 other contributors

Assets 2

27 Nov 19:21

MaxDall

v0.5.3

444c936

v0.5.3

🌍 Expanded Publisher Support & Key Bug Fixes 🌍

This release introduces 10 new publishers to fundus, bringing the total to 160 publishers across 37 countries. We've also added a feature that respects publishers' preferences to not be scraped for AI purposes (see our documentation for details).

Additionally, we resolved several bugs related to deadlocks that appeared in specific edge cases within our threading logic.

New Publishers

🇸🇪 Sweden

Add SE Expressen by @ghostsshadow in #800

🇩🇪 Germany

Add Stuttgarter Zeitung Parser by @myoncee in #808
Add Der Freitag by @bresslem in #798

🇬🇧 UK

Add Nature (UK Scientific Journal) by @Kucki2018 in #797

🇺🇸 USA

Add Rest Of World Publisher by @marten-ti in #801

🇧🇪 Belgium

Add BE Publisher (Politico EU) by @rascaria in #811

🇿🇦 South Africa

Add Dizindaba by @addie9800 in #832
Add Independent Online newspapers by @addie9800 in #827

Maintained Existing Publishers

Add V1_1 for SeznamZpravy by @addie9800 in #821
Fix ZwanzigMinuten by @addie9800 in #820
Update upper_boundary_selector for NZZ by @addie9800 in #819
Update topics for Funke by @addie9800 in #823
Fix BoersenZeitung by @addie9800 in #824
Add V1_1 for ZDF by @addie9800 in #825
Fix summary_selector for TheNation by @addie9800 in #831
Add V1_1 to NTVTR by @addie9800 in #830

New Features

Add skip_publishers_disallowing_training by @addie9800 in #772
Update generic_parsing by @addie9800 in #822

Bug Fixes

Fix spacing error in LaVanguardia by @MaxDall in #795
Handle malformed XML by @addie9800 in #794
Fix race conditions and improve exception handling by @MaxDall in #796
Ignore type check for MONTHS by @addie9800 in #810
Update User Agents by @addie9800 in #818
Fix deadlock in queue_wrapper by @MaxDall in #833

New Contributors

@ghostsshadow made their first contribution in #800
@Kucki2018 made their first contribution in #797
@marten-ti made their first contribution in #801
@bresslem made their first contribution in #798
@rascaria made their first contribution in #811

Full Changelog: v0.5.2...v0.5.3

Contributors

addie9800, bresslem, and 6 other contributors

Assets 2

26 Sep 09:21

MaxDall

v0.5.2

c59d8cc

v0.5.2

🔧 Maintenance Update 🔧

This release includes several fixes to existing publishers, bug patches across the codebase, and some quality-of-life improvements for updating parsers.

✨ Quality of Life Improvements

Implemented a mechanism to deprecate specific parser attributes based on timestamps

Deprecated Attributes by @addie9800 in #745

Added a filter option for text extraction to omit certain tags via XPath selectors

Added tag filter for text extraction by @MaxDall in #792

Publisher Updates

Updated sources for ThePortugalNews by @addie9800 in #790
Updated funke by @addie9800 in #786
Fixed Golem parser by @addie9800 in #787

Bug Fixes

Fixed language code testing by @addie9800 in #784
Updated README.md by @sanj4git in #785
Updated user-agent by @addie9800 in #788
Generalized sitemap selectors by @addie9800 in #789

New Contributors

@sanj4git made their first contribution in #785 🎉

Full Changelog: v0.5.1...v0.5.2

Contributors

addie9800, MaxDall, and sanj4git

Assets 2

22 Jul 21:02

addie9800

v0.5.1

5e4761f

v0.5.1

🌍 Support for 150 Publishers & New Language-Based Search and Corpus Controls 🚀

With this release, Fundus now supports 150 publishers across 30 countries, thanks to the addition of 14 new regions and 24 new publishers!

✨ New Features

As our coverage grows, so does the need for better language and data management—so we’ve introduced two powerful new features:

🔎 Language-Based Publisher Search

You can now filter publishers based on the languages they support. This makes it easier to target specific linguistic corpora or build multilingual datasets.

from fundus import Crawler, PublisherCollection

# Find publishers that support Japanese
filtered_publishers = PublisherCollection.search(languages=["ja"])

# US-based publishers that also offer Spanish content
filtered_publishers = PublisherCollection.us.search(languages=["es"])

crawler = Crawler(*filtered_publishers)
for article in crawler.crawl():
    print(article)

Add search by language functionality by @addie9800 in #667

🧮 Balanced Article Crawling

You can now cap the number of articles per publisher during crawling using the new max_articles_per_publisher parameter—ideal for creating balanced datasets.

from fundus import Crawler, PublisherCollection

crawler = Crawler(PublisherCollection.us)

for article in crawler.crawl(max_articles_per_publisher=10, save_to_file="my_corpus.json"):
    print(article)

Add max_articles_per_publisher parameter to crawl by @MaxDall in #710

Check out our documentation for more details!

Publishers

This update brings 14 new regions and 24 additional publishers, pushing our total to 150 supported publishers!

Added Regions

Add PL by @addie9800 in #698
Add PT by @addie9800 in #699
Add CZ by @horychtom in #725
Add MX + minor bug fixes by @addie9800 in #734
Add GL by @addie9800 in #735
Add ISL by @addie9800 in #736
Add IL by @addie9800 in #737
Add PY by @addie9800 in #741
Add RU by @addie9800 in #757
Add KR by @addie9800 in #758
Add KR with MBN by @zxxxv in #765
Add ZA by @addie9800 in #760
Add LS by @addie9800 in #762
Add LU by @addie9800 in #775
Add LI by @addie9800 in #777

Added Publishers

Added turkish publisher Anadolu Ajansı by @MSDuran in #722
Add Tageszeitung by @addie9800 in #738
Add MallorcaMagazin by @addie9800 in #739
Add MallorcaZeitung by @addie9800 in #740
Add DailyMaverick by @addie9800 in #761
Add LuxemburgerWort by @addie9800 in #776
Add Spanish publishers by @Finiluh in #768
Add SalzburgerNachrichten by @addie9800 in #770
Add DiePresse by @addie9800 in #771
Add KleineZeitung by @addie9800 in #778

Updated Publishers

add url_filter to voa by @addie9800 in #715
add url_filter and RSSFeeds by @addie9800 in #716
Update BusinessInsider by @addie9800 in #717
Update author extraction for JyllandsPosten by @addie9800 in #718
Fix Focus sources by @MaxDall in #732
Update FAZ parser to version V3 by @MaxDall in #733
Adjust ZDFParser to be more suitable for live tickers by @MaxDall in #747
Modify BBC selectors by @addie9800 in #749
Update RSSFeed for Bild by @addie9800 in #752
Add V1_1 for NationalPost by @addie9800 in #754
Update Sitemaps for Tanzania by @addie9800 in #755
Update _paragraph_selector for JyllandsPosten by @addie9800 in #756
Update Tanzanian publishers by @addie9800 in #766
Change name of MBN by @addie9800 in #769
Add V1_1 for NDR by @addie9800 in #773
Update Nieuwsblad by @addie9800 in #780

Deprecated Publishers

Deprecate TheTelegraph by @MaxDall in #711
Deprecate Nikkei by @addie9800 in #767
Deprecate TheNamibian by @addie9800 in #779

Bug Fixes & Stability

Set allow_all=True when robots cannot be loaded by @MaxDall in #709
Add max_articles_per_publisher parameter to crawl by @MaxDall in #710
Extend timeout in publisher coverage by @addie9800 in #712
Properly release resources by @MaxDall in #713
Docs: Fix date_filter example by @dallasbrittany in #714
Bug Fixes - Events by @addie9800 in #719
Register default stop event for WebSource by @MaxDall in #721
Make network connections interruptible by @MaxDall in #723
Rework language attribution by @MaxDall in #726
Make lang attribute deterministic by @MaxDall in #742
Bug Fix in Source Restriction by @addie9800 in #746
Bug Fixes from Publisher Coverage by @addie9800 in #753
Add logging for source restriction by @addie9800 in #774
Remove duplicate entries in PublisherCollection after merge of #757 by @MaxDall in #781
Remove Whitespace Normalization in image source parsing by @addie9800 in #692

Cleanup & Maintenance

Remove leftover ANADOLUAJANSI.json by @MaxDall in #727
Remove unused imports by @MaxDall in #729
Update publisher_coverage.yaml by @addie9800 in #750
Update publisher_coverage.yaml by @addie9800 in #751

Testing

Add unit test if default_language is ISO 639 language code by @MaxDall in #744

New Contributors

@dallasbrittany made their first contribution in #714
@MSDuran made their first contribution in #722
@horychtom made their first contribution in #725
@zxxxv made their first contribution in #765
@Finiluh made their first contribution in #768

Full Changelog: v0.5.0...v0.5.1

Contributors

dallasbrittany, addie9800, and 5 other contributors

Assets 2

15 Feb 15:40

MaxDall

v0.5.0

fa4342a

v0.5.0

🚀 Get millions of labeled images in just a few hours^* 🚀

This release adds image extraction and new publishers, updates existing ones, and fixes several bugs.

^*Testing involved crawling 1 million images including at least a caption or description, which took 1 hour and 20 minutes. This was done on a machine using 10Gbit/s bandwidth and the CC-NEWS crawl running with 50 processes. Results may vary based on the use case and bandwidth.

Image Extraction

Thanks to @addie9800, Fundus now provides image extraction for most of our publishers. Each crawled article automatically parses image links and metadata, allowing users to retrieve millions of labeled images in just a few hours. Parsed images include the caption, description, author, and various image versions (sorted by size).

Language distribution of one million crawled images, excluding languages with fewer than 1000 entries images

Check out our supported publishers to find out which publishers are supported.

New Publishers for `it`, `ch`, `jp`, `es`, `dk`, `tz`, `be`

With this major release, Fundus now offers support for 124 publishers from 22 different countries

`IT`

Initial support for Italian publishers, starting with La Repubblica by @ruggsea in #670
add CorriereDellaSera by @addie9800 in #677
Support for 2 new italian newspapers - Corriere della Sera & Il Giornale by @ruggsea in #700

`CH`

Add 20 Minuten by @MaxDall in #673

`JP`

Add Taipei Times by @MaxDall in #674
Add AsahiShimbun by @MaxDall in #682
Add ChunichiShimbun and TokyoShimbun by @MaxDall in #683
Add MainichiShimbun by @MaxDall in #685
add Nikkei by @MaxDall in #686
Add SankeiShimbun by @MaxDall in #688
Add NikkanGeadai by @MaxDall in #689

`ES`

Add El Mundo by @MaxDall in #675
Add ABC by @addie9800 in #681
Add LaVanguardia by @addie9800 in #684

`DK`

Add DK by @addie9800 in #696

`TZ`

Add Tanzanian Publishers by @addie9800 in #691

`BE`

Add BE by @addie9800 in #697

Update Publishers

Update FreiePresse by @addie9800 in #663
Fix Metro by @addie9800 in #665
Update BoersenZeitung parser by @MaxDall in #666
Update BBC by @addie9800 in #668
Layout Change SRF by @addie9800 in #680
Add parser v1_1 - iNews by @addie9800 in #693
Update Dagbladet by @addie9800 in #695

Bug fixes

Reraise exceptions in main thread when error handling is set to raise by @MaxDall in #662
Fix a bug returning None for empty values in xpath_search by @MaxDall in #671
Add IST to tzinfo by @MaxDall in #690
Fix article serialization for images by @MaxDall in #703

Improvements

Add octet-stream to decompressor by @MaxDall in #660

New Contributors

@ruggsea made their first contribution in #670

Full Changelog: v0.4.6...v0.5.0

Contributors

addie9800, MaxDall, and ruggsea

Assets 2

05 Nov 18:24

MaxDall

v0.4.6

f06969f

v0.4.6

🚨 Hotfix release for `CCNewsCrawler` 🚨

With the newly added xpath_search in version 0.4.5 some parsers generated unpickable extractions, crashing the CCNewsCrawler when piping back to the main thread and thus rendering the crawler unusable. This issue is now fixed with #655

Updated parsers

Fix paragraph and subheadline selectors for MDR by @MaxDall in #648
Fix BoersenZeitung by @addie9800 in #647
Fix Merkur by @addie9800 in #654
Fix Frankfurter Rundschau by @addie9800 in #652
Update Stern parser by @MaxDall in #658
Add RSSFeed to LeFigaro by @addie9800 in #657

Bug fixes

Fix a bug with attribute defaults and add default_factory parameter by @MaxDall in #649
Fix pickling problem in LinkedDataMapping by @addie9800 in #655

QoL

Add additional space characters to normalize_whitespace by @MaxDall in #646
Improve encoding detection by @MaxDall in #650

Full Changelog: v0.4.5...v0.4.6

Contributors

addie9800 and MaxDall

Assets 2

22 Oct 18:30

MaxDall

v0.4.5

5d3f301

v0.4.5

Important

This is a re-release of version 0.4.5 from 10/21/2024, as the package couldn't be published on pypi.

New publishers for Japan and Spain and some maintenance 🔧

Publishers

New

We added two new publishers located in Japan (The Japan News/Yomiuri Shimbun) and one from Spain (El Pais)

Add The Japan News by @addie9800 in #627
Add Yomiuri Shimbun by @addie9800 in #628
Add El Pais by @addie9800 in #632

Fixes

Fix bug in author parsing in TheNamibian by @addie9800 in #619
Fix Hessenschau by @addie9800 in #624
Fix Focus by @addie9800 in #623
Update Taz parser by @MaxDall in #642
Handle author dict Bug by @addie9800 in #641

for DEVs

JSON+LD

We refactored our JSON and JSON-LD parser to be more robust and support multi-type LDs

Cleaner code for LD and JSON parsing by @MaxDall in #625
Handle multiple ld types by @addie9800 in #631
Fix trailing whitespace issue by @addie9800 in #635

Deprecation

Deprecate get_value_by_key_path and replace with xpath_search by @MaxDall in #626

Bug fixes

Fixed a bug with using suppress as error handling would result in skipping articles

Add default return values for attributes by @MaxDall in #633

Full Changelog: v0.4.4...v0.4.5

Contributors

addie9800 and MaxDall

Assets 2

30 Sep 15:46

MaxDall

v0.4.4

cf5b17f

v0.4.4

New publishers for India, Switzerland, and Australia

With this release, we added 3 new publishers, updated several existing ones, and added some QoL functionality for DEVs

Publishers

New

IND: Bhaskar (@MaxDall in #605)
CH: TagesAnzeiger (@MaxDall in #608)
AU: TheWestAustralian (@MaxDall in #615)

Updates

DE: SportSchau (@addie9800 in #611)
FR: LesEchos is now deprecated (@MaxDall in #617)
UK: TheTelegraph (@MaxDall in #616)

What's new?

We implemented XPath queries for LinkedDataMaping to search through the data more fine-grained (@MaxDall in #614). Further, we now parse crawl-delays from publisher-given robots.txt files, which can be omitted through the crawler (@MaxDall in #609). Additionally, we ...

Ignore robots.txt in coverage script by @MaxDall in #610
Adjust generic_topic_parsing to return only unique topics by @MaxDall in #620

Bug fixes

Fix a bug with the plaintext property of Article by @MaxDall in #612

Full Changelog: v0.4.3...v0.4.4

Contributors

addie9800 and MaxDall

Assets 2

04 Sep 12:07

MaxDall

v0.4.3

ccf5a80

v0.4.3

Introducing New Publishers from Canada, Germany, and India 🚀

This release includes:

Support for five new publishers (three from Canada, one from India, and one from Germany)
Article filtering based on robots.txt

New Features

With this update, we've implemented article filtering using robots.txt. Each URL fetched is now evaluated against the path and user-agent restrictions specified by publishers in their robots.txt files. This feature is enabled by default, but users can disable it by setting ignore_robots=True in the Crawler constructor.

Added robots.txt based filtering by @MaxDall in #590

New Publishers

Canada (CA)

Introduced CBC as the first Canadian publisher by @addie9800 in #583
Added NationalPost by @addie9800 in #584
Included The Globe and Mail by @addie9800 in #587

India (IND)

Added Times Of India by @addie9800 in #569

Germany (DE)

Included Krautreporter by @dkm1006 in #588

Updates

We've updated our APNews parser to accurately parse authors once more and applied additional fixes.

Updated APNews by @MaxDall in #603

Bug Fixes

Protected key access for RSSFeed entries by @MaxDall in #599
Fixed an issue in test file generation by @addie9800 in #597

Full Changelog: v0.4.2...v0.4.3

Contributors

addie9800, dkm1006, and MaxDall

Assets 2

Releases: flairNLP/fundus

v0.5.5

🌍 Extend Publisher Support & Maintenance 🌍

✨ Quality of Life Improvements

🚀 New Publishers

🇩🇪

🇻🇳

🇸🇪

🇮🇩

🇺🇦

🇱🇧

🇿🇦

🔧 Updated Publishers

🚫 Deprecated

🐛 Bug fixes

New Contributors

Contributors

Uh oh!

v0.5.4

🛠️ Maintenance Update 🛠️

✨ Quality of Life Improvements

🚀 Publishers

🆕 New

🔧 Updates

🐛 Bug fixes

New Contributors

Contributors

Uh oh!

v0.5.3

🌍 Expanded Publisher Support & Key Bug Fixes 🌍

New Publishers

🇸🇪 Sweden

🇩🇪 Germany

🇬🇧 UK

🇺🇸 USA

🇧🇪 Belgium

🇿🇦 South Africa

Maintained Existing Publishers

New Features

Bug Fixes

New Contributors

Contributors

Uh oh!

v0.5.2

🔧 Maintenance Update 🔧

✨ Quality of Life Improvements

Publisher Updates

Bug Fixes

New Contributors

Contributors

Uh oh!

v0.5.1

🌍 Support for 150 Publishers & New Language-Based Search and Corpus Controls 🚀

✨ New Features

🔎 Language-Based Publisher Search

🧮 Balanced Article Crawling

Publishers

Added Regions

Added Publishers

Updated Publishers

Deprecated Publishers

Bug Fixes & Stability

Cleanup & Maintenance

Testing

New Contributors

Contributors

Uh oh!

v0.5.0

🚀 Get millions of labeled images in just a few hours* 🚀

Image Extraction

New Publishers for it, ch, jp, es, dk, tz, be

IT

CH

JP

ES

DK

TZ

BE

Update Publishers

Bug fixes

🚀 Get millions of labeled images in just a few hours^* 🚀

New Publishers for `it`, `ch`, `jp`, `es`, `dk`, `tz`, `be`

`IT`

`CH`

`JP`

`ES`

`DK`

`TZ`

`BE`

🚨 Hotfix release for `CCNewsCrawler` 🚨