Skip to content

Releases: flairNLP/fundus

v0.5.5

02 Feb 13:35
042040a

Choose a tag to compare

🌍 Extend Publisher Support & Maintenance 🌍

This release expands our publisher coverage with 9 new publishers from 7 countries, increasing Fundus’ total to 171 supported news outlets.

Alongside the expansion, we maintained existing publishers and enhanced the robustness of the forward crawler to better handle unexpected exceptions when fetching HTML files.

✨ Quality of Life Improvements

🚀 New Publishers

Germany Flag 🇩🇪

Vietnam Flag 🇻🇳

Sweden Flag 🇸🇪

Indonesia Flag 🇮🇩

Ukraine Flag 🇺🇦

  • [WIP] Add publisher pravda (Ukrainska Pravda) by @bucheben in #807

Lebanon Flag 🇱🇧

South Africa Flag 🇿🇦

🔧 Updated Publishers

🚫 Deprecated

🐛 Bug fixes

New Contributors

Full Changelog: v0.5.4...v0.5.5

v0.5.4

08 Jan 12:04
bc42e21

Choose a tag to compare

🛠️ Maintenance Update 🛠️

This PR introduces new quality-of-life improvements that streamline the update process for existing parsers. So we got hands on and improved 20 existing publishers as well as added 2 new ones. In addition, with this release we fixed several bugs related to xpath_search, encoding detection, and sitemap parsing.

✨ Quality of Life Improvements

🚀 Publishers

🆕 New

  • Add German publisher T-Online by @freylily in #805
  • add klassegegenklasse (DE) publisher + parser + tests + tables by @baurlaur in #809

🔧 Updates

🐛 Bug fixes

New Contributors

Full Changelog: v0.5.3...v0.5.4

v0.5.3

27 Nov 19:21
444c936

Choose a tag to compare

🌍 Expanded Publisher Support & Key Bug Fixes 🌍

This release introduces 10 new publishers to fundus, bringing the total to 160 publishers across 37 countries. We've also added a feature that respects publishers' preferences to not be scraped for AI purposes (see our documentation for details).

Additionally, we resolved several bugs related to deadlocks that appeared in specific edge cases within our threading logic.

New Publishers

🇸🇪 Sweden

🇩🇪 Germany

🇬🇧 UK

🇺🇸 USA

🇧🇪 Belgium

🇿🇦 South Africa

Maintained Existing Publishers

New Features

Bug Fixes

New Contributors

Full Changelog: v0.5.2...v0.5.3

v0.5.2

26 Sep 09:21
c59d8cc

Choose a tag to compare

🔧 Maintenance Update 🔧

This release includes several fixes to existing publishers, bug patches across the codebase, and some quality-of-life improvements for updating parsers.

✨ Quality of Life Improvements

Implemented a mechanism to deprecate specific parser attributes based on timestamps

Added a filter option for text extraction to omit certain tags via XPath selectors

Publisher Updates

Bug Fixes

New Contributors

Full Changelog: v0.5.1...v0.5.2

v0.5.1

22 Jul 21:02
5e4761f

Choose a tag to compare

🌍 Support for 150 Publishers & New Language-Based Search and Corpus Controls 🚀

With this release, Fundus now supports 150 publishers across 30 countries, thanks to the addition of 14 new regions and 24 new publishers!

✨ New Features

As our coverage grows, so does the need for better language and data management—so we’ve introduced two powerful new features:

🔎 Language-Based Publisher Search

You can now filter publishers based on the languages they support. This makes it easier to target specific linguistic corpora or build multilingual datasets.

from fundus import Crawler, PublisherCollection

# Find publishers that support Japanese
filtered_publishers = PublisherCollection.search(languages=["ja"])

# US-based publishers that also offer Spanish content
filtered_publishers = PublisherCollection.us.search(languages=["es"])

crawler = Crawler(*filtered_publishers)
for article in crawler.crawl():
    print(article)

🧮 Balanced Article Crawling

You can now cap the number of articles per publisher during crawling using the new max_articles_per_publisher parameter—ideal for creating balanced datasets.

from fundus import Crawler, PublisherCollection

crawler = Crawler(PublisherCollection.us)

for article in crawler.crawl(max_articles_per_publisher=10, save_to_file="my_corpus.json"):
    print(article)
  • Add max_articles_per_publisher parameter to crawl by @MaxDall in #710

Check out our documentation for more details!

Publishers

This update brings 14 new regions and 24 additional publishers, pushing our total to 150 supported publishers!

Added Regions

Added Publishers

Updated Publishers

Deprecated Publishers

Bug Fixes & Stability

Cleanup & Maintenance

Testing

  • Add unit test if default_language is ISO 639 language code by @MaxDall in #744

New Contributors

Full Changelog: v0.5.0...v0.5.1

v0.5.0

15 Feb 15:40
fa4342a

Choose a tag to compare

🚀 Get millions of labeled images in just a few hours* 🚀

This release adds image extraction and new publishers, updates existing ones, and fixes several bugs.

*Testing involved crawling 1 million images including at least a caption or description, which took 1 hour and 20 minutes. This was done on a machine using 10Gbit/s bandwidth and the CC-NEWS crawl running with 50 processes. Results may vary based on the use case and bandwidth.

Image Extraction

Thanks to @addie9800, Fundus now provides image extraction for most of our publishers. Each crawled article automatically parses image links and metadata, allowing users to retrieve millions of labeled images in just a few hours. Parsed images include the caption, description, author, and various image versions (sorted by size).

images-log-scale
Language distribution of one million crawled images, excluding languages with fewer than 1000 entries images

Check out our supported publishers to find out which publishers are supported.

New Publishers for it, ch, jp, es, dk, tz, be

With this major release, Fundus now offers support for 124 publishers from 22 different countries

IT

  • Initial support for Italian publishers, starting with La Repubblica by @ruggsea in #670
  • add CorriereDellaSera by @addie9800 in #677
  • Support for 2 new italian newspapers - Corriere della Sera & Il Giornale by @ruggsea in #700

CH

JP

ES

DK

TZ

BE

Update Publishers

Bug fixes

  • Reraise exceptions in main thread when error handling is set to raise by @MaxDall in #662
  • Fix a bug returning None for empty values in xpath_search by @MaxDall in #671
  • Add IST to tzinfo by @MaxDall in #690
  • Fix article serialization for images by @MaxDall in #703

Improvements

New Contributors

Full Changelog: v0.4.6...v0.5.0

v0.4.6

05 Nov 18:24
f06969f

Choose a tag to compare

🚨 Hotfix release for CCNewsCrawler 🚨

With the newly added xpath_search in version 0.4.5 some parsers generated unpickable extractions, crashing the CCNewsCrawler when piping back to the main thread and thus rendering the crawler unusable. This issue is now fixed with #655

Updated parsers

Bug fixes

  • Fix a bug with attribute defaults and add default_factory parameter by @MaxDall in #649
  • Fix pickling problem in LinkedDataMapping by @addie9800 in #655

QoL

  • Add additional space characters to normalize_whitespace by @MaxDall in #646
  • Improve encoding detection by @MaxDall in #650

Full Changelog: v0.4.5...v0.4.6

v0.4.5

22 Oct 18:30
5d3f301

Choose a tag to compare

Important

This is a re-release of version 0.4.5 from 10/21/2024, as the package couldn't be published on pypi.

New publishers for Japan and Spain and some maintenance 🔧

Publishers

New

We added two new publishers located in Japan (The Japan News/Yomiuri Shimbun) and one from Spain (El Pais)

Fixes

for DEVs

JSON+LD

We refactored our JSON and JSON-LD parser to be more robust and support multi-type LDs

Deprecation

  • Deprecate get_value_by_key_path and replace with xpath_search by @MaxDall in #626

Bug fixes

Fixed a bug with using suppress as error handling would result in skipping articles

  • Add default return values for attributes by @MaxDall in #633

Full Changelog: v0.4.4...v0.4.5

v0.4.4

30 Sep 15:46
cf5b17f

Choose a tag to compare

New publishers for India, Switzerland, and Australia

With this release, we added 3 new publishers, updated several existing ones, and added some QoL functionality for DEVs

Publishers

New

Updates

What's new?

We implemented XPath queries for LinkedDataMaping to search through the data more fine-grained (@MaxDall in #614). Further, we now parse crawl-delays from publisher-given robots.txt files, which can be omitted through the crawler (@MaxDall in #609). Additionally, we ...

  • Ignore robots.txt in coverage script by @MaxDall in #610
  • Adjust generic_topic_parsing to return only unique topics by @MaxDall in #620

Bug fixes

  • Fix a bug with the plaintext property of Article by @MaxDall in #612

Full Changelog: v0.4.3...v0.4.4

v0.4.3

04 Sep 12:07
ccf5a80

Choose a tag to compare

Introducing New Publishers from Canada, Germany, and India 🚀

This release includes:

  • Support for five new publishers (three from Canada, one from India, and one from Germany)
  • Article filtering based on robots.txt

New Features

With this update, we've implemented article filtering using robots.txt. Each URL fetched is now evaluated against the path and user-agent restrictions specified by publishers in their robots.txt files. This feature is enabled by default, but users can disable it by setting ignore_robots=True in the Crawler constructor.

New Publishers

Canada (CA)

India (IND)

Germany (DE)

Updates

We've updated our APNews parser to accurately parse authors once more and applied additional fixes.

Bug Fixes

Full Changelog: v0.4.2...v0.4.3