Advanced Topics

Advanced Topics

This tutorial will show further options such as searching for specific publishers in the PublisherCollection or dealing with deprecated ones.

How to search for publishers

Using `search()`

There are quite a few differences between the publishers, especially in the attributes the underlying parser supports. You can search through the collection to get only publishers fitting your use case by utilizing the search() method.

Let's get some publishers based in the US, supporting an attribute called topics and NewsMap as a source, and use them to initialize a crawler afterward. The search() method also implements an internal language filter, allowing you to restrict your results to a specific languages. In this example, we are only interested in Spanish articles.

from fundus import Crawler, PublisherCollection, NewsMap

fitting_publishers = PublisherCollection.us.search(attributes=["topics"], source_types=[NewsMap], languages=["es"])
crawler = Crawler(*fitting_publishers)

Working with deprecated publishers

When we notice that a publisher is uncrawlable for whatever reason, we will mark it with a deprecated flag. This mostly has internal usages, since the default value for the Crawler ignore_deprecated flag is False. You can alter this behaviour when initiating the Crawler and setting the ignore_deprecated flag.

Filtering publishers for AI training

Some publishers explicitly disallow the use of their content for AI training purposes. We try to respect these wishes by introducing the skip_publishers_disallowing_training parameter in the crawl() function. Users intending to use Fundus to gather training data for AI models should set this parameter to True to avoid collecting articles from publishers that wish for their content to not be used in this way. Yet, as publishers are not required to mention this in their robots.txt file, users should additionally check the terms of use of the publishers they want to crawl and set the disallows_training attribute of the Publisher class accordingly.

In the next section we introduce you to Fundus logging mechanics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table of Contents

Advanced Topics

How to search for publishers

Using `search()`

Working with deprecated publishers

Filtering publishers for AI training

FilesExpand file tree

5_advanced_topics.md

Latest commit

History

5_advanced_topics.md

File metadata and controls

Table of Contents

Advanced Topics

How to search for publishers

Using search()

Working with deprecated publishers

Filtering publishers for AI training

Using `search()`