This tutorial will show further options such as searching for specific publishers in the PublisherCollection or dealing with deprecated ones.
There are quite a few differences between the publishers, especially in the attributes the underlying parser supports.
You can search through the collection to get only publishers fitting your use case by utilizing the search() method.
Let's get some publishers based in the US, supporting an attribute called topics and NewsMap as a source, and use them to initialize a crawler afterward.
The search() method also implements an internal language filter, allowing you to restrict your results to a specific languages.
In this example, we are only interested in Spanish articles.
from fundus import Crawler, PublisherCollection, NewsMap
fitting_publishers = PublisherCollection.us.search(attributes=["topics"], source_types=[NewsMap], languages=["es"])
crawler = Crawler(*fitting_publishers)When we notice that a publisher is uncrawlable for whatever reason, we will mark it with a deprecated flag.
This mostly has internal usages, since the default value for the Crawler ignore_deprecated flag is False.
You can alter this behaviour when initiating the Crawler and setting the ignore_deprecated flag.
Some publishers explicitly disallow the use of their content for AI training purposes.
We try to respect these wishes by introducing the skip_publishers_disallowing_training parameter in the crawl() function.
Users intending to use Fundus to gather training data for AI models should set this parameter to True to avoid collecting articles from publishers that wish for their content to not be used in this way.
Yet, as publishers are not required to mention this in their robots.txt file, users should additionally check the terms of use of the publishers they want to crawl and set the disallows_training attribute of the Publisher class accordingly.
In the next section we introduce you to Fundus logging mechanics.