Skip to content

Add support for URLs in DUD_LOAD_RULE_PATHS.#15

Open
wRAR wants to merge 1 commit intomainfrom
rule-urls
Open

Add support for URLs in DUD_LOAD_RULE_PATHS.#15
wRAR wants to merge 1 commit intomainfrom
rule-urls

Conversation

@wRAR
Copy link
Member

@wRAR wRAR commented May 24, 2024

Fixes #1.

This uses treq for downloading, but there are many other options so this is open for discussion.

  • Any sync function (so requests). Pros: synchronous requests simplify the code. Cons: synchronous may be bad?
  • Any async function (listed below). Pros: can run in parallel if we want, doesn't block (is this important?). Cons: makes the code more complicated, though all DUD code is self-contained and so doesn't influence the user code design.
  • Scrapy downloader. Pros: reuses Scrapy (not a benefit by itself I think?), easy parallel downloading (I think), better logging and error handling etc. Cons: UrlCanonicalizer() is currently decoupled from Scrapy.
  • treq. Pros: straightforward. Cons: additional dep.
  • aiohttp. Pros: just a more modern thing. Cons: additional dep, requires the asyncio reactor which is a blocker.

Also this doesn't have tests for URLs, should we add mockserver, or wait until we publish the rules and use them in the tests?

Comment on lines +35 to +36
response = await maybe_deferred_to_future(treq.get(rule_path))
data = await response.text()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's handle the case when the rules were not successfully retrieved due to things like connection issues, timeouts, etc:

  1. Logging the error
  2. Terminate the crawl - I think this would be a good behavior since if the spider proceeds without any rules, the user would accumulate a lot of requests due to unfiltered requests.

for rule_path in rule_paths:
data = Path(rule_path).read_text()
data: str
if isinstance(rule_path, str) and self._is_url(rule_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: isinstance(rule_path, str) can be placed inside _is_url().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DUD_LOAD_POLICY_PATH: support URLs

3 participants