Add support for URLs in DUD_LOAD_RULE_PATHS. by wRAR · Pull Request #15 · zytedata/duplicate-url-discarder

wRAR · 2024-05-24T16:43:27Z

Fixes #1.

This uses treq for downloading, but there are many other options so this is open for discussion.

Any sync function (so requests). Pros: synchronous requests simplify the code. Cons: synchronous may be bad?
Any async function (listed below). Pros: can run in parallel if we want, doesn't block (is this important?). Cons: makes the code more complicated, though all DUD code is self-contained and so doesn't influence the user code design.
Scrapy downloader. Pros: reuses Scrapy (not a benefit by itself I think?), easy parallel downloading (I think), better logging and error handling etc. Cons: UrlCanonicalizer() is currently decoupled from Scrapy.
treq. Pros: straightforward. Cons: additional dep.
aiohttp. Pros: just a more modern thing. Cons: additional dep, requires the asyncio reactor which is a blocker.

Also this doesn't have tests for URLs, should we add mockserver, or wait until we publish the rules and use them in the tests?

BurnzZ · 2024-05-29T06:37:40Z

duplicate_url_discarder/url_canonicalizer.py

+                response = await maybe_deferred_to_future(treq.get(rule_path))
+                data = await response.text()


Let's handle the case when the rules were not successfully retrieved due to things like connection issues, timeouts, etc:

Logging the error

Terminate the crawl - I think this would be a good behavior since if the spider proceeds without any rules, the user would accumulate a lot of requests due to unfiltered requests.

BurnzZ · 2024-05-29T06:38:35Z

duplicate_url_discarder/url_canonicalizer.py

        for rule_path in rule_paths:
-            data = Path(rule_path).read_text()
+            data: str
+            if isinstance(rule_path, str) and self._is_url(rule_path):


minor: isinstance(rule_path, str) can be placed inside _is_url().

Add support for URLs in DUD_LOAD_RULE_PATHS.

285de46

Gallaecio approved these changes May 28, 2024

View reviewed changes

BurnzZ reviewed May 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for URLs in DUD_LOAD_RULE_PATHS.#15

Add support for URLs in DUD_LOAD_RULE_PATHS.#15
wRAR wants to merge 1 commit intomainfrom
rule-urls

wRAR commented May 24, 2024 •

edited

Loading

Uh oh!

BurnzZ May 29, 2024

Uh oh!

BurnzZ May 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		response = await maybe_deferred_to_future(treq.get(rule_path))
		data = await response.text()

Conversation

wRAR commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BurnzZ May 29, 2024

Choose a reason for hiding this comment

Uh oh!

BurnzZ May 29, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wRAR commented May 24, 2024 •

edited

Loading