-
Notifications
You must be signed in to change notification settings - Fork 4
SERP (MVP) #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
SERP (MVP) #62
Changes from 17 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
d785a2d
SERP (MVP)
Gallaecio 4c47efc
Fix references and complete the docs
Gallaecio d10d75e
Customize UI strings for SERP and add tests
Gallaecio ff97f07
Fix requests mocking
Gallaecio 3a44330
Enable the aggressive retry policy by default for the SERP spider
Gallaecio efc6e83
Merge remote-tracking branch 'zytedata/main' into serp-mvp
Gallaecio 1bc4a29
Make the SERP spider more Google-specific, in line with the current a…
Gallaecio 8f3ab3e
Add a mandatory search keywords field, and set a default input URL
Gallaecio b0786e6
Improve the SERP implementation, get all tests to pass
Gallaecio bcf5566
Use a domain drop-down list
Gallaecio 89c1b7f
Improve the search_keywords tooltip and update tests
Gallaecio c3a2f23
search keywords → search queries
Gallaecio 25d4cf7
Fix metadata JSON schema comparison
Gallaecio d8fe94c
Min zyte-common-items: 0.13.0 → 0.22.0
Gallaecio b94b7b4
Remove potentially confusing search keyword references
Gallaecio d7b724a
Make crawl logging more flexible for new page types
Gallaecio a9d5588
Update test expectations
Gallaecio 7aa70b2
Apply feedback
Gallaecio 916b58c
Release notes for 0.9.0
Gallaecio 5c5502e
Remove valid_page_types
Gallaecio File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
.. _google-search: | ||
|
||
================================================= | ||
Google search spider template (``google_search``) | ||
================================================= | ||
|
||
Basic use | ||
========= | ||
|
||
.. code-block:: shell | ||
|
||
scrapy crawl google_search -a search_queries="foo bar" | ||
|
||
Parameters | ||
========== | ||
|
||
.. autopydantic_model:: zyte_spider_templates.spiders.serp.GoogleSearchSpiderParams | ||
:inherited-members: BaseModel | ||
:exclude-members: model_computed_fields |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,4 @@ | ||
import json | ||
import logging | ||
import re | ||
from unittest.mock import MagicMock, call, patch | ||
|
||
import pytest | ||
|
@@ -11,7 +9,6 @@ | |
from scrapy_spider_metadata import get_spider_metadata | ||
from zyte_common_items import ProbabilityRequest, Product, ProductNavigation, Request | ||
|
||
from zyte_spider_templates import BaseSpiderParams | ||
from zyte_spider_templates._geolocations import ( | ||
GEOLOCATION_OPTIONS, | ||
GEOLOCATION_OPTIONS_WITH_CODE, | ||
|
@@ -24,6 +21,7 @@ | |
|
||
from . import get_crawler | ||
from .test_utils import URL_TO_DOMAIN | ||
from .utils import assertEqualSpiderMetadata | ||
|
||
|
||
def test_parameters(): | ||
|
@@ -362,21 +360,6 @@ def test_arguments(): | |
assert spider.allowed_domains == ["example.com"] | ||
|
||
|
||
def assertEqualJson(actual, expected): | ||
"""Compare the JSON representation of 2 Python objects. | ||
|
||
This allows to take into account things like the order of key-value pairs | ||
in dictionaries, which would not be taken into account when comparing | ||
dictionaries directly. | ||
|
||
It also generates a better diff in pytest output when enums are involved, | ||
e.g. geolocation values. | ||
""" | ||
actual_json = json.dumps(actual, indent=2) | ||
expected_json = json.dumps(expected, indent=2) | ||
assert actual_json == expected_json | ||
|
||
|
||
def test_metadata(): | ||
actual_metadata = get_spider_metadata(EcommerceSpider, normalize=True) | ||
expected_metadata = { | ||
|
@@ -480,7 +463,7 @@ def test_metadata(): | |
"title": "Pagination Only", | ||
}, | ||
}, | ||
"title": "Crawl strategy", | ||
"title": "Crawl Strategy", | ||
"enum": [ | ||
"automatic", | ||
"full", | ||
|
@@ -550,60 +533,14 @@ def test_metadata(): | |
"type": "object", | ||
}, | ||
} | ||
assertEqualJson(actual_metadata, expected_metadata) | ||
assertEqualSpiderMetadata(actual_metadata, expected_metadata) | ||
|
||
geolocation = actual_metadata["param_schema"]["properties"]["geolocation"] | ||
assert geolocation["enum"][0] == "AF" | ||
assert geolocation["enumMeta"]["UY"] == {"title": "Uruguay (UY)"} | ||
assert set(geolocation["enum"]) == set(geolocation["enumMeta"]) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"valid,url", | ||
[ | ||
(False, ""), | ||
(False, "http://"), | ||
(False, "http:/example.com"), | ||
(False, "ftp://example.com"), | ||
(False, "example.com"), | ||
(False, "//example.com"), | ||
(False, "http://foo:[email protected]"), | ||
(False, " http://example.com"), | ||
(False, "http://example.com "), | ||
(False, "http://examp le.com"), | ||
(False, "https://example.com:232323"), | ||
(True, "http://example.com"), | ||
(True, "http://bücher.example"), | ||
(True, "http://xn--bcher-kva.example"), | ||
(True, "https://i❤.ws"), | ||
(True, "https://example.com"), | ||
(True, "https://example.com/"), | ||
(True, "https://example.com:2323"), | ||
(True, "https://example.com:2323/"), | ||
(True, "https://example.com:2323/foo"), | ||
(True, "https://example.com/f"), | ||
(True, "https://example.com/foo"), | ||
(True, "https://example.com/foo/"), | ||
(True, "https://example.com/foo/bar"), | ||
(True, "https://example.com/foo/bar/"), | ||
(True, "https://example.com/foo/bar?baz"), | ||
(True, "https://example.com/foo/bar/?baz"), | ||
(True, "https://example.com?foo"), | ||
(True, "https://example.com?foo=bar"), | ||
(True, "https://example.com/?foo=bar&baz"), | ||
(True, "https://example.com/?foo=bar&baz#"), | ||
(True, "https://example.com/?foo=bar&baz#frag"), | ||
(True, "https://example.com#"), | ||
(True, "https://example.com/#"), | ||
(True, "https://example.com/&"), | ||
(True, "https://example.com/&#"), | ||
], | ||
) | ||
def test_validation_url(url, valid): | ||
url_re = BaseSpiderParams.model_fields["url"].metadata[0].pattern | ||
assert bool(re.match(url_re, url)) == valid | ||
|
||
|
||
def test_get_parse_product_request(): | ||
base_kwargs = { | ||
"url": "https://example.com", | ||
|
@@ -818,7 +755,7 @@ def test_urls_file(): | |
crawler = get_crawler() | ||
url = "https://example.com" | ||
|
||
with patch("zyte_spider_templates.spiders.ecommerce.requests.get") as mock_get: | ||
with patch("zyte_spider_templates.params.requests.get") as mock_get: | ||
response = requests.Response() | ||
response._content = ( | ||
b"https://a.example\n \nhttps://b.example\nhttps://c.example\n\n" | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
import re | ||
|
||
import pytest | ||
|
||
from zyte_spider_templates.params import URL_FIELD_KWARGS | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"valid,url", | ||
[ | ||
(False, ""), | ||
(False, "http://"), | ||
(False, "http:/example.com"), | ||
(False, "ftp://example.com"), | ||
(False, "example.com"), | ||
(False, "//example.com"), | ||
(False, "http://foo:[email protected]"), | ||
(False, " http://example.com"), | ||
(False, "http://example.com "), | ||
(False, "http://examp le.com"), | ||
(False, "https://example.com:232323"), | ||
(True, "http://example.com"), | ||
(True, "http://bücher.example"), | ||
(True, "http://xn--bcher-kva.example"), | ||
(True, "https://i❤.ws"), | ||
(True, "https://example.com"), | ||
(True, "https://example.com/"), | ||
(True, "https://example.com:2323"), | ||
(True, "https://example.com:2323/"), | ||
(True, "https://example.com:2323/foo"), | ||
(True, "https://example.com/f"), | ||
(True, "https://example.com/foo"), | ||
(True, "https://example.com/foo/"), | ||
(True, "https://example.com/foo/bar"), | ||
(True, "https://example.com/foo/bar/"), | ||
(True, "https://example.com/foo/bar?baz"), | ||
(True, "https://example.com/foo/bar/?baz"), | ||
(True, "https://example.com?foo"), | ||
(True, "https://example.com?foo=bar"), | ||
(True, "https://example.com/?foo=bar&baz"), | ||
(True, "https://example.com/?foo=bar&baz#"), | ||
(True, "https://example.com/?foo=bar&baz#frag"), | ||
(True, "https://example.com#"), | ||
(True, "https://example.com/#"), | ||
(True, "https://example.com/&"), | ||
(True, "https://example.com/&#"), | ||
], | ||
) | ||
def test_url_pattern(url, valid): | ||
assert isinstance(URL_FIELD_KWARGS["pattern"], str) | ||
assert bool(re.match(URL_FIELD_KWARGS["pattern"], url)) == valid |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.