Skip to content
14 changes: 7 additions & 7 deletions tests/test_pages_price.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,21 +62,21 @@ def price(self):
url = "https://example.com"
page = CustomProductPage(response=HttpResponse(url=url, body=html))
assert page.call_count == 0
assert page.price == "$13.2"
assert page.price == "13.20"
Copy link
Copy Markdown
Contributor

@kmike kmike Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big change, because it goes against the documentation - see https://zyte-common-items.readthedocs.io/en/latest/usage/field-processors.html#overview:

By design, the processors enabled by default are “transparent”: they don’t change the output of the field if the result is of the expected final type. For example, if there is a str attribute in the item, and the field returns str value, the default processor returns the value as-is.

This design decision was made for two reasons:

  1. It makes it possible to safely add new default processors, without a risk of breaking user code. If the code was working before a processor is introduced, it means the field was returning the final data type. Processor won't touch it, so no breaking is possible.
  2. it provides an escape hatch for the user; if you want to return a final data, you can do it just by returning the value.

It's not an issue for Real support in price parsing, for Brand processing, or for Images (it all looks good in this PR), but it's an issue for price fields.

Obviously, this "design principle" has a downside: in most cases (?), we do want to process strings by default. I think most large Scrapy projects develop this kind of utilities over time, where all values are processed by default. We do want to make it more standard.

I can see 2 ways to address it.

  1. Start procesing final data types by default; drop this design principle.
  2. Introduce more aggressive / complete versions of procesors. Keep "cautious" processors enabled by default, which don't handle the final data types. Make it easy to enable the complete versions - either by defining Processors class which users can use, or by providing another base ProductPage class, with processing of final values enabled by default.

I'm +0.5 for option (2), but willing to hear the opinions and arguments. I like keeping the "safe" approach, but it introduces more complexity (e.g. an additional base class, where users need to pick the right one).

cc @wRAR @Gallaecio.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of against this transparency principle, and I've even proposed another PR #101 to ensure all strings are trimmed.

It's difficult to imagine for me why we would want to emit data with extra whitespaces and it's bothersome to always manually add some sort of default text cleaning processor into all used fields. I think this kind of processing should be applied by default at some point and would help to ensure all our output data respect a common standard.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to revert this change for now, to not block the rest.

assert page.call_count == 1
assert page.currency is None
assert await page.currencyRaw is None
assert page.call_count == 2 # we want this to be 1
assert await page.currencyRaw is None
assert page.call_count == 2 # we want this to be 1
assert await page.currencyRaw == "$"
assert page.call_count == 1 # we want this to be 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why did the count change and what should be the new expectation in the comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've noticed that comment info doesn't match test values in places in master. Not sure if this is a proper behavior.

On master such difference can be found in 5 places:

This PR fixes two of these:

It seems like an improvement, albeit incomplete (but fixing this for all cases seems kind of out of scope so perhaps we can live with it for now...?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted parsing strings by price_processor as noted in #99 (comment) so this is no longer relevant as this PR no longer modify test_pages_price.py.

assert await page.currencyRaw == "$"
assert page.call_count == 1 # we want this to be 1

# access currency fields before the price field
page = CustomProductPage(response=HttpResponse(url=url, body=html))
assert page.call_count == 0
assert page.currency is None
assert await page.currencyRaw is None
assert await page.currencyRaw == "$"
assert page.call_count == 1
assert page.price == "$13.2"
assert page.price == "13.20"
assert page.call_count == 2 # we want this to be 1


Expand Down
105 changes: 94 additions & 11 deletions tests/test_processors.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,22 @@
from zyte_parsers import Gtin as zp_Gtin
from zyte_parsers import extract_breadcrumbs

from zyte_common_items import AggregateRating, BasePage, Breadcrumb, Gtin, ProductPage
from zyte_common_items import (
AggregateRating,
BasePage,
Brand,
Breadcrumb,
Gtin,
Image,
ProductPage,
)
from zyte_common_items.processors import (
_format_price,
brand_processor,
breadcrumbs_processor,
gtin_processor,
images_processor,
price_processor,
rating_processor,
)

Expand Down Expand Up @@ -125,16 +135,16 @@ def breadcrumbs(self):
"input_value,expected_value",
[
(None, None),
("", ""),
("foo", "foo"),
("", None),
("foo", Brand(name="foo")),
(Selector(text="<html></html>"), None),
(SelectorList([]), None),
(fromstring("<p>foo</p>"), "foo"),
(fromstring("<img alt='foo'>"), "foo"),
(fromstring("<p><img alt='foo'></p>"), "foo"),
(fromstring("<p><p><img alt='foo'></p></p>"), "foo"),
(Selector(text="<p>foo</p>"), "foo"),
(SelectorList([Selector(text="<p>foo</p>")]), "foo"),
(fromstring("<p>foo</p>"), Brand(name="foo")),
(fromstring("<img alt='foo'>"), Brand(name="foo")),
(fromstring("<p><img alt='foo'></p>"), Brand(name="foo")),
(fromstring("<p><p><img alt='foo'></p></p>"), Brand(name="foo")),
(Selector(text="<p>foo</p>"), Brand(name="foo")),
(SelectorList([Selector(text="<p>foo</p>")]), Brand(name="foo")),
],
)
def test_brand(input_value, expected_value):
Expand All @@ -149,7 +159,7 @@ def brand(self):

def test_brand_page():
class MyProductPage(ProductPage):
@field
@field(out=[brand_processor])
def brand(self):
return self.css("body")

Expand All @@ -158,7 +168,7 @@ def brand(self):
body="<html><body><img alt='foo'></body></html>".encode(),
)
page = MyProductPage(response=response)
assert page.brand == "foo"
assert page.brand == Brand(name="foo")


@pytest.mark.parametrize(
Expand Down Expand Up @@ -321,3 +331,76 @@ def aggregateRating(self):
assert page.aggregateRating == AggregateRating(
ratingValue=3.8, bestRating=10, reviewCount=5
)


@pytest.mark.parametrize(
"input_value,expected_value",
[
(None, None),
([], []),
("https://www.url.com/img.jpg", [Image(url="https://www.url.com/img.jpg")]),
(
[
Image("https://www.url.com/img1.jpg"),
Image("https://www.url.com/img2.jpg"),
],
[
Image("https://www.url.com/img1.jpg"),
Image("https://www.url.com/img2.jpg"),
],
),
(
["https://www.url.com/img1.jpg", "https://www.url.com/img2.jpg"],
[
Image("https://www.url.com/img1.jpg"),
Image("https://www.url.com/img2.jpg"),
],
),
(
[
{"url": "https://www.url.com/img1.jpg"},
{"url": "https://www.url.com/img2.jpg"},
],
[
Image("https://www.url.com/img1.jpg"),
Image("https://www.url.com/img2.jpg"),
],
),
],
)
def test_images(input_value, expected_value):
class ImagesPage(BasePage):
@field(out=[images_processor])
def images(self):
return input_value

page = ImagesPage(base_url) # type: ignore[arg-type]
assert page.images == expected_value


@pytest.mark.parametrize(
"input_value,expected_value",
[
("$10", "10.00"),
("100 ", "100.00"),
("100rub", "100.00"),
(100, "100.00"),
(None, None),
([], []),
({}, {}),
("", None),
("buy 10 ab ab", "10.00"),
("1,000.17", "1000.17"),
("1,000", "1000.00"),
(22.9, "22.90"),
(22.0, "22.00"),
],
)
def test_prices(input_value, expected_value):
class PricePage(BasePage):
@field(out=[price_processor])
def price(self):
return input_value

page = PricePage(base_url) # type: ignore[arg-type]
assert page.price == expected_value
2 changes: 1 addition & 1 deletion tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ commands = mypy zyte_common_items tests
[testenv:twinecheck]
basepython = python3
deps =
twine==4.0.2
twine==5.1.1
build==0.10.0
commands =
python -m build --sdist
Expand Down
3 changes: 3 additions & 0 deletions zyte_common_items/pages/product.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
description_html_processor,
description_processor,
gtin_processor,
images_processor,
price_processor,
rating_processor,
simple_price_processor,
Expand Down Expand Up @@ -46,6 +47,7 @@ class Processors(BasePage.Processors):
gtin = [gtin_processor]
price = [price_processor]
regularPrice = [simple_price_processor]
images = [images_processor]


class ProductPage(
Expand All @@ -62,6 +64,7 @@ class Processors(Page.Processors):
gtin = [gtin_processor]
price = [price_processor]
regularPrice = [simple_price_processor]
images = [images_processor]


@attrs.define
Expand Down
108 changes: 90 additions & 18 deletions zyte_common_items/processors.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from collections.abc import Iterable
from collections.abc import Iterable, Mapping
from functools import wraps
from numbers import Real
from typing import Any, Callable, List, Optional, Union

from clear_html import clean_node, cleaned_node_to_html, cleaned_node_to_text
Expand All @@ -21,8 +22,10 @@
from .components import (
AggregateRating,
BaseMetadata,
Brand,
Breadcrumb,
Gtin,
Image,
ProbabilityRequest,
Request,
)
Expand Down Expand Up @@ -104,50 +107,80 @@ def _from_zp_breadcrumb(value: zp_Breadcrumb) -> Breadcrumb:
return results


@only_handle_nodes
def brand_processor(value: Union[Selector, HtmlElement], page: Any) -> Any:
def brand_processor(value: Any, page: Any) -> Union[Brand, None]:
"""Convert the data into a brand name if possible.

Supported inputs are :class:`~parsel.selector.Selector`,
:class:`~parsel.selector.SelectorList` and :class:`~lxml.html.HtmlElement`.
Other inputs are returned as is.
If inputs are either :class:`~parsel.selector.Selector`,
:class:`~parsel.selector.SelectorList` or :class:`~lxml.html.HtmlElement`, attempts
to extract brand data from it.

If value is a string, use it to create brand object instance

Other inputs are returned unchanged
"""
return extract_brand_name(value, search_depth=2)
value = _handle_selectorlist(value)

if isinstance(value, str):
return Brand(name=value) if value else None

@only_handle_nodes
def price_processor(value: Union[Selector, HtmlElement], page: Any) -> Any:
if isinstance(value, (Selector, SelectorList, HtmlElement)):
if brand_name := extract_brand_name(value, search_depth=2):
return Brand(name=brand_name)
else:
return None

return value


def price_processor(value: Any, page: Any) -> Any:
"""Convert the data into a price string if possible.

Uses the price-parser_ library.

Supported inputs are :class:`~parsel.selector.Selector`,
:class:`~parsel.selector.SelectorList` and :class:`~lxml.html.HtmlElement`.
:class:`~parsel.selector.SelectorList`, :class:`~lxml.html.HtmlElement`, string
instances and numeric values.

Other inputs are returned as is.

Puts the parsed Price object into ``page._parsed_price``.

.. _price-parser: https://github.com/scrapinghub/price-parser
"""
price = extract_price(value)
page._parsed_price = price
return _format_price(price)
value = _handle_selectorlist(value)

if isinstance(value, Real):
return f"{value:.2f}"
elif isinstance(value, (Selector, HtmlElement, str)):
price = extract_price(value)
page._parsed_price = price
return _format_price(price)
else:
return value


@only_handle_nodes
def simple_price_processor(value: Union[Selector, HtmlElement], page: Any) -> Any:
def simple_price_processor(value: Any, page: Any) -> Any:
"""Convert the data into a price string if possible.

Uses the price-parser_ library.

Supported inputs are :class:`~parsel.selector.Selector`,
:class:`~parsel.selector.SelectorList` and :class:`~lxml.html.HtmlElement`.
:class:`~parsel.selector.SelectorList`, :class:`~lxml.html.HtmlElement`, string
instances and numeric values.

Other inputs are returned as is.

.. _price-parser: https://github.com/scrapinghub/price-parser
"""
price = extract_price(value)
return _format_price(price)
value = _handle_selectorlist(value)

if isinstance(value, Real):
return f"{value:.2f}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this would allow to remove the duplication in price_processor vs simple_price_processor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd like to migrate our stuff quickly before certain deadlines, which is why I want to have this thing working for now and avoid coordinating between PR in yet another dependency repository (which is also why I placed proposed parsing of images into TODO). I think we can improve it later and migrate into more fitting place.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

elif isinstance(value, (Selector, HtmlElement, str)):
price = extract_price(value)
return _format_price(price)
else:
return value


@only_handle_nodes
Expand Down Expand Up @@ -330,6 +363,45 @@ def aggregateRating(self):
return value


def images_processor(value: Any, page: Any) -> List[Image]:
"""Convert the data into a list of :class:`~zyte_common_items.Image`
objects if possible.

If the input is a string, it's used as a url for returning image object.

If input is either an iterable of strings or mappings with "url" key, they are
used to populate image objects.

Other inputs are returned unchanged.
"""

value = _handle_selectorlist(value)

# TODO: add generic-purpose extract_images utility to zyte-parsers
#
# if isinstance(value, (Selector, HtmlElement)):
# images = extract_images(value)
# return [Image(url=url) for url in images]

if isinstance(value, str):
return [Image(url=value)]

if isinstance(value, Iterable):
results: List[Any] = []
for item in value:
if isinstance(item, Image):
results.append(item)
elif isinstance(item, Mapping):
if url := item.get("url"):
results.append(Image(url=url))
elif isinstance(item, str):
results.append(Image(url=item))

return results

return value


def probability_request_list_processor(
request_list: List[Request],
) -> List[ProbabilityRequest]:
Expand Down