gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

dlenski · 2022-05-17T20:52:48Z

This function's possible return types have been surprising and error-prone
for the entirety of its Python 3.x history. It can return either:

typing.List[typing.Tuple[str, None]], of length exactly 1
or typing.List[typing.Tuple[bytes, typing.Optional[str]]]

This function can't be rewritten to be more consistent in a backwards-compatible way, because some users of this function depend on the existing return type(s).

This PR addresses the inconsistency as suggested by @JelleZijlstra in #67022 (comment):

we should document the surprising return type at https://docs.python.org/3.10/library/email.header.html.
[create] a new function with a sane return type.

The "sane", Pythonic way to handle the decoding of an email/MIME message header value is simply to convert the whole header to a str; the details of exactly which parts of that header were encoded in which charsets are not relevant to the users. Fortunately, the email.header module already contains a mechanism to do this, via the __str__ method of email.header.header, so we can simply create a wrapper function to guide users in the right direction.

Example of the old/inconsistent (decode_header) vs. new/sane (decode_header_to_string) functions:

>>> from email import decode_header, decode_header_to_string
>>>
>>> # Do most users care about this distinction in (sub)encodings? I think not.
>>> print(decode_header('hello =?utf-8?B?ZsOzbw==?= bar'))
[(b'hello ', None), (b'f\xc3\xb3o', 'utf-8'), (b' bar', None)]
>>> print(decode_header('=?iso-8859-1?q?hello_f=F3o_bar?='))
[(b'hello f\xf3o bar', 'iso-8859-1')]
>>>
>>> # Assuming not, this is a much saner interface
>>> print(decode_header_to_string('hello =?utf-8?B?ZsOzbw==?= bar'))
hello fóo bar
>>> print(decode_header_to_string('=?iso-8859-1?q?hello_f=F3o_bar?='))
hello fóo bar

(Closes #30548 and replaces it.)

Issue: The decode_header() function decodes raw part to bytes or str, depending on encoded part #67022

ghost · 2022-05-17T20:52:50Z

All commit authors signed the Contributor License Agreement.

warsaw

In general, I think this would help users of the legacy API, although I think we should also steer people to the new API. What does @bitdancer think?

Doc/library/email.header.rst

warsaw · 2022-07-20T19:23:47Z

Doc/library/email.header.rst

+   .. note::
+
+      This function exists for for backwards compatibility only. For
+      new code we recommend using :mod:`email.header.decode_header_to_string`.


How about adding a link to the non-legacy API, or an example using that newer API?

Do you mean the non-legacy API as described in https://docs.python.org/3/library/email.html, e.g. email.parser?

To my knowledge, there is not any function/method in that API which can be straightforwardly used instead of email.header.decode_header.

>>> from email.headerregistry import HeaderRegistry >>> decoder = HeaderRegistry() >>> decoder('To', '=?utf-8?q?M=C3=A4x?= <[email protected]>') 'Mäx <[email protected]>' >>> decoder('To', '=?utf-8?q?M=C3=A4x?= <[email protected]>').addresses (Address(display_name='Mäx', username='foo', domain='bar.com'),)

You really don't want to use the legacy decode_header. It has many bugs that the new API fixes.

Obviously this needs to be better documented...

Lib/email/header.py

Misc/NEWS.d/next/Library/2022-01-11-21-40-14.bpo-22833.WB-JWw.rst

Lib/email/header.py

…de_header() This function's possible return types have been surprising and error-prone for the entirety of its Python 3.x history. It can return either: 1. `typing.List[typing.Tuple[bytes, typing.Optional[str]]]` of length >1 2. or `typing.List[typing.Tuple[str, None]]`, of length exactly 1 This means that any user of this function must be prepared to accept either `bytes` or `str` for the first member of the 2-tuples it returns, which is a very surprising behavior in Python 3.x, particularly given that the second member of the tuple is supposed to represent the charset/encoding of the first member. This patch documents the behavior of this function, and adds test cases to demonstrate it. As discussed in bpo-22833, this cannot be changed in a backwards-compatible way, and some users of this function depend precisely on the existing behavior.

This function takes an email header, possibly with portions encoded according to RFC2047, and converts it to a standard Python string. It is intended to provide a sane, Pythonic replacement for `email.header.decode_header()`, which has two major problems: 1. May return either bytes or str (bpo-22833/pythongh-67022), an inconsistent and error-prone interface 2. Exposes details of an email header value's encoding which most users will not care about or want to deal with. Many users likely just want to decode an email header value to a Python string. It turns out that `email.header` already contained most of the code necessary to do this, and providing `decode_header_to_string` as a documented wrapper function points users in the right direction.

dlenski · 2022-07-20T21:24:34Z

I have made the requested changes; please review again, @warsaw.

And if you don't make the requested changes, you will be put in the comfy chair!

😂

dlenski · 2023-02-21T01:43:40Z

I have made the requested changes; please review again

dlenski requested a review from a team as a code owner May 17, 2022 20:52

dlenski force-pushed the gh60722 branch from 6b35dc0 to 712d83d Compare May 17, 2022 20:54

dlenski mentioned this pull request May 17, 2022

bpo-22833: Fix bytes/str inconsistency in email.header.decode_header() #30548

Closed

dlenski force-pushed the gh60722 branch from 712d83d to 9a8a34c Compare May 17, 2022 20:59

dlenski mentioned this pull request Jul 19, 2022

The decode_header() function decodes raw part to bytes or str, depending on encoded part #67022

Open

srittau mentioned this pull request Jul 19, 2022

gh-67022: Improve email.header.decode_header() documentation #95020

Closed

warsaw requested changes Jul 20, 2022

View reviewed changes

dlenski added 2 commits July 20, 2022 14:09

dlenski force-pushed the gh60722 branch from 9a8a34c to e760911 Compare July 20, 2022 21:11

Merge branch 'main' into gh60722

0ced18b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

dlenski commented May 17, 2022 •

edited by bedevere-bot

Loading

ghost commented May 17, 2022 •

edited by ghost

Loading

warsaw left a comment

warsaw Jul 20, 2022

dlenski Jul 20, 2022

bitdancer Feb 21, 2023 •

edited

Loading

bitdancer Feb 21, 2023

dlenski commented Jul 20, 2022

dlenski commented Feb 21, 2023

gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

Are you sure you want to change the base?

gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

Conversation

dlenski commented May 17, 2022 • edited by bedevere-bot Loading

ghost commented May 17, 2022 • edited by ghost Loading

warsaw left a comment

Choose a reason for hiding this comment

warsaw Jul 20, 2022

Choose a reason for hiding this comment

dlenski Jul 20, 2022

Choose a reason for hiding this comment

bitdancer Feb 21, 2023 • edited Loading

Choose a reason for hiding this comment

bitdancer Feb 21, 2023

Choose a reason for hiding this comment

dlenski commented Jul 20, 2022

dlenski commented Feb 21, 2023

dlenski commented May 17, 2022 •

edited by bedevere-bot

Loading

ghost commented May 17, 2022 •

edited by ghost

Loading

bitdancer Feb 21, 2023 •

edited

Loading