bpo-22833: Fix bytes/str inconsistency in email.header.decode_header() #30548

dlenski · 2022-01-11T21:29:48Z

This function's possible return types have been non-intuitive and surprising
for the entirety of its Python 3.x history. It can return either:

typing.List[typing.Tuple[str, None]], of length exactly 1
or typing.List[typing.Tuple[bytes, typing.Optional[str]]]

This has meant that any user of this function must be prepared to accept
either bytes or str for the first member of the 2-tuples it returns,
which is a very unexpected behavior in Python 3.x, particularly given
that the second member of the tuple is supposed to represent the
charset/encoding of the first member.

This change eliminates case (1), ensuring that
email.header.decode_header() always returns bytes, never str, as the
first member of the 2-tuples it returns.

https://bugs.python.org/issue22833

Lib/email/header.py

Lib/test/test_email/test_email.py

This functions possible return types have been non-intuitive and surprising for the entirety of its Python 3.x history. It can return either: 1. `typing.List[typing.Tuple[bytes, typing.Optional[str]]]` 2. or `typing.List[typing.Tuple[str, None]]`, of length exactly 1 This has meant that any user of this function must be prepared to accept either `bytes` or `str` for the first member of the 2-tuples it returns, which is a very surprising behavior in Python 3.x, particularly given that the second member of the tuple is supposed to represent the charset/encoding of the first member. This change eliminates case (2), ensuring that `email.header.decode_header()` always returns `bytes`, never `str`, as the first member of the 2-tuples it returns. It also adds a test case to verify this behavior.

dlenski · 2022-01-21T03:06:18Z

@JelleZijlstra, as you wrote in https://bugs.python.org/msg411069 …

This behavior is definitely unfortunate, but by now it's also been baked into more than a decade of Python 3 releases, so backward compatibility constraints make it difficult to fix.

How can we be sure this change won't break users' code?

For reference, here are a few uses of the function I found in major open-source packages:

https://github.com/httplib2/httplib2/blob/cde9e87d8b2c4c5fc966431965998ed5f45d19c7/python3/httplib2/__init__.py#L1608 - this assumes it only ever hits the (bytes, encoding) case.

This one is in fact "safe" because it does its own check for the encoding delimiters (?=/=?), so it would never hit the (bytes, None) case with the new version.

My guess is it does that because the authors didn't understand the unpredictable return types from this function. 🤕

https://github.com/cherrypy/cherrypy/blob/98929b519fbca003cbf7b14a6b370a3cabc9c412/cherrypy/lib/httputil.py#L258 - this assumes it only gets (str, None) or (bytes, encoding) pairs, which seems unsafe. But if it currently sees (str, None) and would see (bytes, None) with this change, it would break.

Ugh. I also proposed replacing any case of None as the charset with ascii, but that is liable to break as well for the case of unexpected non-ASCII literal characters, which we already discussed from a different angle.

A more robust alternative: return (bytes, "ascii") when there's an unencoded part, and it contains only ASCII characters. Return (bytes, "utf8") when there's an unencoded part and it contains non-ASCII characters, in violation of RFC 2047.

That would ensure that the function would only ever return (bytes, encoding) pairs which could actually be decoded according to the named encoding.

An alternative solution could be a new function with a sane return type.

The sane function would just decode the pieces per their encodings, and concatenate into one single Python str. I don't believe there's a good reason a consumer of this function should care about the "implementation detail" of the encoding(s) of the individual substring(s).

Any preference among these possibilities at this point?

Put ascii as the charset when there's no encoding and the input is pure-ASCII, utf8 when there's no encoding and non-ASCII chars are present (violating RFC 2047)
New function. What to call it? email.header_to_string?

Even if we decide to not change anything, we should document the surprising return type at https://docs.python.org/3.10/library/email.header.html.

Once we decide on a preferred path, I'll update https://github.com/python/cpython/blob/main/Doc/library/email.header.rst

dlenski · 2022-01-21T03:09:00Z

The "sane return type" version of this function is relatively easy to describe in terms of the "insane return type" function.

A lightly updated version of what I proposed in https://bugs.python.org/msg409391, taking into account the possibility of raw-unicode-escape encdoing of the source:

#!/usr/bin/python3
import email.header

# Workaround for https://bugs.python.org/issue22833
def decode_header_to_string(header):
    '''Decodes an email message header (possibly RFC2047-encoded)
    into a string, while working around https://bugs.python.org/issue22833'''

    return ''.join(
        alleged_string if isinstance(alleged_string, str) else alleged_string.decode(
            alleged_charset or 'raw-unicode-escape')
        for alleged_string, alleged_charset in email.header.decode_header(header))


for header in ('=?utf-8?B?ZsOzbw==',
               '=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=',
               'bar=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=',
               'plain string',
               'thís isn’t “allowed” by RFC 2047=?ascii?Q?hello?=',
               '¡not allowed but decode_header doesn’t even notice!'):
    print("Header value: %r" % header)
    print("email.header.decode_header(...) -> %r" % email.header.decode_header(header))
    print("decode_header_to_string(...)    -> %r" % decode_header_to_string(header))
    print("-------")

JelleZijlstra · 2022-01-21T04:33:17Z

First let me note that I just stumbled on this PR looking through the list of open CPython PRs, and I don't really have experience with working with email headers. So don't value my opinion too much.

That said, I'd be hesitant to change the return value of the existing function. It's a backward compatibility break and there's no good way to tell users who may have relied on the previous behavior about it. Also, decode_header is designed to be used together with email.header.make_header, so we'd want to maintain that relationship.

I also learned that the second element of the pair is not strictly an encoding but a charset. The possible charsets are specified in email.charset.CHARSETS, and there is one (viscii) that is not valid as an encoding. So the open-source code from above is buggy when it assumes that it can use the charset as an argument to .encode().

I might suggest a function that returns pairs (bytes, Charset object). This would produce a consistent return type, still work with make_header(), and make it harder to misinterpret the charset as an encoding.

dlenski · 2022-05-17T20:55:25Z

Closing in favor of #92900

dlenski requested a review from a team as a code owner January 11, 2022 21:29

the-knights-who-say-ni added the CLA signed label Jan 11, 2022

bedevere-bot added the awaiting review label Jan 11, 2022

dlenski force-pushed the bpo22833 branch from a8281b0 to d78ec0c Compare January 11, 2022 22:16

JelleZijlstra reviewed Jan 21, 2022

View reviewed changes

Lib/email/header.py Outdated Show resolved Hide resolved

JelleZijlstra reviewed Jan 21, 2022

View reviewed changes

Lib/test/test_email/test_email.py Show resolved Hide resolved

dlenski force-pushed the bpo22833 branch from d78ec0c to 990195f Compare January 21, 2022 02:39

dlenski force-pushed the bpo22833 branch from 990195f to 4729b37 Compare January 21, 2022 02:40

This was referenced May 17, 2022

The decode_header() function decodes raw part to bytes or str, depending on encoded part #67022

Closed

gh-67022: Document bytes/str inconsistency in email.header.decode_header() and suggest email.headerregistry.HeaderRegistry as a sane alternative #92900

Merged

dlenski closed this May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bpo-22833: Fix bytes/str inconsistency in email.header.decode_header() #30548

bpo-22833: Fix bytes/str inconsistency in email.header.decode_header() #30548

Uh oh!

dlenski commented Jan 11, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

dlenski commented Jan 21, 2022

Uh oh!

dlenski commented Jan 21, 2022 •

edited

Loading

Uh oh!

JelleZijlstra commented Jan 21, 2022

Uh oh!

dlenski commented May 17, 2022

Uh oh!

Uh oh!

Uh oh!

bpo-22833: Fix bytes/str inconsistency in email.header.decode_header() #30548

bpo-22833: Fix bytes/str inconsistency in email.header.decode_header() #30548

Uh oh!

Conversation

dlenski commented Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dlenski commented Jan 21, 2022

Uh oh!

dlenski commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JelleZijlstra commented Jan 21, 2022

Uh oh!

dlenski commented May 17, 2022

Uh oh!

Uh oh!

dlenski commented Jan 11, 2022 •

edited

Loading

dlenski commented Jan 21, 2022 •

edited

Loading