Skip to content

bpo-22833: Fix bytes/str inconsistency in email.header.decode_header() #30548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

dlenski
Copy link

@dlenski dlenski commented Jan 11, 2022

This function's possible return types have been non-intuitive and surprising
for the entirety of its Python 3.x history. It can return either:

  1. typing.List[typing.Tuple[str, None]], of length exactly 1
  2. or typing.List[typing.Tuple[bytes, typing.Optional[str]]]

This has meant that any user of this function must be prepared to accept
either bytes or str for the first member of the 2-tuples it returns,
which is a very unexpected behavior in Python 3.x, particularly given
that the second member of the tuple is supposed to represent the
charset/encoding of the first member.

This change eliminates case (1), ensuring that
email.header.decode_header() always returns bytes, never str, as the
first member of the 2-tuples it returns.

https://bugs.python.org/issue22833

This functions possible return types have been non-intuitive and surprising
for the entirety of its Python 3.x history. It can return either:

1. `typing.List[typing.Tuple[bytes, typing.Optional[str]]]`
2. or `typing.List[typing.Tuple[str, None]]`, of length exactly 1

This has meant that any user of this function must be prepared to accept
either `bytes` or `str` for the first member of the 2-tuples it returns,
which is a very surprising behavior in Python 3.x, particularly given
that the second member of the tuple is supposed to represent the
charset/encoding of the first member.

This change eliminates case (2), ensuring that
`email.header.decode_header()` always returns `bytes`, never `str`, as the
first member of the 2-tuples it returns. It also adds a test case to verify
this behavior.
@dlenski
Copy link
Author

dlenski commented Jan 21, 2022

@JelleZijlstra, as you wrote in https://bugs.python.org/msg411069

This behavior is definitely unfortunate, but by now it's also been baked into more than a decade of Python 3 releases, so backward compatibility constraints make it difficult to fix.

How can we be sure this change won't break users' code?

For reference, here are a few uses of the function I found in major open-source packages:

This one is in fact "safe" because it does its own check for the encoding delimiters (?=/=?), so it would never hit the (bytes, None) case with the new version.

My guess is it does that because the authors didn't understand the unpredictable return types from this function. 🤕

Ugh. I also proposed replacing any case of None as the charset with ascii, but that is liable to break as well for the case of unexpected non-ASCII literal characters, which we already discussed from a different angle.

A more robust alternative: return (bytes, "ascii") when there's an unencoded part, and it contains only ASCII characters. Return (bytes, "utf8") when there's an unencoded part and it contains non-ASCII characters, in violation of RFC 2047.

That would ensure that the function would only ever return (bytes, encoding) pairs which could actually be decoded according to the named encoding.

An alternative solution could be a new function with a sane return type.

The sane function would just decode the pieces per their encodings, and concatenate into one single Python str. I don't believe there's a good reason a consumer of this function should care about the "implementation detail" of the encoding(s) of the individual substring(s).

Any preference among these possibilities at this point?

  1. Put ascii as the charset when there's no encoding and the input is pure-ASCII, utf8 when there's no encoding and non-ASCII chars are present (violating RFC 2047)
  2. New function. What to call it? email.header_to_string?

Even if we decide to not change anything, we should document the surprising return type at https://docs.python.org/3.10/library/email.header.html.

Once we decide on a preferred path, I'll update https://github.com/python/cpython/blob/main/Doc/library/email.header.rst

@dlenski
Copy link
Author

dlenski commented Jan 21, 2022

The "sane return type" version of this function is relatively easy to describe in terms of the "insane return type" function.

A lightly updated version of what I proposed in https://bugs.python.org/msg409391, taking into account the possibility of raw-unicode-escape encdoing of the source:

#!/usr/bin/python3
import email.header

# Workaround for https://bugs.python.org/issue22833
def decode_header_to_string(header):
    '''Decodes an email message header (possibly RFC2047-encoded)
    into a string, while working around https://bugs.python.org/issue22833'''

    return ''.join(
        alleged_string if isinstance(alleged_string, str) else alleged_string.decode(
            alleged_charset or 'raw-unicode-escape')
        for alleged_string, alleged_charset in email.header.decode_header(header))


for header in ('=?utf-8?B?ZsOzbw==',
               '=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=',
               'bar=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=',
               'plain string',
               'thís isn’t “allowed” by RFC 2047=?ascii?Q?hello?=',
               '¡not allowed but decode_header doesn’t even notice!'):
    print("Header value: %r" % header)
    print("email.header.decode_header(...) -> %r" % email.header.decode_header(header))
    print("decode_header_to_string(...)    -> %r" % decode_header_to_string(header))
    print("-------")

@JelleZijlstra
Copy link
Member

First let me note that I just stumbled on this PR looking through the list of open CPython PRs, and I don't really have experience with working with email headers. So don't value my opinion too much.

That said, I'd be hesitant to change the return value of the existing function. It's a backward compatibility break and there's no good way to tell users who may have relied on the previous behavior about it. Also, decode_header is designed to be used together with email.header.make_header, so we'd want to maintain that relationship.

I also learned that the second element of the pair is not strictly an encoding but a charset. The possible charsets are specified in email.charset.CHARSETS, and there is one (viscii) that is not valid as an encoding. So the open-source code from above is buggy when it assumes that it can use the charset as an argument to .encode().

I might suggest a function that returns pairs (bytes, Charset object). This would produce a consistent return type, still work with make_header(), and make it harder to misinterpret the charset as an encoding.

@dlenski
Copy link
Author

dlenski commented May 17, 2022

Closing in favor of #92900

@dlenski dlenski closed this May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants