-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
bpo-22833: Fix bytes/str inconsistency in email.header.decode_header() #30548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This functions possible return types have been non-intuitive and surprising for the entirety of its Python 3.x history. It can return either: 1. `typing.List[typing.Tuple[bytes, typing.Optional[str]]]` 2. or `typing.List[typing.Tuple[str, None]]`, of length exactly 1 This has meant that any user of this function must be prepared to accept either `bytes` or `str` for the first member of the 2-tuples it returns, which is a very surprising behavior in Python 3.x, particularly given that the second member of the tuple is supposed to represent the charset/encoding of the first member. This change eliminates case (2), ensuring that `email.header.decode_header()` always returns `bytes`, never `str`, as the first member of the 2-tuples it returns. It also adds a test case to verify this behavior.
@JelleZijlstra, as you wrote in https://bugs.python.org/msg411069 …
This one is in fact "safe" because it does its own check for the encoding delimiters ( My guess is it does that because the authors didn't understand the unpredictable return types from this function. 🤕
Ugh. I also proposed replacing any case of A more robust alternative: return That would ensure that the function would only ever return
The sane function would just decode the pieces per their encodings, and concatenate into one single Python Any preference among these possibilities at this point?
Once we decide on a preferred path, I'll update https://github.com/python/cpython/blob/main/Doc/library/email.header.rst |
The "sane return type" version of this function is relatively easy to describe in terms of the "insane return type" function. A lightly updated version of what I proposed in https://bugs.python.org/msg409391, taking into account the possibility of #!/usr/bin/python3
import email.header
# Workaround for https://bugs.python.org/issue22833
def decode_header_to_string(header):
'''Decodes an email message header (possibly RFC2047-encoded)
into a string, while working around https://bugs.python.org/issue22833'''
return ''.join(
alleged_string if isinstance(alleged_string, str) else alleged_string.decode(
alleged_charset or 'raw-unicode-escape')
for alleged_string, alleged_charset in email.header.decode_header(header))
for header in ('=?utf-8?B?ZsOzbw==',
'=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=',
'bar=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=',
'plain string',
'thís isn’t “allowed” by RFC 2047=?ascii?Q?hello?=',
'¡not allowed but decode_header doesn’t even notice!'):
print("Header value: %r" % header)
print("email.header.decode_header(...) -> %r" % email.header.decode_header(header))
print("decode_header_to_string(...) -> %r" % decode_header_to_string(header))
print("-------") |
First let me note that I just stumbled on this PR looking through the list of open CPython PRs, and I don't really have experience with working with email headers. So don't value my opinion too much. That said, I'd be hesitant to change the return value of the existing function. It's a backward compatibility break and there's no good way to tell users who may have relied on the previous behavior about it. Also, I also learned that the second element of the pair is not strictly an encoding but a charset. The possible charsets are specified in email.charset.CHARSETS, and there is one ( I might suggest a function that returns pairs (bytes, Charset object). This would produce a consistent return type, still work with |
Closing in favor of #92900 |
This function's possible return types have been non-intuitive and surprising
for the entirety of its Python 3.x history. It can return either:
typing.List[typing.Tuple[str, None]]
, of length exactly 1typing.List[typing.Tuple[bytes, typing.Optional[str]]]
This has meant that any user of this function must be prepared to accept
either
bytes
orstr
for the first member of the 2-tuples it returns,which is a very unexpected behavior in Python 3.x, particularly given
that the second member of the tuple is supposed to represent the
charset/encoding of the first member.
This change eliminates case (1), ensuring that
email.header.decode_header()
always returnsbytes
, neverstr
, as thefirst member of the 2-tuples it returns.
https://bugs.python.org/issue22833