Skip to content

gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 46 additions & 11 deletions Doc/library/email.header.rst
Original file line number Diff line number Diff line change
Expand Up @@ -173,21 +173,23 @@ Here is the :class:`Header` class description:
The :mod:`email.header` module also provides the following convenient functions.


.. function:: decode_header(header)
.. function:: decode_header_to_string(header)

Decode a message header value without converting the character set. The header
value is in *header*.
Decode a message header value to a Unicode string, including handling
portions encoded according to :rfc:`2047`.

This function returns a list of ``(decoded_string, charset)`` pairs containing
each of the decoded parts of the header. *charset* is ``None`` for non-encoded
parts of the header, otherwise a lower case string containing the name of the
character set specified in the encoded string.
An :exc:`classemail.errors.HeaderParseError` may be raised when
certain decoding errors occur (e.g. a base64 decoding exception).

Here's an example::
Here are examples:

>>> from email.header import decode_header
>>> decode_header('=?iso-8859-1?q?p=F6stal?=')
[(b'p\xf6stal', 'iso-8859-1')]
>>> from email.header import decode_header_to_string
>>> decode_header_to_string('=?iso-8859-1?q?p=F6stal?=')
'p\xf6stal'
>>> decode_header_to_string('unencoded_string')
'unencoded_string'
>>> decode_header_to_string('bar =?utf-8?B?ZsOzbw==?=')
'bar f\xf3o'


.. function:: make_header(decoded_seq, maxlinelen=None, header_name=None, continuation_ws=' ')
Expand All @@ -203,3 +205,36 @@ The :mod:`email.header` module also provides the following convenient functions.
:class:`Header` instance. Optional *maxlinelen*, *header_name*, and
*continuation_ws* are as in the :class:`Header` constructor.


.. function:: decode_header(header)

Decode a message header value without converting the character set. The header
value is in *header*.

For historical reasons, this function may return either:

1. A list of pairs containing each of the decoded parts of the header,
``(decoded_bytes, charset)``, where *decoded_bytes* is always an instance of
:class:`bytes`, and *charset* is either:
- A lower case string containing the name of the character set specified.
- ``None`` for non-encoded parts of the header.
2. A list of length 1 containing a pair ``(string, None)``, where
*string* is always an instance of :class:`str`.

An :exc:`classemail.errors.HeaderParseError` may be raised when
certain decoding errors occur (e.g. a base64 decoding exception).

Here are examples:

>>> from email.header import decode_header
>>> decode_header('=?iso-8859-1?q?p=F6stal?=')
[(b'p\xf6stal', 'iso-8859-1')]
>>> decode_header('unencoded_string')
[('unencoded_string', None)]
>>> decode_header('bar =?utf-8?B?ZsOzbw==?=')
[(b'bar ', None), (b'f\xc3\xb3o', 'utf-8')]

.. note::

This function exists for for backwards compatibility only. For
new code we recommend using :mod:`email.header.decode_header_to_string`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a link to the non-legacy API, or an example using that newer API?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the non-legacy API as described in https://docs.python.org/3/library/email.html, e.g. email.parser?

To my knowledge, there is not any function/method in that API which can be straightforwardly used instead of email.header.decode_header.

Copy link
Member

@bitdancer bitdancer Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> from email.headerregistry import HeaderRegistry
>>> decoder = HeaderRegistry()
>>> decoder('To', '=?utf-8?q?M=C3=A4x?= <[email protected]>')
'Mäx <[email protected]>'
>>> decoder('To', '=?utf-8?q?M=C3=A4x?= <[email protected]>').addresses
(Address(display_name='Mäx', username='foo', domain='bar.com'),)

You really don't want to use the legacy decode_header. It has many bugs that the new API fixes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously this needs to be better documented...

31 changes: 27 additions & 4 deletions Lib/email/header.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,16 +61,22 @@
def decode_header(header):
"""Decode a message header value without converting charset.

Returns a list of (string, charset) pairs containing each of the decoded
parts of the header. Charset is None for non-encoded parts of the header,
otherwise a lower-case string containing the name of the character set
specified in the encoded string.
For historical reasons, this function may return either:

1. A list of length 1 containing a pair (str, None).
2. A list of (bytes, charset) pairs containing each of the decoded
parts of the header. Charset is None for non-encoded parts of the header,
otherwise a lower-case string containing the name of the character set
specified in the encoded string.

header may be a string that may or may not contain RFC2047 encoded words,
or it may be a Header object.

An email.errors.HeaderParseError may be raised when certain decoding error
occurs (e.g. a base64 decoding exception).

This function exists for backwards compatibility only. For new code, we
recommend using decode_header_to_string instead.
"""
# If it is a Header object, we can just return the encoded chunks.
if hasattr(header, '_chunks'):
Expand Down Expand Up @@ -152,6 +158,23 @@ def decode_header(header):
return collapsed



def decode_header_to_string(header):
"""Decode a message header into a string.

header may be a string that may or may not contain RFC2047 encoded words,
or it may be a Header object; in the latter case, this is equivalent to
str(header).

An email.errors.HeaderParseError may be raised when certain decoding error
occurs (e.g. a base64 decoding exception).
"""

if not isinstance(header, Header):
header = make_header(decode_header(header))
return str(header)



def make_header(decoded_seq, maxlinelen=None, header_name=None,
continuation_ws=' '):
Expand Down
24 changes: 24 additions & 0 deletions Lib/test/test_email/test_email.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import email.policy

from email.charset import Charset
from email.header import Header, decode_header, decode_header_to_string, make_header
from email.generator import Generator, DecodedGenerator, BytesGenerator
from email.header import Header, decode_header, make_header
from email.headerregistry import HeaderRegistry
Expand Down Expand Up @@ -2464,6 +2465,29 @@ def test_multiline_header(self):
self.assertEqual(str(make_header(decode_header(s))),
'"Müller T" <[email protected]>')

def test_unencoded_ascii(self):
# bpo-22833/gh-67022: returns [(str, None)] rather than [(bytes, None)]
s = 'header without encoded words'
self.assertEqual(decode_header(s),
[('header without encoded words', None)])

def test_unencoded_utf8(self):
# bpo-22833/gh-67022: returns [(str, None)] rather than [(bytes, None)]
s = 'header with unexpected non ASCII caract\xe8res'
self.assertEqual(decode_header(s),
[('header with unexpected non ASCII caract\xe8res', None)])

def test_decode_header_to_string_from_string(self):
s = '=?windows-1252?q?=22M=FCller_T=22?=\r\n <[email protected]>'
self.assertEqual(str(make_header(decode_header(s))),
decode_header_to_string(s))

def test_decode_header_to_string_from_header_obj(self):
s = '\xeatre'
h = Header(s)
self.assertEqual(str(h),
decode_header_to_string(h))


# Test the MIMEMessage class
class TestMIMEMessage(TestEmailBase):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
The inconsistent return types of :func:`email.header.decode_header` are now documented.

:func:`email.header.decode_header_to_string` is provided as a less error-prone and
more straightforward alternative for it.