Skip to content

gh-83938, gh-122476: Stop incorrectly RFC 2047 encoding non-ASCII email addresses #122540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

medmunds
Copy link
Contributor

@medmunds medmunds commented Aug 1, 2024

This PR fixes gh-83938 and fixes gh-122476, which have the same underlying issue.

Email generators had been incorrectly flattening non-ASCII email addresses to RFC 2047 encoded-word format, leaving them undeliverable. (RFC 2047 prohibits use of encoded-word in an addr-spec.) This change raises a ValueError when attempting to flatten an EmailMessage with a non-ASCII addr-spec and a policy with utf8=False. (Exception: If the non-ASCII address originated from parsing a message, it will be flattened as originally parsed, without error.)

Non-ASCII email addresses are supported when using a policy with utf8=True (such as email.policy.SMTPUTF8) under RFCs 6531 and 6532.

Non-ASCII email address domains (but not localparts) can also be used with non-SMTPUTF8 policies by encoding the domain as an IDNA A-label. (The email package does not perform this encoding, because it cannot know whether the caller wants IDNA 2003, IDNA 2008, or some other variant such as UTS-46.)


📚 Documentation preview 📚: https://cpython-previews--122540.org.readthedocs.build/

@medmunds medmunds requested a review from a team as a code owner August 1, 2024 00:35
@medmunds medmunds force-pushed the fix-issues-83938-122476 branch 2 times, most recently from d1f0bdc to 2e0696c Compare August 1, 2024 00:43
@medmunds
Copy link
Contributor Author

medmunds commented Aug 1, 2024

This is based on #81074 (comment):

we should probably be raising an error if the rendering policy does not have utf8=True and we don't have an "original source line" from parsing a message (which is the case here), rather than using the incorrect RFC2047 encoding.

Checking part.token_type == 'addr-spec' seemed like the simplest approach.

An alternative would be to introduce a new NonASCIIDomainLiteralDefect paralleling NonASCIILocalPartDefect and apply it in _header_value_parser.get_domain_literal(). And add NonASCIIAddrSpecDefect as a superclass of both. Then change _refold_parse_tree() to check any(isinstance(d, NonASCIIAddrSpecDefect) for d in part.all_defects) (and perhaps move it up with the other UnicodeEncodeError logic). (If we go this direction, PR #122477 will also need an update.)

Also, I think charset == 'unknown-8bit' is only possible in _refold_parse_tree() when the non-ASCII characters resulted from parsing an existing message: see the UndecodableBytesDefect logic just above the new code. (The added tests seem to confirm this.)

@medmunds medmunds force-pushed the fix-issues-83938-122476 branch from 2e0696c to cbedf5d Compare August 1, 2024 01:11
medmunds added 2 commits July 31, 2024 18:35
Email generators had been incorrectly flattening non-ASCII email
addresses to RFC 2047 encoded-word format, leaving them undeliverable.
(RFC 2047 prohibits use of encoded-word in an addr-spec.)
This change raises a ValueError when attempting to flatten an
EmailMessage with a non-ASCII addr-spec and a policy with utf8=False.
(Exception: If the non-ASCII address originated from parsing a message,
it will be flattened as originally parsed, without error.)

Non-ASCII email addresses are supported when using a policy with
utf8=True (such as email.policy.SMTPUTF8) under RFCs 6531 and 6532.

Non-ASCII email address domains (but not localparts) can also be used
with non-SMTPUTF8 policies by encoding the domain as an IDNA A-label.
(The email package does not perform this encoding, because it cannot
know whether the caller wants IDNA 2003, IDNA 2008, or some other
variant such as UTS python#46.)
@picnixz picnixz changed the title gh-83938: Stop incorrectly RFC 2047 encoding non-ASCII email addresses gh-83938, gh-122476: Stop incorrectly RFC 2047 encoding non-ASCII email addresses Dec 3, 2024
Copy link
Member

@bitdancer bitdancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your analysis is solid and the fix looks great. We'll need a follow on PR to have smtplib handle the new error, but that should be a trivial PR.

@bedevere-app
Copy link

bedevere-app bot commented Mar 31, 2025

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be poked with soft cushions!

@medmunds medmunds force-pushed the fix-issues-83938-122476 branch from 5d60c1c to bd6845d Compare April 1, 2025 20:14
@medmunds
Copy link
Contributor Author

medmunds commented Apr 1, 2025

I have made the requested changes; please review again

@bedevere-app
Copy link

bedevere-app bot commented Apr 1, 2025

Thanks for making the requested changes!

@bitdancer: please review the changes made to this pull request.

@bedevere-app bedevere-app bot requested a review from bitdancer April 1, 2025 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EmailMessage bad encoding for non-ASCII localpart EmailMessage bad encoding for international domain
3 participants