GH-115512: Optimize peak memory usage and runtime for large emails #132709

JAJames · 2025-04-18T21:33:23Z

GH-115512: email.message_from_bytes heavy memory use

Note: In the rest of this, time taken, peak overhead, and overhead ratio refer to the similarly named variables in the below snippet:

# Call message_from_bytes, gathering some memory usage stats in the process
tracemalloc.start()
start_time = time.perf_counter()
msg = message_from_bytes(msg_bytes, policy=email.policy.default)
time_taken = time.perf_counter() - start_time
after_bytes, after_peak_bytes = tracemalloc.get_traced_memory()
tracemalloc.stop()

# "How many bytes did we allocate, that were ultimately discarded?"
peak_overhead = after_peak_bytes - after_bytes

# "How large was that overhead, relative to the size of the message?"
overhead_ratio = peak_overhead / len(msg_bytes) if len(msg_bytes) > 0 else None

Changes:

Removed full copy caused by text.decode in parser.BytesParser.parsebytes, by decoding one chunk of bytes at a time instead.
Removed full copy (with 4x expansion on ASCII text) caused by StringIO use in parser.Parser.parsestr, slicing into the string instead. This also impacted parser.BytesParser.parsebytes
Removed circumstantial full copy (with 4x expansion on ASCII text) caused by StringIO use in feedparser.BufferedSubFile, by reverting back to using a list, while still retaining the universal newline behavior exhibited by StringIO
Added runtime optimization to avoid expensive line-by-line processing, in circumstances where we are consuming a blob (i.e: message/attachment body) rather than lines. This optimization accounts for both single-part messages (i.e: blob until end-of-file) and multi-part message (i.e: blob until end-of-part). For single part messages, it dumps every chunk fed into BufferedSubFile. For multi part messages, it dumps every chunk lacking any potential boundaries (i.e: no - character), as well as every line lacking any boundary, up until the boundary we are looking for as indicated by _eofstack. Without this change, the above changes would've introduced a noticeable runtime performance regression. With this change, runtime performance is significantly improved.
Added tests to assert memory overhead (that is, wasted peak memory) does not exceed 1.05x the size of the underlying email for large sized (10 MiB) emails.

Benchmarking

As a part of internal testing, I performed some benchmarking by directly measuring the time to parse ~907k email files using message_from_bytes. For each blob, a script called email.message_from_bytes, measured the memory usage using tracemalloc as well as the time taken using time.perf_counter(), and then did the same function call and measurements using a fork of the email library which at the time included only these changes. It then deep-compared the output of each, to validate that they're exactly equal.

General information:

Number of emails benchmarked against: 907,274
Total bytes parsed: 44,643,207,581 bytes
Average bytes: 49,205.87 bytes

Without the changes, these were the stats of the Python 3.12.9 email parser:

Total overhead: 322,356,018,510 bytes
Minimum overhead: 2768 bytes
Maximum overhead: 3,244,859,393 bytes
Average overhead: 355,301.726 bytes
Average overhead ratio: 7.22x
Time stats:
Total time taken: 5,120.472s
Min time taken: 0.000174s
Max time taken: 36.738s
Average time taken: 0.00564s

With the changes, these were the stats of this email parser when using Python 3.12.9:

Total overhead: 74,988,772,312 bytes (76.737% decrease)
Minimum overhead: 3464 bytes (25.144% increase, but this seems negligible since it's a minimum)
Maximum overhead: 816,899,342 bytes (74.824% decrease)
Average overhead: 82,651.839 bytes (76.737% decrease)
Average overhead ratio: 1.679x (76.745% decrease)
Time stats:
Total time taken: 1780.947s (65.219% decrease)
Min time taken: 0.000134s (22.988% decrease)
Max time taken: 10.3979s (71.697% decrease)
Average time taken: 0.00196s (65.248% decrease)

Focusing in on the totals, this represents:

A 76.737% decrease in memory overhead
A 65.219% decrease in time taken

python-cla-bot · 2025-04-18T21:33:26Z

All commit authors signed the Contributor License Agreement.

bedevere-app · 2025-04-18T21:33:27Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

picnixz

Before reviewing this PR, please

Remove all type annotations. They are left to https://github.com/python/typeshed.
Wrap all lines under 80 characters.
Avoid comments that state what the code does. There are some trivial comments here and there. Some are not very useful.
Do not check statistical overheads. They entirely depend on the host machine and other parameters that are hard to guarantee. We only test functionalities but we don't want to necessarily test that X or Y takes more or less time than this.
If possible, make smaller PRs, targeting either time or memory improvements, and if possible, only one function or method at a time.

Note: it'd be a good idea to provide the full benchmarking script so that others can also verify the results.

bedevere-app · 2025-04-18T23:02:20Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

JAJames · 2025-04-19T18:35:39Z

I have made the requested changes; please review again

Re: Failing build check: This looks unrelated to my changes, and was not failing in previous commits.

Re: Testing for memory usage:

I removed the memory usage tests, but I do think there's some value to testing something of that nature in some sort of automated way. Memory usage tests are easier to get consistency out of than time-based tests. Maybe some subset of tests could be run in a controlled environment (i.e: the Ubuntu tests check), and skipped otherwise. Maybe it merits its own separate test suite. In general though, the way it was being tested was intentionally as close to deterministic as possible. Repeatedly running the same tests on the same machine seemed to be consistently producing the same results, as best as I could tell, at least. I understand if that's entirely out of the scope of this issue & PR, though.

Re: Splitting the PR:

The parser.py changes are entirely separable, but I don't think the changes within each file are easily separable. If that's sufficient, I can split the PR into two PRs, with the parser.py changes standing on their own.

Re: Benchmark script:

I'll have to look into releasing some variation of the benchmark script. It may take a fair bit of time (at least a week), and it's unfortunately not a zero-effort endeavor. It hadn't occurred to me that it might be helpful to include. Let me follow up on this.

bedevere-app · 2025-04-19T18:35:45Z

Thanks for making the requested changes!

@picnixz: please review the changes made to this pull request.

JAJames · 2025-04-19T19:58:26Z

Lib/email/feedparser.py

                for line in self._input:
                    if line is NeedMoreData:
                        yield NeedMoreData
                        continue
-                self._cur.epilogue = EMPTYSTRING.join(epilogue)
+                self._cur.epilogue = ''


Note: I kept the behavior here the same, but I'm not actually certain whether it's correct. The previous code appeared as though it intended to assign the remainder of the message to the epilogue, but then did not.

So, this data is discarded. This is the only place where we discard the rest of the message like this. It's out of the scope of this PR, and it's very narrow in scope, but it's interesting.

pythonGH-115512: Optimize peak memory usage and runtime for large emails

da36214

JAJames requested a review from a team as a code owner April 18, 2025 21:33

bedevere-app bot added the awaiting review label Apr 18, 2025

bedevere-app bot mentioned this pull request Apr 18, 2025

email.message_from_bytes heavy memory use #115512

Open

JAJames added 2 commits April 18, 2025 18:11

Add NEWS entry per bot

c2eb551

Correct class references

d6233e9

picnixz requested changes Apr 18, 2025

View reviewed changes

bedevere-app bot removed the awaiting review label Apr 18, 2025

bedevere-app bot added the awaiting changes label Apr 18, 2025

JAJames added 4 commits April 18, 2025 19:31

Comment: remove annotations

530f6d4

Comment: 80-rule, remove some comments

13ebb39

Comment: Remove TestPeakMemoryUsage

2f6002e

Lint: Remove spaces

4fa6755

bedevere-app bot added awaiting change review and removed awaiting changes labels Apr 19, 2025

bedevere-app bot requested a review from picnixz April 19, 2025 18:35

Missed a comment which served as an annotation

4f36227

JAJames commented Apr 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-115512: Optimize peak memory usage and runtime for large emails #132709

GH-115512: Optimize peak memory usage and runtime for large emails #132709

JAJames commented Apr 18, 2025

python-cla-bot bot commented Apr 18, 2025 •

edited

Loading

bedevere-app bot commented Apr 18, 2025

picnixz left a comment •

edited

Loading

bedevere-app bot commented Apr 18, 2025

JAJames commented Apr 19, 2025

bedevere-app bot commented Apr 19, 2025

JAJames Apr 19, 2025

GH-115512: Optimize peak memory usage and runtime for large emails #132709

Are you sure you want to change the base?

GH-115512: Optimize peak memory usage and runtime for large emails #132709

Conversation

JAJames commented Apr 18, 2025

Benchmarking

python-cla-bot bot commented Apr 18, 2025 • edited Loading

bedevere-app bot commented Apr 18, 2025

picnixz left a comment • edited Loading

Choose a reason for hiding this comment

bedevere-app bot commented Apr 18, 2025

JAJames commented Apr 19, 2025

bedevere-app bot commented Apr 19, 2025

JAJames Apr 19, 2025

Choose a reason for hiding this comment

python-cla-bot bot commented Apr 18, 2025 •

edited

Loading

picnixz left a comment •

edited

Loading