-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
GH-115512: Optimize peak memory usage and runtime for large emails #132709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before reviewing this PR, please
- Remove all type annotations. They are left to https://github.com/python/typeshed.
- Wrap all lines under 80 characters.
- Avoid comments that state what the code does. There are some trivial comments here and there. Some are not very useful.
- Do not check statistical overheads. They entirely depend on the host machine and other parameters that are hard to guarantee. We only test functionalities but we don't want to necessarily test that X or Y takes more or less time than this.
- If possible, make smaller PRs, targeting either time or memory improvements, and if possible, only one function or method at a time.
Note: it'd be a good idea to provide the full benchmarking script so that others can also verify the results.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
I have made the requested changes; please review again Re: Failing build check: This looks unrelated to my changes, and was not failing in previous commits. Re: Testing for memory usage: I removed the memory usage tests, but I do think there's some value to testing something of that nature in some sort of automated way. Memory usage tests are easier to get consistency out of than time-based tests. Maybe some subset of tests could be run in a controlled environment (i.e: the Ubuntu tests check), and skipped otherwise. Maybe it merits its own separate test suite. In general though, the way it was being tested was intentionally as close to deterministic as possible. Repeatedly running the same tests on the same machine seemed to be consistently producing the same results, as best as I could tell, at least. I understand if that's entirely out of the scope of this issue & PR, though. Re: Splitting the PR: The Re: Benchmark script: I'll have to look into releasing some variation of the benchmark script. It may take a fair bit of time (at least a week), and it's unfortunately not a zero-effort endeavor. It hadn't occurred to me that it might be helpful to include. Let me follow up on this. |
Thanks for making the requested changes! @picnixz: please review the changes made to this pull request. |
for line in self._input: | ||
if line is NeedMoreData: | ||
yield NeedMoreData | ||
continue | ||
self._cur.epilogue = EMPTYSTRING.join(epilogue) | ||
self._cur.epilogue = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I kept the behavior here the same, but I'm not actually certain whether it's correct. The previous code appeared as though it intended to assign the remainder of the message to the epilogue, but then did not.
So, this data is discarded. This is the only place where we discard the rest of the message like this. It's out of the scope of this PR, and it's very narrow in scope, but it's interesting.
GH-115512: email.message_from_bytes heavy memory use
Note: In the rest of this,
time taken
,peak overhead
, andoverhead ratio
refer to the similarly named variables in the below snippet:Changes:
text.decode
inparser.BytesParser.parsebytes
, by decoding one chunk of bytes at a time instead.StringIO
use inparser.Parser.parsestr
, slicing into the string instead. This also impactedparser.BytesParser.parsebytes
StringIO
use infeedparser.BufferedSubFile
, by reverting back to using a list, while still retaining the universal newline behavior exhibited byStringIO
BufferedSubFile
. For multi part messages, it dumps every chunk lacking any potential boundaries (i.e: no-
character), as well as every line lacking any boundary, up until the boundary we are looking for as indicated by_eofstack
. Without this change, the above changes would've introduced a noticeable runtime performance regression. With this change, runtime performance is significantly improved.Benchmarking
As a part of internal testing, I performed some benchmarking by directly measuring the time to parse ~907k email files using
message_from_bytes
. For each blob, a script calledemail.message_from_bytes
, measured the memory usage usingtracemalloc
as well as the time taken usingtime.perf_counter()
, and then did the same function call and measurements using a fork of the email library which at the time included only these changes. It then deep-compared the output of each, to validate that they're exactly equal.General information:
Without the changes, these were the stats of the Python 3.12.9 email parser:
Time stats:
With the changes, these were the stats of this email parser when using Python 3.12.9:
Time stats:
Focusing in on the totals, this represents: