-
Notifications
You must be signed in to change notification settings - Fork 1.1k
⚡️ Speed up method ElementHtml._get_children_html by 234%
#4087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚡️ Speed up method ElementHtml._get_children_html by 234%
#4087
Conversation
Here is a **faster rewrite** of your program, based on your line profiling results, the imported code constraints, and the code logic. ### Key optimizations. - **Avoid repeated parsing:** The hotspot is in recursive calls to `child.get_html_element(**kwargs)`, each of which is re-creating a new `BeautifulSoup` object in every call. Solution: **Pass down and reuse a single `BeautifulSoup` instance** when building child HTML elements. - **Minimize object creation:** Create `soup` once at the *topmost* call and reuse for all children and subchildren. - **Reduce .get_text_as_html use:** Optimize to only use the soup instance when really necessary and avoid repeated blank parses. - **Avoid double wrapping:** Only allocate wrappers and new tags if absolutely required. - **General micro-optimizations:** Use `None` instead of `or []`, fast-path checks on empty children, etc. - **Preserve all comments and signatures as specified.** Below is the optimized version. ### Explanation of improvements - **Soup passing**: The `get_html_element` method now optionally receives a `_soup` kwarg. At the top of the tree, it is `None`, so a new one is created. Then, for all descendants, the same `soup` instance is passed via `_soup`, avoiding repeated parsing and allocation. - **Children check**: `self.children` is checked once, and the attribute itself is kept as a list (not or-ed with empty list at every call). - **No unnecessary soup parsing**: `get_text_as_html()` doesn't need a soup argument, since it only returns a Tag (from the parent module). - **No changes to existing comments, new comments added only where logic was changed.** - **Behavior (output and signature) preserved.** This **avoids creating thousands of BeautifulSoup objects recursively**, which was the primary bottleneck found in the profiler. The result is vastly improved performance, especially for large/complex trees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but CI is throwing error because there was no update to changelog. Pls fix.
|
@mpolomdeepsense I have modified the changelog, it should be ready to merge. I noticed there seems to be an issue in |
|
@mpolomdeepsense ready to merge |
📄 234% (2.34x) speedup for
ElementHtml._get_children_htmlinunstructured/partition/html/convert.py⏱️ Runtime :
12.3 milliseconds→3.69 milliseconds(best of101runs)📝 Explanation and details
Here is a faster rewrite of your program, based on your line profiling results, the imported code constraints, and the code logic.
Key optimizations.
child.get_html_element(**kwargs), each of which is re-creating a newBeautifulSoupobject in every call.Solution: Pass down and reuse a single
BeautifulSoupinstance when building child HTML elements.souponce at the topmost call and reuse for all children and subchildren.Noneinstead ofor [], fast-path checks on empty children, etc.Below is the optimized version.
Explanation of improvements
get_html_elementmethod now optionally receives a_soupkwarg. At the top of the tree, it isNone, so a new one is created. Then, for all descendants, the samesoupinstance is passed via_soup, avoiding repeated parsing and allocation.self.childrenis checked once, and the attribute itself is kept as a list (not or-ed with empty list at every call).get_text_as_html()doesn't need a soup argument, since it only returns a Tag (from the parent module).This avoids creating thousands of BeautifulSoup objects recursively, which was the primary bottleneck found in the profiler. The result is vastly improved performance, especially for large/complex trees.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-ElementHtml._get_children_html-mcsd67coand push.