-
Notifications
You must be signed in to change notification settings - Fork 979
Feat/remove reference of PageLayout.elements #3943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/remove reference of PageLayout.elements #3943
Conversation
…-of-pagelayout-elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Attaching the reports for visibility
reports.zip
for page in final_document_layout.pages: | ||
for el in page.elements: | ||
el.text = el.text or "" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those lines are weird so I am happy they are removed. But do you have some insights on why they were added in the first place? Removing them should not break anything right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is when and element's text can be None
. But vectorized data structure cleaned that upstream so at this point there is no need for those lines
|
||
### Enhancements | ||
|
||
- **`extract_image_block_types` now also works for CamelCase elemenet type names**. Previously `NarrativeText` and similar CamelCase element types can't be extracted using the mentioned parameter in `partition`. Now figures for those elements can be extracted like `Image` and `Table` elements | ||
- **stop using `PageLayout.elements` to save memory and cpu cost**. Now only use `PageLayout.elements_array` throughout the partition, except when `analysis=True` where the drawing logic still uses `elements`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you recall any downstream tasks that can still be using PageLayout.elements and would need some refactor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing in the unstructured lib. If user still want elements
they can still get it (it is a cached property so will be computed if not already when getting that attribute)
This PR removes usage of
PageLayout.elements
from partition function, except for whenanalysis=True
. This PR updates the partition logic so thatPageLayout.elements_array
is used everywhere to save memory and cpu cost.Since the analysis function is intended for investigation and not for general document processing purposes, this part of the code is left for a future refactor.
PageLayout.elements
uses a list to store layout elements' data whileelements_array
usesnumpy
array to store the data, which has much lower memory requirements. Usingmemory_profiler
to test the differences is usually around 10x.