You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feat/remove reference of PageLayout.elements (#3943)
This PR removes usage of `PageLayout.elements` from partition function,
except for when `analysis=True`. This PR updates the partition logic so
that `PageLayout.elements_array` is used everywhere to save memory and
cpu cost.
Since the analysis function is intended for investigation and not for
general document processing purposes, this part of the code is left for
a future refactor.
`PageLayout.elements` uses a list to store layout elements' data while
`elements_array` uses `numpy` array to store the data, which has much
lower memory requirements. Using `memory_profiler` to test the
differences is usually around 10x.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+5-1
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,13 @@
1
-
## 0.16.26-dev3
1
+
## 0.17.0
2
2
3
3
### Enhancements
4
4
5
5
-**Add support for images in html partitioner**`<img>` tags will now be parsed as `Image` elements. When `extract_image_block_types` includes `Image` and `extract_image_block_to_payload`=True then the `image_base64` will be included for images that specify the base64 data (rather than url) as the source.
6
+
6
7
-**Use kwargs instead of env to specify `ocr_agent` and `table_ocr_agent`** for `hi_res` strategy.
7
8
9
+
-**stop using `PageLayout.elements` to save memory and cpu cost**. Now only use `PageLayout.elements_array` throughout the partition, except when `analysis=True` where the drawing logic still uses `elements`.
10
+
8
11
### Features
9
12
10
13
### Fixes
@@ -28,6 +31,7 @@
28
31
in unstructured and `register_partitioner` to enable registering your own partitioner for any file type.
29
32
30
33
-**`extract_image_block_types` now also works for CamelCase elemenet type names**. Previously `NarrativeText` and similar CamelCase element types can't be extracted using the mentioned parameter in `partition`. Now figures for those elements can be extracted like `Image` and `Table` elements
34
+
31
35
-**use block matrix to reduce peak memory usage for pdf/image partition**.
0 commit comments