Skip to content

Commit d3b2981

Browse files
authored
fix: typeerror when using chipper (#311)
This PR resolves #310 - chipper, or any page layout extracted with an element extraction model do not have key attributes like `image_metadata` populated - this leads to `None` values for image width and height, which lead to the bug - this fix prevents the function early return after chipper finds the elements - it continues the logic to allow other key attributes of the page to be filled - a bonus from this fix is we remove the image data from the page (which is not needed downstream) for chipper generated pages ## Test A unit test is modified to test all the routes, including using an element extraction model, for page layout Additionally grab this attached pdf and when running partition using chipper the main branch would lead to type error but this fix would run without error. [005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/13731533/005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf)
1 parent 8b33f14 commit d3b2981

File tree

4 files changed

+18
-6
lines changed

4 files changed

+18
-6
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.7.21
2+
3+
* fix: fix a bug where chipper, or any element extraction model based `PageLayout` object, lack `image_metadata` and other attributes that are required for downstream processing; this fix also reduces the memory overhead of using chipper model
4+
15
## 0.7.20
26

37
* chipper-v3: improved table prediction

test_unstructured_inference/inference/test_layout.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,10 @@
99

1010
import unstructured_inference.models.base as models
1111
from unstructured_inference.inference import elements, layout, layoutelement
12-
from unstructured_inference.inference.elements import EmbeddedTextRegion, ImageTextRegion
12+
from unstructured_inference.inference.elements import (
13+
EmbeddedTextRegion,
14+
ImageTextRegion,
15+
)
1316
from unstructured_inference.models.unstructuredmodel import (
1417
UnstructuredElementExtractionModel,
1518
UnstructuredObjectDetectionModel,
@@ -271,12 +274,14 @@ def filter_by(self, *args, **kwargs):
271274
return MockLayout()
272275

273276

277+
@pytest.mark.parametrize("element_extraction_model", [None, "foo"])
274278
@pytest.mark.parametrize("filetype", ["png", "jpg", "tiff"])
275-
def test_from_image_file(monkeypatch, mock_final_layout, filetype):
279+
def test_from_image_file(monkeypatch, mock_final_layout, filetype, element_extraction_model):
276280
def mock_get_elements(self, *args, **kwargs):
277281
self.elements = [mock_final_layout]
278282

279283
monkeypatch.setattr(layout.PageLayout, "get_elements_with_detection_model", mock_get_elements)
284+
monkeypatch.setattr(layout.PageLayout, "get_elements_using_image_extraction", mock_get_elements)
280285
filename = f"sample-docs/loremipsum.{filetype}"
281286
image = Image.open(filename)
282287
image_metadata = {
@@ -285,7 +290,10 @@ def mock_get_elements(self, *args, **kwargs):
285290
"height": image.height,
286291
}
287292

288-
doc = layout.DocumentLayout.from_image_file(filename)
293+
doc = layout.DocumentLayout.from_image_file(
294+
filename,
295+
element_extraction_model=element_extraction_model,
296+
)
289297
page = doc.pages[0]
290298
assert page.elements[0] == mock_final_layout
291299
assert page.image is None
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.7.20" # pragma: no cover
1+
__version__ = "0.7.21" # pragma: no cover

unstructured_inference/inference/layout.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -322,10 +322,10 @@ def from_image(
322322
detection_model=detection_model,
323323
element_extraction_model=element_extraction_model,
324324
)
325+
# FIXME (yao): refactor the other methods so they all return elements like the third route
325326
if page.element_extraction_model is not None:
326327
page.get_elements_using_image_extraction()
327-
return page
328-
if fixed_layout is None:
328+
elif fixed_layout is None:
329329
page.get_elements_with_detection_model()
330330
else:
331331
page.elements = page.get_elements_from_layout(fixed_layout)

0 commit comments

Comments
 (0)