Skip to content

Commit c305d10

Browse files
authored
Fix PDFMiner bug (#253)
Issue: In some cases, PDFMiner identifies an image document as a full page and in other installations not. It is difficult to find out when PDFMiner behaves in one way or another. In either case tested, the version is `pdfminer.six v20221105`. The solution is to ignore any annotation coming from Chipper in case the full page clearing code is activated. Not sure if this is relevant to other models. --------- Co-authored-by: Antonio Jimeno Yepes <[email protected]>
1 parent 2493089 commit c305d10

File tree

4 files changed

+9
-2
lines changed

4 files changed

+9
-2
lines changed

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
1-
## 0.7.4-dev0
1+
## 0.7.4-dev1
22

3+
* Fixed bug when PDFMiner predicts that an image text occupies the full page and removes annotations by Chipper.
4+
* Added random seed to Chipper text generation to avoid differences between calls to Chipper.
35
* Allows user to use super-gradients model if they have a callback predict function, a yaml file with names field corresponding to classes and a path to the model weights
46

57
## 0.7.3
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.7.4-dev0" # pragma: no cover
1+
__version__ = "0.7.4-dev1" # pragma: no cover

unstructured_inference/inference/layoutelement.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,9 @@ def merge_inferred_layout_with_extracted_layout(
125125
continue
126126
region_matched = False
127127
for inferred_region in inferred_layout:
128+
if inferred_region.source in (Source.CHIPPER, Source.CHIPPERV1):
129+
continue
130+
128131
if inferred_region.bbox.intersects(extracted_region.bbox):
129132
same_bbox = region_bounding_boxes_are_almost_the_same(
130133
inferred_region.bbox,

unstructured_inference/models/chipper.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
import cv2
66
import numpy as np
77
import torch
8+
import transformers
89
from huggingface_hub import hf_hub_download
910
from PIL.Image import Image
1011
from transformers import DonutProcessor, VisionEncoderDecoderModel
@@ -134,6 +135,7 @@ def predict_tokens(
134135
image: Image,
135136
) -> Tuple[List[int], Sequence[Sequence[torch.Tensor]]]:
136137
"""Predict tokens from image."""
138+
transformers.set_seed(42)
137139
with torch.no_grad():
138140
outputs = self.model.generate(
139141
self.processor(

0 commit comments

Comments
 (0)