Skip to content

Commit 4e5c4e6

Browse files
authored
refactor: remove remaining table OCR logic in inference (#302)
### Summary Remove all OCR related code: * table OCR code -> require ocr tokens to pass in for table structure * parameter `extract_tables` -> moved to unst already, unst decide if extract or not and calling table model * function `interpret_table_block` -> this was a wrapper to call table in inference on block level, logic moved to unst * paddle ocr related code and readme instruction ### Test * shouldn't affect anything since its just remove a deprecated logic * added some test for coverage * CCT metrics compare (no change): before (main on core product): ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.665 0.278 0.277 109 cct-%missing 0.094 0.176 0.176 109 ``` after (inference checked out to this branch): ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.665 0.278 0.277 109 cct-%missing 0.094 0.176 0.176 109 ```
1 parent d4785df commit 4e5c4e6

File tree

15 files changed

+585
-340
lines changed

15 files changed

+585
-340
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.7.19
2+
3+
* refactor: remove all OCR related code
4+
15
## 0.7.18
26

37
* refactor: remove all image extraction related code

Makefile

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ install-base: install-base-pip-packages
2222
install: install-base-pip-packages install-dev install-detectron2
2323

2424
.PHONY: install-ci
25-
install-ci: install-base-pip-packages install-test install-paddleocr
25+
install-ci: install-base-pip-packages install-test
2626

2727
.PHONY: install-base-pip-packages
2828
install-base-pip-packages:
@@ -32,12 +32,6 @@ install-base-pip-packages:
3232
install-detectron2:
3333
pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@57bdb21249d5418c130d54e2ebdc94dda7a4c01a"
3434

35-
.PHONY: install-paddleocr
36-
install-paddleocr:
37-
pip install --no-cache-dir paddlepaddle
38-
pip install --no-cache-dir paddlepaddle-gpu
39-
pip install --no-cache-dir "unstructured.PaddleOCR"
40-
4135
.PHONY: install-test
4236
install-test: install-base
4337
pip install -r requirements/test.txt

README.md

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -34,24 +34,6 @@ Windows is not officially supported by Detectron2, but some users are able to in
3434
See discussion [here](https://layout-parser.github.io/tutorials/installation#for-windows-users) for
3535
tips on installing Detectron2 on Windows.
3636

37-
### PaddleOCR
38-
39-
[PaddleOCR](https://github.com/Unstructured-IO/unstructured.PaddleOCR) is suggested for table processing. Please set
40-
environment variable `TABLE_OCR`
41-
to `paddle` if you wish to use paddle for table processing instead of default `tesseract`.
42-
43-
PaddleOCR may be with installed with:
44-
45-
```shell
46-
pip install paddlepaddle
47-
pip install "unstructured.PaddleOCR"
48-
```
49-
50-
We suggest that you install paddlepaddle-gpu with `pip install paddepaddle-gpu` if you have gpu devices available for better OCR performance.
51-
52-
Please note that **paddlepaddle does not work on MacOS with Apple Silicon**. So if you want it running on Apple M1/M2 chip, we have a custom wheel of paddlepaddle for aarch64 architecture, you can install it with `pip install unstructured.paddlepaddle`, and run it inside a docker container.
53-
54-
5537
### Repository
5638

5739
To install the repository for development, clone the repo and run `make install` to install dependencies.

test_unstructured_inference/inference/test_layout.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -215,13 +215,11 @@ def __init__(
215215
number=1,
216216
image=None,
217217
model=None,
218-
extract_tables=False,
219218
detection_model=None,
220219
):
221220
self.image = image
222221
self.layout = layout
223222
self.model = model
224-
self.extract_tables = extract_tables
225223
self.number = number
226224
self.detection_model = detection_model
227225

@@ -596,7 +594,6 @@ def test_process_file_with_model_routing(monkeypatch, model_type, is_detection_m
596594
detection_model=detection_model,
597595
element_extraction_model=element_extraction_model,
598596
fixed_layouts=None,
599-
extract_tables=False,
600597
pdf_image_dpi=200,
601598
)
602599

test_unstructured_inference/inference/test_layout_element.py

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,21 @@
1-
import pytest
21
from layoutparser.elements import TextBlock
32
from layoutparser.elements.layout_elements import Rectangle as LPRectangle
43

54
from unstructured_inference.constants import Source
65
from unstructured_inference.inference.layoutelement import LayoutElement, TextRegion
76

87

9-
@pytest.mark.parametrize("is_table", [False, True])
108
def test_layout_element_extract_text(
119
mock_layout_element,
1210
mock_text_region,
13-
mock_pil_image,
14-
is_table,
1511
):
16-
if is_table:
17-
mock_layout_element.type = "Table"
18-
1912
extracted_text = mock_layout_element.extract_text(
2013
objects=[mock_text_region],
21-
image=mock_pil_image,
22-
extract_tables=True,
2314
)
2415

2516
assert isinstance(extracted_text, str)
2617
assert "Sample text" in extracted_text
2718

28-
if mock_layout_element.type == "Table":
29-
assert hasattr(mock_layout_element, "text_as_html")
30-
3119

3220
def test_layout_element_do_dict(mock_layout_element):
3321
expected = {

0 commit comments

Comments
 (0)