Skip to content

Commit a501616

Browse files
ajjimenoqued
andauthored
Table processing (#72)
* First commit * Table processing in document layout * Platform x86_64 check' * PaddleOCR integrated * Deactivate show_log in paddleocr * Utilize layout updates * Formatting and linting * Correct how linting ignores are accumulated * Removed fitz * Bug fixed in intersect_rect * Bump to default 200 dpi * Updated README with instructions to install paddleocr * Fixed typo * Deal with empty case * Formatting * Updates to pass flake8 * Added table test * Fixed test * Typing changes * formatting for large fixture * Up pixel to reflect new dpi * Make table extraction opt-in * Change content to check for * Add install targets for paddleocr * Add optional pip install for paddleocr * Update README.md Updated paddleocr installation instructions * Remove unused functions * New image for table testing * Remove non-unique assignment case * Correct slot_into_contains arguments * Test for nms * update fixtures * Added test for nms * fix for disable table extraction by default * Revised test * Update old tests * reuse postprocess * Additional tests * More rect tests, extract_text_from_spans * Remove unused code * Align supercells test * Updated removal supercell test * header_supercell_tree test * name change to forked paddleocr * Updated installation and removel of a print statement * tidied file * Version update * linting * Updated test * Changed Makefile --------- Co-authored-by: Antonio Jimeno Yepes <[email protected]> Co-authored-by: Alan Bertl <[email protected]>
1 parent 4a52922 commit a501616

File tree

14 files changed

+1894
-13
lines changed

14 files changed

+1894
-13
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
## 0.2.13-dev0
1+
## 0.2.13
22

3+
* Add table processing
34
* Change OCR logic to be aware of PDF image elements
45

56
## 0.2.12

Makefile

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ install-base: install-base-pip-packages
2020
install: install-base-pip-packages install-dev install-detectron2 install-test
2121

2222
.PHONY: install-ci
23-
install-ci: install-base-pip-packages install-test
23+
install-ci: install-base-pip-packages install-test install-paddleocr
2424

2525
.PHONY: install-base-pip-packages
2626
install-base-pip-packages:
@@ -31,6 +31,10 @@ install-base-pip-packages:
3131
install-detectron2:
3232
pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@78d5b4f335005091fe0364ce4775d711ec93566e"
3333

34+
.PHONY: install-paddleocr
35+
install-paddleocr:
36+
pip install "unstructured.PaddleOCR"
37+
3438
.PHONY: install-test
3539
install-test:
3640
pip install -r requirements/test.txt

README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,17 @@ Windows is not officially supported by Detectron2, but some users are able to in
3434
See discussion [here](https://layout-parser.github.io/tutorials/installation#for-windows-users) for
3535
tips on installing Detectron2 on Windows.
3636

37+
### PaddleOCR
38+
39+
[PaddleOCR](https://github.com/Unstructured-IO/unstructured.PaddleOCR) is required for table processing for `x86_64` architectures.
40+
It should not be installed under MacOS with Apple Silicon cpu.
41+
42+
PaddleOCR should be installed using the following instructions.
43+
44+
```shell
45+
pip install "unstructured.PaddleOCR"
46+
```
47+
3748
### Repository
3849

3950
To install the repository for development, clone the repo and run `make install` to install dependencies.

sample-docs/example_table.jpg

536 KB
Loading

setup.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ license_files = LICENSE.md
33

44
[flake8]
55
max-line-length = 100
6-
ignore = D100, D101, D104, D105, D107, D2, D4
6+
extend-ignore = D100, D101, D104, D105, D107, D2, D4
77
per-file-ignores =
88
test_*/**: D
99

setup.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
limitations under the License.
1919
"""
2020
from setuptools import setup, find_packages
21+
from platform import machine
2122

2223
from unstructured_inference.__version__ import __version__
2324

@@ -60,5 +61,5 @@
6061
"onnxruntime",
6162
"transformers",
6263
],
63-
extras_require={},
64+
extras_require={"paddle-ocr": "unstructured.PaddleOCR"},
6465
)

test_unstructured_inference/inference/test_layout.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,11 +186,12 @@ def points(self):
186186

187187

188188
class MockPageLayout(layout.PageLayout):
189-
def __init__(self, layout=None, model=None, ocr_strategy="auto"):
189+
def __init__(self, layout=None, model=None, ocr_strategy="auto", extract_tables=False):
190190
self.image = None
191191
self.layout = layout
192192
self.model = model
193193
self.ocr_strategy = ocr_strategy
194+
self.extract_tables = extract_tables
194195

195196
def ocr(self, text_block: MockTextRegion):
196197
return text_block.ocr_text

0 commit comments

Comments
 (0)