Skip to content

Commit 3d3f3df

Browse files
authored
enhancement: add "ocr_only" strategy for PDFs (#553)
* add tests for validating strategy * refactor into determine_pdf_strategy function * refactor pdf strategies into strategies * remove commented out code * remove unreachable code * add in handling for image types * a little more refactoring * import ocr partioning for images * catch warnings, partition type for valid strategies * fallback to ocr_only from fast * fallback logic for hi_res * test for fallback to ocr only * fallback logic ofr ocr_only * more tests for fallback logic * update doc strings * version and changelog * linting, linting, linting * update docs to include notes about strategy * fix typos * change back patched filename
1 parent 1ac72c6 commit 3d3f3df

File tree

8 files changed

+360
-137
lines changed

8 files changed

+360
-137
lines changed

Diff for: CHANGELOG.md

+11
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,14 @@
1+
## 0.6.4
2+
3+
### Enhancements
4+
5+
* Added an "ocr_only" strategy for `partition_pdf`. Refactored the strategy decision
6+
logic into its own module.
7+
8+
### Features
9+
10+
### Fixes
11+
112
## 0.6.3
213

314
### Enhancements

Diff for: docs/source/bricks.rst

+30-7
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ to disable SSL verification in the request.
138138

139139
``partition_via_api`` allows users to partition documents using the hosted Unstructured API.
140140
The API partitions documents using the automatic ``partition`` function. Currently, the API
141-
supports all filetypes except for RTF and EPUBs.
141+
supports all filetypes except for RTF and EPUBs.
142142
To use another URL for the API use the ``api_url`` kwarg. This is helpful if you're hosting
143143
the API yourself or running it locally through a container. You can pass in your API key
144144
using the ``api_key`` kwarg. You can use the ``content_type`` kwarg to pass in the MIME
@@ -255,7 +255,7 @@ Examples:
255255
------------------
256256

257257
The ``partition_odt`` partitioning brick pre-processes Open Office documents
258-
saved in the ``.odt`` format. The function first converst the document
258+
saved in the ``.odt`` format. The function first converts the document
259259
to ``.docx`` using ``pandoc`` and then processes it using ``partition_docx``.
260260

261261
Examples:
@@ -363,10 +363,22 @@ if you'd like to run inference locally.
363363
If you set the URL, ``partition_pdf`` will make a call to a remote inference server.
364364
``partition_pdf`` also includes a ``token`` function that allows you to pass in an authentication
365365
token for a remote API call.
366-
The ``strategy`` kwarg controls the method that will be used to process the PDF. The ``"hi_res"`` strategy
367-
will identify the layout of the document using ``detectron2``. The ``"fast"`` strategy will extract the
368-
text using ``pdfminer`` and process the raw text with ``partition_text``. If ``detectron2`` is not available,
369-
and the ``"hi_res"`` strategy is set, ``partition_pdf`` will fallback to the ``"fast"`` strategy.
366+
367+
The ``strategy`` kwarg controls the method that will be used to process the PDF.
368+
The available strategies for PDFs are `"hi_res"`, `"ocr_only"`, and `"fast"`.
369+
The ``"hi_res"`` strategy will identify the layout of the document using ``detectron2``. The advantage of `"hi_res"` is that
370+
it uses the document layout to gain additional information about document elements. We recommend using this strategy
371+
if your use case is highly sensitive to correct classifications for document elements. If ``detectron2`` is not available,
372+
the ``"hi_res"`` strategy will fall back to the ``"ocr_only"`` strategy.
373+
The ``"ocr_only"`` strategy runs the document through Tesseract for OCR and then runs the raw text through ``partition_text``.
374+
Currently, ``"hi_res"`` has difficulty ordering elements for documents with multiple columns. If you have a document with
375+
multiple columns that does not have extractable text, we recommend using the ``"ocr_only"`` strategy. ``"ocr_only"`` falls
376+
back to ``"fast"`` if Tesseract is not available and the document has extractable text.
377+
The ``"fast"`` strategy will extract the text using ``pdfminer`` and process the raw text with ``partition_text``.
378+
If the PDF text is not extractable, ``partition_pdf`` will fall back to ``"ocr_only"``. We recommend using the
379+
``"fast"`` strategy in most cases where the PDF has extractable text.
380+
381+
370382
You can also specify what languages to use for OCR with the ``ocr_languages`` kwarg. For example,
371383
use ``ocr_languages="eng+deu"`` to use the English and German language packs. See the
372384
`Tesseract documentation <https://github.com/tesseract-ocr/tessdata>`_ for a full list of languages and
@@ -411,6 +423,17 @@ Examples:
411423
The ``partition_image`` function has the same API as ``partition_pdf``, which is document above.
412424
The only difference is that ``partition_image`` does not need to convert a PDF to an image
413425
prior to processing. The ``partition_image`` function supports ``.png`` and ``.jpg`` files.
426+
427+
The ``strategy`` kwarg controls the method that will be used to process the PDF.
428+
The available strategies for images are `"hi_res"` and ``"ocr_only"``.
429+
The ``"hi_res"`` strategy will identify the layout of the document using ``detectron2``. The advantage of `"hi_res"` is that it
430+
uses the document layout to gain additional information about document elements. We recommend using this strategy
431+
if your use case is highly sensitive to correct classifications for document elements. If ``detectron2`` is not available,
432+
the ``"hi_res"`` strategy will fall back to the ``"ocr_only"`` strategy.
433+
The ``"ocr_only"`` strategy runs the document through Tesseract for OCR and then runs the raw text through ``partition_text``.
434+
Currently, ``"hi_res"`` has difficulty ordering elements for documents with multiple columns. If you have a document with
435+
multiple columns that does not have extractable text, we recoomend using the ``"ocr_only"`` strategy.
436+
414437
You can also specify what languages to use for OCR with the ``ocr_languages`` kwarg. For example,
415438
use ``ocr_languages="eng+deu"`` to use the English and German language packs. See the
416439
`Tesseract documentation <https://github.com/tesseract-ocr/tessdata>`_ for a full list of languages and
@@ -430,7 +453,7 @@ Examples:
430453
elements = partition_image("example-docs/layout-parser-paper-fast.jpg", ocr_languages="eng+swe")
431454
432455
433-
The default partitioning strategy for ``partition_image`` is `"hi_res"`, which segements the document using
456+
The default partitioning strategy for ``partition_image`` is `"hi_res"`, which segments the document using
434457
``detectron2`` and then OCRs the document. You can also choose ``"ocr_only"`` as the partitioning strategy,
435458
which OCRs the document and then runs the output through ``partition_text``. This can be helpful
436459
if ``detectron2`` does not detect a text element in the image. To run example below, ensure you

Diff for: test_unstructured/partition/test_pdf.py

+79-27
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
from unstructured_inference.inference import layout
77

88
from unstructured.documents.elements import NarrativeText, PageBreak, Text, Title
9-
from unstructured.partition import pdf
9+
from unstructured.partition import pdf, strategies
1010

1111

1212
class MockResponse:
@@ -161,7 +161,7 @@ def test_partition_pdf_api_raises_with_failed_api_call(
161161
[("fakeurl", True, False), (None, False, True)],
162162
)
163163
def test_partition_pdf(url, api_called, local_called, monkeypatch):
164-
monkeypatch.setattr(pdf, "is_pdf_text_extractable", lambda *args, **kwargs: True)
164+
monkeypatch.setattr(strategies, "is_pdf_text_extractable", lambda *args, **kwargs: True)
165165
with mock.patch.object(
166166
pdf,
167167
attribute="_partition_via_api",
@@ -177,7 +177,7 @@ def test_partition_pdf(url, api_called, local_called, monkeypatch):
177177
[("fakeurl", True, False), (None, False, True)],
178178
)
179179
def test_partition_pdf_with_template(url, api_called, local_called, monkeypatch):
180-
monkeypatch.setattr(pdf, "is_pdf_text_extractable", lambda *args, **kwargs: True)
180+
monkeypatch.setattr(strategies, "is_pdf_text_extractable", lambda *args, **kwargs: True)
181181
with mock.patch.object(
182182
pdf,
183183
attribute="_partition_via_api",
@@ -253,13 +253,83 @@ def test_partition_pdf_falls_back_to_fast(
253253
caplog,
254254
filename="example-docs/layout-parser-paper-fast.pdf",
255255
):
256-
monkeypatch.setattr(pdf, "dependency_exists", lambda dep: dep != "detectron2")
256+
def mock_exists(dep):
257+
return dep not in ["detectron2", "pytesseract"]
258+
259+
monkeypatch.setattr(strategies, "dependency_exists", mock_exists)
260+
261+
mock_return = [Text("Hello there!")]
262+
with mock.patch.object(
263+
pdf,
264+
"_partition_pdf_with_pdfminer",
265+
return_value=mock_return,
266+
) as mock_partition:
267+
pdf.partition_pdf(filename=filename, url=None, strategy="hi_res")
268+
269+
mock_partition.assert_called_once()
270+
assert "detectron2 is not installed" in caplog.text
271+
272+
273+
def test_partition_pdf_falls_back_to_fast_from_ocr_only(
274+
monkeypatch,
275+
caplog,
276+
filename="example-docs/layout-parser-paper-fast.pdf",
277+
):
278+
def mock_exists(dep):
279+
return dep not in ["pytesseract"]
280+
281+
monkeypatch.setattr(strategies, "dependency_exists", mock_exists)
257282

258283
mock_return = [Text("Hello there!")]
259284
with mock.patch.object(
260285
pdf,
261286
"_partition_pdf_with_pdfminer",
262287
return_value=mock_return,
288+
) as mock_partition:
289+
pdf.partition_pdf(filename=filename, url=None, strategy="ocr_only")
290+
291+
mock_partition.assert_called_once()
292+
assert "pytesseract is not installed" in caplog.text
293+
294+
295+
def test_partition_pdf_falls_back_to_hi_res_from_ocr_only(
296+
monkeypatch,
297+
caplog,
298+
filename="example-docs/layout-parser-paper-fast.pdf",
299+
):
300+
def mock_exists(dep):
301+
return dep not in ["pytesseract"]
302+
303+
monkeypatch.setattr(strategies, "dependency_exists", mock_exists)
304+
monkeypatch.setattr(strategies, "is_pdf_text_extractable", lambda *args, **kwargs: False)
305+
306+
mock_return = [Text("Hello there!")]
307+
with mock.patch.object(
308+
pdf,
309+
"_partition_pdf_or_image_local",
310+
return_value=mock_return,
311+
) as mock_partition:
312+
pdf.partition_pdf(filename=filename, url=None, strategy="ocr_only")
313+
314+
mock_partition.assert_called_once()
315+
assert "pytesseract is not installed" in caplog.text
316+
317+
318+
def test_partition_pdf_falls_back_to_ocr_only(
319+
monkeypatch,
320+
caplog,
321+
filename="example-docs/layout-parser-paper-fast.pdf",
322+
):
323+
def mock_exists(dep):
324+
return dep not in ["detectron2"]
325+
326+
monkeypatch.setattr(strategies, "dependency_exists", mock_exists)
327+
328+
mock_return = [Text("Hello there!")]
329+
with mock.patch.object(
330+
pdf,
331+
"_partition_pdf_or_image_with_ocr",
332+
return_value=mock_return,
263333
) as mock_partition:
264334
pdf.partition_pdf(filename=filename, url=None, strategy="hi_res")
265335

@@ -276,27 +346,6 @@ def test_partition_pdf_uses_table_extraction():
276346
assert mock_process_file_with_model.call_args[1]["extract_tables"]
277347

278348

279-
@pytest.mark.parametrize(
280-
("filename", "from_file", "expected"),
281-
[
282-
("layout-parser-paper-fast.pdf", True, True),
283-
("copy-protected.pdf", True, False),
284-
("layout-parser-paper-fast.pdf", False, True),
285-
("copy-protected.pdf", False, False),
286-
],
287-
)
288-
def test_is_pdf_text_extractable(filename, from_file, expected):
289-
filename = os.path.join("example-docs", filename)
290-
291-
if from_file:
292-
with open(filename, "rb") as f:
293-
extractable = pdf.is_pdf_text_extractable(file=f)
294-
else:
295-
extractable = pdf.is_pdf_text_extractable(filename=filename)
296-
297-
assert extractable is expected
298-
299-
300349
def test_partition_pdf_with_copy_protection():
301350
filename = os.path.join("example-docs", "copy-protected.pdf")
302351
elements = pdf.partition_pdf(filename=filename, strategy="hi_res")
@@ -314,8 +363,11 @@ def test_partition_pdf_fails_if_pdf_not_processable(
314363
monkeypatch,
315364
filename="example-docs/layout-parser-paper-fast.pdf",
316365
):
317-
monkeypatch.setattr(pdf, "dependency_exists", lambda dep: dep != "detectron2")
318-
monkeypatch.setattr(pdf, "is_pdf_text_extractable", lambda *args, **kwargs: False)
366+
def mock_exists(dep):
367+
return dep not in ["detectron2", "pytesseract"]
368+
369+
monkeypatch.setattr(strategies, "dependency_exists", mock_exists)
370+
monkeypatch.setattr(strategies, "is_pdf_text_extractable", lambda *args, **kwargs: False)
319371

320372
with pytest.raises(ValueError):
321373
pdf.partition_pdf(filename=filename)

Diff for: test_unstructured/partition/test_strategies.py

+41
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
import os
2+
3+
import pytest
4+
5+
from unstructured.partition import strategies
6+
7+
8+
def test_validate_strategy_validates():
9+
# Nothing should raise for a valid strategy
10+
strategies.validate_strategy("hi_res", "pdf")
11+
12+
13+
def test_validate_strategy_raises_for_bad_filetype():
14+
with pytest.raises(ValueError):
15+
strategies.validate_strategy("fast", "image")
16+
17+
18+
def test_validate_strategy_raises_for_bad_strategy():
19+
with pytest.raises(ValueError):
20+
strategies.validate_strategy("totally_guess_the_text", "image")
21+
22+
23+
@pytest.mark.parametrize(
24+
("filename", "from_file", "expected"),
25+
[
26+
("layout-parser-paper-fast.pdf", True, True),
27+
("copy-protected.pdf", True, False),
28+
("layout-parser-paper-fast.pdf", False, True),
29+
("copy-protected.pdf", False, False),
30+
],
31+
)
32+
def test_is_pdf_text_extractable(filename, from_file, expected):
33+
filename = os.path.join("example-docs", filename)
34+
35+
if from_file:
36+
with open(filename, "rb") as f:
37+
extractable = strategies.is_pdf_text_extractable(file=f)
38+
else:
39+
extractable = strategies.is_pdf_text_extractable(filename=filename)
40+
41+
assert extractable is expected

Diff for: unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.6.3" # pragma: no cover
1+
__version__ = "0.6.4" # pragma: no cover

Diff for: unstructured/partition/image.py

+16-35
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,8 @@
11
from typing import List, Optional
22

3-
import pytesseract
4-
from PIL import Image
5-
63
from unstructured.documents.elements import Element
74
from unstructured.partition.common import exactly_one
85
from unstructured.partition.pdf import partition_pdf_or_image
9-
from unstructured.partition.text import partition_text
10-
11-
VALID_STRATEGIES = ["hi_res", "ocr_only"]
126

137

148
def partition_image(
@@ -42,35 +36,22 @@ def partition_image(
4236
to install the appropriate Tesseract language pack.
4337
strategy
4438
The strategy to use for partitioning the PDF. Valid strategies are "hi_res" and
45-
"ocr_only". When using the "hi_res" strategy, the function ses a layout detection
46-
model if to identify document elements. When using the "ocr_only strategy",
47-
partition_image simply extracts the text from the document and processes it.
39+
"ocr_only". When using the "hi_res" strategy, the function uses a layout detection
40+
model if to identify document elements. When using the "ocr_only" strategy,
41+
partition_image simply extracts the text from the document using OCR and processes it.
4842
"""
4943
exactly_one(filename=filename, file=file)
5044

51-
if strategy == "hi_res":
52-
if template is None:
53-
template = "layout/image"
54-
return partition_pdf_or_image(
55-
filename=filename,
56-
file=file,
57-
url=url,
58-
template=template,
59-
token=token,
60-
include_page_breaks=include_page_breaks,
61-
ocr_languages=ocr_languages,
62-
)
63-
64-
elif strategy == "ocr_only":
65-
if file is not None:
66-
image = Image.open(file)
67-
text = pytesseract.image_to_string(image, config=f"-l '{ocr_languages}'")
68-
else:
69-
text = pytesseract.image_to_string(filename, config=f"-l '{ocr_languages}'")
70-
return partition_text(text=text)
71-
72-
else:
73-
raise ValueError(
74-
f"{strategy} is not a valid strategy for partition_image. "
75-
f"Choose one of {VALID_STRATEGIES}.",
76-
)
45+
if template is None:
46+
template = "layout/image"
47+
48+
return partition_pdf_or_image(
49+
filename=filename,
50+
file=file,
51+
url=url,
52+
template=template,
53+
token=token,
54+
include_page_breaks=include_page_breaks,
55+
ocr_languages=ocr_languages,
56+
strategy=strategy,
57+
)

0 commit comments

Comments
 (0)