Skip to content

Commit 727d366

Browse files
authored
enhancement: auto strategy for PDFs and images (#578)
* added functions for determining auto stratgy * change default strategy to auto * tests for auto strategy * update docs * changelog and version * bump version * remove ingest file in wrong location * update jpg output * typo fix
1 parent 210e735 commit 727d366

File tree

12 files changed

+296
-285
lines changed

12 files changed

+296
-285
lines changed

Diff for: CHANGELOG.md

+5-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
1-
## 0.6.6-dev2
1+
## 0.6.6
22

33
### Enhancements
44

5+
* Adds an `"auto"` strategy that chooses the partitioning strategy based on document
6+
characteristics and function kwargs. This is the new default strategy for `partition_pdf`
7+
and `partition_image`. Users can maintain existing behavior by explicitly setting
8+
`strategy="hi_res"`.
59
* Added an additional trace logger for NLP debugging.
610
* Add `get_date` method to `ElementMetadata` for converting the datestring to a `datetime` object.
711
* Cleanup the `filename` attribute on `ElementMetadata` to remove the full filepath.

Diff for: docs/source/bricks.rst

+41-30
Original file line numberDiff line numberDiff line change
@@ -364,21 +364,6 @@ If you set the URL, ``partition_pdf`` will make a call to a remote inference ser
364364
``partition_pdf`` also includes a ``token`` function that allows you to pass in an authentication
365365
token for a remote API call.
366366

367-
The ``strategy`` kwarg controls the method that will be used to process the PDF.
368-
The available strategies for PDFs are `"hi_res"`, `"ocr_only"`, and `"fast"`.
369-
The ``"hi_res"`` strategy will identify the layout of the document using ``detectron2``. The advantage of `"hi_res"` is that
370-
it uses the document layout to gain additional information about document elements. We recommend using this strategy
371-
if your use case is highly sensitive to correct classifications for document elements. If ``detectron2`` is not available,
372-
the ``"hi_res"`` strategy will fall back to the ``"ocr_only"`` strategy.
373-
The ``"ocr_only"`` strategy runs the document through Tesseract for OCR and then runs the raw text through ``partition_text``.
374-
Currently, ``"hi_res"`` has difficulty ordering elements for documents with multiple columns. If you have a document with
375-
multiple columns that does not have extractable text, we recommend using the ``"ocr_only"`` strategy. ``"ocr_only"`` falls
376-
back to ``"fast"`` if Tesseract is not available and the document has extractable text.
377-
The ``"fast"`` strategy will extract the text using ``pdfminer`` and process the raw text with ``partition_text``.
378-
If the PDF text is not extractable, ``partition_pdf`` will fall back to ``"ocr_only"``. We recommend using the
379-
``"fast"`` strategy in most cases where the PDF has extractable text.
380-
381-
382367
You can also specify what languages to use for OCR with the ``ocr_languages`` kwarg. For example,
383368
use ``ocr_languages="eng+deu"`` to use the English and German language packs. See the
384369
`Tesseract documentation <https://github.com/tesseract-ocr/tessdata>`_ for a full list of languages and
@@ -398,9 +383,31 @@ Examples:
398383
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", ocr_languages="eng+swe")
399384
400385
386+
The ``strategy`` kwarg controls the method that will be used to process the PDF.
387+
The available strategies for PDFs are `"auto"`, `"hi_res"`, `"ocr_only"`, and `"fast"`.
388+
389+
The ``"auto"`` strategy will choose the partitioning strategy based on document characteristics and the function kwargs.
390+
If ``infer_table_structure`` is passed, the strategy will be ``"hi_res"`` because that is the only strategy that
391+
currently extracts tables for PDFs. Otherwise, ``"auto"`` will choose ``"fast"`` if the PDF text is extractable and
392+
``"ocr_only"`` otherwise. ``"auto"`` is the default strategy.
393+
394+
The ``"hi_res"`` strategy will identify the layout of the document using ``detectron2``. The advantage of `"hi_res"` is that
395+
it uses the document layout to gain additional information about document elements. We recommend using this strategy
396+
if your use case is highly sensitive to correct classifications for document elements. If ``detectron2`` is not available,
397+
the ``"hi_res"`` strategy will fall back to the ``"ocr_only"`` strategy.
398+
399+
The ``"ocr_only"`` strategy runs the document through Tesseract for OCR and then runs the raw text through ``partition_text``.
400+
Currently, ``"hi_res"`` has difficulty ordering elements for documents with multiple columns. If you have a document with
401+
multiple columns that does not have extractable text, we recommend using the ``"ocr_only"`` strategy. ``"ocr_only"`` falls
402+
back to ``"fast"`` if Tesseract is not available and the document has extractable text.
403+
404+
The ``"fast"`` strategy will extract the text using ``pdfminer`` and process the raw text with ``partition_text``.
405+
If the PDF text is not extractable, ``partition_pdf`` will fall back to ``"ocr_only"``. We recommend using the
406+
``"fast"`` strategy in most cases where the PDF has extractable text.
407+
401408
If a PDF is copy protected, ``partition_pdf`` can process the document with the ``"hi_res"`` strategy (which
402-
will treat it like an image), but cannot process the document with the ``"fast"`` strategy. If the user
403-
chooses ``"fast"`` on a copy protected PDF, ``partition_pdf`` will fall back to the ``"hi_res"``
409+
will treat it like an image), but cannot process the document with the ``"fast"`` strategy.
410+
If the user chooses ``"fast"`` on a copy protected PDF, ``partition_pdf`` will fall back to the ``"hi_res"``
404411
strategy. If ``detectron2`` is not installed, ``partition_pdf`` will fail for copy protected
405412
PDFs because the document will not be processable by any of the available methods.
406413

@@ -424,16 +431,6 @@ The ``partition_image`` function has the same API as ``partition_pdf``, which is
424431
The only difference is that ``partition_image`` does not need to convert a PDF to an image
425432
prior to processing. The ``partition_image`` function supports ``.png`` and ``.jpg`` files.
426433

427-
The ``strategy`` kwarg controls the method that will be used to process the PDF.
428-
The available strategies for images are `"hi_res"` and ``"ocr_only"``.
429-
The ``"hi_res"`` strategy will identify the layout of the document using ``detectron2``. The advantage of `"hi_res"` is that it
430-
uses the document layout to gain additional information about document elements. We recommend using this strategy
431-
if your use case is highly sensitive to correct classifications for document elements. If ``detectron2`` is not available,
432-
the ``"hi_res"`` strategy will fall back to the ``"ocr_only"`` strategy.
433-
The ``"ocr_only"`` strategy runs the document through Tesseract for OCR and then runs the raw text through ``partition_text``.
434-
Currently, ``"hi_res"`` has difficulty ordering elements for documents with multiple columns. If you have a document with
435-
multiple columns that does not have extractable text, we recoomend using the ``"ocr_only"`` strategy.
436-
437434
You can also specify what languages to use for OCR with the ``ocr_languages`` kwarg. For example,
438435
use ``ocr_languages="eng+deu"`` to use the English and German language packs. See the
439436
`Tesseract documentation <https://github.com/tesseract-ocr/tessdata>`_ for a full list of languages and
@@ -453,9 +450,23 @@ Examples:
453450
elements = partition_image("example-docs/layout-parser-paper-fast.jpg", ocr_languages="eng+swe")
454451
455452
456-
The default partitioning strategy for ``partition_image`` is `"hi_res"`, which segments the document using
457-
``detectron2`` and then OCRs the document. You can also choose ``"ocr_only"`` as the partitioning strategy,
458-
which OCRs the document and then runs the output through ``partition_text``. This can be helpful
453+
The ``strategy`` kwarg controls the method that will be used to process the PDF.
454+
The available strategies for images are ``"auto"``, ``"hi_res"`` and ``"ocr_only"``.
455+
456+
The ``"auto"`` strategy will choose the partitioning strategy based on document characteristics and the function kwargs.
457+
If ``infer_table_structure`` is passed, the strategy will be ``"hi_res"`` because that is the only strategy that
458+
currently extracts tables for PDFs. Otherwise, ``"auto"`` will choose ``ocr_only``. ``"auto"`` is the default strategy.
459+
460+
The ``"hi_res"`` strategy will identify the layout of the document using ``detectron2``. The advantage of `"hi_res"` is that it
461+
uses the document layout to gain additional information about document elements. We recommend using this strategy
462+
if your use case is highly sensitive to correct classifications for document elements. If ``detectron2`` is not available,
463+
the ``"hi_res"`` strategy will fall back to the ``"ocr_only"`` strategy.
464+
465+
The ``"ocr_only"`` strategy runs the document through Tesseract for OCR and then runs the raw text through ``partition_text``.
466+
Currently, ``"hi_res"`` has difficulty ordering elements for documents with multiple columns. If you have a document with
467+
multiple columns that does not have extractable text, we recoomend using the ``"ocr_only"`` strategy.
468+
469+
It is helpful to use ``"ocr_only"`` instead of ``"hi_res"``
459470
if ``detectron2`` does not detect a text element in the image. To run example below, ensure you
460471
have the Korean language pack for Tesseract installed on your system.
461472

Diff for: slack-ingest-output/C052BGT7718.json

-10
This file was deleted.

Diff for: test_unstructured/partition/test_auto.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -331,7 +331,7 @@ def test_partition_pdf_doesnt_raise_warning():
331331
[(False, None), (False, "image/jpeg"), (True, "image/jpeg"), (True, None)],
332332
)
333333
def test_auto_partition_jpg(pass_file_filename, content_type):
334-
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "example.jpg")
334+
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.jpg")
335335
file_filename = filename if pass_file_filename else None
336336
elements = partition(filename=filename, file_filename=file_filename, content_type=content_type)
337337
assert len(elements) > 0
@@ -342,7 +342,7 @@ def test_auto_partition_jpg(pass_file_filename, content_type):
342342
[(False, None), (False, "image/jpeg"), (True, "image/jpeg"), (True, None)],
343343
)
344344
def test_auto_partition_jpg_from_file(pass_file_filename, content_type):
345-
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "example.jpg")
345+
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.jpg")
346346
file_filename = filename if pass_file_filename else None
347347
with open(filename, "rb") as f:
348348
elements = partition(file=f, file_filename=file_filename, content_type=content_type)

Diff for: test_unstructured/partition/test_image.py

+11-4
Original file line numberDiff line numberDiff line change
@@ -162,29 +162,36 @@ def test_partition_image(url, api_called, local_called):
162162
attribute="_partition_via_api",
163163
new=mock.MagicMock(),
164164
), mock.patch.object(pdf, "_partition_pdf_or_image_local", mock.MagicMock()):
165-
image.partition_image(filename="fake.pdf", url=url)
165+
image.partition_image(filename="fake.pdf", strategy="hi_res", url=url)
166166
assert pdf._partition_via_api.called == api_called
167167
assert pdf._partition_pdf_or_image_local.called == local_called
168168

169169

170+
def test_partition_image_with_auto_strategy(filename="example-docs/layout-parser-paper-fast.jpg"):
171+
elements = image.partition_image(filename=filename, strategy="auto")
172+
titles = [el for el in elements if el.category == "Title" and len(el.text.split(" ")) > 10]
173+
title = "LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis"
174+
assert titles[0].text == title
175+
176+
170177
def test_partition_image_with_language_passed(filename="example-docs/example.jpg"):
171178
with mock.patch.object(layout, "process_file_with_model", mock.MagicMock()) as mock_partition:
172-
image.partition_image(filename=filename, ocr_languages="eng+swe")
179+
image.partition_image(filename=filename, strategy="hi_res", ocr_languages="eng+swe")
173180

174181
assert mock_partition.call_args.kwargs.get("ocr_languages") == "eng+swe"
175182

176183

177184
def test_partition_image_from_file_with_language_passed(filename="example-docs/example.jpg"):
178185
with mock.patch.object(layout, "process_data_with_model", mock.MagicMock()) as mock_partition:
179186
with open(filename, "rb") as f:
180-
image.partition_image(file=f, ocr_languages="eng+swe")
187+
image.partition_image(file=f, strategy="hi_res", ocr_languages="eng+swe")
181188

182189
assert mock_partition.call_args.kwargs.get("ocr_languages") == "eng+swe"
183190

184191

185192
def test_partition_image_raises_with_invalid_language(filename="example-docs/example.jpg"):
186193
with pytest.raises(TesseractError):
187-
image.partition_image(filename=filename, ocr_languages="fakeroo")
194+
image.partition_image(filename=filename, strategy="hi_res", ocr_languages="fakeroo")
188195

189196

190197
@pytest.mark.skipif(is_in_docker, reason="Skipping this test in Docker container")

Diff for: test_unstructured/partition/test_pdf.py

+9-2
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ def test_partition_pdf(url, api_called, local_called, monkeypatch):
168168
attribute="_partition_via_api",
169169
new=mock.MagicMock(),
170170
), mock.patch.object(pdf, "_partition_pdf_or_image_local", mock.MagicMock()):
171-
pdf.partition_pdf(filename="fake.pdf", url=url)
171+
pdf.partition_pdf(filename="fake.pdf", strategy="hi_res", url=url)
172172
assert pdf._partition_via_api.called == api_called
173173
assert pdf._partition_pdf_or_image_local.called == local_called
174174

@@ -202,11 +202,18 @@ def test_partition_pdf_with_template(url, api_called, local_called, monkeypatch)
202202
attribute="_partition_via_api",
203203
new=mock.MagicMock(),
204204
), mock.patch.object(pdf, "_partition_pdf_or_image_local", mock.MagicMock()):
205-
pdf.partition_pdf(filename="fake.pdf", url=url, template="checkbox")
205+
pdf.partition_pdf(filename="fake.pdf", strategy="hi_res", url=url, template="checkbox")
206206
assert pdf._partition_via_api.called == api_called
207207
assert pdf._partition_pdf_or_image_local.called == local_called
208208

209209

210+
def test_partition_pdf_with_auto_strategy(filename="example-docs/layout-parser-paper-fast.pdf"):
211+
elements = pdf.partition_pdf(filename=filename, strategy="auto")
212+
titles = [el for el in elements if el.category == "Title" and len(el.text.split(" ")) > 10]
213+
title = "LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis"
214+
assert titles[0].text == title
215+
216+
210217
def test_partition_pdf_with_page_breaks(filename="example-docs/layout-parser-paper-fast.pdf"):
211218
elements = pdf.partition_pdf(filename=filename, url=None, include_page_breaks=True)
212219
assert PageBreak() in elements

Diff for: test_unstructured/partition/test_strategies.py

+31
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,34 @@ def test_is_pdf_text_extractable(filename, from_file, expected):
3939
extractable = strategies.is_pdf_text_extractable(filename=filename)
4040

4141
assert extractable is expected
42+
43+
44+
@pytest.mark.parametrize(
45+
("infer_table_structure", "expected"),
46+
[
47+
(True, "hi_res"),
48+
(False, "ocr_only"),
49+
],
50+
)
51+
def test_determine_image_auto_strategy(infer_table_structure, expected):
52+
strategy = strategies._determine_image_auto_strategy(
53+
infer_table_structure=infer_table_structure,
54+
)
55+
assert strategy is expected
56+
57+
58+
@pytest.mark.parametrize(
59+
("pdf_text_extractable", "infer_table_structure", "expected"),
60+
[
61+
(True, True, "hi_res"),
62+
(False, True, "hi_res"),
63+
(True, False, "fast"),
64+
(False, False, "ocr_only"),
65+
],
66+
)
67+
def test_determine_image_pdf_strategy(pdf_text_extractable, infer_table_structure, expected):
68+
strategy = strategies._determine_pdf_auto_strategy(
69+
pdf_text_extractable=pdf_text_extractable,
70+
infer_table_structure=infer_table_structure,
71+
)
72+
assert strategy is expected

0 commit comments

Comments
 (0)