enhancement: add "ocr_only" strategy for PDFs (#553)

MthwRobinson · web-flow · commit 3d3f3df3ecfc · 2023-05-08T17:21:24.000Z
* add tests for validating strategy

* refactor into determine_pdf_strategy function

* refactor pdf strategies into strategies

* remove commented out code

* remove unreachable code

* add in handling for image types

* a little more refactoring

* import ocr partioning for images

* catch warnings, partition type for valid strategies

* fallback to ocr_only from fast

* fallback logic for hi_res

* test for fallback to ocr only

* fallback logic ofr ocr_only

* more tests for fallback logic

* update doc strings

* version and changelog

* linting, linting, linting

* update docs to include notes about strategy

* fix typos

* change back patched filename
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,14 @@
+## 0.6.4
+
+### Enhancements
+
+* Added an "ocr_only" strategy for `partition_pdf`. Refactored the strategy decision
+  logic into its own module.
+
+### Features
+
+### Fixes
+
 ## 0.6.3
 
 ### Enhancements
diff --git a/docs/source/bricks.rst b/docs/source/bricks.rst
@@ -138,7 +138,7 @@ to disable SSL verification in the request.
 
 ``partition_via_api`` allows users to partition documents using the hosted Unstructured API.
 The API partitions documents using the automatic ``partition`` function. Currently, the API
-supports all filetypes except for RTF and EPUBs. 
+supports all filetypes except for RTF and EPUBs.
 To use another URL for the API use the ``api_url`` kwarg. This is helpful if you're hosting
 the API yourself or running it locally through a container. You can pass in your API key
 using the ``api_key`` kwarg. You can use the ``content_type`` kwarg to pass in the MIME
@@ -255,7 +255,7 @@ Examples:
 ------------------
 
 The ``partition_odt`` partitioning brick pre-processes Open Office documents
-saved in the ``.odt`` format. The function first converst the document
+saved in the ``.odt`` format. The function first converts the document
 to ``.docx`` using ``pandoc`` and then processes it using ``partition_docx``.
 
 Examples:
@@ -363,10 +363,22 @@ if you'd like to run inference locally.
 If you set the URL, ``partition_pdf`` will make a call to a remote inference server.
 ``partition_pdf`` also includes a ``token`` function that allows you to pass in an authentication
 token for a remote API call.
-The ``strategy`` kwarg controls the method that will be used to process the PDF. The ``"hi_res"`` strategy
-will identify the layout of the document using ``detectron2``. The ``"fast"`` strategy will extract the
-text using ``pdfminer`` and process the raw text with ``partition_text``. If ``detectron2`` is not available,
-and the ``"hi_res"`` strategy is set, ``partition_pdf`` will fallback to the ``"fast"`` strategy.
+
+The ``strategy`` kwarg controls the method that will be used to process the PDF.
+The available strategies for PDFs are `"hi_res"`, `"ocr_only"`, and `"fast"`.
+The ``"hi_res"`` strategy will identify the layout of the document using ``detectron2``. The advantage of `"hi_res"` is that
+it uses the document layout to gain additional information about document elements. We recommend using this strategy
+if your use case is highly sensitive to correct classifications for document elements. If ``detectron2`` is not available,
+the ``"hi_res"`` strategy will fall back to the ``"ocr_only"`` strategy.
+The ``"ocr_only"`` strategy runs the document through Tesseract for OCR and then runs the raw text through ``partition_text``.
+Currently, ``"hi_res"`` has difficulty ordering elements for documents with multiple columns. If you have a document with
+multiple columns that does not have extractable text, we recommend using the ``"ocr_only"`` strategy. ``"ocr_only"`` falls
+back to ``"fast"`` if Tesseract is not available and the document has extractable text.
+The ``"fast"`` strategy will extract the text using ``pdfminer`` and process the raw text with ``partition_text``.
+If the PDF text is not extractable, ``partition_pdf`` will fall back to ``"ocr_only"``. We recommend using the
+``"fast"`` strategy in most cases where the PDF has extractable text.
+
+
 You can also specify what languages to use for OCR with the ``ocr_languages`` kwarg. For example,
 use ``ocr_languages="eng+deu"`` to use the English and German language packs. See the
 `Tesseract documentation <https://github.com/tesseract-ocr/tessdata>`_ for a full list of languages and
@@ -411,6 +423,17 @@ Examples:
 The ``partition_image`` function has the same API as ``partition_pdf``, which is document above.
 The only difference is that ``partition_image`` does not need to convert a PDF to an image
 prior to processing. The ``partition_image`` function supports ``.png`` and ``.jpg`` files.
+
+The ``strategy`` kwarg controls the method that will be used to process the PDF.
+The available strategies for images are `"hi_res"` and ``"ocr_only"``.
+The ``"hi_res"`` strategy will identify the layout of the document using ``detectron2``. The advantage of `"hi_res"` is that it
+uses the document layout to gain additional information about document elements. We recommend using this strategy
+if your use case is highly sensitive to correct classifications for document elements. If ``detectron2`` is not available,
+the ``"hi_res"`` strategy will fall back to the ``"ocr_only"`` strategy.
+The ``"ocr_only"`` strategy runs the document through Tesseract for OCR and then runs the raw text through ``partition_text``.
+Currently, ``"hi_res"`` has difficulty ordering elements for documents with multiple columns. If you have a document with
+multiple columns that does not have extractable text, we recoomend using the ``"ocr_only"`` strategy.
+
 You can also specify what languages to use for OCR with the ``ocr_languages`` kwarg. For example,
 use ``ocr_languages="eng+deu"`` to use the English and German language packs. See the
 `Tesseract documentation <https://github.com/tesseract-ocr/tessdata>`_ for a full list of languages and
@@ -430,7 +453,7 @@ Examples:
   elements = partition_image("example-docs/layout-parser-paper-fast.jpg", ocr_languages="eng+swe")
 
 
-The default partitioning strategy for ``partition_image`` is `"hi_res"`, which segements the document using
+The default partitioning strategy for ``partition_image`` is `"hi_res"`, which segments the document using
 ``detectron2`` and then OCRs the document. You can also choose ``"ocr_only"`` as the partitioning strategy,
 which OCRs the document and then runs the output through ``partition_text``. This can be helpful
 if ``detectron2`` does not detect a text element in the image. To run example below, ensure you
diff --git a/test_unstructured/partition/test_pdf.py b/test_unstructured/partition/test_pdf.py
@@ -6,7 +6,7 @@
 from unstructured_inference.inference import layout
 
 from unstructured.documents.elements import NarrativeText, PageBreak, Text, Title
-from unstructured.partition import pdf
+from unstructured.partition import pdf, strategies
 
 
 class MockResponse:
@@ -161,7 +161,7 @@ def test_partition_pdf_api_raises_with_failed_api_call(
     [("fakeurl", True, False), (None, False, True)],
 )
 def test_partition_pdf(url, api_called, local_called, monkeypatch):
-    monkeypatch.setattr(pdf, "is_pdf_text_extractable", lambda *args, **kwargs: True)
+    monkeypatch.setattr(strategies, "is_pdf_text_extractable", lambda *args, **kwargs: True)
     with mock.patch.object(
         pdf,
         attribute="_partition_via_api",
@@ -177,7 +177,7 @@ def test_partition_pdf(url, api_called, local_called, monkeypatch):
     [("fakeurl", True, False), (None, False, True)],
 )
 def test_partition_pdf_with_template(url, api_called, local_called, monkeypatch):
-    monkeypatch.setattr(pdf, "is_pdf_text_extractable", lambda *args, **kwargs: True)
+    monkeypatch.setattr(strategies, "is_pdf_text_extractable", lambda *args, **kwargs: True)
     with mock.patch.object(
         pdf,
         attribute="_partition_via_api",
@@ -253,13 +253,83 @@ def test_partition_pdf_falls_back_to_fast(
     caplog,
     filename="example-docs/layout-parser-paper-fast.pdf",
 ):
-    monkeypatch.setattr(pdf, "dependency_exists", lambda dep: dep != "detectron2")
+    def mock_exists(dep):
+        return dep not in ["detectron2", "pytesseract"]
+
+    monkeypatch.setattr(strategies, "dependency_exists", mock_exists)
+
+    mock_return = [Text("Hello there!")]
+    with mock.patch.object(
+        pdf,
+        "_partition_pdf_with_pdfminer",
+        return_value=mock_return,
+    ) as mock_partition:
+        pdf.partition_pdf(filename=filename, url=None, strategy="hi_res")
+
+    mock_partition.assert_called_once()
+    assert "detectron2 is not installed" in caplog.text
+
+
+def test_partition_pdf_falls_back_to_fast_from_ocr_only(
+    monkeypatch,
+    caplog,
+    filename="example-docs/layout-parser-paper-fast.pdf",
+):
+    def mock_exists(dep):
+        return dep not in ["pytesseract"]
+
+    monkeypatch.setattr(strategies, "dependency_exists", mock_exists)
 
     mock_return = [Text("Hello there!")]
     with mock.patch.object(
         pdf,
         "_partition_pdf_with_pdfminer",
         return_value=mock_return,
+    ) as mock_partition:
+        pdf.partition_pdf(filename=filename, url=None, strategy="ocr_only")
+
+    mock_partition.assert_called_once()
+    assert "pytesseract is not installed" in caplog.text
+
+
+def test_partition_pdf_falls_back_to_hi_res_from_ocr_only(
+    monkeypatch,
+    caplog,
+    filename="example-docs/layout-parser-paper-fast.pdf",
+):
+    def mock_exists(dep):
+        return dep not in ["pytesseract"]
+
+    monkeypatch.setattr(strategies, "dependency_exists", mock_exists)
+    monkeypatch.setattr(strategies, "is_pdf_text_extractable", lambda *args, **kwargs: False)
+
+    mock_return = [Text("Hello there!")]
+    with mock.patch.object(
+        pdf,
+        "_partition_pdf_or_image_local",
+        return_value=mock_return,
+    ) as mock_partition:
+        pdf.partition_pdf(filename=filename, url=None, strategy="ocr_only")
+
+    mock_partition.assert_called_once()
+    assert "pytesseract is not installed" in caplog.text
+
+
+def test_partition_pdf_falls_back_to_ocr_only(
+    monkeypatch,
+    caplog,
+    filename="example-docs/layout-parser-paper-fast.pdf",
+):
+    def mock_exists(dep):
+        return dep not in ["detectron2"]
+
+    monkeypatch.setattr(strategies, "dependency_exists", mock_exists)
+
+    mock_return = [Text("Hello there!")]
+    with mock.patch.object(
+        pdf,
+        "_partition_pdf_or_image_with_ocr",
+        return_value=mock_return,
     ) as mock_partition:
         pdf.partition_pdf(filename=filename, url=None, strategy="hi_res")
 
@@ -276,27 +346,6 @@ def test_partition_pdf_uses_table_extraction():
         assert mock_process_file_with_model.call_args[1]["extract_tables"]
 
 
-@pytest.mark.parametrize(
-    ("filename", "from_file", "expected"),
-    [
-        ("layout-parser-paper-fast.pdf", True, True),
-        ("copy-protected.pdf", True, False),
-        ("layout-parser-paper-fast.pdf", False, True),
-        ("copy-protected.pdf", False, False),
-    ],
-)
-def test_is_pdf_text_extractable(filename, from_file, expected):
-    filename = os.path.join("example-docs", filename)
-
-    if from_file:
-        with open(filename, "rb") as f:
-            extractable = pdf.is_pdf_text_extractable(file=f)
-    else:
-        extractable = pdf.is_pdf_text_extractable(filename=filename)
-
-    assert extractable is expected
-
-
 def test_partition_pdf_with_copy_protection():
     filename = os.path.join("example-docs", "copy-protected.pdf")
     elements = pdf.partition_pdf(filename=filename, strategy="hi_res")
@@ -314,8 +363,11 @@ def test_partition_pdf_fails_if_pdf_not_processable(
     monkeypatch,
     filename="example-docs/layout-parser-paper-fast.pdf",
 ):
-    monkeypatch.setattr(pdf, "dependency_exists", lambda dep: dep != "detectron2")
-    monkeypatch.setattr(pdf, "is_pdf_text_extractable", lambda *args, **kwargs: False)
+    def mock_exists(dep):
+        return dep not in ["detectron2", "pytesseract"]
+
+    monkeypatch.setattr(strategies, "dependency_exists", mock_exists)
+    monkeypatch.setattr(strategies, "is_pdf_text_extractable", lambda *args, **kwargs: False)
 
     with pytest.raises(ValueError):
         pdf.partition_pdf(filename=filename)
diff --git a/test_unstructured/partition/test_strategies.py b/test_unstructured/partition/test_strategies.py
@@ -0,0 +1,41 @@
+import os
+
+import pytest
+
+from unstructured.partition import strategies
+
+
+def test_validate_strategy_validates():
+    # Nothing should raise for a valid strategy
+    strategies.validate_strategy("hi_res", "pdf")
+
+
+def test_validate_strategy_raises_for_bad_filetype():
+    with pytest.raises(ValueError):
+        strategies.validate_strategy("fast", "image")
+
+
+def test_validate_strategy_raises_for_bad_strategy():
+    with pytest.raises(ValueError):
+        strategies.validate_strategy("totally_guess_the_text", "image")
+
+
+@pytest.mark.parametrize(
+    ("filename", "from_file", "expected"),
+    [
+        ("layout-parser-paper-fast.pdf", True, True),
+        ("copy-protected.pdf", True, False),
+        ("layout-parser-paper-fast.pdf", False, True),
+        ("copy-protected.pdf", False, False),
+    ],
+)
+def test_is_pdf_text_extractable(filename, from_file, expected):
+    filename = os.path.join("example-docs", filename)
+
+    if from_file:
+        with open(filename, "rb") as f:
+            extractable = strategies.is_pdf_text_extractable(file=f)
+    else:
+        extractable = strategies.is_pdf_text_extractable(filename=filename)
+
+    assert extractable is expected
diff --git a/unstructured/__version__.py b/unstructured/__version__.py
@@ -1 +1 @@
-__version__ = "0.6.3"  # pragma: no cover
+__version__ = "0.6.4"  # pragma: no cover
diff --git a/unstructured/partition/image.py b/unstructured/partition/image.py
@@ -1,14 +1,8 @@
 from typing import List, Optional
 
-import pytesseract
-from PIL import Image
-
 from unstructured.documents.elements import Element
 from unstructured.partition.common import exactly_one
 from unstructured.partition.pdf import partition_pdf_or_image
-from unstructured.partition.text import partition_text
-
-VALID_STRATEGIES = ["hi_res", "ocr_only"]
 
 
 def partition_image(
@@ -42,35 +36,22 @@ def partition_image(
         to install the appropriate Tesseract language pack.
     strategy
         The strategy to use for partitioning the PDF. Valid strategies are "hi_res" and
-        "ocr_only". When using the "hi_res" strategy, the function  ses a layout detection
-        model if to identify document elements. When using the "ocr_only strategy",
-        partition_image simply extracts the text from the document and processes it.
+        "ocr_only". When using the "hi_res" strategy, the function uses a layout detection
+        model if to identify document elements. When using the "ocr_only" strategy,
+        partition_image simply extracts the text from the document using OCR and processes it.
     """
     exactly_one(filename=filename, file=file)
 
-    if strategy == "hi_res":
-        if template is None:
-            template = "layout/image"
-        return partition_pdf_or_image(
-            filename=filename,
-            file=file,
-            url=url,
-            template=template,
-            token=token,
-            include_page_breaks=include_page_breaks,
-            ocr_languages=ocr_languages,
-        )
-
-    elif strategy == "ocr_only":
-        if file is not None:
-            image = Image.open(file)
-            text = pytesseract.image_to_string(image, config=f"-l '{ocr_languages}'")
-        else:
-            text = pytesseract.image_to_string(filename, config=f"-l '{ocr_languages}'")
-        return partition_text(text=text)
-
-    else:
-        raise ValueError(
-            f"{strategy} is not a valid strategy for partition_image. "
-            f"Choose one of {VALID_STRATEGIES}.",
-        )
+    if template is None:
+        template = "layout/image"
+
+    return partition_pdf_or_image(
+        filename=filename,
+        file=file,
+        url=url,
+        template=template,
+        token=token,
+        include_page_breaks=include_page_breaks,
+        ocr_languages=ocr_languages,
+        strategy=strategy,
+    )
diff --git a/unstructured/partition/pdf.py b/unstructured/partition/pdf.py
diff --git a/unstructured/partition/strategies.py b/unstructured/partition/strategies.py

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-__version__ = "0.6.3" # pragma: no cover`
	`1`	`+__version__ = "0.6.4" # pragma: no cover`