@@ -364,21 +364,6 @@ If you set the URL, ``partition_pdf`` will make a call to a remote inference ser
364
364
``partition_pdf `` also includes a ``token `` function that allows you to pass in an authentication
365
365
token for a remote API call.
366
366
367
- The ``strategy `` kwarg controls the method that will be used to process the PDF.
368
- The available strategies for PDFs are `"hi_res" `, `"ocr_only" `, and `"fast" `.
369
- The ``"hi_res" `` strategy will identify the layout of the document using ``detectron2 ``. The advantage of `"hi_res" ` is that
370
- it uses the document layout to gain additional information about document elements. We recommend using this strategy
371
- if your use case is highly sensitive to correct classifications for document elements. If ``detectron2 `` is not available,
372
- the ``"hi_res" `` strategy will fall back to the ``"ocr_only" `` strategy.
373
- The ``"ocr_only" `` strategy runs the document through Tesseract for OCR and then runs the raw text through ``partition_text ``.
374
- Currently, ``"hi_res" `` has difficulty ordering elements for documents with multiple columns. If you have a document with
375
- multiple columns that does not have extractable text, we recommend using the ``"ocr_only" `` strategy. ``"ocr_only" `` falls
376
- back to ``"fast" `` if Tesseract is not available and the document has extractable text.
377
- The ``"fast" `` strategy will extract the text using ``pdfminer `` and process the raw text with ``partition_text ``.
378
- If the PDF text is not extractable, ``partition_pdf `` will fall back to ``"ocr_only" ``. We recommend using the
379
- ``"fast" `` strategy in most cases where the PDF has extractable text.
380
-
381
-
382
367
You can also specify what languages to use for OCR with the ``ocr_languages `` kwarg. For example,
383
368
use ``ocr_languages="eng+deu" `` to use the English and German language packs. See the
384
369
`Tesseract documentation <https://github.com/tesseract-ocr/tessdata >`_ for a full list of languages and
@@ -398,9 +383,31 @@ Examples:
398
383
elements = partition_pdf(" example-docs/layout-parser-paper-fast.pdf" , ocr_languages = " eng+swe" )
399
384
400
385
386
+ The ``strategy `` kwarg controls the method that will be used to process the PDF.
387
+ The available strategies for PDFs are `"auto" `, `"hi_res" `, `"ocr_only" `, and `"fast" `.
388
+
389
+ The ``"auto" `` strategy will choose the partitioning strategy based on document characteristics and the function kwargs.
390
+ If ``infer_table_structure `` is passed, the strategy will be ``"hi_res" `` because that is the only strategy that
391
+ currently extracts tables for PDFs. Otherwise, ``"auto" `` will choose ``"fast" `` if the PDF text is extractable and
392
+ ``"ocr_only" `` otherwise. ``"auto" `` is the default strategy.
393
+
394
+ The ``"hi_res" `` strategy will identify the layout of the document using ``detectron2 ``. The advantage of `"hi_res" ` is that
395
+ it uses the document layout to gain additional information about document elements. We recommend using this strategy
396
+ if your use case is highly sensitive to correct classifications for document elements. If ``detectron2 `` is not available,
397
+ the ``"hi_res" `` strategy will fall back to the ``"ocr_only" `` strategy.
398
+
399
+ The ``"ocr_only" `` strategy runs the document through Tesseract for OCR and then runs the raw text through ``partition_text ``.
400
+ Currently, ``"hi_res" `` has difficulty ordering elements for documents with multiple columns. If you have a document with
401
+ multiple columns that does not have extractable text, we recommend using the ``"ocr_only" `` strategy. ``"ocr_only" `` falls
402
+ back to ``"fast" `` if Tesseract is not available and the document has extractable text.
403
+
404
+ The ``"fast" `` strategy will extract the text using ``pdfminer `` and process the raw text with ``partition_text ``.
405
+ If the PDF text is not extractable, ``partition_pdf `` will fall back to ``"ocr_only" ``. We recommend using the
406
+ ``"fast" `` strategy in most cases where the PDF has extractable text.
407
+
401
408
If a PDF is copy protected, ``partition_pdf `` can process the document with the ``"hi_res" `` strategy (which
402
- will treat it like an image), but cannot process the document with the ``"fast" `` strategy. If the user
403
- chooses ``"fast" `` on a copy protected PDF, ``partition_pdf `` will fall back to the ``"hi_res" ``
409
+ will treat it like an image), but cannot process the document with the ``"fast" `` strategy.
410
+ If the user chooses ``"fast" `` on a copy protected PDF, ``partition_pdf `` will fall back to the ``"hi_res" ``
404
411
strategy. If ``detectron2 `` is not installed, ``partition_pdf `` will fail for copy protected
405
412
PDFs because the document will not be processable by any of the available methods.
406
413
@@ -424,16 +431,6 @@ The ``partition_image`` function has the same API as ``partition_pdf``, which is
424
431
The only difference is that ``partition_image `` does not need to convert a PDF to an image
425
432
prior to processing. The ``partition_image `` function supports ``.png `` and ``.jpg `` files.
426
433
427
- The ``strategy `` kwarg controls the method that will be used to process the PDF.
428
- The available strategies for images are `"hi_res" ` and ``"ocr_only" ``.
429
- The ``"hi_res" `` strategy will identify the layout of the document using ``detectron2 ``. The advantage of `"hi_res" ` is that it
430
- uses the document layout to gain additional information about document elements. We recommend using this strategy
431
- if your use case is highly sensitive to correct classifications for document elements. If ``detectron2 `` is not available,
432
- the ``"hi_res" `` strategy will fall back to the ``"ocr_only" `` strategy.
433
- The ``"ocr_only" `` strategy runs the document through Tesseract for OCR and then runs the raw text through ``partition_text ``.
434
- Currently, ``"hi_res" `` has difficulty ordering elements for documents with multiple columns. If you have a document with
435
- multiple columns that does not have extractable text, we recoomend using the ``"ocr_only" `` strategy.
436
-
437
434
You can also specify what languages to use for OCR with the ``ocr_languages `` kwarg. For example,
438
435
use ``ocr_languages="eng+deu" `` to use the English and German language packs. See the
439
436
`Tesseract documentation <https://github.com/tesseract-ocr/tessdata >`_ for a full list of languages and
@@ -453,9 +450,23 @@ Examples:
453
450
elements = partition_image(" example-docs/layout-parser-paper-fast.jpg" , ocr_languages = " eng+swe" )
454
451
455
452
456
- The default partitioning strategy for ``partition_image `` is `"hi_res" `, which segments the document using
457
- ``detectron2 `` and then OCRs the document. You can also choose ``"ocr_only" `` as the partitioning strategy,
458
- which OCRs the document and then runs the output through ``partition_text ``. This can be helpful
453
+ The ``strategy `` kwarg controls the method that will be used to process the PDF.
454
+ The available strategies for images are ``"auto" ``, ``"hi_res" `` and ``"ocr_only" ``.
455
+
456
+ The ``"auto" `` strategy will choose the partitioning strategy based on document characteristics and the function kwargs.
457
+ If ``infer_table_structure `` is passed, the strategy will be ``"hi_res" `` because that is the only strategy that
458
+ currently extracts tables for PDFs. Otherwise, ``"auto" `` will choose ``ocr_only ``. ``"auto" `` is the default strategy.
459
+
460
+ The ``"hi_res" `` strategy will identify the layout of the document using ``detectron2 ``. The advantage of `"hi_res" ` is that it
461
+ uses the document layout to gain additional information about document elements. We recommend using this strategy
462
+ if your use case is highly sensitive to correct classifications for document elements. If ``detectron2 `` is not available,
463
+ the ``"hi_res" `` strategy will fall back to the ``"ocr_only" `` strategy.
464
+
465
+ The ``"ocr_only" `` strategy runs the document through Tesseract for OCR and then runs the raw text through ``partition_text ``.
466
+ Currently, ``"hi_res" `` has difficulty ordering elements for documents with multiple columns. If you have a document with
467
+ multiple columns that does not have extractable text, we recoomend using the ``"ocr_only" `` strategy.
468
+
469
+ It is helpful to use ``"ocr_only" `` instead of ``"hi_res" ``
459
470
if ``detectron2 `` does not detect a text element in the image. To run example below, ensure you
460
471
have the Korean language pack for Tesseract installed on your system.
461
472
0 commit comments