Skip to content

Commit 2836e2d

Browse files
committed
doc nit
1 parent a4701a4 commit 2836e2d

File tree

3 files changed

+8
-5
lines changed

3 files changed

+8
-5
lines changed

Diff for: unstructured/partition/auto.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -70,8 +70,8 @@ def partition(
7070
include_page_breaks
7171
If True, the output will include page breaks if the filetype supports it
7272
strategy
73-
The strategy to use for partitioning the PDF. Uses a layout detection model if set
74-
to 'hi_res', otherwise partition_pdf simply extracts the text from the document
73+
The strategy to use for partitioning PDF/image. Uses a layout detection model if set
74+
to 'hi_res', otherwise partition simply extracts the text from the document
7575
and processes it.
7676
encoding
7777
The encoding method used to decode the text input. If None, utf-8 will be used.

Diff for: unstructured/partition/image.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -35,10 +35,12 @@ def partition_image(
3535
The languages to use for the Tesseract agent. To use a language, you'll first need
3636
to install the appropriate Tesseract language pack.
3737
strategy
38-
The strategy to use for partitioning the PDF. Valid strategies are "hi_res" and
38+
The strategy to use for partitioning the image. Valid strategies are "hi_res" and
3939
"ocr_only". When using the "hi_res" strategy, the function uses a layout detection
4040
model if to identify document elements. When using the "ocr_only" strategy,
4141
partition_image simply extracts the text from the document using OCR and processes it.
42+
The default strategy `auto` will determine when a image can be extracted using `ocr_only` mode,
43+
otherwise it will fall back to `hi_res`.
4244
"""
4345
exactly_one(filename=filename, file=file)
4446

Diff for: unstructured/partition/pdf.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -57,9 +57,10 @@ def partition_pdf(
5757
The strategy to use for partitioning the PDF. Valid strategies are "hi_res",
5858
"ocr_only", and "fast". When using the "hi_res" strategy, the function uses
5959
a layout detection model to identify document elements. When using the
60-
"ocr_only" strategy, partition_image simply extracts the text from the
60+
"ocr_only" strategy, partition_pdf simply extracts the text from the
6161
document using OCR and processes it. If the "fast" strategy is used, the text
62-
is extracted directly from the PDF.
62+
is extracted directly from the PDF. The default strategy `auto` will determine
63+
when a page can be extracted using `fast` mode, otherwise it will fall back to `hi_res`.
6364
infer_table_structure
6465
Only applicable if `strategy=hi_res`.
6566
If True, any Table elements that are extracted will also have a metadata field

0 commit comments

Comments
 (0)