File tree 3 files changed +8
-5
lines changed
3 files changed +8
-5
lines changed Original file line number Diff line number Diff line change @@ -70,8 +70,8 @@ def partition(
70
70
include_page_breaks
71
71
If True, the output will include page breaks if the filetype supports it
72
72
strategy
73
- The strategy to use for partitioning the PDF. Uses a layout detection model if set
74
- to 'hi_res', otherwise partition_pdf simply extracts the text from the document
73
+ The strategy to use for partitioning PDF/image . Uses a layout detection model if set
74
+ to 'hi_res', otherwise partition simply extracts the text from the document
75
75
and processes it.
76
76
encoding
77
77
The encoding method used to decode the text input. If None, utf-8 will be used.
Original file line number Diff line number Diff line change @@ -35,10 +35,12 @@ def partition_image(
35
35
The languages to use for the Tesseract agent. To use a language, you'll first need
36
36
to install the appropriate Tesseract language pack.
37
37
strategy
38
- The strategy to use for partitioning the PDF . Valid strategies are "hi_res" and
38
+ The strategy to use for partitioning the image . Valid strategies are "hi_res" and
39
39
"ocr_only". When using the "hi_res" strategy, the function uses a layout detection
40
40
model if to identify document elements. When using the "ocr_only" strategy,
41
41
partition_image simply extracts the text from the document using OCR and processes it.
42
+ The default strategy `auto` will determine when a image can be extracted using `ocr_only` mode,
43
+ otherwise it will fall back to `hi_res`.
42
44
"""
43
45
exactly_one (filename = filename , file = file )
44
46
Original file line number Diff line number Diff line change @@ -57,9 +57,10 @@ def partition_pdf(
57
57
The strategy to use for partitioning the PDF. Valid strategies are "hi_res",
58
58
"ocr_only", and "fast". When using the "hi_res" strategy, the function uses
59
59
a layout detection model to identify document elements. When using the
60
- "ocr_only" strategy, partition_image simply extracts the text from the
60
+ "ocr_only" strategy, partition_pdf simply extracts the text from the
61
61
document using OCR and processes it. If the "fast" strategy is used, the text
62
- is extracted directly from the PDF.
62
+ is extracted directly from the PDF. The default strategy `auto` will determine
63
+ when a page can be extracted using `fast` mode, otherwise it will fall back to `hi_res`.
63
64
infer_table_structure
64
65
Only applicable if `strategy=hi_res`.
65
66
If True, any Table elements that are extracted will also have a metadata field
You can’t perform that action at this time.
0 commit comments