Skip to content

Commit eabf116

Browse files
authored
chore/change default split page behavior to true (#118)
* Set the split_pdf_page default to true and run `make client-generate` locally. * Update the readme, add another reference back to our docs * Change some warning logs to info. The user should not be warned about default behavior for non pdf files # Testing Use the client locally and verify that split mode is the default, and that the client behavior is consistent with older versions. * Set up (or activate) your pyenv for the client: `pyenv virtualenv 3.12 unstructured-client; pyenv activate unstructured-client` * Check out this branch and install: `pip install -e .` * Run this sample script in the top level of the client repo. Try different files in `_sample_docs` and verify that the logging and results look acceptable. ``` from unstructured_client import UnstructuredClient from unstructured_client.models import shared, operations import json api_key = "free-api-key" filename = "_sample_docs/layout-parser-paper.pdf" s = UnstructuredClient( api_key_auth=api_key, ) with open(filename, "rb") as f: files=shared.Files( content=f.read(), file_name=filename, ) req = operations.PartitionRequest( shared.PartitionParameters( files=files, strategy=shared.Strategy.AUTO ), ) try: resp = s.general.partition(req) print(json.dumps(resp.elements, indent=4)) except Exception as e: print(e) ```
1 parent c55e721 commit eabf116

File tree

10 files changed

+16
-15
lines changed

10 files changed

+16
-15
lines changed

Diff for: .speakeasy/gen.lock

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
lockVersion: 2.0.0
22
id: 8b5fa338-9106-4734-abf0-e30d67044a90
33
management:
4-
docChecksum: 5365c99c52e23b044ef9916ecf51b1a9
4+
docChecksum: c7e23b3b8242eb21eccb2091bcc57c72
55
docVersion: 1.0.35
66
speakeasyVersion: 1.308.1
77
generationVersion: 2.342.6
8-
releaseVersion: 0.23.5
9-
configChecksum: e210d7bff3afd386269cb7c6adeef630
8+
releaseVersion: 0.23.6
9+
configChecksum: 4e2e510c7f4b61e04b61acf7de2939a3
1010
repoURL: https://github.com/Unstructured-IO/unstructured-python-client.git
1111
repoSubDirectory: .
1212
installationURL: https://github.com/Unstructured-IO/unstructured-python-client.git

Diff for: README.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,9 @@ Refer to the [API parameters page](https://docs.unstructured.io/api-reference/ap
7272

7373
#### Splitting PDF by pages
7474

75-
In order to speed up processing of long PDF files, `split_pdf_page` can be set to `True` (defaults to `False`). It will cause the PDF to be split at client side, before sending to API, and combining individual responses as single result. This parameter will affect only PDF files, no need to disable it for other filetypes.
75+
See [page splitting](https://docs.unstructured.io/api-reference/api-services/sdk#page-splitting) for more details.
76+
77+
In order to speed up processing of large PDF files, the client splits up PDFs into smaller files, sends these to the API concurrently, and recombines the results. `split_pdf_page` can be set to `False` to disable this.
7678

7779
The amount of workers utilized for splitting PDFs is dictated by the `split_pdf_concurrency_level` parameter, with a default of 5 and a maximum of 15 to keep resource usage and costs in check. The splitting process leverages `asyncio` to manage concurrency effectively.
7880
The size of each batch of pages (ranging from 2 to 20) is internally determined based on the concurrency level and the total number of pages in the document. Because the splitting process uses `asyncio` the client can encouter event loop issues if it is nested in another async runner, like running in a `gevent` spawned task. Instead, this is safe to run in multiprocessing workers (e.g., using `multiprocessing.Pool` with `fork` context).
@@ -83,7 +85,6 @@ req = shared.PartitionParameters(
8385
files=files,
8486
strategy="fast",
8587
languages=["eng"],
86-
split_pdf_page=True,
8788
split_pdf_concurrency_level=8
8889
)
8990
```

Diff for: _test_unstructured_client/unit/test_split_pdf_hook.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -276,7 +276,7 @@ def test_unit_is_pdf_invalid_extension(caplog):
276276
"""Test is pdf method returns False for file with invalid extension."""
277277
file = shared.Files(b"txt_content", "test_file.txt")
278278

279-
with caplog.at_level(logging.WARNING):
279+
with caplog.at_level(logging.INFO):
280280
result = pdf_utils.is_pdf(file)
281281

282282
assert result is False

Diff for: gen.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ generation:
1010
auth:
1111
oAuth2ClientCredentialsEnabled: false
1212
python:
13-
version: 0.23.5
13+
version: 0.23.6
1414
additionalDependencies:
1515
dependencies:
1616
deepdiff: '>=6.0'

Diff for: overlay_client.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ actions:
1010
"type": "boolean",
1111
"title": "Split Pdf Page",
1212
"description": "This parameter determines if the PDF file should be split on the client side. It's an internal parameter for the Python client and is not sent to the backend.",
13-
"default": false,
13+
"default": true,
1414
}
1515
- target: $["components"]["schemas"]["partition_parameters"]["properties"]
1616
update:

Diff for: setup.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919

2020
setuptools.setup(
2121
name='unstructured-client',
22-
version='0.23.5',
22+
version='0.23.6',
2323
author='Unstructured',
2424
description='Python Client SDK for Unstructured API',
2525
license = 'MIT',

Diff for: src/unstructured_client/_hooks/custom/pdf_utils.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ def is_pdf(file: shared.Files) -> bool:
5959
True if the file is a PDF, False otherwise.
6060
"""
6161
if not file.file_name.endswith(".pdf"):
62-
logger.warning("Given file doesn't have '.pdf' extension. Continuing without splitting.")
62+
logger.info("Given file doesn't have '.pdf' extension, so splitting is not enabled.")
6363
return False
6464

6565
try:

Diff for: src/unstructured_client/_hooks/custom/split_pdf_hook.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ def before_request(
135135
or not isinstance(file, shared.Files)
136136
or not pdf_utils.is_pdf(file)
137137
):
138-
logger.warning("File could not be split. Partitioning without split.")
138+
logger.info("Partitioning without split.")
139139
return request
140140

141141
starting_page_number = form_utils.get_starting_page_number(
@@ -160,7 +160,7 @@ def before_request(
160160
logger.info("Determined optimal split size of %d pages.", split_size)
161161

162162
if split_size >= len(pdf.pages):
163-
logger.warning(
163+
logger.info(
164164
"Document has too few pages (%d) to be split efficiently. Partitioning without split.",
165165
len(pdf.pages),
166166
)

Diff for: src/unstructured_client/models/shared/partition_parameters.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ class PartitionParameters:
8383
r"""The document types that you want to skip table extraction with. Default: []"""
8484
split_pdf_concurrency_level: Optional[int] = dataclasses.field(default=5, metadata={'multipart_form': { 'field_name': 'split_pdf_concurrency_level' }})
8585
r"""When `split_pdf_page` is set to `True`, this parameter specifies the number of workers used for sending requests when the PDF is split on the client side. It's an internal parameter for the Python client and is not sent to the backend."""
86-
split_pdf_page: Optional[bool] = dataclasses.field(default=False, metadata={'multipart_form': { 'field_name': 'split_pdf_page' }})
86+
split_pdf_page: Optional[bool] = dataclasses.field(default=True, metadata={'multipart_form': { 'field_name': 'split_pdf_page' }})
8787
r"""This parameter determines if the PDF file should be split on the client side. It's an internal parameter for the Python client and is not sent to the backend."""
8888
starting_page_number: Optional[int] = dataclasses.field(default=None, metadata={'multipart_form': { 'field_name': 'starting_page_number' }})
8989
r"""When PDF is split into pages before sending it into the API, providing this information will allow the page number to be assigned correctly. Introduced in 1.0.27."""

Diff for: src/unstructured_client/sdkconfiguration.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,9 @@ class SDKConfiguration:
2929
server: Optional[str] = ''
3030
language: str = 'python'
3131
openapi_doc_version: str = '1.0.35'
32-
sdk_version: str = '0.23.5'
32+
sdk_version: str = '0.23.6'
3333
gen_version: str = '2.342.6'
34-
user_agent: str = 'speakeasy-sdk/python 0.23.5 2.342.6 1.0.35 unstructured-client'
34+
user_agent: str = 'speakeasy-sdk/python 0.23.6 2.342.6 1.0.35 unstructured-client'
3535
retry_config: Optional[RetryConfig] = None
3636

3737
def __post_init__(self):

0 commit comments

Comments
 (0)