Description
Describe the bug
When using v0.25.5 of unstructured-client on vscode, on processing PDFs of more than 1 page with "hi_res", I consistently receive INFO: Failed to process a request due to API server error with status code 504.
and consequently:
INFO: Server message - <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>
To Reproduce
import os
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
os.environ['UNSTRUCTURED_API_KEY'] = "<MY_API_KI>"
os.environ['UNSTRUCTURED_API_URL'] = "<MY_API_URL>"
client_obj = UnstructuredClient(
api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
server_url=os.getenv("UNSTRUCTURED_API_URL"),
)
filename = "./data/kenwood_en.pdf"
file = open(filename, "rb")
req = shared.PartitionParameters(
# Note that this currently only supports a single file
files=shared.Files(
content=file.read(),
file_name=filename,
),
chunking_strategy="by_title",
max_characters=1024,
split_pdf_page=True,
split_pdf_allow_failed=True
)
try:
res = client_obj.general.partition(request=req)
print(res.elements[0])
except SDKError as e:
print(e)
Expected behavior
After 2 minutes, it will always throw the error:
INFO: Preparing to split document for partition.
INFO: Starting page number set to 1
INFO: Allow failed set to 1
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 40 (40 total)
INFO: Determined optimal split size of 8 pages.
INFO: Partitioning 5 files with 8 page(s) each.
INFO: Partitioning set #1 (pages 1-8).
INFO: Partitioning set #2 (pages 9-16).
INFO: Partitioning set #3 (pages 17-24).
INFO: Partitioning set #4 (pages 25-32).
INFO: Partitioning set #5 (pages 33-40).
INFO: HTTP Request: POST <MY_API_URL> "HTTP/1.1 504 Gateway Time-out"
ERROR: Failed to send request for page 25
INFO: HTTP Request: POST <MY_API_URL> "HTTP/1.1 504 Gateway Time-out"
ERROR: Failed to send request for page 17
INFO: HTTP Request: POST <MY_API_URL> "HTTP/1.1 504 Gateway Time-out"
ERROR: Failed to send request for page 9
INFO: HTTP Request: POST<MY_API_URL> "HTTP/1.1 504 Gateway Time-out"
ERROR: Failed to send request for page 1
WARNING: Failed to partition set #1, its elements will be omitted in the final result.
WARNING: Failed to partition set #2, its elements will be omitted in the final result.
WARNING: Failed to partition set #3, its elements will be omitted in the final result.
WARNING: Failed to partition set #4, its elements will be omitted in the final result.
WARNING: Failed to partition set #5, its elements will be omitted in the final result.
INFO: Failed to process a request due to API server error with status code 504. Attempting retry number 1 after sleep.
INFO: Server message - <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>
And then it will go about the retry strategy, which I presume is the one defined in general.py.
This loop of 504s continues again and again.
I have tried adjusting the RetryConfig in my Client and general.Partition, but can't seem to make it make a difference to when and how my program fails.
Environment Info
I am running this in a Jupyter notebook in VSCode, within a venv.
Additional Info
The pdf I used to reproduce this example is here
Would anyone have a solution, or could help guide me as to whether this is a me issue or a bug?