Skip to content

When splitting_pdf_page is started, only the last set of API requests can succeed. #220

Open
@issj6

Description

@issj6

Describe the bug
When I set split_pdf_page=True,split_pdf_concurrency_level=15.
Assuming the pdf is divided into 10 sets, it will report an error:
ERROR: Failed to send request for page 1
...
WARNING: Failed to partition set Unstructured-IO/unstructured-api#1, its elements will be omitted in the final result.
...
WARNING: Failed to partition set Unstructured-IO/unstructured-api#9, its elements will be omitted in the final result.
INFO: Successfully partitioned set Unstructured-IO/unstructured-api#10, elements added to the final result.

To Reproduce
code:

import os, json

import requests
from unstructured_client.models.operations import PartitionRequest
from unstructured_client.models.shared import PartitionParameters, ChunkingStrategy

os.environ["UNSTRUCTURED_API_KEY"] = "EMPTY"
os.environ["UNSTRUCTURED_API_URL"] = ""

import unstructured_client
from unstructured_client.models import shared, operations

requests_client = requests.Session()
client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL"),
    client=requests_client
)

filename = "./test_pdf.pdf"

file = open(filename, "rb")
req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=file.read(),
            file_name=filename,
        ),
        strategy=shared.Strategy.HI_RES,
        split_pdf_page=True,
        split_pdf_concurrency_level=15,
        chunking_strategy=ChunkingStrategy("by_title")
    )
)

try:
    res = client.general.partition(req)
    element_dicts = [element for element in res.elements]

    print(element_dicts)
    for e in element_dicts:
        print(e['text'])
except Exception as e:
    print(e)

Console Information:

INFO: Preparing to split document for partition.
INFO: Concurrency level set to 15
INFO: Splitting pages 1 to 23 (23 total)
INFO: Determined optimal split size of 2 pages.
INFO: Partitioning 11 files with 2 page(s) each.
INFO: Partitioning 1 file with 1 page(s).
INFO: Partitioning set Unstructured-IO/unstructured-api#1 (pages 1-2).
INFO: Partitioning set Unstructured-IO/unstructured-api#2 (pages 3-4).
INFO: Partitioning set Unstructured-IO/unstructured-api#3 (pages 5-6).
INFO: Partitioning set Unstructured-IO/unstructured-api#4 (pages 7-8).
INFO: Partitioning set Unstructured-IO/unstructured-api#5 (pages 9-10).
INFO: Partitioning set Unstructured-IO/unstructured-api#6 (pages 11-12).
INFO: Partitioning set Unstructured-IO/unstructured-api#7 (pages 13-14).
INFO: Partitioning set Unstructured-IO/unstructured-api#8 (pages 15-16).
INFO: Partitioning set Unstructured-IO/unstructured-api#9 (pages 17-18).
INFO: Partitioning set Unstructured-IO/unstructured-api#10 (pages 19-20).
INFO: Partitioning set Unstructured-IO/unstructured-api#11 (pages 21-22).
INFO: Partitioning set Unstructured-IO/unstructured-api#12 (pages 23-23).
ERROR: Failed to send request for page 1
ERROR: Failed to send request for page 3
ERROR: Failed to send request for page 5
ERROR: Failed to send request for page 7
ERROR: Failed to send request for page 9
ERROR: Failed to send request for page 11
ERROR: Failed to send request for page 13
ERROR: Failed to send request for page 15
ERROR: Failed to send request for page 17
ERROR: Failed to send request for page 19
ERROR: Failed to send request for page 21
WARNING: Failed to partition set Unstructured-IO/unstructured-api#1, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#2, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#3, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#4, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#5, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#6, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#7, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#8, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#9, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#10, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#11, its elements will be omitted in the final result.
INFO: Successfully partitioned set Unstructured-IO/unstructured-api#12, elements added to the final result.
INFO: Successfully partitioned the document.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions