Skip to content

Incorrect Bounding Boxes for PDFs #3867

Open
@charlottecrnj

Description

@charlottecrnj

Describe the bug
The bounding boxes returned by the HI_RES strategy are wrong for PDFs.

To Reproduce

filename = "example.pdf"
with open(filename, "rb") as f:
    data = f.read()

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=data,
            file_name=filename,
        ),
        strategy=shared.Strategy.HI_RES,  
        coordinates = True,
        languages=['de'],
    ),
)

try:
    res = client.general.partition(request=req)
    print(res.elements[0])
except Exception as e:
    print(e)

Expected behavior
I would expect the bounding boxes to be correctly placed around each of the elements returned by the unstructured API.

Screenshots of Actual (Wrong) Behavior
Image

Additional context
This issue was already discussed in a previous issue (#3100 ) Back then, the default strategy would still return bounding boxes. This does not seem to the the case anymore - all strategies except for hi_res return no coordinates (and hence no bounding boyes anymore). Hence there currently is no way to retrieve proper bounding boxes for PDFs?

Does anyone know a way to retrieve correct bounding boxes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions