Skip to content

Version 0.0.71 breaks ocr_only pdf recognition on some documents #485

Open
@fatadel

Description

@fatadel

Describe the bug
Starting with version 0.0.71 ocr_only pdf processing of some documents yields an empty result. Exactly the same code works fine with version 0.0.70.

To Reproduce

  1. docker run --platform linux/x86_64 -p 8000:8000 -d --rm --name unstructured-api downloads.unstructured.io/unstructured-io/unstructured-api:0.0.71
curl -X 'POST' \
  'http://127.0.0.1:8000/general/v0/general' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F '[email protected]' \
  -F 'strategy=ocr_only' \
  -F 'languages=deu'
  1. Now pull version 0.0.70 and try the same. In the first case the result is empty, in the second the result is correct and non-empty.
  • Filetype: PDF
  • Any additional API parameters: -

Environment:

  • self-hosted API
  • any client yields the same result

Additional context
Attaching the problematic PDF.
4.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions