Skip to content

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet #2939

Open
@DarioBernardo

Description

@DarioBernardo

Describe the bug
I am evaluating the UnstructuredClient for processing PDF documents and am encountering an issue with the Greek language text extraction. When I attempt to extract text from PDF documents in Greek, the output text appears in a non-Greek alphabet and is unreadable, making it impossible to use for my purposes.

To Reproduce
This is the code I am using, running it on any greek document will reproduce the error:

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

s = UnstructuredClient(
    api_key_auth=DLAI_API_KEY,
    server_url=DLAI_API_URL,
)

filename = "example_files/c_20230111133942393_2525540.pdf"
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(), 
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy='hi_res',
    pdf_infer_table_structure=True,
    languages=["gr"],
)
try:
    resp = s.general.partition(req)
    print(json.dumps(resp.elements[:3], indent=2))
except SDKError as e:
    print(e)

Expected behavior
I expect the extracted text to accurately represent the original Greek characters from the PDF document.

Actual results
The extracted text contains characters that are not in the Greek alphabet, rendering the text unreadable. Here's a snippet of what I get:

{
    "type": "NarrativeText",
    "element_id": "aaad19db9a99367b392003c6db4a7e2b",
    "text": "\u00a3635 TO TGS TOV KTV sPSopmva E5 yihddav oySovia Svo ko Sixx entd exatootdv (176.082,17) EURO, nov avriotogi o8 g&ivia exatoppudpia (60.000.000) Spayy\u00e9e,",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "gr"
      ],
      "page_number": 1,
      "parent_id": "1b406d8798c823dcd1d195f4a4f331dd",
      "filename": "c_20230111133942393_2525540.pdf"
    }
  }

Additional context

  • Using the latest version of the Unstructured SDK.
  • Issue occurs consistently with multiple documents in Greek.

Could this issue be due to a missing OCR plugin for the Greek language? Since I am utilizing the API, I would expect such components to be managed server-side.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingocrRelated to optical character recognition (OCR).

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions