Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet

**Describe the bug**
I am evaluating the UnstructuredClient for processing PDF documents and am encountering an issue with the Greek language text extraction. When I attempt to extract text from PDF documents in Greek, the output text appears in a non-Greek alphabet and is unreadable, making it impossible to use for my purposes.

**To Reproduce**
This is the code I am using, running it on any greek document will reproduce the error:
```
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

s = UnstructuredClient(
    api_key_auth=DLAI_API_KEY,
    server_url=DLAI_API_URL,
)

filename = "example_files/c_20230111133942393_2525540.pdf"
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(), 
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy='hi_res',
    pdf_infer_table_structure=True,
    languages=["gr"],
)
try:
    resp = s.general.partition(req)
    print(json.dumps(resp.elements[:3], indent=2))
except SDKError as e:
    print(e)
```

**Expected behavior**
I expect the extracted text to accurately represent the original Greek characters from the PDF document.

**Actual results**
The extracted text contains characters that are not in the Greek alphabet, rendering the text unreadable. Here's a snippet of what I get:
```
{
    "type": "NarrativeText",
    "element_id": "aaad19db9a99367b392003c6db4a7e2b",
    "text": "\u00a3635 TO TGS TOV KTV sPSopmva E5 yihddav oySovia Svo ko Sixx entd exatootdv (176.082,17) EURO, nov avriotogi o8 g&ivia exatoppudpia (60.000.000) Spayy\u00e9e,",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "gr"
      ],
      "page_number": 1,
      "parent_id": "1b406d8798c823dcd1d195f4a4f331dd",
      "filename": "c_20230111133942393_2525540.pdf"
    }
  }
```

**Additional context**
 - Using the latest version of the Unstructured SDK.
 - Issue occurs consistently with multiple documents in Greek.

Could this issue be due to a missing OCR plugin for the Greek language? Since I am utilizing the API, I would expect such components to be managed server-side.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet #2939

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet #2939

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions