Skip to content

bug/right2left_pdf_output #3232

Open
Open
@DsDastgheib

Description

@DsDastgheib

Describe the bug
The output of the pdf partitioner for right-to-left languages is incorrect.

To Reproduce
I've downloaded a sample pdf from this link,
then using the following code

filename = "Path_to_the_sample_pdf_file"

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(files=files)

try:
    resp = client.general.partition(req)
except SDKError as e:
    print(e)

I've got the following output (only part of it):

PartitionResponse(content_type='application/json', status_code=200, raw_response=<Response [200]>, elements=[{'type': 'Header', 'element_id': '4e8ada3c22ab6f719d3a16379b9d2ca5', 'text': 'See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/381042047', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'filename': 'drbarh.pdf'}}, {'type': 'Title', 'element_id': '193c5b2dbecb6826b3e4d0ad1a37e699', 'text': 'ﻲﻣدﺎﻛآ و هﺮﻣزور ﻲﮔﺪﻧز رد بﻮﺧ عوﺮﺷ ﻚﻳ هرﺎﺑرد', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'parent_id': '4e8ada3c22ab6f719d3a16379b9d2ca5', 'filename': 'drbarh.pdf'}}, {'type': 'NarrativeText', 'element_id': 'a632662d5c3182a47e0a547204c7a311', 'text': 'Article · June 2024', 'metadata': {'filetype': 'application/pdf',

Expected behavior
The text should be like this (It seems it reverted):

دربارهٔ یک شروع خوب در زندگی روزمره و آکادمی

Additional context
The problem with text is for the whole document, and also changing the language won't help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpdf

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions