Skip to content

bug: correctly combine words spanning multiple lines #2234

Open
@Coniferish

Description

@Coniferish

The bug
When partitioning pdfs using auto strategy, some elements contain words that are split over multiple lines and have a dash. Even though line separators are removed in the final element.text, the dash remains.

Example:
When partitioning example-docs/layout-parser-paper-fast.pdf the word "distribution" in the numbered list spans multiple lines and is left broken in the combined list element:

(pdb) elements[22].text
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'

Note: though this ListItem passes through _combine_list_elements, it is not like the other list items in this document that were broken and needed to be combined (meaning the bug occurs somewhere earlier in the call stack).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpdf

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions