Open
Description
The bug
When partitioning pdfs using auto strategy, some elements contain words that are split over multiple lines and have a dash. Even though line separators are removed in the final element.text
, the dash remains.
Example:
When partitioning example-docs/layout-parser-paper-fast.pdf
the word "distribution" in the numbered list spans multiple lines and is left broken in the combined list element:
(pdb) elements[22].text
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
Note: though this ListItem passes through _combine_list_elements
, it is not like the other list items in this document that were broken and needed to be combined (meaning the bug occurs somewhere earlier in the call stack).