Skip to content

Commit f445724

Browse files
fix: partition_pdf() removes spaces from the text (#3106)
Closes #2896. This PR aims to fix `partition_pdf()` to keep spaces in text. The control character `\t` is now replaced with a space instead of being removed when merging inferred and embedded elements. ### Testing PDF: [rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf) ``` elements = partition_pdf( filename="rok_20230930_1-1.pdf", strategy="hi_res", ) print(str(elements[20])) ``` **Results:** - PR ``` Name of each exchange on which registered New York Stock Exchange ``` - main branch ``` Nameofeachexchangeonwhichregistered NewYorkStockExchange ```
1 parent 3158169 commit f445724

File tree

4 files changed

+5
-4
lines changed

4 files changed

+5
-4
lines changed

CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.14.3-dev5
1+
## 0.14.3
22

33
### Enhancements
44

@@ -10,6 +10,7 @@
1010

1111
### Fixes
1212

13+
* **Fix `partition_pdf()` to keep spaces in the text**. The control character `\t` is now replaced with a space instead of being removed when merging inferred elements with embedded elements.
1314
* **Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
1415
to avoid text being dynamically injected into the XML document.
1516
* **Add backward compatibility for the deprecated pdf_infer_table_structure parameter**.

test_unstructured/partition/pdf_image/test_pdf_image_utils.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -347,7 +347,7 @@ def test_annotate_layout_elements_file_not_found_error():
347347

348348
@pytest.mark.parametrize(
349349
("text", "expected"),
350-
[("c\to\x0cn\ftrol\ncharacter\rs\b", "control characters"), ("\"'\\", "\"'\\")],
350+
[("test\tco\x0cn\ftrol\ncharacter\rs\b", "test control characters"), ("\"'\\", "\"'\\")],
351351
)
352352
def test_remove_control_characters(text, expected):
353353
assert pdf_image_utils.remove_control_characters(text) == expected

unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.14.3-dev5" # pragma: no cover
1+
__version__ = "0.14.3" # pragma: no cover

unstructured/partition/pdf_image/pdf_image_utils.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -427,7 +427,7 @@ def remove_control_characters(text: str) -> str:
427427
"""Removes control characters from text."""
428428

429429
# Replace newline character with a space
430-
text = text.replace("\n", " ")
430+
text = text.replace("\t", " ").replace("\n", " ")
431431
# Remove other control characters
432432
out_text = "".join(c for c in text if unicodedata.category(c)[0] != "C")
433433
return out_text

0 commit comments

Comments
 (0)