Skip to content

Commit 1dede50

Browse files
fix: parsing pdf error - new_cells as str has no "copy" (#3130)
Closes #3119. ### Testing Parsing the provided PDF should be successful. [testing_brochure_2.pdf](https://github.com/user-attachments/files/15518094/testing_brochure_2.pdf) ``` filename = "testing_brochure_2.pdf" with open(filename, "rb") as pdf_content: elements = partition_pdf( file=pdf_content, infer_table_structure=True, extract_image_block_types=["Image", "Table"], chunking_strategy="by_title", max_characters=1000, new_after_n_chars=3000, combine_text_under_n_chars=1000, ) print("\n\n".join([str(el) for el in elements])) ```
1 parent 1b43102 commit 1dede50

File tree

3 files changed

+5
-3
lines changed

3 files changed

+5
-3
lines changed

CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.14.4-dev6
1+
## 0.14.4
22

33
### Enhancements
44

@@ -12,6 +12,7 @@
1212

1313
### Fixes
1414

15+
* **Address the issue of unrecognized tables in `UnstructuredTableTransformerModel`** When a table is not recognized, the `element.metadata.text_as_html` attribute is set to an empty string.
1516
* **Remove root handlers in ingest logger**. Removes root handlers in ingest loggers to ensure secrets aren't accidentally exposed in Colab notebooks.
1617
* **Fix V2 S3 Destination Connector authentication** Fixes bugs with S3 Destination Connector where the connection config was neither registered nor properly deserialized.
1718
* **Clarified dependence on particular version of `python-docx`** Pinned `python-docx` version to ensure a particular method `unstructured` uses is included.

unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.14.4-dev6" # pragma: no cover
1+
__version__ = "0.14.4" # pragma: no cover

unstructured/partition/pdf_image/ocr.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -280,7 +280,8 @@ def supplement_element_with_table_extraction(
280280
cropped_image, ocr_tokens=table_tokens, result_format="cells"
281281
)
282282

283-
text_as_html = cells_to_html(tatr_cells)
283+
# NOTE(christine): `tatr_cells == ""` means that the table was not recognized
284+
text_as_html = "" if tatr_cells == "" else cells_to_html(tatr_cells)
284285
element.text_as_html = text_as_html
285286

286287
if env_config.EXTRACT_TABLE_AS_CELLS:

0 commit comments

Comments
 (0)