Skip to content

Commit 27cd53b

Browse files
badGarnetplutasnyychristinestraubryannikolaidis
authored
fix: fix multiple values for infer_table_structure (#3870)
This PR fixes a bug when using `partition` to partition an email with image attachments with hi_res and allow table structure inference -> the partitioning of the image would encounter a value error: `got multiple values for keyword argument 'infer_table_structure'`. This is because pass `kwargs` into partition "other" types of files in this [block](https://github.com/Unstructured-IO/unstructured/blob/50ea6fe7fc324efa09398898dc35d0cd4e78b1cf/unstructured/partition/auto.py#L270-L280) `infer_table_structure` is packaged into `partitioning_kwargs`. Then for email at least when there are attachments that can be partitioned with `hi_res` we pass that dict of `kwargs` right back into `partition` entry -> so when we get [here](https://github.com/Unstructured-IO/unstructured/blob/50ea6fe7fc324efa09398898dc35d0cd4e78b1cf/unstructured/partition/auto.py#L222-L235) we are both specifying explicitly `infer_table_structure` and have it in `kwargs` variable The fix is to detect first if `kwargs` already contains `infer_table_structure` and if yes use that and pop it from `kwargs`. --------- Co-authored-by: Kamil Plucinski <[email protected]> Co-authored-by: christinestraub <[email protected]> Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: christinestraub <[email protected]>
1 parent 38eb661 commit 27cd53b

File tree

6 files changed

+1073
-897
lines changed

6 files changed

+1073
-897
lines changed

Diff for: .github/workflows/ingest-test-fixtures-update-pr.yml

+1
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ jobs:
109109
sudo apt-get install -y tesseract-ocr-kor
110110
sudo apt-get install diffstat
111111
tesseract --version
112+
python -m nltk.downloader punkt_tab averaged_perceptron_tagger_eng
112113
./test_unstructured_ingest/test-ingest-src.sh
113114
114115
- name: Save branch name to environment file

Diff for: CHANGELOG.md

+9
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
## 0.16.14-dev0
2+
3+
### Enhancements
4+
5+
### Features
6+
7+
### Fixes
8+
- **Fix an issue with multiple values for `infer_table_structure`** when paritioning email with image attachements the kwarg calls into `partition` to partition the image already contains `infer_table_structure`. Now `partition` function checks if the `kwarg` has `infer_table_structure` already
9+
110
## 0.16.13
211

312
### Enhancements

Diff for: test_unstructured/partition/test_auto.py

+27
Original file line numberDiff line numberDiff line change
@@ -570,6 +570,33 @@ def test_auto_partition_pdf_with_fast_strategy(request: FixtureRequest):
570570
)
571571

572572

573+
@pytest.mark.parametrize("infer_bool", [True, False])
574+
def test_auto_handles_kwarg_with_infer_table_structure(infer_bool):
575+
with patch(
576+
"unstructured.partition.pdf_image.ocr.process_file_with_ocr",
577+
) as mock_process_file_with_model:
578+
partition(
579+
example_doc_path("pdf/layout-parser-paper-fast.pdf"),
580+
pdf_infer_table_structure=True,
581+
strategy=PartitionStrategy.HI_RES,
582+
infer_table_structure=infer_bool,
583+
)
584+
assert mock_process_file_with_model.call_args[1]["infer_table_structure"] is infer_bool
585+
586+
587+
def test_auto_handles_kwarg_with_infer_table_structure_when_none():
588+
with patch(
589+
"unstructured.partition.pdf_image.ocr.process_file_with_ocr",
590+
) as mock_process_file_with_model:
591+
partition(
592+
example_doc_path("pdf/layout-parser-paper-fast.pdf"),
593+
pdf_infer_table_structure=True,
594+
strategy=PartitionStrategy.HI_RES,
595+
infer_table_structure=None,
596+
)
597+
assert mock_process_file_with_model.call_args[1]["infer_table_structure"] is True
598+
599+
573600
def test_auto_partition_pdf_uses_pdf_infer_table_structure_argument():
574601
with patch(
575602
"unstructured.partition.pdf_image.ocr.process_file_with_ocr",

0 commit comments

Comments
 (0)