Skip to content

Commit 06c8523

Browse files
authored
rfctr(ppt): remove double-decoration (#3701)
Somehow this slipped through the earlier PR removing double-decoration from PPTX. Remove the decorators from PPT (because it is a delegating partitioner) and let the decorators on the proxy partitioner (`partition_pptx()`) do the needful.
1 parent 27fa2a3 commit 06c8523

File tree

3 files changed

+12
-46
lines changed

3 files changed

+12
-46
lines changed

CHANGELOG.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.15.14-dev12
1+
## 0.15.14-dev13
22

33
### Enhancements
44

@@ -13,10 +13,11 @@
1313
* **Fix occasional `KeyError` when mapping parent ids to hash ids.** Occasionally the input elements into `assign_and_map_hash_ids` can contain duplicated element instances, which lead to error when mapping parent id.
1414
* **Allow empty text files.** Fixes an issue where text files with only white space would fail to be partitioned.
1515
* **Remove double-decoration for CSV, DOC, ODT partitioners.** Refactor these partitioners to use the new `@apply_metadata()` decorator and only decorate the principal partitioner (CSV and DOCX in this case); remove decoration from delegating partitioners.
16-
* **Remove double-decoration for PPT, PPTX, TSV, XLSX, and XML partitioners.** Refactor these partitioners to use the new `@apply_metadata()` decorator and only decorate the principal partitioner; remove decoration from delegating partitioners.
16+
* **Remove double-decoration for PPTX, TSV, XLSX, and XML partitioners.** Refactor these partitioners to use the new `@apply_metadata()` decorator and only decorate the principal partitioner; remove decoration from delegating partitioners.
1717
* **Remove double-decoration for HTML, EPUB, MD, ORG, RST, and RTF partitioners.** Refactor these partitioners to use the new `@apply_metadata()` decorator and only decorate the principal partitioner (HTML in this case); remove decoration from delegating partitioners.
1818
* **Remove obsolete min_partition/max_partition args from TXT and EML.** The legacy `min_partition` and `max_partition` parameters were an initial rough implementation of chunking but now interfere with chunking and are unused. Remove those parameters from `partition_text()` and `partition_email()`.
1919
* **Remove double-decoration on EML and MSG.** Refactor these partitioners to rely on the new `@apply_metadata()` decorator operating on partitioners they delegate to (TXT, HTML, and all others for attachments) and remove direct decoration from EML and MSG.
20+
* **Remove double-decoration for PPT.** Remove decorators from the delegating PPT partitioner.
2021

2122
## 0.15.13
2223

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.15.14-dev12" # pragma: no cover
1+
__version__ = "0.15.14-dev13" # pragma: no cover

unstructured/partition/ppt.py

Lines changed: 8 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -4,31 +4,18 @@
44
import tempfile
55
from typing import IO, Any, Optional
66

7-
from unstructured.chunking import add_chunking_strategy
8-
from unstructured.documents.elements import Element, process_metadata
9-
from unstructured.file_utils.filetype import add_metadata_with_filetype
7+
from unstructured.documents.elements import Element
108
from unstructured.file_utils.model import FileType
119
from unstructured.partition.common.common import convert_office_doc, exactly_one
1210
from unstructured.partition.common.metadata import get_last_modified_date
1311
from unstructured.partition.pptx import partition_pptx
14-
from unstructured.partition.utils.constants import PartitionStrategy
1512

1613

17-
@process_metadata()
18-
@add_metadata_with_filetype(FileType.PPT)
19-
@add_chunking_strategy
2014
def partition_ppt(
2115
filename: Optional[str] = None,
2216
file: Optional[IO[bytes]] = None,
23-
include_page_breaks: bool = False,
24-
include_slide_notes: Optional[bool] = None,
25-
infer_table_structure: bool = True,
2617
metadata_filename: Optional[str] = None,
2718
metadata_last_modified: Optional[str] = None,
28-
languages: Optional[list[str]] = ["auto"],
29-
detect_language_per_element: bool = False,
30-
starting_page_number: int = 1,
31-
strategy: str = PartitionStrategy.FAST,
3219
**kwargs: Any,
3320
) -> list[Element]:
3421
"""Partitions Microsoft PowerPoint Documents in .ppt format into their document elements.
@@ -39,29 +26,11 @@ def partition_ppt(
3926
A string defining the target filename path.
4027
file
4128
A file-like object using "rb" mode --> open(filename, "rb").
42-
include_page_breaks
43-
If True, includes a PageBreak element between slides
44-
include_slide_notes
45-
If True, includes the slide notes as element
46-
infer_table_structure
47-
If True, any Table elements that are extracted will also have a metadata field
48-
named "text_as_html" where the table's text content is rendered into an html string.
49-
I.e., rows and cells are preserved.
50-
Whether True or False, the "text" field is always present in any Table element
51-
and is the text content of the table (no structure).
5229
metadata_last_modified
5330
The last modified date for the document.
54-
languages
55-
User defined value for `metadata.languages` if provided. Otherwise language is detected
56-
using naive Bayesian filter via `langdetect`. Multiple languages indicates text could be
57-
in either language.
58-
Additional Parameters:
59-
detect_language_per_element
60-
Detect language per element instead of at the document level.
61-
starting_page_number
62-
Indicates what page number should be assigned to the first slide in the presentation.
63-
This information will be reflected in elements' metadata and can be be especially
64-
useful when partitioning a document that is part of a larger document.
31+
32+
Note that all arguments valid on `partition_pptx()` are also valid here and will be passed
33+
along to the `partition_pptx()` function.
6534
"""
6635
# -- Verify that only one of the arguments was provided
6736
exactly_one(filename=filename, file=file)
@@ -92,17 +61,13 @@ def partition_ppt(
9261
target_filter="Impress MS PowerPoint 2007 XML",
9362
)
9463
pptx_filename = os.path.join(tmpdir, f"{base_filename}.pptx")
64+
9565
elements = partition_pptx(
9666
filename=pptx_filename,
97-
detect_language_per_element=detect_language_per_element,
98-
include_page_breaks=include_page_breaks,
99-
include_slide_notes=include_slide_notes,
100-
infer_table_structure=infer_table_structure,
101-
languages=languages,
102-
metadata_filename=metadata_filename,
67+
metadata_filename=metadata_filename or filename,
68+
metadata_file_type=FileType.PPT,
10369
metadata_last_modified=metadata_last_modified or last_modified,
104-
starting_page_number=starting_page_number,
105-
strategy=strategy,
70+
**kwargs,
10671
)
10772

10873
# -- Remove tmp.name from filename if parsing file

0 commit comments

Comments
 (0)