Skip to content

Commit 601f250

Browse files
authored
feat: add partition_ppt for older power point docs (#238)
* added partition_ppt function and tests * add ppt support to auto * version bump * update docs * doc fixes * update changelog * `.docx` -> `.pptx` * its -> their * remove whitespace
1 parent 6036af3 commit 601f250

File tree

12 files changed

+157
-12
lines changed

12 files changed

+157
-12
lines changed

Diff for: CHANGELOG.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
1-
## 0.4.11-dev0
1+
## 0.4.11
22

3-
* Adds `partition_doc` for partition Word documents in `.doc` format. Requires `libreoffice`.
3+
* Adds `partition_doc` for partitioning Word documents in `.doc` format. Requires `libreoffice`.
4+
* Adds `partition_ppt` for partitioning PowerPoint documents in `.ppt` format. Requires `libreoffice`.
45

56
## 0.4.10
67

Diff for: README.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,8 @@ To install the library, run `pip install unstructured`.
7878
You can run this [Colab notebook](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) to run the examples below.
7979

8080
The following examples show how to get started with the `unstructured` library.
81-
You can parse **TXT**, **HTML**, **PDF**, **EML** **DOC** and **DOCX** documents with one line of code!
81+
You can parse **TXT**, **HTML**, **PDF**, **EML**, **DOC**, **DOCX**, **PPT**, **PPTX**, **JPG**,
82+
and **PNG** documents with one line of code!
8283
<br></br>
8384
See our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
8485
of the features in the library.
@@ -92,7 +93,7 @@ If you are using the `partition` brick, you may need to install additional param
9293
instructions outlined [here](https://unstructured-io.github.io/unstructured/installing.html#filetype-detection)
9394
`partition` will always apply the default arguments. If you need
9495
advanced features, use a document-specific brick. The `partition` brick currently works for
95-
`.txt`, `.doc`, `.docx`, `.pptx`, `.jpg`, `.png`, `.eml`, `.html`, and `.pdf` documents.
96+
`.txt`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.jpg`, `.png`, `.eml`, `.html`, and `.pdf` documents.
9697

9798
```python
9899
from unstructured.partition.auto import partition

Diff for: docs/source/bricks.rst

+22-3
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ If you call the ``partition`` function, ``unstructured`` will attempt to detect
2222
file type and route it to the appropriate partitioning brick. All partitioning bricks
2323
called within ``partition`` are called using the defualt kwargs. Use the document-type
2424
specific bricks if you need to apply non-default settings.
25-
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.eml``, ``.html``, ``.pdf``,
25+
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.ppt``, ``.eml``, ``.html``, ``.pdf``,
2626
``.png``, ``.jpg``, and ``.txt`` files.
2727
If you set the ``include_page_breaks`` kwarg to ``True``, the output will include page breaks. This is only supported for ``.pptx``, ``.html``, ``.pdf``,
2828
``.png``, and ``.jpg``.
@@ -89,8 +89,8 @@ The ``partition_doc`` partitioning brick pre-processes Microsoft Word documents
8989
saved in the ``.doc`` format. This staging brick uses a combination of the styling
9090
information in the document and the structure of the text to determine the type
9191
of a text element. The ``partition_doc`` can take a filename or file-like object
92-
as input, as shown in the two examples below. ``partiton_doc``
93-
uses ``libreoffice`` to convert the file to ``.docx`` and then
92+
as input.
93+
``partiton_doc`` uses ``libreoffice`` to convert the file to ``.docx`` and then
9494
calls ``partition_docx``. Ensure you have ``libreoffice`` installed
9595
before using ``partition_doc``.
9696

@@ -124,6 +124,25 @@ Examples:
124124
elements = partition_pptx(file=f)
125125
126126
127+
``partition_ppt``
128+
---------------------
129+
130+
The ``partition_ppt`` partitioning brick pre-processes Microsoft PowerPoint documents
131+
saved in the ``.ppt`` format. This staging brick uses a combination of the styling
132+
information in the document and the structure of the text to determine the type
133+
of a text element. The ``partition_ppt`` can take a filename or file-like object.
134+
``partition_ppt`` uses ``libreoffice`` to convert the file to ``.pptx`` and then
135+
calls ``partition_pptx``. Ensure you have ``libreoffice`` installed
136+
before using ``partition_ppt``.
137+
138+
Examples:
139+
140+
.. code:: python
141+
142+
from unstructured.partition.ppt import partition_ppt
143+
144+
elements = partition_ppt(filename="example-docs/fake-power-point.ppt")
145+
127146
``partition_html``
128147
---------------------
129148

Diff for: example-docs/fake-power-point.ppt

594 KB
Binary file not shown.

Diff for: test_unstructured/partition/test_auto.py

+8
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,7 @@ def test_auto_partition_doc_with_filename(mock_docx_document, expected_docx_elem
105105

106106
elements = partition(filename=doc_filename)
107107
assert elements == expected_docx_elements
108+
assert elements[0].metadata.filename == doc_filename
108109

109110

110111
# NOTE(robinson) - the application/x-ole-storage mime type is not specific enough to
@@ -240,6 +241,13 @@ def test_auto_partition_pptx_from_filename():
240241
assert elements[0].metadata.filename == filename
241242

242243

244+
def test_auto_partition_ppt_from_filename():
245+
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "fake-power-point.ppt")
246+
elements = partition(filename=filename)
247+
assert elements == EXPECTED_PPTX_OUTPUT
248+
assert elements[0].metadata.filename == filename
249+
250+
243251
def test_auto_with_page_breaks():
244252
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf")
245253
elements = partition(filename=filename, include_page_breaks=True)

Diff for: test_unstructured/partition/test_ppt.py

+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
import os
2+
import pathlib
3+
import pytest
4+
5+
from unstructured.partition.ppt import partition_ppt
6+
from unstructured.documents.elements import ListItem, NarrativeText, Title
7+
8+
DIRECTORY = pathlib.Path(__file__).parent.resolve()
9+
EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, "..", "..", "example-docs")
10+
11+
EXPECTED_PPT_OUTPUT = [
12+
Title(text="Adding a Bullet Slide"),
13+
ListItem(text="Find the bullet slide layout"),
14+
ListItem(text="Use _TextFrame.text for first bullet"),
15+
ListItem(text="Use _TextFrame.add_paragraph() for subsequent bullets"),
16+
NarrativeText(text="Here is a lot of text!"),
17+
NarrativeText(text="Here is some text in a text box!"),
18+
]
19+
20+
21+
def test_partition_ppt_from_filename():
22+
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "fake-power-point.ppt")
23+
elements = partition_ppt(filename=filename)
24+
assert elements == EXPECTED_PPT_OUTPUT
25+
26+
27+
def test_partition_ppt_raises_with_missing_file():
28+
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "doesnt-exist.ppt")
29+
with pytest.raises(ValueError):
30+
partition_ppt(filename=filename)
31+
32+
33+
def test_partition_ppt_from_file():
34+
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "fake-power-point.ppt")
35+
with open(filename, "rb") as f:
36+
elements = partition_ppt(file=f)
37+
assert elements == EXPECTED_PPT_OUTPUT
38+
39+
40+
def test_partition_ppt_raises_with_both_specified():
41+
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "fake-power-point.ppt")
42+
with open(filename, "rb") as f:
43+
with pytest.raises(ValueError):
44+
partition_ppt(filename=filename, file=f)
45+
46+
47+
def test_partition_ppt_raises_with_neither():
48+
with pytest.raises(ValueError):
49+
partition_ppt()

Diff for: unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.4.11-dev0" # pragma: no cover
1+
__version__ = "0.4.11" # pragma: no cover

Diff for: unstructured/partition/auto.py

+3
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from unstructured.partition.email import partition_email
77
from unstructured.partition.html import partition_html
88
from unstructured.partition.pdf import partition_pdf
9+
from unstructured.partition.ppt import partition_ppt
910
from unstructured.partition.pptx import partition_pptx
1011
from unstructured.partition.image import partition_image
1112
from unstructured.partition.text import partition_text
@@ -59,6 +60,8 @@ def partition(
5960
)
6061
elif filetype == FileType.TXT:
6162
return partition_text(filename=filename, file=file)
63+
elif filetype == FileType.PPT:
64+
return partition_ppt(filename=filename, file=file, include_page_breaks=include_page_breaks)
6265
elif filetype == FileType.PPTX:
6366
return partition_pptx(filename=filename, file=file, include_page_breaks=include_page_breaks)
6467
else:

Diff for: unstructured/partition/doc.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,6 @@ def partition_doc(filename: Optional[str] = None, file: Optional[IO] = None) ->
4040
with tempfile.TemporaryDirectory() as tmpdir:
4141
convert_office_doc(filename, tmpdir, target_format="docx")
4242
docx_filename = os.path.join(tmpdir, f"{base_filename}.docx")
43-
elements = partition_docx(filename=docx_filename)
43+
elements = partition_docx(filename=docx_filename, metadata_filename=filename)
4444

4545
return elements

Diff for: unstructured/partition/docx.py

+11-2
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,11 @@
5656
}
5757

5858

59-
def partition_docx(filename: Optional[str] = None, file: Optional[IO] = None) -> List[Element]:
59+
def partition_docx(
60+
filename: Optional[str] = None,
61+
file: Optional[IO] = None,
62+
metadata_filename: Optional[str] = None,
63+
) -> List[Element]:
6064
"""Partitions Microsoft Word Documents in .docx format into its document elements.
6165
6266
Parameters
@@ -65,6 +69,10 @@ def partition_docx(filename: Optional[str] = None, file: Optional[IO] = None) ->
6569
A string defining the target filename path.
6670
file
6771
A file-like object using "rb" mode --> open(filename, "rb").
72+
metadata_filename
73+
The filename to use for the metadata. Relevant because partition_doc converts the
74+
document to .docx before partition. We want the original source filename in the
75+
metadata.
6876
"""
6977

7078
if not any([filename, file]):
@@ -77,11 +85,12 @@ def partition_docx(filename: Optional[str] = None, file: Optional[IO] = None) ->
7785
else:
7886
raise ValueError("Only one of filename or file can be specified.")
7987

88+
metadata_filename = metadata_filename or filename
8089
elements: List[Element] = []
8190
for paragraph in document.paragraphs:
8291
element = _paragraph_to_element(paragraph)
8392
if element is not None:
84-
element.metadata = ElementMetadata(filename=filename)
93+
element.metadata = ElementMetadata(filename=metadata_filename)
8594
elements.append(element)
8695

8796
return elements

Diff for: unstructured/partition/ppt.py

+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
import os
2+
import tempfile
3+
from typing import IO, List, Optional
4+
5+
from unstructured.documents.elements import Element
6+
from unstructured.partition.common import convert_office_doc
7+
from unstructured.partition.pptx import partition_pptx
8+
9+
10+
def partition_ppt(
11+
filename: Optional[str] = None, file: Optional[IO] = None, include_page_breaks: bool = False
12+
) -> List[Element]:
13+
"""Partitions Microsoft PowerPoint Documents in .ppt format into their document elements.
14+
15+
Parameters
16+
----------
17+
filename
18+
A string defining the target filename path.
19+
file
20+
A file-like object using "rb" mode --> open(filename, "rb").
21+
include_page_breaks
22+
If True, includes a PageBreak element between slides
23+
"""
24+
if not any([filename, file]):
25+
raise ValueError("One of filename or file must be specified.")
26+
27+
if filename is not None and not file:
28+
_, filename_no_path = os.path.split(os.path.abspath(filename))
29+
base_filename, _ = os.path.splitext(filename_no_path)
30+
elif file is not None and not filename:
31+
tmp = tempfile.NamedTemporaryFile(delete=False)
32+
tmp.write(file.read())
33+
tmp.close()
34+
filename = tmp.name
35+
_, filename_no_path = os.path.split(os.path.abspath(tmp.name))
36+
else:
37+
raise ValueError("Only one of filename or file can be specified.")
38+
39+
if not os.path.exists(filename):
40+
raise ValueError(f"The file {filename} does not exist.")
41+
42+
base_filename, _ = os.path.splitext(filename_no_path)
43+
44+
with tempfile.TemporaryDirectory() as tmpdir:
45+
convert_office_doc(filename, tmpdir, target_format="pptx")
46+
pptx_filename = os.path.join(tmpdir, f"{base_filename}.pptx")
47+
elements = partition_pptx(filename=pptx_filename, metadata_filename=filename)
48+
49+
return elements

Diff for: unstructured/partition/pptx.py

+7-1
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ def partition_pptx(
2424
filename: Optional[str] = None,
2525
file: Optional[IO] = None,
2626
include_page_breaks: bool = True,
27+
metadata_filename: Optional[str] = None,
2728
) -> List[Element]:
2829
"""Partitions Microsoft PowerPoint Documents in .pptx format into its document elements.
2930
@@ -35,6 +36,10 @@ def partition_pptx(
3536
A file-like object using "rb" mode --> open(filename, "rb").
3637
include_page_breaks
3738
If True, includes a PageBreak element between slides
39+
metadata_filename
40+
The filename to use for the metadata. Relevant because partition_ppt converts the
41+
document .pptx before partition. We want the original source filename in the
42+
metadata.
3843
"""
3944

4045
if not any([filename, file]):
@@ -48,7 +53,8 @@ def partition_pptx(
4853
raise ValueError("Only one of filename or file can be specified.")
4954

5055
elements: List[Element] = list()
51-
metadata = ElementMetadata(filename=filename)
56+
metadata_filename = metadata_filename or filename
57+
metadata = ElementMetadata(filename=metadata_filename)
5258
num_slides = len(presentation.slides)
5359
for i, slide in enumerate(presentation.slides):
5460
metadata.page_number = i + 1

0 commit comments

Comments
 (0)