Skip to content

Commit e43cb0e

Browse files
MthwRobinsonqued
andauthored
feat: add partition_epub function (#364)
* add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <[email protected]>
1 parent aa49462 commit e43cb0e

File tree

18 files changed

+206
-7
lines changed

18 files changed

+206
-7
lines changed

Diff for: .github/workflows/ci.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ jobs:
105105
source .venv/bin/activate
106106
make install-detectron2
107107
sudo apt-get update
108-
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice
108+
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice pandoc
109109
make test
110110
make check-coverage
111111
make install-ingest-s3

Diff for: CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.5.4-dev7
1+
## 0.5.4
22

33
### Enhancements
44

@@ -21,6 +21,7 @@
2121

2222
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
2323
from `FsspecConnector`
24+
* Add `partition_epub` for partitioning e-books in EPUB3 format.
2425

2526
### Fixes
2627

Diff for: README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ file to ensure your code matches the formatting and linting standards used in `u
110110
If you'd prefer not having code changes auto-tidied before every commit, you can use `make check` to see
111111
whether any linting or formatting changes should be applied, and `make tidy` to apply them.
112112

113-
If using the optional `pre-commit`, you'll just need to install the hooks with `pre-commit install` since the
113+
If using the optional `pre-commit`, you'll just need to install the hooks with `pre-commit install` since the
114114
`pre-commit` package is installed as part of `make install` mentioned above. Finally, if you decided to use `pre-commit`
115115
you can also uninstall the hooks with `pre-commit uninstall`.
116116

@@ -119,7 +119,7 @@ you can also uninstall the hooks with `pre-commit uninstall`.
119119
You can run this [Colab notebook](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) to run the examples below.
120120

121121
The following examples show how to get started with the `unstructured` library.
122-
You can parse **TXT**, **HTML**, **PDF**, **EML**, **DOC**, **DOCX**, **PPT**, **PPTX**, **JPG**,
122+
You can parse **TXT**, **HTML**, **PDF**, **EML**, **EPUB**, **DOC**, **DOCX**, **PPT**, **PPTX**, **JPG**,
123123
and **PNG** documents with one line of code!
124124
<br></br>
125125
See our [documentation page](https://unstructured-io.github.io/unstructured) for a full description

Diff for: docs/source/bricks.rst

+36-1
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ If you call the ``partition`` function, ``unstructured`` will attempt to detect
8282
file type and route it to the appropriate partitioning brick. All partitioning bricks
8383
called within ``partition`` are called using the default kwargs. Use the document-type
8484
specific bricks if you need to apply non-default settings.
85-
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.ppt``, ``.eml``, ``.html``, ``.pdf``,
85+
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.ppt``, ``.eml``, ``.epub``, ``.html``, ``.pdf``,
8686
``.png``, ``.jpg``, and ``.txt`` files.
8787
If you set the ``include_page_breaks`` kwarg to ``True``, the output will include page breaks. This is only supported for ``.pptx``, ``.html``, ``.pdf``,
8888
``.png``, and ``.jpg``.
@@ -306,6 +306,41 @@ Examples:
306306
elements = partition_email(text=text, include_headers=True)
307307
308308
309+
``partition_epub``
310+
---------------------
311+
312+
The ``partition_epub`` function processes e-books in EPUB3 format. The function
313+
first converts the document to HTML using ``pandocs`` and then calls ``partition_html``.
314+
You'll need `pandocs <https://pandoc.org/installing.html>`_ installed on your system
315+
to use ``partition_epub``.
316+
317+
318+
Examples:
319+
320+
.. code:: python
321+
322+
from unstructured.partition.epub import partition_epub
323+
324+
elements = partition_epub(filename="example-docs/winter-sports.epub")
325+
326+
327+
``partition_md``
328+
---------------------
329+
330+
The ``partition_md`` function provides the ability to parse markdown files. The
331+
following workflow shows how to use ``partition_md``.
332+
333+
334+
Examples:
335+
336+
.. code:: python
337+
338+
from unstructured.partition.md import partition_md
339+
340+
elements = partition_md(filename="README.md")
341+
342+
343+
309344
``partition_text``
310345
---------------------
311346

Diff for: docs/source/installing.rst

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ installation.
1515
* ``poppler-utils`` (images and PDFs)
1616
* ``tesseract-ocr`` (images and PDFs)
1717
* ``libreoffice`` (MS Office docs)
18+
* ``pandocs`` (EPUBs)
1819

1920
* If you are parsing PDFs, run the following to install the ``detectron2`` model, which ``unstructured`` uses for layout detection:
2021
* ``pip install "detectron2@git+https://github.com/facebookresearch/[email protected]#egg=detectron2"``

Diff for: example-docs/winter-sports.epub

205 KB
Binary file not shown.

Diff for: requirements/base.txt

+5
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44
#
55
# pip-compile --output-file=requirements/base.txt
66
#
7+
--extra-index-url https://pypi.ngc.nvidia.com
8+
--trusted-host pypi.ngc.nvidia.com
9+
710
anyio==3.6.2
811
# via httpcore
912
argilla==1.4.0
@@ -72,6 +75,8 @@ pydantic==1.10.6
7275
# via argilla
7376
pygments==2.14.0
7477
# via rich
78+
pypandoc==1.11
79+
# via unstructured (setup.py)
7580
python-dateutil==2.8.2
7681
# via pandas
7782
python-docx==0.8.11

Diff for: scripts/setup_ubuntu.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ $sudo $pac install -y poppler-utils
8484

8585
#### Tesseract
8686
# Install tesseract as well as Russian language
87-
$sudo $pac install -y tesseract-ocr libtesseract-dev tesseract-ocr-rus libreoffice
87+
$sudo $pac install -y tesseract-ocr libtesseract-dev tesseract-ocr-rus libreoffice pandoc
8888

8989
#### libmagic
9090
$sudo $pac install -y libmagic-dev

Diff for: setup.py

+1
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@
5656
"openpyxl",
5757
"pandas",
5858
"pillow",
59+
"pypandoc",
5960
"python-docx",
6061
"python-pptx",
6162
"python-magic",

Diff for: test_unstructured/file_utils/test_file_conversion.py

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
import os
2+
import pathlib
3+
from unittest.mock import patch
4+
5+
import pypandoc
6+
import pytest
7+
8+
from unstructured.file_utils.file_conversion import convert_file_to_text
9+
10+
DIRECTORY = pathlib.Path(__file__).parent.resolve()
11+
12+
13+
def test_convert_file_to_text():
14+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
15+
html_text = convert_file_to_text(filename, source_format="epub", target_format="html")
16+
assert html_text.startswith("<p>")
17+
18+
19+
def test_convert_to_file_raises_if_pandoc_not_available():
20+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
21+
with patch.object(pypandoc, "convert_file", side_effect=FileNotFoundError):
22+
with pytest.raises(FileNotFoundError):
23+
convert_file_to_text(filename, source_format="epub", target_format="html")

Diff for: test_unstructured/file_utils/test_filetype.py

+3
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
("fake-html.html", FileType.HTML),
3131
("unsupported/fake-excel.xlsx", FileType.XLSX),
3232
("fake-power-point.pptx", FileType.PPTX),
33+
("winter-sports.epub", FileType.EPUB),
3334
],
3435
)
3536
def test_detect_filetype_from_filename(file, expected):
@@ -50,6 +51,7 @@ def test_detect_filetype_from_filename(file, expected):
5051
("fake-html.html", FileType.HTML),
5152
("unsupported/fake-excel.xlsx", FileType.XLSX),
5253
("fake-power-point.pptx", FileType.PPTX),
54+
("winter-sports.epub", FileType.EPUB),
5355
],
5456
)
5557
def test_detect_filetype_from_filename_with_extension(monkeypatch, file, expected):
@@ -73,6 +75,7 @@ def test_detect_filetype_from_filename_with_extension(monkeypatch, file, expecte
7375
("fake-html.html", FileType.HTML),
7476
("unsupported/fake-excel.xlsx", FileType.XLSX),
7577
("fake-power-point.pptx", FileType.PPTX),
78+
("winter-sports.epub", FileType.EPUB),
7679
],
7780
)
7881
def test_detect_filetype_from_file(file, expected):

Diff for: test_unstructured/partition/test_auto.py

+15
Original file line numberDiff line numberDiff line change
@@ -277,3 +277,18 @@ def test_auto_with_page_breaks():
277277
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf")
278278
elements = partition(filename=filename, include_page_breaks=True)
279279
assert PageBreak() in elements
280+
281+
282+
def test_auto_partition_epub_from_filename():
283+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
284+
elements = partition(filename=filename)
285+
assert len(elements) > 0
286+
assert elements[0].text.startswith("The Project Gutenberg eBook of Winter Sports")
287+
288+
289+
def test_auto_partition_epub_from_file():
290+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
291+
with open(filename, "rb") as f:
292+
elements = partition(file=f)
293+
assert len(elements) > 0
294+
assert elements[0].text.startswith("The Project Gutenberg eBook of Winter Sports")

Diff for: test_unstructured/partition/test_epub.py

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import os
2+
import pathlib
3+
4+
from unstructured.partition.epub import partition_epub
5+
6+
DIRECTORY = pathlib.Path(__file__).parent.resolve()
7+
8+
9+
def test_partition_epub_from_filename():
10+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
11+
elements = partition_epub(filename=filename)
12+
assert len(elements) > 0
13+
assert elements[0].text.startswith("The Project Gutenberg eBook of Winter Sports")
14+
15+
16+
def test_partition_epub_from_file():
17+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
18+
with open(filename, "rb") as f:
19+
elements = partition_epub(file=f)
20+
assert len(elements) > 0
21+
assert elements[0].text.startswith("The Project Gutenberg eBook of Winter Sports")

Diff for: unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.5.4-dev7" # pragma: no cover
1+
__version__ = "0.5.4" # pragma: no cover

Diff for: unstructured/file_utils/file_conversion.py

+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
import tempfile
2+
from typing import IO, Optional
3+
4+
import pypandoc
5+
6+
from unstructured.partition.common import exactly_one
7+
8+
9+
def convert_file_to_text(filename: str, source_format: str, target_format: str) -> str:
10+
"""Uses pandoc to convert the source document to a raw text string."""
11+
try:
12+
text = pypandoc.convert_file(filename, "html", format="epub")
13+
except FileNotFoundError as err:
14+
msg = (
15+
"Error converting the file to text. Ensure you have the pandoc "
16+
"package installed on your system. Install instructions are available at "
17+
"https://pandoc.org/installing.html. The original exception text was:\n"
18+
f"{err}"
19+
)
20+
raise FileNotFoundError(msg)
21+
22+
return text
23+
24+
25+
def convert_epub_to_html(
26+
filename: Optional[str] = None,
27+
file: Optional[IO] = None,
28+
) -> str:
29+
"""Converts an EPUB document to HTML raw text. Enables an EPUB doucment to be
30+
processed using the partition_html function."""
31+
exactly_one(filename=filename, file=file)
32+
33+
if file is not None:
34+
tmp = tempfile.NamedTemporaryFile(delete=False)
35+
tmp.write(file.read())
36+
tmp.close()
37+
html_text = convert_file_to_text(
38+
filename=tmp.name,
39+
source_format="epub",
40+
target_format="html",
41+
)
42+
elif filename is not None:
43+
html_text = convert_file_to_text(
44+
filename=filename,
45+
source_format="epub",
46+
target_format="html",
47+
)
48+
49+
return html_text

Diff for: unstructured/file_utils/filetype.py

+10
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,11 @@
4747
"text/x-markdown",
4848
]
4949

50+
EPUB_MIME_TYPES = [
51+
"application/epub",
52+
"application/epub+zip",
53+
]
54+
5055
# NOTE(robinson) - .docx.xlsx files are actually zip file with a .docx/.xslx extension.
5156
# If the MIME type is application/octet-stream, we check if it's a .docx/.xlsx file by
5257
# looking for expected filenames within the zip file.
@@ -94,6 +99,7 @@ class FileType(Enum):
9499
HTML = 50
95100
XML = 51
96101
MD = 52
102+
EPUB = 53
97103

98104
# Compressed Types
99105
ZIP = 60
@@ -123,6 +129,7 @@ def __lt__(self, other):
123129
".ppt": FileType.PPT,
124130
".rtf": FileType.RTF,
125131
".json": FileType.JSON,
132+
".epub": FileType.EPUB,
126133
}
127134

128135

@@ -180,6 +187,9 @@ def detect_filetype(
180187
# NOTE - I am not sure whether libmagic ever returns these mimetypes.
181188
return FileType.MD
182189

190+
elif mime_type in EPUB_MIME_TYPES:
191+
return FileType.EPUB
192+
183193
elif mime_type in TXT_MIME_TYPES:
184194
if extension and extension == ".eml":
185195
return FileType.EML

Diff for: unstructured/partition/auto.py

+3
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
from unstructured.partition.doc import partition_doc
55
from unstructured.partition.docx import partition_docx
66
from unstructured.partition.email import partition_email
7+
from unstructured.partition.epub import partition_epub
78
from unstructured.partition.html import partition_html
89
from unstructured.partition.image import partition_image
910
from unstructured.partition.json import partition_json
@@ -59,6 +60,8 @@ def partition(
5960
include_page_breaks=include_page_breaks,
6061
encoding=encoding,
6162
)
63+
elif filetype == FileType.EPUB:
64+
return partition_epub(filename=filename, file=file, include_page_breaks=include_page_breaks)
6265
elif filetype == FileType.MD:
6366
return partition_md(filename=filename, file=file, include_page_breaks=include_page_breaks)
6467
elif filetype == FileType.PDF:

Diff for: unstructured/partition/epub.py

+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
from typing import IO, List, Optional
2+
3+
from unstructured.documents.elements import Element
4+
from unstructured.file_utils.file_conversion import convert_epub_to_html
5+
from unstructured.partition.html import partition_html
6+
7+
8+
def partition_epub(
9+
filename: Optional[str] = None,
10+
file: Optional[IO] = None,
11+
include_page_breaks: bool = False,
12+
) -> List[Element]:
13+
"""Partitions an EPUB document. The document is first converted to HTML and then
14+
partitoned using partiton_html.
15+
16+
Parameters
17+
----------
18+
filename
19+
A string defining the target filename path.
20+
file
21+
A file-like object using "rb" mode --> open(filename, "rb").
22+
include_page_breaks
23+
If True, the output will include page breaks if the filetype supports it
24+
"""
25+
html_text = convert_epub_to_html(filename=filename, file=file)
26+
# NOTE(robinson) - pypandoc returns a text string with unicode encoding
27+
# ref: https://github.com/JessicaTegner/pypandoc#usage
28+
return partition_html(
29+
text=html_text,
30+
include_page_breaks=include_page_breaks,
31+
encoding="unicode",
32+
)

0 commit comments

Comments
 (0)