Skip to content

Commit e43cb0e

Browse files
MthwRobinsonqued
andauthored
feat: add partition_epub function (#364)
* add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <[email protected]>
1 parent aa49462 commit e43cb0e

File tree

18 files changed

+206
-7
lines changed

18 files changed

+206
-7
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ jobs:
105105
source .venv/bin/activate
106106
make install-detectron2
107107
sudo apt-get update
108-
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice
108+
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice pandoc
109109
make test
110110
make check-coverage
111111
make install-ingest-s3

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.5.4-dev7
1+
## 0.5.4
22

33
### Enhancements
44

@@ -21,6 +21,7 @@
2121

2222
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
2323
from `FsspecConnector`
24+
* Add `partition_epub` for partitioning e-books in EPUB3 format.
2425

2526
### Fixes
2627

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ file to ensure your code matches the formatting and linting standards used in `u
110110
If you'd prefer not having code changes auto-tidied before every commit, you can use `make check` to see
111111
whether any linting or formatting changes should be applied, and `make tidy` to apply them.
112112

113-
If using the optional `pre-commit`, you'll just need to install the hooks with `pre-commit install` since the
113+
If using the optional `pre-commit`, you'll just need to install the hooks with `pre-commit install` since the
114114
`pre-commit` package is installed as part of `make install` mentioned above. Finally, if you decided to use `pre-commit`
115115
you can also uninstall the hooks with `pre-commit uninstall`.
116116

@@ -119,7 +119,7 @@ you can also uninstall the hooks with `pre-commit uninstall`.
119119
You can run this [Colab notebook](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) to run the examples below.
120120

121121
The following examples show how to get started with the `unstructured` library.
122-
You can parse **TXT**, **HTML**, **PDF**, **EML**, **DOC**, **DOCX**, **PPT**, **PPTX**, **JPG**,
122+
You can parse **TXT**, **HTML**, **PDF**, **EML**, **EPUB**, **DOC**, **DOCX**, **PPT**, **PPTX**, **JPG**,
123123
and **PNG** documents with one line of code!
124124
<br></br>
125125
See our [documentation page](https://unstructured-io.github.io/unstructured) for a full description

docs/source/bricks.rst

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ If you call the ``partition`` function, ``unstructured`` will attempt to detect
8282
file type and route it to the appropriate partitioning brick. All partitioning bricks
8383
called within ``partition`` are called using the default kwargs. Use the document-type
8484
specific bricks if you need to apply non-default settings.
85-
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.ppt``, ``.eml``, ``.html``, ``.pdf``,
85+
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.ppt``, ``.eml``, ``.epub``, ``.html``, ``.pdf``,
8686
``.png``, ``.jpg``, and ``.txt`` files.
8787
If you set the ``include_page_breaks`` kwarg to ``True``, the output will include page breaks. This is only supported for ``.pptx``, ``.html``, ``.pdf``,
8888
``.png``, and ``.jpg``.
@@ -306,6 +306,41 @@ Examples:
306306
elements = partition_email(text=text, include_headers=True)
307307
308308
309+
``partition_epub``
310+
---------------------
311+
312+
The ``partition_epub`` function processes e-books in EPUB3 format. The function
313+
first converts the document to HTML using ``pandocs`` and then calls ``partition_html``.
314+
You'll need `pandocs <https://pandoc.org/installing.html>`_ installed on your system
315+
to use ``partition_epub``.
316+
317+
318+
Examples:
319+
320+
.. code:: python
321+
322+
from unstructured.partition.epub import partition_epub
323+
324+
elements = partition_epub(filename="example-docs/winter-sports.epub")
325+
326+
327+
``partition_md``
328+
---------------------
329+
330+
The ``partition_md`` function provides the ability to parse markdown files. The
331+
following workflow shows how to use ``partition_md``.
332+
333+
334+
Examples:
335+
336+
.. code:: python
337+
338+
from unstructured.partition.md import partition_md
339+
340+
elements = partition_md(filename="README.md")
341+
342+
343+
309344
``partition_text``
310345
---------------------
311346

docs/source/installing.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ installation.
1515
* ``poppler-utils`` (images and PDFs)
1616
* ``tesseract-ocr`` (images and PDFs)
1717
* ``libreoffice`` (MS Office docs)
18+
* ``pandocs`` (EPUBs)
1819

1920
* If you are parsing PDFs, run the following to install the ``detectron2`` model, which ``unstructured`` uses for layout detection:
2021
* ``pip install "detectron2@git+https://github.com/facebookresearch/[email protected]#egg=detectron2"``

example-docs/winter-sports.epub

205 KB
Binary file not shown.

requirements/base.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44
#
55
# pip-compile --output-file=requirements/base.txt
66
#
7+
--extra-index-url https://pypi.ngc.nvidia.com
8+
--trusted-host pypi.ngc.nvidia.com
9+
710
anyio==3.6.2
811
# via httpcore
912
argilla==1.4.0
@@ -72,6 +75,8 @@ pydantic==1.10.6
7275
# via argilla
7376
pygments==2.14.0
7477
# via rich
78+
pypandoc==1.11
79+
# via unstructured (setup.py)
7580
python-dateutil==2.8.2
7681
# via pandas
7782
python-docx==0.8.11

scripts/setup_ubuntu.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ $sudo $pac install -y poppler-utils
8484

8585
#### Tesseract
8686
# Install tesseract as well as Russian language
87-
$sudo $pac install -y tesseract-ocr libtesseract-dev tesseract-ocr-rus libreoffice
87+
$sudo $pac install -y tesseract-ocr libtesseract-dev tesseract-ocr-rus libreoffice pandoc
8888

8989
#### libmagic
9090
$sudo $pac install -y libmagic-dev

setup.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@
5656
"openpyxl",
5757
"pandas",
5858
"pillow",
59+
"pypandoc",
5960
"python-docx",
6061
"python-pptx",
6162
"python-magic",
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
import os
2+
import pathlib
3+
from unittest.mock import patch
4+
5+
import pypandoc
6+
import pytest
7+
8+
from unstructured.file_utils.file_conversion import convert_file_to_text
9+
10+
DIRECTORY = pathlib.Path(__file__).parent.resolve()
11+
12+
13+
def test_convert_file_to_text():
14+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
15+
html_text = convert_file_to_text(filename, source_format="epub", target_format="html")
16+
assert html_text.startswith("<p>")
17+
18+
19+
def test_convert_to_file_raises_if_pandoc_not_available():
20+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
21+
with patch.object(pypandoc, "convert_file", side_effect=FileNotFoundError):
22+
with pytest.raises(FileNotFoundError):
23+
convert_file_to_text(filename, source_format="epub", target_format="html")

0 commit comments

Comments
 (0)