Skip to content

Commit 354eff1

Browse files
authored
build(deps): automatically download nltk models when required (#246)
* code for downloading nltk packages * don't run nltk make command in ci * test for model downloads * remove nltk install from docs * update changelog and bump version
1 parent 83f0454 commit 354eff1

File tree

8 files changed

+42
-22
lines changed

8 files changed

+42
-22
lines changed

Diff for: .github/workflows/ci.yml

-1
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,6 @@ jobs:
103103
- name: Test
104104
run: |
105105
source .venv/bin/activate
106-
make install-nltk-models
107106
make install-detectron2
108107
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice
109108
make test

Diff for: CHANGELOG.md

+4
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.4.14
2+
3+
* Automatically install `nltk` models in the `tokenize` module.
4+
15
## 0.4.13
26

37
* Fixes unstructured-ingest cli.

Diff for: Makefile

-1
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,6 @@ install-huggingface:
3636
install-nltk-models:
3737
python -c "import nltk; nltk.download('punkt')"
3838
python -c "import nltk; nltk.download('averaged_perceptron_tagger')"
39-
python -c "import nltk; nltk.download('words')"
4039

4140
.PHONY: install-test
4241
install-test:

Diff for: README.md

-4
Original file line numberDiff line numberDiff line change
@@ -62,10 +62,6 @@ installation.
6262
- `poppler-utils` (images and PDFs)
6363
- `tesseract-ocr` (images and PDFs)
6464
- `libreoffice` (MS Office docs)
65-
- Run the following to install NLTK dependencies. `unstructured` will handle this automatically
66-
soon.
67-
- `python -c "import nltk; nltk.download('punkt')"`
68-
- `python -c "import nltk; nltk.download('averaged_perceptron_tagger')"`
6965
- If you are parsing PDFs, run the following to install the `detectron2` model, which
7066
`unstructured` uses for layout detection:
7167
- `pip install "detectron2@git+https://github.com/facebookresearch/[email protected]#egg=detectron2"`

Diff for: docs/source/installing.rst

-15
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,6 @@ installation.
1616
* ``tesseract-ocr`` (images and PDFs)
1717
* ``libreoffice`` (MS Office docs)
1818

19-
* Run the following to install NLTK dependencies. ``unstructured`` will handle this automatically soon.
20-
* ``python -c "import nltk; nltk.download('punkt')"``
21-
* ``python -c "import nltk; nltk.download('averaged_perceptron_tagger')"``
22-
2319
* If you are parsing PDFs, run the following to install the ``detectron2`` model, which ``unstructured`` uses for layout detection:
2420
* ``pip install "detectron2@git+https://github.com/facebookresearch/[email protected]#egg=detectron2"``
2521

@@ -141,17 +137,6 @@ If you are on Windows using ``conda``, run:
141137
142138
$ conda install -c conda-forge libmagic
143139
144-
145-
=================
146-
NLTK Dependencies
147-
=================
148-
149-
The `NLTK <https://www.nltk.org/>`_ library is used for word and sentence tokenziation and
150-
part of speech (POS) tagging. Tokenization and POS tagging help to identify sections of
151-
narrative text within a document and are used across parsing families. The ``make install``
152-
command downloads the ``punkt`` and ``averaged_perceptron_tagger`` depdenencies from ``nltk``.
153-
If they are not already installed, you can install them with ``make install-nltk``.
154-
155140
======================
156141
XML/HTML Depenedencies
157142
======================

Diff for: test_unstructured/nlp/test_tokenize.py

+19
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,29 @@
11
from typing import List, Tuple
2+
from unittest.mock import patch
3+
4+
import nltk
25

36
import unstructured.nlp.tokenize as tokenize
47

58
from test_unstructured.nlp.mock_nltk import mock_sent_tokenize, mock_word_tokenize
69

710

11+
def test_nltk_packages_download_if_not_present():
12+
with patch.object(nltk, "find", side_effect=LookupError):
13+
with patch.object(nltk, "download") as mock_download:
14+
tokenize._download_nltk_package_if_not_present("fake_package", "tokenizers")
15+
16+
mock_download.assert_called_with("fake_package")
17+
18+
19+
def test_nltk_packages_do_not_download_if():
20+
with patch.object(nltk, "find"):
21+
with patch.object(nltk, "download") as mock_download:
22+
tokenize._download_nltk_package_if_not_present("fake_package", "tokenizers")
23+
24+
mock_download.assert_not_called()
25+
26+
827
def mock_pos_tag(tokens: List[str]) -> List[Tuple[str, str]]:
928
pos_tags: List[Tuple[str, str]] = list()
1029
for token in tokens:

Diff for: unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.4.13" # pragma: no cover
1+
__version__ = "0.4.14" # pragma: no cover

Diff for: unstructured/nlp/tokenize.py

+18
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
else:
88
from typing import Final
99

10+
import nltk
1011
from nltk import (
1112
pos_tag as _pos_tag,
1213
sent_tokenize as _sent_tokenize,
@@ -16,6 +17,23 @@
1617
CACHE_MAX_SIZE: Final[int] = 128
1718

1819

20+
def _download_nltk_package_if_not_present(package_name: str, package_category: str):
21+
"""If the required nlt package is not present, download it."""
22+
try:
23+
nltk.find(f"{package_category}/{package_name}")
24+
except LookupError:
25+
nltk.download(package_name)
26+
27+
28+
NLTK_PACKAGES = [
29+
("tokenizers", "punkt"),
30+
("taggers", "averaged_perceptron_tagger"),
31+
]
32+
33+
for package_category, package_name in NLTK_PACKAGES:
34+
_download_nltk_package_if_not_present(package_name, package_category)
35+
36+
1937
@lru_cache(maxsize=CACHE_MAX_SIZE)
2038
def sent_tokenize(text: str) -> List[str]:
2139
"""A wrapper around the NLTK sentence tokenizer with LRU caching enabled."""

0 commit comments

Comments
 (0)