Skip to content

Commit a447b81

Browse files
added auto_download logic to download data runtime (#3883)
- **Add auto-download for NLTK for Python Enviroment** When user import `tokenize`, It will automatically download nltk data. - Added `AUTO_DOWNLOAD_NLTK` flag in `tokenize.py` to download `NLTK_DATA`
1 parent 9e5ff22 commit a447b81

File tree

8 files changed

+30
-14
lines changed

8 files changed

+30
-14
lines changed

CHANGELOG.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
1-
## 0.16.16-dev2
1+
## 0.16.16
22

33
### Enhancements
44

55
### Features
66
- **Vectorize layout (inferred, extracted, and OCR) data structure** Using `np.ndarray` to store a group of layout elements or text regions instead of using a list of objects. This improves the memory efficiency and compute speed around layout merging and deduplication.
77

88
### Fixes
9+
- **Add auto-download for NLTK for Python Enviroment** When user import tokenize, It will automatic download nltk data from `tokenize.py` file. Added `AUTO_DOWNLOAD_NLTK` flag in `tokenize.py` to download `NLTK_DATA`.
910
- **Correctly patch pdfminer to avoid PDF repair**. The patch applied to pdfminer's parser caused it to occasionally split tokens in content streams, throwing `PDFSyntaxError`. Repairing these PDFs sometimes failed (since they were not actually invalid) resulting in unnecessary OCR fallback.
1011

1112
* **Drop usage of ndjson dependency**
1213

1314
## 0.16.15
14-
1515
### Enhancements
1616

1717
### Features

requirements/base.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ langdetect==1.0.9
6464
# via -r ./base.in
6565
lxml==5.3.0
6666
# via -r ./base.in
67-
marshmallow==3.25.1
67+
marshmallow==3.26.0
6868
# via
6969
# dataclasses-json
7070
# unstructured-client

requirements/extra-paddleocr.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ exceptiongroup==1.2.2
3232
# via
3333
# -c ./base.txt
3434
# anyio
35-
fonttools==4.55.4
35+
fonttools==4.55.5
3636
# via matplotlib
3737
h11==0.14.0
3838
# via

requirements/extra-pdf-image.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -42,15 +42,15 @@ filelock==3.17.0
4242
# transformers
4343
flatbuffers==25.1.21
4444
# via onnxruntime
45-
fonttools==4.55.4
45+
fonttools==4.55.5
4646
# via matplotlib
4747
fsspec==2024.12.0
4848
# via
4949
# huggingface-hub
5050
# torch
5151
google-api-core[grpc]==2.24.0
5252
# via google-cloud-vision
53-
google-auth==2.37.0
53+
google-auth==2.38.0
5454
# via
5555
# google-api-core
5656
# google-cloud-vision

requirements/extra-pptx.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,5 @@ python-pptx==1.0.2
1212
# via -r ./extra-pptx.in
1313
typing-extensions==4.12.2
1414
# via python-pptx
15-
xlsxwriter==3.2.0
15+
xlsxwriter==3.2.1
1616
# via python-pptx

requirements/test.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ exceptiongroup==1.2.2
5454
# -c ./base.txt
5555
# anyio
5656
# pytest
57-
faker==34.0.0
57+
faker==35.0.0
5858
# via jsf
5959
flake8==7.1.1
6060
# via

unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.16.16-dev2" # pragma: no cover
1+
__version__ = "0.16.16" # pragma: no cover

unstructured/nlp/tokenize.py

+21-5
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,6 @@
1212
CACHE_MAX_SIZE: Final[int] = 128
1313

1414

15-
def download_nltk_packages():
16-
nltk.download("averaged_perceptron_tagger_eng", quiet=True)
17-
nltk.download("punkt_tab", quiet=True)
18-
19-
2015
def check_for_nltk_package(package_name: str, package_category: str) -> bool:
2116
"""Checks to see if the specified NLTK package exists on the image."""
2217
paths: list[str] = []
@@ -32,6 +27,27 @@ def check_for_nltk_package(package_name: str, package_category: str) -> bool:
3227
return False
3328

3429

30+
def download_nltk_packages():
31+
"""If required NLTK packages are not available, download them."""
32+
33+
tagger_available = check_for_nltk_package(
34+
package_category="taggers",
35+
package_name="averaged_perceptron_tagger_eng",
36+
)
37+
tokenizer_available = check_for_nltk_package(
38+
package_category="tokenizers", package_name="punkt_tab"
39+
)
40+
41+
if (not tokenizer_available) or (not tagger_available):
42+
nltk.download("averaged_perceptron_tagger_eng", quiet=True)
43+
nltk.download("punkt_tab", quiet=True)
44+
45+
46+
# auto download nltk packages if the environment variable is set
47+
if os.getenv("AUTO_DOWNLOAD_NLTK", "True").lower() == "true":
48+
download_nltk_packages()
49+
50+
3551
@lru_cache(maxsize=CACHE_MAX_SIZE)
3652
def sent_tokenize(text: str) -> List[str]:
3753
"""A wrapper around the NLTK sentence tokenizer with LRU caching enabled."""

0 commit comments

Comments
 (0)