Skip to content

Commit 01dbc7b

Browse files
fix: nltk data download path to prevent redundant nested directories (#3546)
Closes #3543. ### Summary This PR addresses an issue with the NLTK data download process. Previously, when downloading NLTK data, a nested "nltk_data" directory was created within the parent "nltk_data" directory if the parent directory already existed. This redundant directory structure led to two significant problems: - errors in checking if data had already been downloaded, potentially causing redundant downloads in subsequent calls. - failures in loading models from the downloaded NLTK data due to incorrect path resolution. This fix modifies the NLTK data download logic to prevent creation of unnecessary nested directories. If the download path ends with "nltk_data" and that directory already exists, we now use the existing directory instead of creating a new nested one. ### Testing CI should pass.
1 parent 1f8030d commit 01dbc7b

File tree

3 files changed

+15
-2
lines changed

3 files changed

+15
-2
lines changed

CHANGELOG.md

+10-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
1+
## 0.15.7
2+
3+
### Enhancements
4+
5+
### Features
6+
7+
### Fixes
8+
9+
* **Fix NLTK data download path to prevent nested directories**. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.
10+
111
## 0.15.6
212

313
### Enhancements
@@ -10,7 +20,6 @@
1020
* **Update CI for `ingest-test-fixture-update-pr` to resolve NLTK model download errors.**
1121
* **Synchronized text and html on `TableChunk` splits.** When a `Table` element is divided during chunking to fit the chunking window, `TableChunk.text` corresponds exactly with the table text in `TableChunk.metadata.text_as_html`, `.text_as_html` is always parseable HTML, and the table is split on even row boundaries whenever possible.
1222

13-
1423
## 0.15.5
1524

1625
### Enhancements

unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.15.6" # pragma: no cover
1+
__version__ = "0.15.7" # pragma: no cover

unstructured/nlp/tokenize.py

+4
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,10 @@ def download_nltk_packages():
7070
if nltk_data_dir is None:
7171
raise OSError("NLTK data directory does not exist or is not writable.")
7272

73+
# Check if the path ends with "nltk_data" and remove it if it does
74+
if nltk_data_dir.endswith("nltk_data"):
75+
nltk_data_dir = os.path.dirname(nltk_data_dir)
76+
7377
def sha256_checksum(filename: str, block_size: int = 65536):
7478
sha256 = hashlib.sha256()
7579
with open(filename, "rb") as f:

0 commit comments

Comments
 (0)