You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(CVE-2024-39705): bump to nltk 3.9.1; correct model download issues (#3541)
### Summary
Bumps to `nltk==3.9.1` and resolves
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An
NLTK version bump was originally introduced in #3512 and rolled back in
#3527 because `nltk==3.8.2` was yanked from PyPI, and also because we
observed significant slowdowns in processing time after bumping to
`nltk==3.8.2`. The processing time regression does not appear in
`nltk==3.9.1`.
### Testing
After the bump, CI should pass. Additionally we verified locally that
files processing takes around the amount of time we would expect for a
long `.docx` file.
```python
In [1]: from unstructured.partition.auto import partition
In [2]: filename = "test-doc.docx"
In [3]: %timeit partition(filename=filename)
3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
Copy file name to clipboardExpand all lines: CHANGELOG.md
+2-1
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,12 @@
1
-
## 0.15.6-dev1
1
+
## 0.15.6
2
2
3
3
### Enhancements
4
4
5
5
### Features
6
6
7
7
### Fixes
8
8
9
+
***Bump to NLTK 3.9.x** Bumps to the latest `nltk` version to resolve CVE.
9
10
***Update CI for `ingest-test-fixture-update-pr` to resolve NLTK model download errors.**
10
11
***Synchronized text and html on `TableChunk` splits.** When a `Table` element is divided during chunking to fit the chunking window, `TableChunk.text` corresponds exactly with the table text in `TableChunk.metadata.text_as_html`, `.text_as_html` is always parseable HTML, and the table is split on even row boundaries whenever possible.
0 commit comments