Skip to content

Commit 1f8030d

Browse files
authored
fix(CVE-2024-39705): bump to nltk 3.9.1; correct model download issues (#3541)
### Summary Bumps to `nltk==3.9.1` and resolves [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An NLTK version bump was originally introduced in #3512 and rolled back in #3527 because `nltk==3.8.2` was yanked from PyPI, and also because we observed significant slowdowns in processing time after bumping to `nltk==3.8.2`. The processing time regression does not appear in `nltk==3.9.1`. ### Testing After the bump, CI should pass. Additionally we verified locally that files processing takes around the amount of time we would expect for a long `.docx` file. ```python In [1]: from unstructured.partition.auto import partition In [2]: filename = "test-doc.docx" In [3]: %timeit partition(filename=filename) 3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ```
1 parent a861ed8 commit 1f8030d

35 files changed

+112
-101
lines changed

CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
1-
## 0.15.6-dev1
1+
## 0.15.6
22

33
### Enhancements
44

55
### Features
66

77
### Fixes
88

9+
* **Bump to NLTK 3.9.x** Bumps to the latest `nltk` version to resolve CVE.
910
* **Update CI for `ingest-test-fixture-update-pr` to resolve NLTK model download errors.**
1011
* **Synchronized text and html on `TableChunk` splits.** When a `Table` element is divided during chunking to fit the chunking window, `TableChunk.text` corresponds exactly with the table text in `TableChunk.metadata.text_as_html`, `.text_as_html` is always parseable HTML, and the table is split on even row boundaries whenever possible.
1112

requirements/base.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ mypy-extensions==1.0.0
6969
# unstructured-client
7070
nest-asyncio==1.6.0
7171
# via unstructured-client
72-
nltk==3.8.1
72+
nltk==3.9.1
7373
# via -r ./base.in
7474
numpy==1.26.4
7575
# via -r ./base.in
@@ -110,7 +110,7 @@ sniffio==1.3.1
110110
# via
111111
# anyio
112112
# httpx
113-
soupsieve==2.5
113+
soupsieve==2.6
114114
# via beautifulsoup4
115115
tabulate==0.9.0
116116
# via -r ./base.in
@@ -129,7 +129,7 @@ typing-inspect==0.9.0
129129
# via
130130
# dataclasses-json
131131
# unstructured-client
132-
unstructured-client==0.25.4
132+
unstructured-client==0.25.5
133133
# via
134134
# -c ././deps/constraints.txt
135135
# -r ./base.in

requirements/deps/constraints.txt

+3
Original file line numberDiff line numberDiff line change
@@ -56,3 +56,6 @@ fsspec==2024.5.0
5656
wrapt>=1.14.0
5757

5858
langchain-community>=0.2.5
59+
60+
grpcio==1.64.3
61+
label-studio-sdk==0.0.34

requirements/dev.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -310,7 +310,7 @@ pyyaml==6.0.2
310310
# -c ./test.txt
311311
# jupyter-events
312312
# pre-commit
313-
pyzmq==26.1.0
313+
pyzmq==26.1.1
314314
# via
315315
# ipykernel
316316
# jupyter-client
@@ -360,7 +360,7 @@ sniffio==1.3.1
360360
# -c ./base.txt
361361
# anyio
362362
# httpx
363-
soupsieve==2.5
363+
soupsieve==2.6
364364
# via
365365
# -c ./base.txt
366366
# beautifulsoup4

requirements/extra-markdown.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
#
77
importlib-metadata==8.2.0
88
# via markdown
9-
markdown==3.6
9+
markdown==3.7
1010
# via -r ./extra-markdown.in
1111
zipp==3.20.0
1212
# via importlib-metadata

requirements/extra-paddleocr.txt

+4-4
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ astor==0.8.1
1313
# via paddlepaddle
1414
attrdict==2.0.1
1515
# via unstructured-paddleocr
16-
cachetools==5.4.0
16+
cachetools==5.5.0
1717
# via premailer
1818
certifi==2024.7.4
1919
# via
@@ -64,13 +64,13 @@ idna==3.7
6464
# anyio
6565
# httpx
6666
# requests
67-
imageio==2.34.2
67+
imageio==2.35.1
6868
# via
6969
# imgaug
7070
# scikit-image
7171
imgaug==0.4.0
7272
# via unstructured-paddleocr
73-
importlib-resources==6.4.0
73+
importlib-resources==6.4.3
7474
# via matplotlib
7575
kiwisolver==1.4.5
7676
# via matplotlib
@@ -83,7 +83,7 @@ lxml==5.3.0
8383
# -c ./base.txt
8484
# premailer
8585
# unstructured-paddleocr
86-
matplotlib==3.9.1.post1
86+
matplotlib==3.9.2
8787
# via imgaug
8888
more-itertools==10.4.0
8989
# via cssutils

requirements/extra-pdf-image.txt

+9-8
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
#
77
antlr4-python3-runtime==4.9.3
88
# via omegaconf
9-
cachetools==5.4.0
9+
cachetools==5.5.0
1010
# via google-auth
1111
certifi==2024.7.4
1212
# via
@@ -48,7 +48,7 @@ fsspec==2024.5.0
4848
# torch
4949
google-api-core[grpc]==2.19.1
5050
# via google-cloud-vision
51-
google-auth==2.33.0
51+
google-auth==2.34.0
5252
# via
5353
# google-api-core
5454
# google-cloud-vision
@@ -58,13 +58,14 @@ googleapis-common-protos==1.63.2
5858
# via
5959
# google-api-core
6060
# grpcio-status
61-
grpcio==1.65.4
61+
grpcio==1.64.3
6262
# via
63+
# -c ././deps/constraints.txt
6364
# google-api-core
6465
# grpcio-status
6566
grpcio-status==1.62.3
6667
# via google-api-core
67-
huggingface-hub==0.24.5
68+
huggingface-hub==0.24.6
6869
# via
6970
# timm
7071
# tokenizers
@@ -76,7 +77,7 @@ idna==3.7
7677
# via
7778
# -c ./base.txt
7879
# requests
79-
importlib-resources==6.4.0
80+
importlib-resources==6.4.3
8081
# via matplotlib
8182
iopath==0.1.10
8283
# via layoutparser
@@ -92,7 +93,7 @@ lxml==5.3.0
9293
# pikepdf
9394
markupsafe==2.1.5
9495
# via jinja2
95-
matplotlib==3.9.1.post1
96+
matplotlib==3.9.2
9697
# via
9798
# pycocotools
9899
# unstructured-inference
@@ -120,7 +121,7 @@ onnx==1.16.2
120121
# via
121122
# -r ./extra-pdf-image.in
122123
# unstructured-inference
123-
onnxruntime==1.18.1
124+
onnxruntime==1.19.0
124125
# via unstructured-inference
125126
opencv-python==4.8.0.76
126127
# via
@@ -147,7 +148,7 @@ pdfminer-six==20231228
147148
# via
148149
# -r ./extra-pdf-image.in
149150
# pdfplumber
150-
pdfplumber==0.11.3
151+
pdfplumber==0.11.4
151152
# via layoutparser
152153
pikepdf==9.1.1
153154
# via -r ./extra-pdf-image.in

requirements/huggingface.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ fsspec==2024.5.0
2727
# -c ././deps/constraints.txt
2828
# huggingface-hub
2929
# torch
30-
huggingface-hub==0.24.5
30+
huggingface-hub==0.24.6
3131
# via
3232
# tokenizers
3333
# transformers

requirements/ingest/azure.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66
#
77
adlfs==2024.7.0
88
# via -r ./ingest/azure.in
9-
aiohappyeyeballs==2.3.5
9+
aiohappyeyeballs==2.3.7
1010
# via aiohttp
11-
aiohttp==3.10.3
11+
aiohttp==3.10.4
1212
# via adlfs
1313
aiosignal==1.3.1
1414
# via aiohttp

requirements/ingest/biomed.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ beautifulsoup4==4.12.3
1010
# bs4
1111
bs4==0.0.2
1212
# via -r ./ingest/biomed.in
13-
soupsieve==2.5
13+
soupsieve==2.6
1414
# via
1515
# -c ./ingest/../base.txt
1616
# beautifulsoup4

requirements/ingest/chroma.txt

+11-10
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ backoff==2.2.1
2020
# posthog
2121
bcrypt==4.2.0
2222
# via chromadb
23-
cachetools==5.4.0
23+
cachetools==5.5.0
2424
# via google-auth
2525
certifi==2024.7.4
2626
# via
@@ -51,7 +51,7 @@ exceptiongroup==1.2.2
5151
# via
5252
# -c ./ingest/../base.txt
5353
# anyio
54-
fastapi==0.112.0
54+
fastapi==0.112.1
5555
# via chromadb
5656
filelock==3.15.4
5757
# via huggingface-hub
@@ -61,12 +61,13 @@ fsspec==2024.5.0
6161
# via
6262
# -c ./ingest/../deps/constraints.txt
6363
# huggingface-hub
64-
google-auth==2.33.0
64+
google-auth==2.34.0
6565
# via kubernetes
6666
googleapis-common-protos==1.63.2
6767
# via opentelemetry-exporter-otlp-proto-grpc
68-
grpcio==1.65.4
68+
grpcio==1.64.3
6969
# via
70+
# -c ./ingest/../deps/constraints.txt
7071
# chromadb
7172
# opentelemetry-exporter-otlp-proto-grpc
7273
h11==0.14.0
@@ -76,7 +77,7 @@ h11==0.14.0
7677
# uvicorn
7778
httptools==0.6.1
7879
# via uvicorn
79-
huggingface-hub==0.24.5
80+
huggingface-hub==0.24.6
8081
# via tokenizers
8182
humanfriendly==10.0
8283
# via coloredlogs
@@ -88,7 +89,7 @@ idna==3.7
8889
# requests
8990
importlib-metadata==8.2.0
9091
# via -r ./ingest/chroma.in
91-
importlib-resources==6.4.0
92+
importlib-resources==6.4.3
9293
# via chromadb
9394
kubernetes==30.1.0
9495
# via chromadb
@@ -106,7 +107,7 @@ oauthlib==3.2.2
106107
# via
107108
# kubernetes
108109
# requests-oauthlib
109-
onnxruntime==1.18.1
110+
onnxruntime==1.19.0
110111
# via chromadb
111112
opentelemetry-api==1.16.0
112113
# via
@@ -192,7 +193,7 @@ sniffio==1.3.1
192193
# -c ./ingest/../base.txt
193194
# anyio
194195
# httpx
195-
starlette==0.37.2
196+
starlette==0.38.2
196197
# via fastapi
197198
sympy==1.13.2
198199
# via onnxruntime
@@ -231,9 +232,9 @@ urllib3==1.26.19
231232
# -c ./ingest/../deps/constraints.txt
232233
# kubernetes
233234
# requests
234-
uvicorn[standard]==0.30.5
235+
uvicorn[standard]==0.30.6
235236
# via chromadb
236-
uvloop==0.19.0
237+
uvloop==0.20.0
237238
# via uvicorn
238239
watchfiles==0.23.0
239240
# via uvicorn

requirements/ingest/clarifai.txt

+5-3
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,16 @@ charset-normalizer==3.3.2
1515
# requests
1616
clarifai==10.7.0
1717
# via -r ./ingest/clarifai.in
18-
clarifai-grpc==10.7.1
18+
clarifai-grpc==10.7.2
1919
# via clarifai
2020
contextlib2==21.6.0
2121
# via schema
2222
googleapis-common-protos==1.63.2
2323
# via clarifai-grpc
24-
grpcio==1.65.4
25-
# via clarifai-grpc
24+
grpcio==1.64.3
25+
# via
26+
# -c ./ingest/../deps/constraints.txt
27+
# clarifai-grpc
2628
idna==3.7
2729
# via
2830
# -c ./ingest/../base.txt

requirements/ingest/confluence.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ six==1.16.0
4242
# via
4343
# -c ./ingest/../base.txt
4444
# atlassian-python-api
45-
soupsieve==2.5
45+
soupsieve==2.6
4646
# via
4747
# -c ./ingest/../base.txt
4848
# beautifulsoup4

requirements/ingest/databricks-volumes.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#
55
# pip-compile ./ingest/databricks-volumes.in
66
#
7-
cachetools==5.4.0
7+
cachetools==5.5.0
88
# via google-auth
99
certifi==2024.7.4
1010
# via
@@ -15,9 +15,9 @@ charset-normalizer==3.3.2
1515
# via
1616
# -c ./ingest/../base.txt
1717
# requests
18-
databricks-sdk==0.29.0
18+
databricks-sdk==0.30.0
1919
# via -r ./ingest/databricks-volumes.in
20-
google-auth==2.33.0
20+
google-auth==2.34.0
2121
# via databricks-sdk
2222
idna==3.7
2323
# via

requirements/ingest/delta-table.txt

+1-3
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#
55
# pip-compile ./ingest/delta-table.in
66
#
7-
deltalake==0.18.2
7+
deltalake==0.19.0
88
# via -r ./ingest/delta-table.in
99
fsspec==2024.5.0
1010
# via
@@ -16,5 +16,3 @@ numpy==1.26.4
1616
# pyarrow
1717
pyarrow==17.0.0
1818
# via deltalake
19-
pyarrow-hotfix==0.6
20-
# via deltalake

requirements/ingest/discord.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44
#
55
# pip-compile ./ingest/discord.in
66
#
7-
aiohappyeyeballs==2.3.5
7+
aiohappyeyeballs==2.3.7
88
# via aiohttp
9-
aiohttp==3.10.3
9+
aiohttp==3.10.4
1010
# via discord-py
1111
aiosignal==1.3.1
1212
# via aiohttp

requirements/ingest/elasticsearch.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44
#
55
# pip-compile ./ingest/elasticsearch.in
66
#
7-
aiohappyeyeballs==2.3.5
7+
aiohappyeyeballs==2.3.7
88
# via aiohttp
9-
aiohttp==3.10.3
9+
aiohttp==3.10.4
1010
# via elasticsearch
1111
aiosignal==1.3.1
1212
# via aiohttp
@@ -21,7 +21,7 @@ certifi==2024.7.4
2121
# elastic-transport
2222
elastic-transport==8.15.0
2323
# via elasticsearch
24-
elasticsearch[async]==8.14.0
24+
elasticsearch[async]==8.15.0
2525
# via -r ./ingest/elasticsearch.in
2626
frozenlist==1.4.1
2727
# via

0 commit comments

Comments
 (0)