Skip to content

Commit 7437f0a

Browse files
fix(CVE-2024-39705): update to latest nltk version (#3512)
### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by updating to `nltk==3.8.2` and closes #3511. This CVE had previously been mitigated in #3361. --------- Co-authored-by: Christine Straub <[email protected]>
1 parent 1158d8f commit 7437f0a

32 files changed

+57
-75
lines changed

.github/workflows/ci.yml

+5-11
Original file line numberDiff line numberDiff line change
@@ -120,8 +120,6 @@ jobs:
120120
matrix:
121121
python-version: ["3.9","3.10","3.11", "3.12"]
122122
runs-on: ubuntu-latest
123-
env:
124-
NLTK_DATA: ${{ github.workspace }}/nltk_data
125123
needs: [setup, lint]
126124
steps:
127125
- uses: actions/checkout@v4
@@ -161,7 +159,6 @@ jobs:
161159
python-version: ["3.10"]
162160
runs-on: ubuntu-latest
163161
env:
164-
NLTK_DATA: ${{ github.workspace }}/nltk_data
165162
UNSTRUCTURED_HF_TOKEN: ${{ secrets.HF_TOKEN }}
166163
needs: [setup, lint]
167164
steps:
@@ -179,6 +176,7 @@ jobs:
179176
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
180177
run: |
181178
source .venv/bin/activate
179+
make install-nltk-models
182180
sudo apt-get update
183181
sudo apt-get install -y poppler-utils
184182
make install-pandoc install-test
@@ -193,8 +191,6 @@ jobs:
193191
matrix:
194192
python-version: ["3.10"]
195193
runs-on: ubuntu-latest
196-
env:
197-
NLTK_DATA: ${{ github.workspace }}/nltk_data
198194
needs: [setup, lint]
199195
steps:
200196
- uses: actions/checkout@v4
@@ -211,6 +207,7 @@ jobs:
211207
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
212208
run: |
213209
source .venv/bin/activate
210+
make install-nltk-models
214211
make test-no-extras CI=true
215212
216213
test_unit_dependency_extras:
@@ -276,8 +273,6 @@ jobs:
276273
matrix:
277274
python-version: [ "3.9","3.10" ]
278275
runs-on: ubuntu-latest
279-
env:
280-
NLTK_DATA: ${{ github.workspace }}/nltk_data
281276
needs: [ setup_ingest, lint ]
282277
steps:
283278
# actions/checkout MUST come before auth
@@ -296,6 +291,7 @@ jobs:
296291
- name: Test Ingest (unit)
297292
run: |
298293
source .venv/bin/activate
294+
make install-nltk-models
299295
PYTHONPATH=. pytest test_unstructured_ingest/unit
300296
301297
@@ -304,8 +300,6 @@ jobs:
304300
matrix:
305301
python-version: ["3.9","3.10"]
306302
runs-on: ubuntu-latest-m
307-
env:
308-
NLTK_DATA: ${{ github.workspace }}/nltk_data
309303
needs: [setup_ingest, lint]
310304
steps:
311305
# actions/checkout MUST come before auth
@@ -373,6 +367,7 @@ jobs:
373367
CI: "true"
374368
run: |
375369
source .venv/bin/activate
370+
make install-nltk-models
376371
sudo apt-get update
377372
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
378373
make install-pandoc
@@ -391,8 +386,6 @@ jobs:
391386
matrix:
392387
python-version: ["3.9","3.10"]
393388
runs-on: ubuntu-latest-m
394-
env:
395-
NLTK_DATA: ${{ github.workspace }}/nltk_data
396389
needs: [setup_ingest, lint]
397390
steps:
398391
# actions/checkout MUST come before auth
@@ -445,6 +438,7 @@ jobs:
445438
CI: "true"
446439
run: |
447440
source .venv/bin/activate
441+
make install-nltk-models
448442
sudo apt-get update
449443
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
450444
make install-pandoc

CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.15.2-dev8
1+
## 0.15.2
22

33
### Enhancements
44

@@ -10,6 +10,7 @@
1010

1111
### Fixes
1212

13+
* **Updates NLTK data file for compatibility with `nltk>=3.8.2`**. The NLTK data file now container `punkt_tab`, making it possible to upgrade to `nltk>=3.8.2`. The `nltk==3.8.2` patches CVE-2024-39705.
1314
* **Renames Astra to Astra DB** Conforms with DataStax internal naming conventions.
1415
* **Accommodate single-column CSV files.** Resolves a limitation of `partition_csv()` where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters).
1516
* **Accommodate `image/jpg` in PPTX as alias for `image/jpeg`.** Resolves problem partitioning PPTX files having an invalid `image/jpg` (should be `image/jpeg`) MIME-type in the `[Content_Types].xml` member of the PPTX Zip archive.

Makefile

+1-2
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,7 @@ install-huggingface:
3838

3939
.PHONY: install-nltk-models
4040
install-nltk-models:
41-
python3 -c "import nltk; nltk.download('punkt')"
42-
python3 -c "import nltk; nltk.download('averaged_perceptron_tagger')"
41+
python3 -c "from unstructured.nlp.tokenize import download_nltk_packages; download_nltk_packages()"
4342

4443
.PHONY: install-test
4544
install-test:

requirements/base.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ jsonpath-python==1.0.6
5757
# via unstructured-client
5858
langdetect==1.0.9
5959
# via -r ./base.in
60-
lxml==5.2.2
60+
lxml==5.3.0
6161
# via -r ./base.in
6262
marshmallow==3.21.3
6363
# via
@@ -69,7 +69,7 @@ mypy-extensions==1.0.0
6969
# unstructured-client
7070
nest-asyncio==1.6.0
7171
# via unstructured-client
72-
nltk==3.8.1
72+
nltk==3.8.2
7373
# via -r ./base.in
7474
numpy==1.26.4
7575
# via -r ./base.in

requirements/dev.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -423,7 +423,7 @@ virtualenv==20.26.3
423423
# via pre-commit
424424
wcwidth==0.2.13
425425
# via prompt-toolkit
426-
webcolors==24.6.0
426+
webcolors==24.8.0
427427
# via jsonschema
428428
webencodings==0.5.1
429429
# via
@@ -437,7 +437,7 @@ wheel==0.44.0
437437
# pip-tools
438438
widgetsnbextension==4.0.11
439439
# via ipywidgets
440-
zipp==3.19.2
440+
zipp==3.20.0
441441
# via importlib-metadata
442442

443443
# The following packages are considered to be unsafe in a requirements file:

requirements/extra-docx.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#
55
# pip-compile ./extra-docx.in
66
#
7-
lxml==5.2.2
7+
lxml==5.3.0
88
# via
99
# -c ./base.txt
1010
# python-docx

requirements/extra-markdown.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,5 +8,5 @@ importlib-metadata==8.2.0
88
# via markdown
99
markdown==3.6
1010
# via -r ./extra-markdown.in
11-
zipp==3.19.2
11+
zipp==3.20.0
1212
# via importlib-metadata

requirements/extra-odt.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#
55
# pip-compile ./extra-odt.in
66
#
7-
lxml==5.2.2
7+
lxml==5.3.0
88
# via
99
# -c ./base.txt
1010
# python-docx

requirements/extra-paddleocr.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ lanms-neo==1.0.2
7878
# via unstructured-paddleocr
7979
lazy-loader==0.4
8080
# via scikit-image
81-
lxml==5.2.2
81+
lxml==5.3.0
8282
# via
8383
# -c ./base.txt
8484
# premailer
@@ -191,7 +191,7 @@ sniffio==1.3.1
191191
# -c ./base.txt
192192
# anyio
193193
# httpx
194-
tifffile==2024.7.24
194+
tifffile==2024.8.10
195195
# via scikit-image
196196
tqdm==4.66.5
197197
# via
@@ -208,5 +208,5 @@ urllib3==1.26.19
208208
# -c ././deps/constraints.txt
209209
# -c ./base.txt
210210
# requests
211-
zipp==3.19.2
211+
zipp==3.20.0
212212
# via importlib-resources

requirements/extra-pdf-image.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ kiwisolver==1.4.5
8686
# via matplotlib
8787
layoutparser==0.3.4
8888
# via unstructured-inference
89-
lxml==5.2.2
89+
lxml==5.3.0
9090
# via
9191
# -c ./base.txt
9292
# pikepdf
@@ -249,7 +249,7 @@ six==1.16.0
249249
# via
250250
# -c ./base.txt
251251
# python-dateutil
252-
sympy==1.13.1
252+
sympy==1.13.2
253253
# via
254254
# onnxruntime
255255
# torch
@@ -301,5 +301,5 @@ wrapt==1.16.0
301301
# -c ././deps/constraints.txt
302302
# -c ./base.txt
303303
# deprecated
304-
zipp==3.19.2
304+
zipp==3.20.0
305305
# via importlib-resources

requirements/extra-pptx.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#
55
# pip-compile ./extra-pptx.in
66
#
7-
lxml==5.2.2
7+
lxml==5.3.0
88
# via python-pptx
99
pillow==10.4.0
1010
# via python-pptx

requirements/huggingface.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ six==1.16.0
8585
# via
8686
# -c ./base.txt
8787
# langdetect
88-
sympy==1.13.1
88+
sympy==1.13.2
8989
# via torch
9090
tokenizers==0.19.1
9191
# via

requirements/ingest/azure.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ adlfs==2024.7.0
88
# via -r ./ingest/azure.in
99
aiohappyeyeballs==2.3.5
1010
# via aiohttp
11-
aiohttp==3.10.2
11+
aiohttp==3.10.3
1212
# via adlfs
1313
aiosignal==1.3.1
1414
# via aiohttp

requirements/ingest/chroma.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@ sniffio==1.3.1
194194
# httpx
195195
starlette==0.37.2
196196
# via fastapi
197-
sympy==1.13.1
197+
sympy==1.13.2
198198
# via onnxruntime
199199
tenacity==8.5.0
200200
# via
@@ -247,7 +247,7 @@ wrapt==1.16.0
247247
# -c ./ingest/../deps/constraints.txt
248248
# deprecated
249249
# opentelemetry-instrumentation
250-
zipp==3.19.2
250+
zipp==3.20.0
251251
# via
252252
# importlib-metadata
253253
# importlib-resources

requirements/ingest/clarifai.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ charset-normalizer==3.3.2
1515
# requests
1616
clarifai==10.7.0
1717
# via -r ./ingest/clarifai.in
18-
clarifai-grpc==10.7.0
18+
clarifai-grpc==10.7.1
1919
# via clarifai
2020
contextlib2==21.6.0
2121
# via schema

requirements/ingest/discord.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
#
77
aiohappyeyeballs==2.3.5
88
# via aiohttp
9-
aiohttp==3.10.2
9+
aiohttp==3.10.3
1010
# via discord-py
1111
aiosignal==1.3.1
1212
# via aiohttp

requirements/ingest/elasticsearch.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
#
77
aiohappyeyeballs==2.3.5
88
# via aiohttp
9-
aiohttp==3.10.2
9+
aiohttp==3.10.3
1010
# via elasticsearch
1111
aiosignal==1.3.1
1212
# via aiohttp
@@ -19,7 +19,7 @@ certifi==2024.7.4
1919
# -c ./ingest/../base.txt
2020
# -c ./ingest/../deps/constraints.txt
2121
# elastic-transport
22-
elastic-transport==8.13.1
22+
elastic-transport==8.15.0
2323
# via elasticsearch
2424
elasticsearch[async]==8.14.0
2525
# via -r ./ingest/elasticsearch.in

requirements/ingest/embed-aws-bedrock.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
#
77
aiohappyeyeballs==2.3.5
88
# via aiohttp
9-
aiohttp==3.10.2
9+
aiohttp==3.10.3
1010
# via
1111
# langchain
1212
# langchain-community
@@ -70,7 +70,7 @@ langchain-core==0.2.29
7070
# langchain-text-splitters
7171
langchain-text-splitters==0.2.2
7272
# via langchain
73-
langsmith==0.1.98
73+
langsmith==0.1.99
7474
# via
7575
# langchain
7676
# langchain-community

requirements/ingest/embed-huggingface.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ langchain-core==0.2.29
4949
# via langchain-huggingface
5050
langchain-huggingface==0.0.3
5151
# via -r ./ingest/embed-huggingface.in
52-
langsmith==0.1.98
52+
langsmith==0.1.99
5353
# via langchain-core
5454
markupsafe==2.1.5
5555
# via jinja2
@@ -107,7 +107,7 @@ scipy==1.11.3
107107
# sentence-transformers
108108
sentence-transformers==3.0.1
109109
# via langchain-huggingface
110-
sympy==1.13.1
110+
sympy==1.13.2
111111
# via torch
112112
tenacity==8.5.0
113113
# via langchain-core

requirements/ingest/embed-octoai.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ idna==3.7
4949
# requests
5050
jiter==0.5.0
5151
# via openai
52-
openai==1.40.2
52+
openai==1.40.3
5353
# via -r ./ingest/embed-octoai.in
5454
pydantic==2.8.2
5555
# via openai

requirements/ingest/embed-openai.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -55,11 +55,11 @@ jsonpointer==3.0.0
5555
# via jsonpatch
5656
langchain-core==0.2.29
5757
# via langchain-openai
58-
langchain-openai==0.1.20
58+
langchain-openai==0.1.21
5959
# via -r ./ingest/embed-openai.in
60-
langsmith==0.1.98
60+
langsmith==0.1.99
6161
# via langchain-core
62-
openai==1.40.2
62+
openai==1.40.3
6363
# via langchain-openai
6464
orjson==3.10.7
6565
# via langsmith

requirements/ingest/embed-vertexai.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
#
77
aiohappyeyeballs==2.3.5
88
# via aiohttp
9-
aiohttp==3.10.2
9+
aiohttp==3.10.3
1010
# via
1111
# langchain
1212
# langchain-community
@@ -120,7 +120,7 @@ langchain-google-vertexai==1.0.8
120120
# via -r ./ingest/embed-vertexai.in
121121
langchain-text-splitters==0.2.2
122122
# via langchain
123-
langsmith==0.1.98
123+
langsmith==0.1.99
124124
# via
125125
# langchain
126126
# langchain-community

0 commit comments

Comments
 (0)