Skip to content

Commit 331c7fa

Browse files
authored
build(deps): split up dependencies by document type (#986)
* split dependencies by document type * make pip-compile with new requirements * add extra requirements to setup.py * add in all docs; re pip-compile * extra for all docs * add pandas to xlsx * dependency requires for tsv and csv * handling for doc, docx and odt * dependency check for pypandoc * required dependencies for pandoc files * xml and html * markdown * msg * add in pdf * add in pptx * add in excel * add lxml as base req * extra all docs for local inference * local inference installs all * pin pillow version * fixes for plain text tests * fixes for doc * update make commands * changelog and version * add xlrd * update pip-compile * pin numpy for python 3.8 support * more constraints * contraint on scipy * update install docs * constrain ipython * add outlook to pip-compile * more ipython constraints * add extras to dockerfile * pin office365 client * few doc tweaks * types as strings * last pip-compile * re pip-comple * make tidy * make tidy
1 parent 13d3559 commit 331c7fa

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+507
-352
lines changed

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
## 0.9.0
2+
3+
### Enhancements
4+
5+
* Dependencies are now split by document type, creating a slimmer base installation.
6+
17
## 0.8.8
28

39
### Enhancements
@@ -6,6 +12,7 @@
612

713
### Fixes
814

15+
916
* Rename "date" field to "last_modified"
1017
* Adds Box connector
1118

Dockerfile

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,15 @@ RUN python3.8 -m pip install pip==${PIP_VERSION} && \
3030
pip install --no-cache -r requirements/ingest-s3.txt && \
3131
pip install --no-cache -r requirements/ingest-slack.txt && \
3232
pip install --no-cache -r requirements/ingest-wikipedia.txt && \
33-
pip install --no-cache -r requirements/local-inference.txt && \
33+
pip install --no-cache -r requirements/extra-csv.txt && \
34+
pip install --no-cache -r requirements/extra-docx.txt && \
35+
pip install --no-cache -r requirements/extra-markdown.txt && \
36+
pip install --no-cache -r requirements/extra-msg.txt && \
37+
pip install --no-cache -r requirements/extra-odt.txt && \
38+
pip install --no-cache -r requirements/extra-pandoc.txt && \
39+
pip install --no-cache -r requirements/extra-pdf-image.txt && \
40+
pip install --no-cache -r requirements/extra-pptx.txt && \
41+
pip install --no-cache -r requirements/extra-xlsx.txt && \
3442
dnf -y groupremove "Development Tools" && \
3543
dnf clean all
3644

Makefile

Lines changed: 55 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,10 @@ install-base: install-base-pip-packages install-nltk-models
1818

1919
## install: installs all test, dev, and experimental requirements
2020
.PHONY: install
21-
install: install-base-pip-packages install-dev install-nltk-models install-test install-huggingface install-unstructured-inference
21+
install: install-base-pip-packages install-dev install-nltk-models install-test install-huggingface install-all-docs
2222

2323
.PHONY: install-ci
24-
install-ci: install-base-pip-packages install-nltk-models install-huggingface install-unstructured-inference install-test
24+
install-ci: install-base-pip-packages install-nltk-models install-huggingface install-all-docs install-test
2525

2626
.PHONY: install-base-pip-packages
2727
install-base-pip-packages:
@@ -53,6 +53,45 @@ install-dev:
5353
install-build:
5454
python3 -m pip install -r requirements/build.txt
5555

56+
.PHONY: install-csv
57+
install-csv:
58+
python3 -m pip install -r requirements/extra-csv.txt
59+
60+
.PHONY: install-docx
61+
install-docx:
62+
python3 -m pip install -r requirements/extra-docx.txt
63+
64+
.PHONY: install-odt
65+
install-odt:
66+
python3 -m pip install -r requirements/extra-odt.txt
67+
68+
.PHONY: install-pypandoc
69+
install-pypandoc:
70+
python3 -m pip install -r requirements/extra-pandoc.txt
71+
72+
.PHONY: install-markdown
73+
install-markdown:
74+
python3 -m pip install -r requirements/extra-markdown.txt
75+
76+
.PHONY: install-msg
77+
install-msg:
78+
python3 -m pip install -r requirements/extra-msg.txt
79+
80+
.PHONY: install-pdf-image
81+
install-pdf-image:
82+
python3 -m pip install -r requirements/extra-pdf-image.txt
83+
84+
.PHONY: install-pptx
85+
install-pptx:
86+
python3 -m pip install -r requirements/extra-pptx.txt
87+
88+
.PHONY: install-xlsx
89+
install-xlsx:
90+
python3 -m pip install -r requirements/extra-xlsx.txt
91+
92+
.PHONY: install-all-docs
93+
install-all-docs: install-base install-csv install-docx install-docx install-odt install-pypandoc install-markdown install-msg install-pdf-image install-pptx install-xlsx
94+
5695
.PHONY: install-ingest-google-drive
5796
install-ingest-google-drive:
5897
python3 -m pip install -r requirements/ingest-google-drive.txt
@@ -124,7 +163,7 @@ install-unstructured-inference:
124163

125164
## install-local-inference: installs requirements for local inference
126165
.PHONY: install-local-inference
127-
install-local-inference: install install-unstructured-inference
166+
install-local-inference: install install-all-docs
128167

129168
.PHONY: install-pandoc
130169
install-pandoc:
@@ -135,12 +174,23 @@ install-pandoc:
135174
.PHONY: pip-compile
136175
pip-compile:
137176
pip-compile --upgrade requirements/base.in
177+
178+
# Extra requirements that are specific to document types
179+
pip-compile --upgrade requirements/extra-csv.in
180+
pip-compile --upgrade requirements/extra-docx.in
181+
pip-compile --upgrade requirements/extra-pandoc.in
182+
pip-compile --upgrade requirements/extra-markdown.in
183+
pip-compile --upgrade requirements/extra-msg.in
184+
pip-compile --upgrade requirements/extra-odt.in
185+
pip-compile --upgrade requirements/extra-pdf-image.in
186+
pip-compile --upgrade requirements/extra-pptx.in
187+
pip-compile --upgrade requirements/extra-xlsx.in
188+
138189
# Extra requirements for huggingface staging functions
139190
pip-compile --upgrade requirements/huggingface.in
140191
pip-compile --upgrade requirements/test.in
141192
pip-compile --upgrade requirements/dev.in
142193
pip-compile --upgrade requirements/build.in
143-
pip-compile --upgrade requirements/local-inference.in
144194
# NOTE(robinson) - doc/requirements.txt is where the GitHub action for building
145195
# sphinx docs looks for additional requirements
146196
cp requirements/build.txt docs/requirements.txt
@@ -158,6 +208,7 @@ pip-compile:
158208
pip-compile --upgrade requirements/ingest-google-drive.in
159209
pip-compile --upgrade requirements/ingest-elasticsearch.in
160210
pip-compile --upgrade requirements/ingest-onedrive.in
211+
pip-compile --upgrade requirements/ingest-outlook.in
161212
pip-compile --upgrade requirements/ingest-confluence.in
162213

163214
## install-project-local: install unstructured into your local python environment

docs/requirements.txt

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#
2-
# This file is autogenerated by pip-compile with Python 3.8
2+
# This file is autogenerated by pip-compile with Python 3.11
33
# by the following command:
44
#
55
# pip-compile requirements/build.in
@@ -12,22 +12,26 @@ beautifulsoup4==4.12.2
1212
# via furo
1313
certifi==2023.7.22
1414
# via
15+
# -c requirements/base.txt
16+
# -c requirements/constraints.in
1517
# -r requirements/build.in
1618
# requests
1719
charset-normalizer==3.2.0
18-
# via requests
20+
# via
21+
# -c requirements/base.txt
22+
# requests
1923
docutils==0.18.1
2024
# via
2125
# sphinx
2226
# sphinx-rtd-theme
2327
furo==2023.7.26
2428
# via -r requirements/build.in
2529
idna==3.4
26-
# via requests
30+
# via
31+
# -c requirements/base.txt
32+
# requests
2733
imagesize==1.4.1
2834
# via sphinx
29-
importlib-metadata==6.8.0
30-
# via sphinx
3135
jinja2==3.1.2
3236
# via sphinx
3337
markupsafe==2.1.3
@@ -38,10 +42,10 @@ pygments==2.15.1
3842
# via
3943
# furo
4044
# sphinx
41-
pytz==2023.3
42-
# via babel
4345
requests==2.31.0
44-
# via sphinx
46+
# via
47+
# -c requirements/base.txt
48+
# sphinx
4549
snowballstemmer==2.2.0
4650
# via sphinx
4751
soupsieve==2.4.1
@@ -71,7 +75,8 @@ sphinxcontrib-qthelp==1.0.3
7175
# via sphinx
7276
sphinxcontrib-serializinghtml==1.1.5
7377
# via sphinx
74-
urllib3==2.0.4
75-
# via requests
76-
zipp==3.16.2
77-
# via importlib-metadata
78+
urllib3==1.26.16
79+
# via
80+
# -c requirements/base.txt
81+
# -c requirements/constraints.in
82+
# requests

docs/source/installing.rst

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,15 @@ Quick Start
77
Use the following instructions to get up and running with ``unstructured`` and test your
88
installation.
99

10-
* Install the Python SDK with ``pip install "unstructured[local-inference]"``
11-
* If you do not need to process PDFs or images, you can run ``pip install unstructured``
10+
* Install the Python SDK with ``pip install unstructured``
11+
* Plain text files, HTML, XML, JSON and Emails do not require any extra dependencies.
12+
* If you need to process other document types, you can install the extras required for those documents
13+
with ``pip install "unstructured[docx,pptx]"``.
14+
* To install the extras for every document type, use ``pip install "unstructured[all-docs]"``.
15+
* For ``unstructured<0.9.0``, you can install the extras for all document types with
16+
``pip install "unstructured[local-inference]"``. The ``local-inference`` extra is still
17+
supported in newer versions for backward compatibility, but may be deprecated in a future version.
18+
The ``all-docs`` extra is the officially supported installation pattern.
1219

1320
* Install the following system dependencies if they are not already available on your system. Depending on what document types you're parsing, you may not need all of these.
1421
* ``libmagic-dev`` (filetype detection)

requirements/base.in

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,8 @@
11
-c "constraints.in"
22
chardet
33
filetype
4+
python-magic
45
lxml
5-
msg_parser
66
nltk
7-
openpyxl
8-
pandas
9-
pdf2image
10-
pdfminer.six
11-
pillow
12-
pypandoc
13-
python-docx
14-
python-pptx
15-
python-magic
16-
markdown
17-
requests
187
tabulate
19-
xlrd
8+
requests

requirements/base.txt

Lines changed: 2 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#
2-
# This file is autogenerated by pip-compile with Python 3.8
2+
# This file is autogenerated by pip-compile with Python 3.11
33
# by the following command:
44
#
55
# pip-compile requirements/base.in
@@ -8,89 +8,33 @@ certifi==2023.7.22
88
# via
99
# -c requirements/constraints.in
1010
# requests
11-
cffi==1.15.1
12-
# via cryptography
1311
chardet==5.1.0
1412
# via -r requirements/base.in
1513
charset-normalizer==3.2.0
16-
# via
17-
# pdfminer-six
18-
# requests
14+
# via requests
1915
click==8.1.6
2016
# via nltk
21-
cryptography==41.0.2
22-
# via pdfminer-six
23-
et-xmlfile==1.1.0
24-
# via openpyxl
2517
filetype==1.2.0
2618
# via -r requirements/base.in
2719
idna==3.4
2820
# via requests
29-
importlib-metadata==6.8.0
30-
# via markdown
3121
joblib==1.3.1
3222
# via nltk
3323
lxml==4.9.3
34-
# via
35-
# -r requirements/base.in
36-
# python-docx
37-
# python-pptx
38-
markdown==3.4.4
39-
# via -r requirements/base.in
40-
msg-parser==1.2.0
4124
# via -r requirements/base.in
4225
nltk==3.8.1
4326
# via -r requirements/base.in
44-
numpy==1.24.4
45-
# via pandas
46-
olefile==0.46
47-
# via msg-parser
48-
openpyxl==3.1.2
49-
# via -r requirements/base.in
50-
pandas==2.0.3
51-
# via -r requirements/base.in
52-
pdf2image==1.16.3
53-
# via -r requirements/base.in
54-
pdfminer-six==20221105
55-
# via -r requirements/base.in
56-
pillow==10.0.0
57-
# via
58-
# -r requirements/base.in
59-
# pdf2image
60-
# python-pptx
61-
pycparser==2.21
62-
# via cffi
63-
pypandoc==1.11
64-
# via -r requirements/base.in
65-
python-dateutil==2.8.2
66-
# via pandas
67-
python-docx==0.8.11
68-
# via -r requirements/base.in
6927
python-magic==0.4.27
7028
# via -r requirements/base.in
71-
python-pptx==0.6.21
72-
# via -r requirements/base.in
73-
pytz==2023.3
74-
# via pandas
7529
regex==2023.6.3
7630
# via nltk
7731
requests==2.31.0
7832
# via -r requirements/base.in
79-
six==1.16.0
80-
# via python-dateutil
8133
tabulate==0.9.0
8234
# via -r requirements/base.in
8335
tqdm==4.65.0
8436
# via nltk
85-
tzdata==2023.3
86-
# via pandas
8737
urllib3==1.26.16
8838
# via
8939
# -c requirements/constraints.in
9040
# requests
91-
xlrd==2.0.1
92-
# via -r requirements/base.in
93-
xlsxwriter==3.1.2
94-
# via python-pptx
95-
zipp==3.16.2
96-
# via importlib-metadata

requirements/build.in

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
-c base.txt
2+
-c constraints.in
3+
14
sphinx
25
# NOTE(alan) - Pinning to resolve a conflict with sphinx. We can unpin on next sphinx_rtd_theme release.
36
sphinx_rtd_theme==1.2.2

0 commit comments

Comments
 (0)