Skip to content

Commit e7d189f

Browse files
authored
chore: Bump inference and set default ocr_mode to entire_page (#1172)
* pip-compile in order to bump unstructured-inference * Set the default `ocr_mode` back to `enitre_page` now that [this error](Unstructured-IO/unstructured-inference#183) is addressed * Explicitly add `sphinx-tabs` to `build.in`. This file provides `docs/requirements.txt`. * Remove a pinned `pydantic` version * Fix a makefile command to `pip-compile` a missing ingest file.
1 parent 05e3116 commit e7d189f

27 files changed

+240
-258
lines changed

Diff for: CHANGELOG.md

+6-2
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,16 @@
1-
## 0.10.5-dev4
1+
## 0.10.5
22

33
### Enhancements
44
* Create new CI Pipelines
55
- Checking text, xml, email, and html doc tests against the library installed without extras
66
- Checking each library extra against their respective tests
7-
* `partition` raises and error and tells the user to install the appropriate extra if a filetype
7+
* `partition` raises an error and tells the user to install the appropriate extra if a filetype
88
is detected that is missing dependencies.
99
* Add custom errors to ingest
10+
* Bump `unstructured-ingest==0.5.15`
11+
- Handle an uncaught TesseractError (0.5.15)
12+
- Add TIFF test file and TIFF filetype to `test_from_image_file` in `test_layout` (0.5.14)
13+
* Use `entire_page` ocr mode for pdfs and images
1014
* Add notes on extra installs to docs
1115

1216

Diff for: Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -232,7 +232,7 @@ pip-compile:
232232
pip-compile --upgrade requirements/ingest-gcs.in
233233
pip-compile --upgrade requirements/ingest-dropbox.in
234234
pip-compile --upgrade requirements/ingest-azure.in
235-
pip-compile --upgrade requirements/ingest-delta-lake.in
235+
pip-compile --upgrade requirements/ingest-delta-table.in
236236
pip-compile --upgrade requirements/ingest-discord.in
237237
pip-compile --upgrade requirements/ingest-reddit.in
238238
pip-compile --upgrade requirements/ingest-github.in

Diff for: docs/requirements.txt

+6-3
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,8 @@ docutils==0.18.1
2626
# via
2727
# sphinx
2828
# sphinx-rtd-theme
29-
furo==2023.7.26
29+
# sphinx-tabs
30+
furo==2023.8.19
3031
# via -r requirements/build.in
3132
idna==3.4
3233
# via
@@ -46,6 +47,7 @@ pygments==2.16.1
4647
# via
4748
# furo
4849
# sphinx
50+
# sphinx-tabs
4951
pytz==2023.3
5052
# via babel
5153
requests==2.31.0
@@ -64,13 +66,14 @@ sphinx==6.2.1
6466
# furo
6567
# sphinx-basic-ng
6668
# sphinx-rtd-theme
69+
# sphinx-tabs
6770
# sphinxcontrib-jquery
6871
sphinx-basic-ng==1.0.0b2
6972
# via furo
7073
sphinx-rtd-theme==1.2.2
7174
# via -r requirements/build.in
72-
sphinx-tabs
73-
# to enable tabbed code blocks
75+
sphinx-tabs==3.4.1
76+
# via -r requirements/build.in
7477
sphinxcontrib-applehelp==1.0.4
7578
# via sphinx
7679
sphinxcontrib-devhelp==1.0.2

Diff for: requirements/base.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ chardet==5.2.0
1414
# via -r requirements/base.in
1515
charset-normalizer==3.2.0
1616
# via requests
17-
click==8.1.6
17+
click==8.1.7
1818
# via nltk
1919
emoji==2.8.0
2020
# via -r requirements/base.in

Diff for: requirements/build.in

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
-c constraints.in
33

44
sphinx
5+
sphinx-tabs
56
# NOTE(alan) - Pinning to resolve a conflict with sphinx. We can unpin on next sphinx_rtd_theme release.
67
sphinx_rtd_theme==1.2.2
78
furo

Diff for: requirements/build.txt

+6-1
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,8 @@ docutils==0.18.1
2626
# via
2727
# sphinx
2828
# sphinx-rtd-theme
29-
furo==2023.7.26
29+
# sphinx-tabs
30+
furo==2023.8.19
3031
# via -r requirements/build.in
3132
idna==3.4
3233
# via
@@ -46,6 +47,7 @@ pygments==2.16.1
4647
# via
4748
# furo
4849
# sphinx
50+
# sphinx-tabs
4951
pytz==2023.3
5052
# via babel
5153
requests==2.31.0
@@ -64,11 +66,14 @@ sphinx==6.2.1
6466
# furo
6567
# sphinx-basic-ng
6668
# sphinx-rtd-theme
69+
# sphinx-tabs
6770
# sphinxcontrib-jquery
6871
sphinx-basic-ng==1.0.0b2
6972
# via furo
7073
sphinx-rtd-theme==1.2.2
7174
# via -r requirements/build.in
75+
sphinx-tabs==3.4.1
76+
# via -r requirements/build.in
7277
sphinxcontrib-applehelp==1.0.4
7378
# via sphinx
7479
sphinxcontrib-devhelp==1.0.2

Diff for: requirements/constraints.in

+1-1
Original file line numberDiff line numberDiff line change
@@ -26,4 +26,4 @@ Pillow<10.0.0
2626
# AttributeError: 'ResourcePath' object has no attribute 'collection'
2727
Office365-REST-Python-Client<2.4.3
2828
# NOTE(christine) Pinned to set the `unstructured-inference` version
29-
unstructured-inference==0.5.13
29+
unstructured-inference==0.5.15

Diff for: requirements/dev.txt

+5-5
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ charset-normalizer==3.2.0
5151
# -c requirements/base.txt
5252
# -c requirements/test.txt
5353
# requests
54-
click==8.1.6
54+
click==8.1.7
5555
# via
5656
# -c requirements/base.txt
5757
# -c requirements/test.txt
@@ -80,7 +80,7 @@ filelock==3.12.2
8080
# via virtualenv
8181
fqdn==1.5.1
8282
# via jsonschema
83-
identify==2.5.26
83+
identify==2.5.27
8484
# via pre-commit
8585
idna==3.4
8686
# via
@@ -167,7 +167,7 @@ jupyter-events==0.7.0
167167
# via jupyter-server
168168
jupyter-lsp==2.2.0
169169
# via jupyterlab
170-
jupyter-server==2.7.1
170+
jupyter-server==2.7.2
171171
# via
172172
# jupyter-lsp
173173
# jupyterlab
@@ -398,9 +398,9 @@ webencodings==0.5.1
398398
# via
399399
# bleach
400400
# tinycss2
401-
websocket-client==1.6.1
401+
websocket-client==1.6.2
402402
# via jupyter-server
403-
wheel==0.41.1
403+
wheel==0.41.2
404404
# via
405405
# -c requirements/constraints.in
406406
# pip-tools

Diff for: requirements/extra-pdf-image.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ filelock==3.12.2
3535
# transformers
3636
flatbuffers==23.5.26
3737
# via onnxruntime
38-
fonttools==4.42.0
38+
fonttools==4.42.1
3939
# via matplotlib
4040
fsspec==2023.6.0
4141
# via huggingface-hub
@@ -196,7 +196,7 @@ tqdm==4.66.1
196196
# huggingface-hub
197197
# iopath
198198
# transformers
199-
transformers==4.31.0
199+
transformers==4.32.0
200200
# via unstructured-inference
201201
typing-extensions==4.7.1
202202
# via
@@ -205,7 +205,7 @@ typing-extensions==4.7.1
205205
# torch
206206
tzdata==2023.3
207207
# via pandas
208-
unstructured-inference==0.5.13
208+
unstructured-inference==0.5.15
209209
# via
210210
# -c requirements/constraints.in
211211
# -r requirements/extra-pdf-image.in

Diff for: requirements/huggingface.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ charset-normalizer==3.2.0
1313
# via
1414
# -c requirements/base.txt
1515
# requests
16-
click==8.1.6
16+
click==8.1.7
1717
# via
1818
# -c requirements/base.txt
1919
# sacremoses
@@ -88,7 +88,7 @@ tqdm==4.66.1
8888
# huggingface-hub
8989
# sacremoses
9090
# transformers
91-
transformers==4.31.0
91+
transformers==4.32.0
9292
# via -r requirements/huggingface.in
9393
typing-extensions==4.7.1
9494
# via

Diff for: requirements/ingest-airtable.txt

+8-2
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44
#
55
# pip-compile requirements/ingest-airtable.in
66
#
7+
annotated-types==0.5.0
8+
# via pydantic
79
certifi==2023.7.22
810
# via
911
# -c requirements/base.txt
@@ -19,18 +21,22 @@ idna==3.4
1921
# requests
2022
inflection==0.5.1
2123
# via pyairtable
22-
pyairtable==2.0.0
24+
pyairtable==2.1.0.post1
2325
# via -r requirements/ingest-airtable.in
24-
pydantic==1.10.12
26+
pydantic==2.2.1
2527
# via pyairtable
28+
pydantic-core==2.6.1
29+
# via pydantic
2630
requests==2.31.0
2731
# via
2832
# -c requirements/base.txt
2933
# pyairtable
3034
typing-extensions==4.7.1
3135
# via
36+
# annotated-types
3237
# pyairtable
3338
# pydantic
39+
# pydantic-core
3440
urllib3==1.26.16
3541
# via
3642
# -c requirements/base.txt

Diff for: requirements/ingest-azure.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ async-timeout==4.0.3
1414
# via aiohttp
1515
attrs==23.1.0
1616
# via aiohttp
17-
azure-core==1.29.2
17+
azure-core==1.29.3
1818
# via
1919
# adlfs
2020
# azure-identity

Diff for: requirements/ingest-biomed.txt

+6-2
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,12 @@
55
# pip-compile requirements/ingest-biomed.in
66
#
77
beautifulsoup4==4.12.2
8-
# via bs4
8+
# via
9+
# -c requirements/base.txt
10+
# bs4
911
bs4==0.0.1
1012
# via -r requirements/ingest-biomed.in
1113
soupsieve==2.4.1
12-
# via beautifulsoup4
14+
# via
15+
# -c requirements/base.txt
16+
# beautifulsoup4

Diff for: requirements/ingest-gcs.txt

+6-2
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,9 @@ async-timeout==4.0.3
1313
attrs==23.1.0
1414
# via aiohttp
1515
beautifulsoup4==4.12.2
16-
# via bs4
16+
# via
17+
# -c requirements/base.txt
18+
# bs4
1719
bs4==0.0.1
1820
# via -r requirements/ingest-gcs.in
1921
cachetools==5.3.1
@@ -99,7 +101,9 @@ rsa==4.9
99101
six==1.16.0
100102
# via google-auth
101103
soupsieve==2.4.1
102-
# via beautifulsoup4
104+
# via
105+
# -c requirements/base.txt
106+
# beautifulsoup4
103107
urllib3==1.26.16
104108
# via
105109
# -c requirements/base.txt

Diff for: requirements/ingest-reddit.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -33,5 +33,5 @@ urllib3==1.26.16
3333
# -c requirements/base.txt
3434
# -c requirements/constraints.in
3535
# requests
36-
websocket-client==1.6.1
36+
websocket-client==1.6.2
3737
# via praw

Diff for: requirements/test.in

+1-2
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,7 @@ flake8
1111
freezegun
1212
label_studio_sdk
1313
mypy
14-
# pinning to avoid error in argilla library
15-
pydantic<2
14+
pydantic
1615
pytest-cov
1716
pytest-mock
1817
ruff

Diff for: requirements/test.txt

+9-3
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44
#
55
# pip-compile requirements/test.in
66
#
7+
annotated-types==0.5.0
8+
# via pydantic
79
appdirs==1.4.4
810
# via label-studio-tools
911
black==23.7.0
@@ -17,7 +19,7 @@ charset-normalizer==3.2.0
1719
# via
1820
# -c requirements/base.txt
1921
# requests
20-
click==8.1.6
22+
click==8.1.7
2123
# via
2224
# -c requirements/base.txt
2325
# -r requirements/test.in
@@ -72,10 +74,12 @@ pluggy==1.2.0
7274
# via pytest
7375
pycodestyle==2.11.0
7476
# via flake8
75-
pydantic==1.10.12
77+
pydantic==2.2.1
7678
# via
7779
# -r requirements/test.in
7880
# label-studio-sdk
81+
pydantic-core==2.6.1
82+
# via pydantic
7983
pyflakes==3.1.0
8084
# via flake8
8185
pytest==7.4.0
@@ -94,7 +98,7 @@ requests==2.31.0
9498
# via
9599
# -c requirements/base.txt
96100
# label-studio-sdk
97-
ruff==0.0.284
101+
ruff==0.0.285
98102
# via -r requirements/test.in
99103
six==1.16.0
100104
# via python-dateutil
@@ -116,9 +120,11 @@ types-urllib3==1.26.25.14
116120
# via types-requests
117121
typing-extensions==4.7.1
118122
# via
123+
# annotated-types
119124
# black
120125
# mypy
121126
# pydantic
127+
# pydantic-core
122128
urllib3==1.26.16
123129
# via
124130
# -c requirements/base.txt

Diff for: test_unstructured/partition/pdf-image/test_pdf.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ def test_partition_pdf_with_model_name_env_var(
177177
filename,
178178
is_image=False,
179179
ocr_languages="eng",
180-
ocr_mode="individual_blocks",
180+
ocr_mode="entire_page",
181181
extract_tables=False,
182182
model_name="checkbox",
183183
)
@@ -198,7 +198,7 @@ def test_partition_pdf_with_model_name(
198198
filename,
199199
is_image=False,
200200
ocr_languages="eng",
201-
ocr_mode="individual_blocks",
201+
ocr_mode="entire_page",
202202
extract_tables=False,
203203
model_name="checkbox",
204204
)
@@ -404,7 +404,7 @@ def test_partition_pdf_with_dpi():
404404
filename,
405405
is_image=False,
406406
ocr_languages="eng",
407-
ocr_mode="individual_blocks",
407+
ocr_mode="entire_page",
408408
extract_tables=False,
409409
model_name=None,
410410
pdf_image_dpi=100,

0 commit comments

Comments
 (0)