Skip to content

Commit ad14321

Browse files
yuming-longqued
andauthored
Chore: don't pass empty language code to tesseract CLI (#1996)
Summary: Close: #1920 * stop passing in empty string from `languages` to tesseract, which will result in passing empty string to language config `-l` for the tesseract CLI * also stop passing in duplicate language code from `languages` to tesseract OCR * if we failed to convert any iso languages from the `languages` parameter, proceed OCR with `eng` as default ### Test * First confirm the tesseract error `Estimating resolution as X` before this: * on the `unstructured-api` repo with main branch, run `make run-web-app` * curl to test error from empty string, or just any wrong input like `-F 'languages="eng,de"'`: ``` curl -X 'POST' 'http://0.0.0.0:8000/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \ -F 'languages=""' \ -F 'strategy=hi_res' \ -F 'pdf_infer_table_structure=True' \ | jq -C . | less -R ``` * after this change: * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .` * check out to this branch * run `make run-web-app` again in api repo * the curl command return output and see warning in log --------- Co-authored-by: qued <[email protected]>
1 parent 38ab35d commit ad14321

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+720
-748
lines changed

Diff for: CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.10.29-dev13
1+
## 0.10.29
22

33
### Enhancements
44

@@ -21,6 +21,7 @@
2121
* **Ingest session handler not being shared correctly** All ingest docs that leverage the session handler should only need to set it once per process. It was recreating it each time because the right values weren't being set nor available given how dataclasses work in python.
2222
* **Ingest download-only fix.** Previously the download only flag was being checked after the doc factory pipeline step, which occurs before the files are actually downloaded by the source node. This check was moved after the source node to allow for the files to be downloaded first before exiting the pipeline.
2323
* **Fix flaky chunk-metadata.** Prior implementation was sensitive to element order in the section resulting in metadata values sometimes being dropped. Also, not all metadata items can be consolidated across multiple elements (e.g. coordinates) and so are now dropped from consolidated metadata.
24+
* **Fix tesseract error `Estimating resolution as X`** leaded by invalid language parameters input. Proceed with defalut language `eng` when `lang.py` fails to find valid language code for tesseract, so that we don't pass an empty string to tesseract CLI and raise an exception in downstream.
2425

2526
## 0.10.28
2627

Diff for: docs/requirements.txt

+25-25
Original file line numberDiff line numberDiff line change
@@ -2,25 +2,25 @@
22
# This file is autogenerated by pip-compile with Python 3.8
33
# by the following command:
44
#
5-
# pip-compile --constraint=requirements/constraints.in requirements/build.in
5+
# pip-compile --output-file=build.txt build.in
66
#
77
alabaster==0.7.13
88
# via sphinx
9-
babel==2.13.0
9+
babel==2.13.1
1010
# via sphinx
1111
beautifulsoup4==4.12.2
1212
# via
13-
# -c requirements/base.txt
13+
# -c base.txt
1414
# furo
1515
certifi==2023.7.22
1616
# via
17-
# -c requirements/base.txt
18-
# -c requirements/constraints.in
19-
# -r requirements/build.in
17+
# -c base.txt
18+
# -c constraints.in
19+
# -r build.in
2020
# requests
21-
charset-normalizer==3.3.0
21+
charset-normalizer==3.3.2
2222
# via
23-
# -c requirements/base.txt
23+
# -c base.txt
2424
# requests
2525
docutils==0.18.1
2626
# via
@@ -29,10 +29,10 @@ docutils==0.18.1
2929
# sphinx-rtd-theme
3030
# sphinx-tabs
3131
furo==2023.7.26
32-
# via -r requirements/build.in
32+
# via -r build.in
3333
idna==3.4
3434
# via
35-
# -c requirements/base.txt
35+
# -c base.txt
3636
# requests
3737
imagesize==1.4.1
3838
# via sphinx
@@ -53,10 +53,10 @@ mdit-py-plugins==0.4.0
5353
mdurl==0.1.2
5454
# via markdown-it-py
5555
myst-parser==2.0.0
56-
# via -r requirements/build.in
56+
# via -r build.in
5757
packaging==23.2
5858
# via
59-
# -c requirements/base.txt
59+
# -c base.txt
6060
# sphinx
6161
pygments==2.16.1
6262
# via
@@ -69,17 +69,17 @@ pyyaml==6.0.1
6969
# via myst-parser
7070
requests==2.31.0
7171
# via
72-
# -c requirements/base.txt
72+
# -c base.txt
7373
# sphinx
7474
snowballstemmer==2.2.0
7575
# via sphinx
7676
soupsieve==2.5
7777
# via
78-
# -c requirements/base.txt
78+
# -c base.txt
7979
# beautifulsoup4
8080
sphinx==6.2.1
8181
# via
82-
# -r requirements/build.in
82+
# -r build.in
8383
# furo
8484
# myst-parser
8585
# sphinx-basic-ng
@@ -89,37 +89,37 @@ sphinx==6.2.1
8989
sphinx-basic-ng==1.0.0b2
9090
# via furo
9191
sphinx-rtd-theme==1.2.2
92-
# via -r requirements/build.in
93-
sphinx-tabs==3.4.1
94-
# via -r requirements/build.in
92+
# via -r build.in
93+
sphinx-tabs==3.4.4
94+
# via -r build.in
9595
sphinxcontrib-applehelp==1.0.4
9696
# via
97-
# -r requirements/build.in
97+
# -r build.in
9898
# sphinx
9999
sphinxcontrib-devhelp==1.0.2
100100
# via
101-
# -r requirements/build.in
101+
# -r build.in
102102
# sphinx
103103
sphinxcontrib-htmlhelp==2.0.1
104104
# via
105-
# -r requirements/build.in
105+
# -r build.in
106106
# sphinx
107107
sphinxcontrib-jquery==4.1
108108
# via sphinx-rtd-theme
109109
sphinxcontrib-jsmath==1.0.1
110110
# via sphinx
111111
sphinxcontrib-qthelp==1.0.3
112112
# via
113-
# -r requirements/build.in
113+
# -r build.in
114114
# sphinx
115115
sphinxcontrib-serializinghtml==1.1.5
116116
# via
117-
# -r requirements/build.in
117+
# -r build.in
118118
# sphinx
119119
urllib3==1.26.18
120120
# via
121-
# -c requirements/base.txt
122-
# -c requirements/constraints.in
121+
# -c base.txt
122+
# -c constraints.in
123123
# requests
124124
zipp==3.17.0
125125
# via importlib-metadata

Diff for: requirements/base.txt

+22-22
Original file line numberDiff line numberDiff line change
@@ -2,73 +2,73 @@
22
# This file is autogenerated by pip-compile with Python 3.8
33
# by the following command:
44
#
5-
# pip-compile --constraint=requirements/constraints.in requirements/base.in
5+
# pip-compile --output-file=base.txt base.in
66
#
77
backoff==2.2.1
8-
# via -r requirements/base.in
8+
# via -r base.in
99
beautifulsoup4==4.12.2
10-
# via -r requirements/base.in
10+
# via -r base.in
1111
certifi==2023.7.22
1212
# via
13-
# -c requirements/constraints.in
13+
# -c constraints.in
1414
# requests
1515
chardet==5.2.0
16-
# via -r requirements/base.in
17-
charset-normalizer==3.3.1
16+
# via -r base.in
17+
charset-normalizer==3.3.2
1818
# via requests
1919
click==8.1.7
2020
# via nltk
2121
dataclasses-json==0.6.1
22-
# via -r requirements/base.in
22+
# via -r base.in
2323
emoji==2.8.0
24-
# via -r requirements/base.in
24+
# via -r base.in
2525
filetype==1.2.0
26-
# via -r requirements/base.in
26+
# via -r base.in
2727
idna==3.4
2828
# via requests
2929
joblib==1.3.2
3030
# via nltk
3131
langdetect==1.0.9
32-
# via -r requirements/base.in
32+
# via -r base.in
3333
lxml==4.9.3
34-
# via -r requirements/base.in
34+
# via -r base.in
3535
marshmallow==3.20.1
3636
# via dataclasses-json
3737
mypy-extensions==1.0.0
3838
# via typing-inspect
3939
nltk==3.8.1
40-
# via -r requirements/base.in
40+
# via -r base.in
4141
numpy==1.24.4
4242
# via
43-
# -c requirements/constraints.in
44-
# -r requirements/base.in
43+
# -c constraints.in
44+
# -r base.in
4545
packaging==23.2
4646
# via marshmallow
4747
python-iso639==2023.6.15
48-
# via -r requirements/base.in
48+
# via -r base.in
4949
python-magic==0.4.27
50-
# via -r requirements/base.in
51-
rapidfuzz==3.4.0
52-
# via -r requirements/base.in
50+
# via -r base.in
51+
rapidfuzz==3.5.2
52+
# via -r base.in
5353
regex==2023.10.3
5454
# via nltk
5555
requests==2.31.0
56-
# via -r requirements/base.in
56+
# via -r base.in
5757
six==1.16.0
5858
# via langdetect
5959
soupsieve==2.5
6060
# via beautifulsoup4
6161
tabulate==0.9.0
62-
# via -r requirements/base.in
62+
# via -r base.in
6363
tqdm==4.66.1
6464
# via nltk
6565
typing-extensions==4.8.0
6666
# via
67-
# -r requirements/base.in
67+
# -r base.in
6868
# typing-inspect
6969
typing-inspect==0.9.0
7070
# via dataclasses-json
7171
urllib3==1.26.18
7272
# via
73-
# -c requirements/constraints.in
73+
# -c constraints.in
7474
# requests

Diff for: requirements/build.txt

+23-23
Original file line numberDiff line numberDiff line change
@@ -2,25 +2,25 @@
22
# This file is autogenerated by pip-compile with Python 3.8
33
# by the following command:
44
#
5-
# pip-compile --constraint=requirements/constraints.in requirements/build.in
5+
# pip-compile --output-file=build.txt build.in
66
#
77
alabaster==0.7.13
88
# via sphinx
99
babel==2.13.1
1010
# via sphinx
1111
beautifulsoup4==4.12.2
1212
# via
13-
# -c requirements/base.txt
13+
# -c base.txt
1414
# furo
1515
certifi==2023.7.22
1616
# via
17-
# -c requirements/base.txt
18-
# -c requirements/constraints.in
19-
# -r requirements/build.in
17+
# -c base.txt
18+
# -c constraints.in
19+
# -r build.in
2020
# requests
21-
charset-normalizer==3.3.1
21+
charset-normalizer==3.3.2
2222
# via
23-
# -c requirements/base.txt
23+
# -c base.txt
2424
# requests
2525
docutils==0.18.1
2626
# via
@@ -29,10 +29,10 @@ docutils==0.18.1
2929
# sphinx-rtd-theme
3030
# sphinx-tabs
3131
furo==2023.7.26
32-
# via -r requirements/build.in
32+
# via -r build.in
3333
idna==3.4
3434
# via
35-
# -c requirements/base.txt
35+
# -c base.txt
3636
# requests
3737
imagesize==1.4.1
3838
# via sphinx
@@ -53,10 +53,10 @@ mdit-py-plugins==0.4.0
5353
mdurl==0.1.2
5454
# via markdown-it-py
5555
myst-parser==2.0.0
56-
# via -r requirements/build.in
56+
# via -r build.in
5757
packaging==23.2
5858
# via
59-
# -c requirements/base.txt
59+
# -c base.txt
6060
# sphinx
6161
pygments==2.16.1
6262
# via
@@ -69,17 +69,17 @@ pyyaml==6.0.1
6969
# via myst-parser
7070
requests==2.31.0
7171
# via
72-
# -c requirements/base.txt
72+
# -c base.txt
7373
# sphinx
7474
snowballstemmer==2.2.0
7575
# via sphinx
7676
soupsieve==2.5
7777
# via
78-
# -c requirements/base.txt
78+
# -c base.txt
7979
# beautifulsoup4
8080
sphinx==6.2.1
8181
# via
82-
# -r requirements/build.in
82+
# -r build.in
8383
# furo
8484
# myst-parser
8585
# sphinx-basic-ng
@@ -89,37 +89,37 @@ sphinx==6.2.1
8989
sphinx-basic-ng==1.0.0b2
9090
# via furo
9191
sphinx-rtd-theme==1.2.2
92-
# via -r requirements/build.in
92+
# via -r build.in
9393
sphinx-tabs==3.4.4
94-
# via -r requirements/build.in
94+
# via -r build.in
9595
sphinxcontrib-applehelp==1.0.4
9696
# via
97-
# -r requirements/build.in
97+
# -r build.in
9898
# sphinx
9999
sphinxcontrib-devhelp==1.0.2
100100
# via
101-
# -r requirements/build.in
101+
# -r build.in
102102
# sphinx
103103
sphinxcontrib-htmlhelp==2.0.1
104104
# via
105-
# -r requirements/build.in
105+
# -r build.in
106106
# sphinx
107107
sphinxcontrib-jquery==4.1
108108
# via sphinx-rtd-theme
109109
sphinxcontrib-jsmath==1.0.1
110110
# via sphinx
111111
sphinxcontrib-qthelp==1.0.3
112112
# via
113-
# -r requirements/build.in
113+
# -r build.in
114114
# sphinx
115115
sphinxcontrib-serializinghtml==1.1.5
116116
# via
117-
# -r requirements/build.in
117+
# -r build.in
118118
# sphinx
119119
urllib3==1.26.18
120120
# via
121-
# -c requirements/base.txt
122-
# -c requirements/constraints.in
121+
# -c base.txt
122+
# -c constraints.in
123123
# requests
124124
zipp==3.17.0
125125
# via importlib-metadata

0 commit comments

Comments
 (0)