Skip to content

Commit a11ad22

Browse files
badGarnetryannikolaidischristinestraub
authored
bump unstructured-inference (#3711)
This PR bumps `unstructured-inference` to `0.8.0`, which introduces vectorized data structure for layout elements and text regions. This PR also cleans up a few places in CI that has repeated definition of env variables or missing installation of testing dependencies in cache. A few document ingest results are changed: - two places for `biomed-api` (actually processed locally on runner) are due to very small changes in numerical results of the bounding box areas: one results in a duplicated page number/header and another results in a deduplication of a word of a sentence that starts in a new line. (yes, two cases goes in opposite directions) - the layout parser paper now outputs the code lines with page number inside the code box as list items --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: badGarnet <[email protected]> Co-authored-by: christinestraub <[email protected]>
1 parent e764bc5 commit a11ad22

File tree

23 files changed

+184
-109
lines changed

23 files changed

+184
-109
lines changed

Diff for: .github/actions/base-cache/action.yml

+4-1
Original file line numberDiff line numberDiff line change
@@ -30,14 +30,17 @@ runs:
3030
shell: bash
3131
run: |
3232
python${{ inputs.python-version }} -m pip install --upgrade virtualenv
33-
python${{ inputs.python-version }} -m venv .venv
33+
if [ ! -d ".venv" ]; then
34+
python${{ inputs.python-version }} -m venv .venv
35+
fi
3436
source .venv/bin/activate
3537
[ ! -d "$NLTK_DATA" ] && mkdir "$NLTK_DATA"
3638
if [ "${{ inputs.python-version == '3.12' }}" == "true" ]; then
3739
python -m ensurepip --upgrade
3840
python -m pip install --upgrade setuptools
3941
fi
4042
make install-ci
43+
make install-nltk-models
4144
- name: Save Cache
4245
if: steps.virtualenv-cache-restore.outputs.cache-hit != 'true'
4346
id: virtualenv-cache-save

Diff for: .github/actions/base-ingest-cache/action.yml

+4-2
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ runs:
1818
path: |
1919
.venv
2020
nltk_data
21-
key: unstructured-ingest-${{ runner.os }}-${{ inputs.python-version }}-${{ hashFiles('requirements/ingest/*.txt') }}-${{ hashFiles('requirements/*.txt') }}
21+
key: unstructured-ingest-${{ runner.os }}-${{ inputs.python-version }}-${{ hashFiles('requirements/ingest/*.txt', 'requirements/*.txt') }}
2222
lookup-only: ${{ inputs.check-only }}
2323
- name: Set up Python ${{ inputs.python-version }}
2424
if: steps.ingest-virtualenv-cache-restore.outputs.cache-hit != 'true'
@@ -39,6 +39,8 @@ runs:
3939
python -m pip install --upgrade setuptools
4040
fi
4141
make install-ci
42+
make install-nltk-models
43+
make install-all-docs
4244
make install-ingest
4345
- name: Save Ingest Cache
4446
if: steps.ingest-virtualenv-cache-restore.outputs.cache-hit != 'true'
@@ -48,5 +50,5 @@ runs:
4850
path: |
4951
.venv
5052
nltk_data
51-
key: unstructured-ingest-${{ runner.os }}-${{ inputs.python-version }}-${{ hashFiles('requirements/ingest/*.txt') }}-${{ hashFiles('requirements/*.txt') }}
53+
key: unstructured-ingest-${{ runner.os }}-${{ inputs.python-version }}-${{ hashFiles('requirements/ingest/*.txt', 'requirements/*.txt') }}
5254

Diff for: .github/workflows/ci.yml

+6-11
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,15 @@ permissions:
1212
id-token: write
1313
contents: read
1414

15+
env:
16+
NLTK_DATA: ${{ github.workspace }}/nltk_data
17+
1518
jobs:
1619
setup:
1720
strategy:
1821
matrix:
1922
python-version: ["3.9","3.10","3.11", "3.12"]
2023
runs-on: ubuntu-latest
21-
env:
22-
NLTK_DATA: ${{ github.workspace }}/nltk_data
2324
steps:
2425
- uses: actions/checkout@v4
2526
- uses: ./.github/actions/base-cache
@@ -78,8 +79,6 @@ jobs:
7879
strategy:
7980
matrix:
8081
python-version: ["3.9","3.10","3.11"]
81-
env:
82-
NLTK_DATA: ${{ github.workspace }}/nltk_data
8382
runs-on: ubuntu-latest
8483
needs: [setup, changelog]
8584
steps:
@@ -185,8 +184,6 @@ jobs:
185184
python-version: ["3.10"]
186185
extra: ["csv", "docx", "odt", "markdown", "pypandoc", "pdf-image", "pptx", "xlsx"]
187186
runs-on: ubuntu-latest
188-
env:
189-
NLTK_DATA: ${{ github.workspace }}/nltk_data
190187
needs: [setup, lint, test_unit_no_extras]
191188
steps:
192189
- uses: actions/checkout@v4
@@ -220,15 +217,14 @@ jobs:
220217
sudo apt-get update
221218
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
222219
tesseract --version
220+
make install-${{ matrix.extra }}
223221
make test-extra-${{ matrix.extra }} CI=true
224222
225223
setup_ingest:
226224
strategy:
227225
matrix:
228226
python-version: [ "3.9","3.10" ]
229227
runs-on: ubuntu-latest
230-
env:
231-
NLTK_DATA: ${{ github.workspace }}/nltk_data
232228
needs: [setup]
233229
steps:
234230
- uses: actions/checkout@v4
@@ -307,7 +303,6 @@ jobs:
307303
MXBAI_API_KEY: ${{secrets.MXBAI_API_KEY}}
308304
OCR_AGENT: "unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract"
309305
CI: "true"
310-
NLTK_DATA: ${{ github.workspace }}/nltk_data
311306
PYTHON: python${{ matrix.python-version }}
312307
run: |
313308
source .venv/bin/activate
@@ -320,6 +315,8 @@ jobs:
320315
sudo apt-get install -y tesseract-ocr-kor
321316
sudo apt-get install diffstat
322317
tesseract --version
318+
make install-all-docs
319+
make install-ingest
323320
./test_unstructured_ingest/test-ingest-src.sh
324321
325322
@@ -329,8 +326,6 @@ jobs:
329326
# NOTE(yuming): Unstructured API only use Python 3.10
330327
python-version: ["3.10"]
331328
runs-on: ubuntu-latest
332-
env:
333-
NLTK_DATA: ${{ github.workspace }}/nltk_data
334329
needs: [setup, lint]
335330
steps:
336331
- uses: actions/checkout@v4

Diff for: CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
1-
## 0.16.1-dev5
1+
## 0.16.1-dev6
22

33
### Enhancements
44

5+
* **Bump `unstructured-inference` to 0.7.39** and upgrade other dependencies
56
* **Round coordinates** Round coordinates when computing bounding box overlaps in `pdfminer_processing.py` to nearest machine precision. This can help reduce underterministic behavior from machine precision that affects which bounding boxes to combine.
67

78
### Features

Diff for: requirements/base.txt

+5-5
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#
55
# pip-compile ./base.in
66
#
7-
anyio==4.6.0
7+
anyio==4.6.2.post1
88
# via httpx
99
backoff==2.2.1
1010
# via -r ./base.in
@@ -20,15 +20,15 @@ cffi==1.17.1
2020
# via cryptography
2121
chardet==5.2.0
2222
# via -r ./base.in
23-
charset-normalizer==3.3.2
23+
charset-normalizer==3.4.0
2424
# via
2525
# requests
2626
# unstructured-client
2727
click==8.1.7
2828
# via
2929
# nltk
3030
# python-oxmsg
31-
cryptography==43.0.1
31+
cryptography==43.0.3
3232
# via unstructured-client
3333
dataclasses-json==0.6.7
3434
# via
@@ -62,7 +62,7 @@ langdetect==1.0.9
6262
# via -r ./base.in
6363
lxml==5.3.0
6464
# via -r ./base.in
65-
marshmallow==3.22.0
65+
marshmallow==3.23.0
6666
# via
6767
# dataclasses-json
6868
# unstructured-client
@@ -84,7 +84,7 @@ packaging==24.1
8484
# via
8585
# marshmallow
8686
# unstructured-client
87-
psutil==6.0.0
87+
psutil==6.1.0
8888
# via -r ./base.in
8989
pycparser==2.22
9090
# via cffi

Diff for: requirements/dev.txt

+4-4
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#
55
# pip-compile ./dev.in
66
#
7-
build==1.2.2
7+
build==1.2.2.post1
88
# via pip-tools
99
cfgv==3.4.0
1010
# via pre-commit
@@ -13,7 +13,7 @@ click==8.1.7
1313
# -c ./base.txt
1414
# -c ./test.txt
1515
# pip-tools
16-
distlib==0.3.8
16+
distlib==0.3.9
1717
# via virtualenv
1818
filelock==3.16.1
1919
# via virtualenv
@@ -36,7 +36,7 @@ platformdirs==4.3.6
3636
# via
3737
# -c ./test.txt
3838
# virtualenv
39-
pre-commit==3.8.0
39+
pre-commit==4.0.1
4040
# via -r ./dev.in
4141
pyproject-hooks==1.2.0
4242
# via
@@ -51,7 +51,7 @@ tomli==2.0.2
5151
# -c ./test.txt
5252
# build
5353
# pip-tools
54-
virtualenv==20.26.6
54+
virtualenv==20.27.0
5555
# via pre-commit
5656
wheel==0.44.0
5757
# via pip-tools

Diff for: requirements/extra-epub.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,5 @@
44
#
55
# pip-compile ./extra-epub.in
66
#
7-
pypandoc==1.13
7+
pypandoc==1.14
88
# via -r ./extra-epub.in

Diff for: requirements/extra-odt.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ lxml==5.3.0
88
# via
99
# -c ./base.txt
1010
# python-docx
11-
pypandoc==1.13
11+
pypandoc==1.14
1212
# via -r ./extra-odt.in
1313
python-docx==1.1.2
1414
# via -r ./extra-odt.in

Diff for: requirements/extra-paddleocr.txt

+6-6
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#
55
# pip-compile ./extra-paddleocr.in
66
#
7-
anyio==4.6.0
7+
anyio==4.6.2.post1
88
# via
99
# -c ./base.txt
1010
# httpx
@@ -16,7 +16,7 @@ certifi==2024.8.30
1616
# httpcore
1717
# httpx
1818
# requests
19-
charset-normalizer==3.3.2
19+
charset-normalizer==3.4.0
2020
# via
2121
# -c ./base.txt
2222
# requests
@@ -52,7 +52,7 @@ idna==3.10
5252
# anyio
5353
# httpx
5454
# requests
55-
imageio==2.35.1
55+
imageio==2.36.0
5656
# via
5757
# imgaug
5858
# scikit-image
@@ -104,7 +104,7 @@ paddlepaddle==3.0.0b1
104104
# via -r ./extra-paddleocr.in
105105
pdf2image==1.17.0
106106
# via unstructured-paddleocr
107-
pillow==10.4.0
107+
pillow==11.0.0
108108
# via
109109
# imageio
110110
# imgaug
@@ -117,9 +117,9 @@ protobuf==4.25.5
117117
# via
118118
# -c ././deps/constraints.txt
119119
# paddlepaddle
120-
pyclipper==1.3.0.post5
120+
pyclipper==1.3.0.post6
121121
# via unstructured-paddleocr
122-
pyparsing==3.1.4
122+
pyparsing==3.2.0
123123
# via matplotlib
124124
python-dateutil==2.9.0.post0
125125
# via

Diff for: requirements/extra-pandoc.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,5 @@
44
#
55
# pip-compile ./extra-pandoc.in
66
#
7-
pypandoc==1.13
7+
pypandoc==1.14
88
# via -r ./extra-pandoc.in

Diff for: requirements/extra-pdf-image.in

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,5 @@ google-cloud-vision
1111
effdet
1212
# Do not move to constraints.in, otherwise unstructured-inference will not be upgraded
1313
# when unstructured library is.
14-
unstructured-inference==0.7.36
14+
unstructured-inference==0.8.0
1515
unstructured.pytesseract>=0.3.12

0 commit comments

Comments
 (0)