Skip to content

Commit 180c55d

Browse files
feat/improve ingest tests (#119)
* Use python script to check diff in generated files * Add overwrite logic * remove text evaluation code * don't track skipped-files.txt * Add back in cleanup in s3 test * remove metrics * add more print statements to test script * bump client dep * Update all ingest tests to use api for partitioning * Add pip freeze to the end of both cache github actions * print pip freeze before ingest tests * tidy shell * Explicitly install client for base cache * update deps * fix script to check for diffs * omit compression test * Fix mixedbread test * Fix mixedbread test * bump create pull request GH action * Update ingest test fixtures (#127) Co-authored-by: rbiseck3 <[email protected]> * tidy shell * Add in chroma dep * Don't install local partition deps in CI runners * Add back in numpy dep constraints * Bump changelog * Add back in kdbai dep file * Also ignore text_as_html metadata in ingest tests * fix typo --------- Co-authored-by: Unstructured-DevOps <[email protected]>
1 parent 2b91890 commit 180c55d

File tree

223 files changed

+14626
-5791
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

223 files changed

+14626
-5791
lines changed

.github/actions/base-cache/action.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ runs:
3737
fi
3838
make install-base
3939
make install-ci
40+
make install-client
4041
- name: Save Cache
4142
if: steps.virtualenv-cache-restore.outputs.cache-hit != 'true'
4243
id: virtualenv-cache-save

.github/workflows/e2e.yml

Lines changed: 3 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ jobs:
7979
SHAREPOINT_PERMISSIONS_TENANT: ${{secrets.SHAREPOINT_PERMISSIONS_TENANT}}
8080
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
8181
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
82+
UNS_PAID_API_KEY: ${{ secrets.UNS_PAID_API_KEY }}
8283
NOTION_API_KEY: ${{ secrets.NOTION_API_KEY }}
8384
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
8485
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
@@ -95,17 +96,9 @@ jobs:
9596
CI: "true"
9697
run: |
9798
source .venv/bin/activate
98-
sudo apt-get update
99-
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
100-
make install-pandoc
101-
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
102-
sudo apt-get update
103-
sudo apt-get install -y tesseract-ocr
104-
sudo apt-get install -y tesseract-ocr-kor
105-
sudo apt-get install diffstat
106-
tesseract --version
10799
sudo make install-docker-compose
108100
docker compose version
101+
pip freeze
109102
./test_e2e/test-src.sh
110103
111104
test_src_api:
@@ -192,15 +185,7 @@ jobs:
192185
CI: "true"
193186
run: |
194187
source .venv/bin/activate
195-
sudo apt-get update
196-
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
197-
make install-pandoc
198-
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
199-
sudo apt-get update
200-
sudo apt-get install -y tesseract-ocr
201-
sudo apt-get install -y tesseract-ocr-kor
202-
sudo apt-get install diffstat
203-
tesseract --version
204188
sudo make install-docker-compose
205189
docker compose version
190+
pip freeze
206191
./test_e2e/test-dest.sh

.github/workflows/ingest-test-fixtures-update-pr.yml

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ jobs:
7171
SHAREPOINT_PERMISSIONS_TENANT: ${{secrets.SHAREPOINT_PERMISSIONS_TENANT}}
7272
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
7373
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
74+
UNS_PAID_API_KEY: ${{ secrets.UNS_PAID_API_KEY }}
7475
NOTION_API_KEY: ${{ secrets.NOTION_API_KEY }}
7576
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
7677
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
@@ -85,15 +86,9 @@ jobs:
8586
CI: "true"
8687
run: |
8788
source .venv/bin/activate
88-
sudo apt-get update
89-
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
90-
make install-pandoc
91-
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
92-
sudo apt-get install -y tesseract-ocr
93-
sudo apt-get install -y tesseract-ocr-kor
94-
tesseract --version
9589
sudo make install-docker-compose
9690
docker compose version
91+
pip freeze
9792
./test_e2e/test-src.sh
9893
9994
- name: Save branch name to environment file
@@ -114,12 +109,11 @@ jobs:
114109
echo "PR_NAME=$pr_name" >> $GITHUB_ENV
115110
116111
- name: Create Pull Request
117-
uses: peter-evans/create-pull-request@v5
112+
uses: peter-evans/create-pull-request@v7
118113
with:
119114
token: ${{ secrets.GH_CREATE_PR_TOKEN }}
120115
add-paths: |
121116
test_e2e/expected-structured-output
122-
test_e2e/metrics
123117
commit-message: "Update ingest test fixtures"
124118
branch: ${{ env.BRANCH_NAME }}
125119
title: "${{ env.PR_NAME }} <- Ingest test fixtures update"

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.0.15-dev3
1+
## 0.0.15-dev4
22

33
### Fixes
44

Makefile

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,10 +43,6 @@ install-all-embedders:
4343
install-all-deps:
4444
find requirements -type f -name "*.txt" ! -name "constraints.txt" -exec pip install -r '{}' ';'
4545

46-
.PHONY: install-pandoc
47-
install-pandoc:
48-
ARCH=${ARCH} ./scripts/install-pandoc.sh
49-
5046
.PHONY: install-docker-compose
5147
install-docker-compose:
5248
ARCH=${ARCH} ./scripts/install-docker-compose.sh

requirements/common/base.txt

Lines changed: 11 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,7 @@ click==8.1.7
1111
dataclasses-json==0.6.7
1212
# via -r ./requirements/common/base.in
1313
deprecated==1.2.14
14-
# via
15-
# opentelemetry-api
16-
# opentelemetry-semantic-conventions
17-
importlib-metadata==7.1.0
18-
# via
19-
# -c ./requirements/common/constraints.txt
20-
# opentelemetry-api
14+
# via opentelemetry-api
2115
marshmallow==3.22.0
2216
# via dataclasses-json
2317
mypy-extensions==1.0.0
@@ -26,29 +20,27 @@ numpy==1.26.4
2620
# via
2721
# -c ./requirements/common/constraints.txt
2822
# pandas
29-
opentelemetry-api==1.26.0
30-
# via
31-
# opentelemetry-sdk
32-
# opentelemetry-semantic-conventions
33-
opentelemetry-sdk==1.26.0
23+
opentelemetry-api==1.16.0
24+
# via opentelemetry-sdk
25+
opentelemetry-sdk==1.16.0
3426
# via -r ./requirements/common/base.in
35-
opentelemetry-semantic-conventions==0.47b0
27+
opentelemetry-semantic-conventions==0.37b0
3628
# via opentelemetry-sdk
3729
packaging==23.2
3830
# via
3931
# -c ./requirements/common/constraints.txt
4032
# marshmallow
4133
pandas==2.2.2
4234
# via -r ./requirements/common/base.in
43-
pydantic==2.8.2
35+
pydantic==2.9.1
4436
# via -r ./requirements/common/base.in
45-
pydantic-core==2.20.1
37+
pydantic-core==2.23.3
4638
# via pydantic
4739
python-dateutil==2.9.0.post0
4840
# via
4941
# -r ./requirements/common/base.in
5042
# pandas
51-
pytz==2024.1
43+
pytz==2024.2
5244
# via pandas
5345
six==1.16.0
5446
# via python-dateutil
@@ -68,5 +60,6 @@ wrapt==1.16.0
6860
# via
6961
# -c ./requirements/common/constraints.txt
7062
# deprecated
71-
zipp==3.20.1
72-
# via importlib-metadata
63+
64+
# The following packages are considered to be unsafe in a requirements file:
65+
# setuptools

requirements/common/constraints.txt

Lines changed: 13 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -5,62 +5,28 @@
55
####################################################################################################
66
# consistency with local-inference-pin
77
protobuf<4.24
8-
# NOTE(robinson): - Required pins for security scans
9-
jupyter-core>=4.11.2
10-
wheel>=0.38.1
11-
# NOTE(robinson): - The following pins are to address
12-
# vulnerabilities in dependency scans
13-
certifi>=2023.7.22
14-
# From pycocotools in local-inference
15-
pyparsing<3.1.0
16-
scipy<1.11.4
17-
IPython<8.13
18-
# NOTE(alan): Pinned to avoid error that occurs with 2.4.3:
19-
# AttributeError: 'ResourcePath' object has no attribute 'collection'
20-
Office365-REST-Python-Client<2.4.3
21-
# NOTE(trevor): `unstructured-inference` is set in extra-pdf-image.in to allow
22-
# unstructured-inference to be upgraded when unstructured library is upgraded
23-
# https://github.com/Unstructured-IO/unstructured/issues/1458
24-
# unstructured-inference
25-
# use the known compatible version of weaviate and unstructured.pytesseract
26-
unstructured.pytesseract>=0.3.12
27-
weaviate-client>3.25.0
28-
# NOTE(yuming): pining to avoid conflict with paddle install
29-
matplotlib==3.7.2
30-
# langchain limits anyio to below 4.0
31-
anyio<4.0
32-
# NOTE(crag): earlier versions fail in compilation step when pip installing the package
33-
pycocotools>=2.0.7
34-
# NOTE(crag): python3.8-python3.11 compat (if it ends up being required)
35-
torch>2
36-
# pinned in unstructured paddleocr
37-
opencv-python==4.8.0.76
38-
opencv-contrib-python==4.8.0.76
39-
platformdirs==3.10.0
40-
8+
grpcio>=1.65.5
9+
# TODO: Pinned in transformers package, remove when that gets updated
10+
tokenizers>=0.19,<0.20
11+
# TODO: Constaint due to boto, with python before 3.10 not requiring openssl 1.1.1, remove when that gets
12+
# updated or we drop support for 3.9
13+
urllib3<1.27
14+
# TODO: Constriant due to aiobotocore, remove when that gets updates:
15+
botocore<1.34.132
16+
# TODO: Constriant due to both 8.5.0 and 8.4.0 being installed during pip-compile
17+
importlib-metadata>=8.5.0
4118
# TODO: Constraint due to langchain, remove when that gets updated:
4219
packaging<24.0
43-
4420
# TODO: Constraint due to boto, with python before 3.10 not requiring openssl 1.1.1, remove when that gets
4521
# updated or we drop support for 3.9
4622
urllib3<1.27
47-
48-
# TODO: Constraint due to aiobotocore, remove when that gets updates:
49-
botocore<1.34.52
50-
51-
# NOTE(jennings): pinned due to later versions not supporting api_key_auth in UnstructuredClient
52-
unstructured-client>=0.15.1
53-
23+
unstructured-client>= 0.25.8
5424
fsspec==2024.5.0
55-
5625
# python 3.12 support
5726
wrapt>=1.14.0
58-
5927
langchain-community>=0.2.5
60-
61-
# NOTE(robinson): choma was pinned to importlib-metadata>=7.1.0 but 7.1.0 was installed
28+
# NOTE(robinson): chroma was pinned to importlib-metadata>=7.1.0 but 7.1.0 was installed
6229
# instead of 7.2.0. Need to investigate
6330
importlib-metadata==7.1.0
64-
65-
unstructured==0.15.8
31+
unstructured==0.15.10
6632
numpy<2

requirements/connectors/airtable.txt

Lines changed: 15 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -6,26 +6,18 @@
66
#
77
annotated-types==0.7.0
88
# via pydantic
9-
certifi==2024.7.4
10-
# via
11-
# -c ./requirements/connectors/../common/constraints.txt
12-
# requests
9+
certifi==2024.8.30
10+
# via requests
1311
charset-normalizer==3.3.2
1412
# via requests
1513
click==8.1.7
1614
# via -r ./requirements/connectors/../common/base.in
1715
dataclasses-json==0.6.7
1816
# via -r ./requirements/connectors/../common/base.in
1917
deprecated==1.2.14
20-
# via
21-
# opentelemetry-api
22-
# opentelemetry-semantic-conventions
23-
idna==3.8
18+
# via opentelemetry-api
19+
idna==3.10
2420
# via requests
25-
importlib-metadata==7.1.0
26-
# via
27-
# -c ./requirements/connectors/../common/constraints.txt
28-
# opentelemetry-api
2921
inflection==0.5.1
3022
# via pyairtable
3123
marshmallow==3.22.0
@@ -36,13 +28,11 @@ numpy==1.26.4
3628
# via
3729
# -c ./requirements/connectors/../common/constraints.txt
3830
# pandas
39-
opentelemetry-api==1.26.0
40-
# via
41-
# opentelemetry-sdk
42-
# opentelemetry-semantic-conventions
43-
opentelemetry-sdk==1.26.0
31+
opentelemetry-api==1.16.0
32+
# via opentelemetry-sdk
33+
opentelemetry-sdk==1.16.0
4434
# via -r ./requirements/connectors/../common/base.in
45-
opentelemetry-semantic-conventions==0.47b0
35+
opentelemetry-semantic-conventions==0.37b0
4636
# via opentelemetry-sdk
4737
packaging==23.2
4838
# via
@@ -52,17 +42,17 @@ pandas==2.2.2
5242
# via -r ./requirements/connectors/../common/base.in
5343
pyairtable==2.3.3
5444
# via -r ./requirements/connectors/airtable.in
55-
pydantic==2.8.2
45+
pydantic==2.9.1
5646
# via
5747
# -r ./requirements/connectors/../common/base.in
5848
# pyairtable
59-
pydantic-core==2.20.1
49+
pydantic-core==2.23.3
6050
# via pydantic
6151
python-dateutil==2.9.0.post0
6252
# via
6353
# -r ./requirements/connectors/../common/base.in
6454
# pandas
65-
pytz==2024.1
55+
pytz==2024.2
6656
# via pandas
6757
requests==2.32.3
6858
# via pyairtable
@@ -81,7 +71,7 @@ typing-inspect==0.9.0
8171
# via dataclasses-json
8272
tzdata==2024.1
8373
# via pandas
84-
urllib3==1.26.19
74+
urllib3==1.26.20
8575
# via
8676
# -c ./requirements/connectors/../common/constraints.txt
8777
# pyairtable
@@ -90,5 +80,6 @@ wrapt==1.16.0
9080
# via
9181
# -c ./requirements/connectors/../common/constraints.txt
9282
# deprecated
93-
zipp==3.20.1
94-
# via importlib-metadata
83+
84+
# The following packages are considered to be unsafe in a requirements file:
85+
# setuptools

0 commit comments

Comments
 (0)