Skip to content

Commit 2279655

Browse files
authored
feat/migrate embedders over (#92)
* migrate embedders over * update init in embedders * Use aliases for model name * fix chroma dest test approach * Add embed unit tests * Remove unstructured from embed requirements * remove use of unstructured in embedders * update v1 use of embedder * update unit tests * Fix small typo * improve pinecone dest test * Add additional logging to pinecone * bring over the mixedbreadai src test * Add missing column in singlestore test * Add mixedbread ai requirements * expose mixbread secret in CI
1 parent c13b3f0 commit 2279655

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+1291
-970
lines changed

.github/workflows/e2e.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ jobs:
8686
AZURE_SEARCH_API_KEY: ${{ secrets.AZURE_SEARCH_API_KEY }}
8787
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
8888
OCTOAI_API_KEY: ${{ secrets.OCTOAI_API_KEY }}
89-
PINECONE_API_KEY: ${{secrets.PINECONE_API_KEY}}
89+
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
9090
ASTRA_DB_APPLICATION_TOKEN: ${{ secrets.ASTRA_DB_APPLICATION_TOKEN }}
9191
ASTRA_DB_API_ENDPOINT: ${{ secrets.ASTRA_DB_ENDPOINT }}
9292
MXBAI_API_KEY: ${{ secrets.MXBAI_API_KEY }}

CHANGELOG.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
## 0.0.14-dev1
1+
## 0.0.14
22

33
### Enhancements
44

55
* **Support async batch uploads for pinecone connector**
6+
* **Migrate embedders** Move embedder implementations from the open source unstructured repo into this one.
67

78
### Fixes
89

@@ -28,17 +29,12 @@
2829

2930
* **Fix OpenSearch connector** OpenSearch connector did not work when `http_auth` was not provided
3031

31-
### Fixes
32-
3332
## 0.0.10
3433

3534
### Enhancements
3635

3736
* "Fix tar extraction" - tar extraction function assumed archive was gzip compressed which isn't true for supported `.tar` archives. Updated to work for both compressed and uncompressed tar archives.
3837

39-
### Fixes
40-
41-
4238
## 0.0.9
4339

4440
### Enhancements

requirements/embed/aws-bedrock.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
-r ../common/base.in
2-
-r ./common/base.in
2+
-c ../common/constraints.txt
33

44
boto3
55
langchain-community

requirements/embed/aws-bedrock.txt

Lines changed: 42 additions & 127 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# This file is autogenerated by pip-compile with Python 3.9
33
# by the following command:
44
#
5-
# pip-compile ./requirements/embed/aws-bedrock.in
5+
# pip-compile ./aws-bedrock.in
66
#
77
aiohappyeyeballs==2.4.0
88
# via aiohttp
@@ -16,60 +16,41 @@ annotated-types==0.7.0
1616
# via pydantic
1717
anyio==3.7.1
1818
# via
19-
# -c ./requirements/embed/../common/constraints.txt
19+
# -c ./../common/constraints.txt
2020
# httpx
2121
async-timeout==4.0.3
2222
# via
2323
# aiohttp
2424
# langchain
2525
attrs==24.2.0
2626
# via aiohttp
27-
backoff==2.2.1
28-
# via unstructured
29-
beautifulsoup4==4.12.3
30-
# via unstructured
3127
boto3==1.34.51
32-
# via -r ./requirements/embed/aws-bedrock.in
28+
# via -r ./aws-bedrock.in
3329
botocore==1.34.51
3430
# via
35-
# -c ./requirements/embed/../common/constraints.txt
31+
# -c ./../common/constraints.txt
3632
# boto3
3733
# s3transfer
38-
certifi==2024.7.4
34+
certifi==2024.8.30
3935
# via
40-
# -c ./requirements/embed/../common/constraints.txt
36+
# -c ./../common/constraints.txt
4137
# httpcore
4238
# httpx
4339
# requests
44-
# unstructured-client
45-
chardet==5.2.0
46-
# via unstructured
4740
charset-normalizer==3.3.2
48-
# via
49-
# requests
50-
# unstructured-client
41+
# via requests
5142
click==8.1.7
52-
# via
53-
# -r ./requirements/embed/../common/base.in
54-
# nltk
43+
# via -r ./../common/base.in
5544
dataclasses-json==0.6.7
5645
# via
57-
# -r ./requirements/embed/../common/base.in
46+
# -r ./../common/base.in
5847
# langchain-community
59-
# unstructured
60-
# unstructured-client
61-
deepdiff==7.0.1
62-
# via unstructured-client
6348
deprecated==1.2.14
6449
# via
6550
# opentelemetry-api
6651
# opentelemetry-semantic-conventions
67-
emoji==2.12.1
68-
# via unstructured
6952
exceptiongroup==1.2.2
7053
# via anyio
71-
filetype==1.2.0
72-
# via unstructured
7354
frozenlist==1.4.1
7455
# via
7556
# aiohttp
@@ -79,207 +60,141 @@ h11==0.14.0
7960
httpcore==1.0.5
8061
# via httpx
8162
httpx==0.27.2
82-
# via
83-
# langsmith
84-
# unstructured-client
63+
# via langsmith
8564
idna==3.8
8665
# via
8766
# anyio
8867
# httpx
8968
# requests
90-
# unstructured-client
9169
# yarl
9270
importlib-metadata==7.1.0
9371
# via
94-
# -c ./requirements/embed/../common/constraints.txt
72+
# -c ./../common/constraints.txt
9573
# opentelemetry-api
9674
jmespath==1.0.1
9775
# via
9876
# boto3
9977
# botocore
100-
joblib==1.4.2
101-
# via nltk
10278
jsonpatch==1.33
10379
# via langchain-core
104-
jsonpath-python==1.0.6
105-
# via unstructured-client
10680
jsonpointer==3.0.0
10781
# via jsonpatch
108-
langchain==0.2.14
82+
langchain==0.2.16
10983
# via langchain-community
110-
langchain-community==0.2.12
84+
langchain-community==0.2.16
11185
# via
112-
# -c ./requirements/embed/../common/constraints.txt
113-
# -r ./requirements/embed/aws-bedrock.in
114-
langchain-core==0.2.35
86+
# -c ./../common/constraints.txt
87+
# -r ./aws-bedrock.in
88+
langchain-core==0.2.39
11589
# via
11690
# langchain
11791
# langchain-community
11892
# langchain-text-splitters
119-
langchain-text-splitters==0.2.2
93+
langchain-text-splitters==0.2.4
12094
# via langchain
121-
langdetect==1.0.9
122-
# via unstructured
123-
langsmith==0.1.104
95+
langsmith==0.1.117
12496
# via
12597
# langchain
12698
# langchain-community
12799
# langchain-core
128-
lxml==5.3.0
129-
# via unstructured
130100
marshmallow==3.22.0
131-
# via
132-
# dataclasses-json
133-
# unstructured-client
134-
multidict==6.0.5
101+
# via dataclasses-json
102+
multidict==6.1.0
135103
# via
136104
# aiohttp
137105
# yarl
138106
mypy-extensions==1.0.0
139-
# via
140-
# typing-inspect
141-
# unstructured-client
142-
nest-asyncio==1.6.0
143-
# via unstructured-client
144-
nltk==3.9.1
145-
# via unstructured
107+
# via typing-inspect
146108
numpy==1.26.4
147109
# via
148-
# -c ./requirements/embed/../common/constraints.txt
110+
# -c ./../common/constraints.txt
149111
# langchain
150112
# langchain-community
151113
# pandas
152-
# unstructured
153-
opentelemetry-api==1.26.0
114+
opentelemetry-api==1.27.0
154115
# via
155116
# opentelemetry-sdk
156117
# opentelemetry-semantic-conventions
157-
opentelemetry-sdk==1.26.0
158-
# via -r ./requirements/embed/../common/base.in
159-
opentelemetry-semantic-conventions==0.47b0
118+
opentelemetry-sdk==1.27.0
119+
# via -r ./../common/base.in
120+
opentelemetry-semantic-conventions==0.48b0
160121
# via opentelemetry-sdk
161-
ordered-set==4.1.0
162-
# via deepdiff
163122
orjson==3.10.7
164123
# via langsmith
165124
packaging==23.2
166125
# via
167-
# -c ./requirements/embed/../common/constraints.txt
126+
# -c ./../common/constraints.txt
168127
# langchain-core
169128
# marshmallow
170-
# unstructured-client
171129
pandas==2.2.2
172-
# via -r ./requirements/embed/../common/base.in
173-
psutil==6.0.0
174-
# via unstructured
175-
pydantic==2.8.2
130+
# via -r ./../common/base.in
131+
pydantic==2.9.1
176132
# via
177-
# -r ./requirements/embed/../common/base.in
133+
# -r ./../common/base.in
178134
# langchain
179135
# langchain-core
180136
# langsmith
181-
pydantic-core==2.20.1
137+
pydantic-core==2.23.3
182138
# via pydantic
183-
pypdf==4.3.1
184-
# via unstructured-client
185139
python-dateutil==2.9.0.post0
186140
# via
187-
# -r ./requirements/embed/../common/base.in
141+
# -r ./../common/base.in
188142
# botocore
189143
# pandas
190-
# unstructured-client
191-
python-iso639==2024.4.27
192-
# via unstructured
193-
python-magic==0.4.27
194-
# via unstructured
195-
pytz==2024.1
144+
pytz==2024.2
196145
# via pandas
197146
pyyaml==6.0.2
198147
# via
199148
# langchain
200149
# langchain-community
201150
# langchain-core
202-
rapidfuzz==3.9.6
203-
# via unstructured
204-
regex==2024.7.24
205-
# via nltk
206151
requests==2.32.3
207152
# via
208153
# langchain
209154
# langchain-community
210155
# langsmith
211-
# requests-toolbelt
212-
# unstructured
213-
# unstructured-client
214-
requests-toolbelt==1.0.0
215-
# via unstructured-client
216156
s3transfer==0.10.2
217157
# via boto3
218158
six==1.16.0
219-
# via
220-
# langdetect
221-
# python-dateutil
222-
# unstructured-client
159+
# via python-dateutil
223160
sniffio==1.3.1
224161
# via
225162
# anyio
226163
# httpx
227-
soupsieve==2.6
228-
# via beautifulsoup4
229-
sqlalchemy==2.0.32
164+
sqlalchemy==2.0.34
230165
# via
231166
# langchain
232167
# langchain-community
233-
tabulate==0.9.0
234-
# via unstructured
235168
tenacity==8.5.0
236169
# via
237170
# langchain
238171
# langchain-community
239172
# langchain-core
240173
tqdm==4.66.5
241-
# via
242-
# -r ./requirements/embed/../common/base.in
243-
# nltk
244-
# unstructured
174+
# via -r ./../common/base.in
245175
typing-extensions==4.12.2
246176
# via
247-
# emoji
248177
# langchain-core
178+
# multidict
249179
# opentelemetry-sdk
250180
# pydantic
251181
# pydantic-core
252-
# pypdf
253182
# sqlalchemy
254183
# typing-inspect
255-
# unstructured
256-
# unstructured-client
257184
typing-inspect==0.9.0
258-
# via
259-
# dataclasses-json
260-
# unstructured-client
185+
# via dataclasses-json
261186
tzdata==2024.1
262187
# via pandas
263-
unstructured==0.15.7
264-
# via
265-
# -c ./requirements/embed/../common/constraints.txt
266-
# -r ./requirements/embed/./common/base.in
267-
unstructured-client==0.25.5
268-
# via
269-
# -c ./requirements/embed/../common/constraints.txt
270-
# unstructured
271-
urllib3==1.26.19
188+
urllib3==1.26.20
272189
# via
273-
# -c ./requirements/embed/../common/constraints.txt
190+
# -c ./../common/constraints.txt
274191
# botocore
275192
# requests
276-
# unstructured-client
277193
wrapt==1.16.0
278194
# via
279-
# -c ./requirements/embed/../common/constraints.txt
195+
# -c ./../common/constraints.txt
280196
# deprecated
281-
# unstructured
282-
yarl==1.9.4
197+
yarl==1.11.1
283198
# via aiohttp
284199
zipp==3.20.1
285200
# via importlib-metadata

requirements/embed/common/base.in

Lines changed: 0 additions & 1 deletion
This file was deleted.

0 commit comments

Comments
 (0)