Skip to content

Commit 4682138

Browse files
feat: add lexical search flag for Weaviate
Add enable_lexical_search flag to WeaviateUploadStagerConfig to indicate that the collection is configured for BM25 keyword search. Users must manually configure their Weaviate collection schema with invertedIndexConfig.bm25 settings. See inline documentation for schema example. This flag is declarative - it documents user intent rather than controlling automatic schema creation.
1 parent ddd4b44 commit 4682138

File tree

5 files changed

+53
-13
lines changed

5 files changed

+53
-13
lines changed

REPO_UPDATES.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Repository Updates
2+
3+
## 2026-01-30 09:15:00
4+
5+
- **fix**: Remove SFTP indexer run override (commit: 060bd52)
6+
- **fix**: SFTP protocol needs default (commit: ba493ee)
7+
8+
## 2026-01-28 10:00:01
9+
10+
- **feat**: Teradata Source and Destination Connector (commit: 577fd6b)
11+
- **feat**: SFTP connector improvements with better error handling (commit: 1fe1ecd, d3e8927)
12+
- **fix**: Fix AstraDB uploader timeouts for large files (commit: e6de207)
13+
- **fix**: Update Vectara token endpoint (commit: 64938d9)
14+
- **chore**: Remove Teradata charset setting (commit: 4fd9f71)

pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,9 @@ test = [
157157
"htmlBuilder",
158158
"office365-rest-python-client",
159159
]
160+
dev = [
161+
"pytest>=9.0.2",
162+
]
160163

161164
[project.scripts]
162165
unstructured-ingest = "unstructured_ingest.main:main"

unstructured_ingest/processes/connectors/assets/weaviate_collection_config.json

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,5 @@
1919
"tokenization": "word"
2020
}
2121
],
22-
"vectorizer": "none",
23-
"invertedIndexConfig": {
24-
"bm25": {
25-
"b": 0.75,
26-
"k1": 1.2
27-
},
28-
"stopwords": {
29-
"preset": "en"
30-
}
31-
}
22+
"vectorizer": "none"
3223
}

unstructured_ingest/processes/connectors/weaviate/weaviate.py

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,27 @@
3131

3232
CONNECTOR_TYPE = "weaviate"
3333

34+
# Schema Example for BM25 Lexical Search:
35+
# To enable BM25 keyword search, configure your Weaviate collection with invertedIndexConfig:
36+
#
37+
# {
38+
# "class": "YourCollection",
39+
# "properties": [...],
40+
# "vectorizer": "none",
41+
# "invertedIndexConfig": {
42+
# "bm25": {
43+
# "b": 0.75,
44+
# "k1": 1.2
45+
# },
46+
# "stopwords": {
47+
# "preset": "en"
48+
# }
49+
# }
50+
# }
51+
#
52+
# Then set enable_lexical_search=True in WeaviateUploadStagerConfig to indicate the
53+
# collection is configured for hybrid search (vector + keyword/BM25).
54+
3455

3556
class WeaviateAccessConfig(AccessConfig, ABC):
3657
pass
@@ -59,8 +80,12 @@ def get_client(self) -> Generator["WeaviateClient", None, None]:
5980
class WeaviateUploadStagerConfig(UploadStagerConfig):
6081
enable_lexical_search: bool = Field(
6182
default=False,
62-
description="Enable BM25 keyword search. Text fields are indexed "
63-
"for lexical search via Weaviate's inverted index with BM25 scoring.",
83+
description=(
84+
"Indicates that the Weaviate collection is configured for BM25 keyword search. "
85+
"Users must manually configure their collection schema with invertedIndexConfig.bm25 "
86+
"settings. See Weaviate documentation for BM25 configuration: "
87+
"https://weaviate.io/developers/weaviate/config-refs/schema#invertedindexconfig--bm25"
88+
),
6489
)
6590

6691

uv.lock

Lines changed: 8 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)