Skip to content

Commit 92306b5

Browse files
authored
refactor: document processing interface (#419)
1 parent 5852042 commit 92306b5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+1276
-1285
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -98,3 +98,5 @@ chroma/
9898
qdrant/
9999

100100
.aider*
101+
102+
.DS_Store
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Element Enrichers
2+
3+
::: ragbits.document_search.ingestion.enrichers.router.ElementEnricherRouter
4+
5+
::: ragbits.document_search.ingestion.enrichers.base.ElementEnricher
6+
7+
::: ragbits.document_search.ingestion.enrichers.image.ImageElementEnricher
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Document Parsers
2+
3+
::: ragbits.document_search.ingestion.parsers.router.DocumentParserRouter
4+
5+
::: ragbits.document_search.ingestion.parsers.base.DocumentParser
6+
7+
::: ragbits.document_search.ingestion.parsers.base.TextDocumentParser
8+
9+
::: ragbits.document_search.ingestion.parsers.base.ImageDocumentParser
10+
11+
::: ragbits.document_search.ingestion.parsers.unstructured.UnstructuredDocumentParser

docs/api_reference/document_search/processing.md

-24
This file was deleted.

docs/how-to/document_search/ingest/strategies/async_processing.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
In Ragbits, a component called "processing execution strategy" controls how document processing is executed during ingestion. There are multiple execution strategies available in Ragbits that can be easily interchanged. You can also [create new custom execution strategies](create_custom_execution_strategy.md) to meet your specific needs.
44

55
!!! note
6-
It's important to note that processing execution strategies are a separate concept from processors. While the former manage how the processing is executed, the latter deals with the actual processing of documents. Processors are managed by [DocumentProcessorRouter][ragbits.document_search.ingestion.document_processor.DocumentProcessorRouter].
6+
It's important to note that processing execution strategies are a separate concept from processors. While the former manage how the processing is executed, the latter deals with the actual processing of documents. Processors are managed by [DocumentParserRouter][ragbits.document_search.ingestion.parsers.router.DocumentParserRouter].
77

88
## The Synchronous Execution Strategy
99

docs/how-to/document_search/ingest/strategies/create_custom_execution_strategy.md

+9-9
Original file line numberDiff line numberDiff line change
@@ -19,16 +19,16 @@ from collections.abc import Sequence
1919

2020
from ragbits.document_search.documents.document import Document, DocumentMeta, Source
2121
from ragbits.document_search.documents.element import Element
22-
from ragbits.document_search.ingestion.document_processor import DocumentProcessorRouter
22+
from ragbits.document_search.ingestion.parsers.router import DocumentParserRouter
2323
from ragbits.document_search.ingestion.strategies import IngestStrategy
24-
from ragbits.document_search.ingestion.providers.base import BaseProvider
24+
from ragbits.document_search.ingestion.parsers.base import DocumentParser
2525

2626
class DelayedExecutionStrategy(IngestStrategy):
2727
async def process_documents(
2828
self,
2929
documents: Sequence[DocumentMeta | Document | Source],
30-
processor_router: DocumentProcessorRouter,
31-
processor_overwrite: BaseProvider | None = None,
30+
processor_router: DocumentParserRouter,
31+
processor_overwrite: DocumentParser | None = None,
3232
) -> list[Element]:
3333
elements = []
3434
for document in documents:
@@ -48,24 +48,24 @@ from collections.abc import Sequence
4848

4949
from ragbits.document_search.documents.document import Document, DocumentMeta, Source
5050
from ragbits.document_search.documents.element import Element
51-
from ragbits.document_search.ingestion.document_processor import DocumentProcessorRouter
51+
from ragbits.document_search.ingestion.parsers.router import DocumentParserRouter
5252
from ragbits.document_search.ingestion.strategies import IngestStrategy
53-
from ragbits.document_search.ingestion.providers.base import BaseProvider
53+
from ragbits.document_search.ingestion.parsers.base import DocumentParser
5454

5555
class DelayedExecutionStrategy(IngestStrategy):
5656
async def process_documents(
5757
self,
5858
documents: Sequence[DocumentMeta | Document | Source],
59-
processor_router: DocumentProcessorRouter,
60-
processor_overwrite: BaseProvider | None = None,
59+
processor_router: DocumentParserRouter,
60+
processor_overwrite: DocumentParser | None = None,
6161
) -> list[Element]:
6262
elements = []
6363
for document in documents:
6464
# Convert the document to DocumentMeta
6565
document_meta = await self.to_document_meta(document)
6666

6767
# Get the processor for the document
68-
processor = processor_overwrite or processor_router.get_provider(document)
68+
processor = processor_overwrite or processor_router.get(document)
6969

7070
await asyncio.sleep(1)
7171

docs/how-to/document_search/search_documents.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -76,26 +76,26 @@ library that supports parsing and chunking of most common document types (i.e. p
7676
from ragbits.core.embeddings.litellm import LiteLLMEmbedder
7777
from ragbits.core.vector_stores.in_memory import InMemoryVectorStore
7878
from ragbits.document_search import DocumentSearch
79-
from ragbits.document_search.ingestion.document_processor import DocumentProcessorRouter
79+
from ragbits.document_search.ingestion.parsers.router import DocumentParserRouter
8080
from ragbits.document_search.documents.document import DocumentType
81-
from ragbits.document_search.ingestion.providers.unstructured.default import UnstructuredDefaultProvider
81+
from ragbits.document_search.ingestion.parsers.unstructured.default import UnstructuredDocumentParser
8282

8383
embedder = LiteLLMEmbedder()
8484
vector_store = InMemoryVectorStore(embedder=embedder)
8585
document_search = DocumentSearch(
8686
vector_store=vector_store,
87-
parser_router=DocumentProcessorRouter({DocumentType.TXT: UnstructuredDefaultProvider()})
87+
parser_router=DocumentParserRouter({DocumentType.TXT: UnstructuredDocumentParser()})
8888
)
8989
```
9090

91-
If you want to implement a new provider you should extend the [`BaseProvider`][ragbits.document_search.ingestion.providers.base.BaseProvider] class:
91+
If you want to implement a new provider you should extend the [`DocumentParser`][ragbits.document_search.ingestion.parsers.base.DocumentParser] class:
9292
```python
9393
from ragbits.document_search.documents.document import DocumentMeta, DocumentType
9494
from ragbits.document_search.documents.element import Element
95-
from ragbits.document_search.ingestion.providers.base import BaseProvider
95+
from ragbits.document_search.ingestion.parsers.base import DocumentParser
9696

9797

98-
class CustomProvider(BaseProvider):
98+
class CustomProvider(DocumentParser):
9999
SUPPORTED_DOCUMENT_TYPES = { DocumentType.TXT } # provide supported document types
100100

101101
async def process(self, document_meta: DocumentMeta) -> list[Element]:
@@ -112,7 +112,7 @@ There is an additional functionality of [`DocumentSearch`][ragbits.document_sear
112112
config = {
113113
"vector_store": {...},
114114
"reranker": {...},
115-
"providers": {...},
115+
"parser_router": {...},
116116
"rephraser": {...},
117117
}
118118

docs/how-to/project/component_preferences.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,6 @@ This is the list of component types for which you can set a preferred configurat
167167
| `vector_store` | `ragbits-core` | [`VectorStore`][ragbits.core.vector_stores.base.VectorStore]| |
168168
| `history_compressor` | `ragbits-conversations` | [`ConversationHistoryCompressor`][ragbits.conversations.history.compressors.base.ConversationHistoryCompressor]| |
169169
| `document_search` | `ragbits-document-search` | [`DocumentSearch`][ragbits.document_search.DocumentSearch]| Specifics: [Configuration](#ds-configuration)|
170-
| `provider` | `ragbits-document-search` | [`BaseProvider`][ragbits.document_search.ingestion.providers.base.BaseProvider]| |
170+
| `parser` | `ragbits-document-search` | [`DocumentParser`][ragbits.document_search.ingestion.parsers.base.DocumentParser]| |
171171
| `rephraser` | `ragbits-document-search` | [`QueryRephraser`][ragbits.document_search.retrieval.rephrasers.QueryRephraser]| |
172172
| `reranker` | `ragbits-document-search` | [`Reranker`][ragbits.document_search.retrieval.rerankers.base.Reranker]| |

examples/document-search/configurable.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ class to rephrase the query.
9090
},
9191
},
9292
},
93-
"providers": {"txt": {"type": "DummyProvider"}},
93+
"parser_router": {"txt": {"type": "TextDocumentParser"}},
9494
"rephraser": {
9595
"type": "LLMQueryRephraser",
9696
"config": {

examples/document-search/multimodal.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,8 @@
4040
from ragbits.document_search import DocumentSearch
4141
from ragbits.document_search.documents.document import DocumentMeta, DocumentType
4242
from ragbits.document_search.documents.sources import LocalFileSource
43-
from ragbits.document_search.ingestion.document_processor import DocumentProcessorRouter
44-
from ragbits.document_search.ingestion.providers.dummy import DummyImageProvider
43+
from ragbits.document_search.ingestion.parsers.base import ImageDocumentParser
44+
from ragbits.document_search.ingestion.parsers.router import DocumentParserRouter
4545

4646
IMAGES_PATH = Path(__file__).parent / "images"
4747

@@ -68,7 +68,7 @@ async def main() -> None:
6868
vector_store_hybrid = HybridSearchVectorStore(vector_store_text, vector_store_image)
6969

7070
# For this example, we want to skip OCR and make sure that we test direct image embeddings.
71-
parser_router = DocumentProcessorRouter.from_config({DocumentType.JPG: DummyImageProvider()})
71+
parser_router = DocumentParserRouter({DocumentType.JPG: ImageDocumentParser()})
7272

7373
document_search = DocumentSearch(
7474
vector_store=vector_store_hybrid,

examples/evaluation/document-search/advanced/config/experiments/chunking-1000.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ task:
55

66
pipeline:
77
config:
8-
providers:
8+
parser_router:
99
txt:
1010
config:
1111
chunking_kwargs:

examples/evaluation/document-search/advanced/config/experiments/chunking-250.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ task:
55

66
pipeline:
77
config:
8-
providers:
8+
parser_router:
99
txt:
1010
config:
1111
chunking_kwargs:

examples/evaluation/document-search/advanced/config/experiments/chunking-500.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ task:
55

66
pipeline:
77
config:
8-
providers:
8+
parser_router:
99
txt:
1010
config:
1111
chunking_kwargs:

examples/evaluation/document-search/advanced/config/pipeline/document_search.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ defaults:
22
- [email protected]_store: chroma
33
44
5-
- providers@config.providers: unstructured
5+
- parser_router@config.parser_router: unstructured
66
77
- _self_
88

examples/evaluation/document-search/advanced/config/pipeline/document_search_optimization.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ defaults:
22
- [email protected]_store: chroma_optimization
33
44
5-
- providers@config.providers: unstructured_optimization
5+
- parser_router@config.parser_router: unstructured_optimization
66
77
- _self_
88

examples/evaluation/document-search/advanced/config/pipeline/providers/unstructured.yaml renamed to examples/evaluation/document-search/advanced/config/pipeline/parser_router/unstructured.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
txt:
2-
type: ragbits.document_search.ingestion.providers.unstructured:UnstructuredDefaultProvider
2+
type: ragbits.document_search.ingestion.parsers.unstructured:UnstructuredDocumentParser
33
config:
44
use_api: false
55
partition_kwargs:
@@ -12,7 +12,7 @@ txt:
1212
overlap_all: 0
1313

1414
md:
15-
type: ragbits.document_search.ingestion.providers.unstructured:UnstructuredDefaultProvider
15+
type: ragbits.document_search.ingestion.parsers.unstructured:UnstructuredDocumentParser
1616
config:
1717
use_api: false
1818
partition_kwargs:

examples/evaluation/document-search/advanced/config/pipeline/providers/unstructured_optimization.yaml renamed to examples/evaluation/document-search/advanced/config/pipeline/parser_router/unstructured_optimization.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
txt:
2-
type: ragbits.document_search.ingestion.providers.unstructured:UnstructuredDefaultProvider
2+
type: ragbits.document_search.ingestion.parsers.unstructured:UnstructuredDocumentParser
33
config:
44
use_api: false
55
partition_kwargs:
@@ -16,7 +16,7 @@ txt:
1616
overlap_all: 0
1717

1818
md:
19-
type: ragbits.document_search.ingestion.providers.unstructured:UnstructuredDefaultProvider
19+
type: ragbits.document_search.ingestion.parsers.unstructured:UnstructuredDocumentParser
2020
config:
2121
use_api: false
2222
partition_kwargs:

examples/evaluation/document-search/basic/evaluate.py

+5-5
Original file line numberDiff line numberDiff line change
@@ -45,17 +45,17 @@
4545
},
4646
},
4747
},
48-
"providers": {
49-
"txt": {
50-
"type": "ragbits.document_search.ingestion.providers.unstructured:UnstructuredDefaultProvider",
51-
},
52-
},
5348
"ingest_strategy": {
5449
"type": "ragbits.document_search.ingestion.strategies.batched:BatchedIngestStrategy",
5550
"config": {
5651
"batch_size": 10,
5752
},
5853
},
54+
"parser_router": {
55+
"txt": {
56+
"type": "ragbits.document_search.ingestion.parsers.unstructured:UnstructuredDocumentParser",
57+
},
58+
},
5959
"source": {
6060
"type": "ragbits.document_search.documents.sources:HuggingFaceSource",
6161
"config": {

examples/evaluation/document-search/basic/optimize.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -53,9 +53,9 @@
5353
},
5454
},
5555
},
56-
"providers": {
56+
"parser_router": {
5757
"txt": {
58-
"type": "ragbits.document_search.ingestion.providers.unstructured:UnstructuredDefaultProvider",
58+
"type": "ragbits.document_search.ingestion.parsers.unstructured:UnstructuredDocumentParser",
5959
},
6060
},
6161
"source": {

mkdocs.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,8 @@ nav:
5353
- api_reference/document_search/index.md
5454
- api_reference/document_search/documents.md
5555
- Ingest:
56-
- api_reference/document_search/processing.md
56+
- api_reference/document_search/ingest/parsers.md
57+
- api_reference/document_search/ingest/enrichers.md
5758
- api_reference/document_search/ingest/strategies.md
5859
- Retrieval:
5960
- api_reference/document_search/retrieval/rephrasers.md

packages/ragbits-core/CHANGELOG.md

+4
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,16 @@
33
## Unreleased
44
- Allow Prompt class to accept the asynchronous response_parser. Change the signature of parse_response method.
55

6+
- Fix Qdrant vector store serialization (#419)
7+
68
## 0.11.0 (2025-03-25)
9+
710
- Add HybridSearchVectorStore which can aggregate results from multiple VectorStores (#412)
811

912
## 0.10.2 (2025-03-21)
1013

1114
## 0.10.1 (2025-03-19)
15+
1216
- Better handling of cases when text and image embeddings are mixed in VectorStore
1317

1418
## 0.10.0 (2025-03-17)

packages/ragbits-document-search/CHANGELOG.md

+5-6
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,15 @@
22

33
## Unreleased
44

5+
### Changed
6+
7+
- BREAKING CHANGE: Providers and intermediate handlers refactored to parsers and enrichers (#419)
8+
59
## 0.11.0 (2025-03-25)
610

711
### Changed
812

913
- ragbits-core updated to version v0.11.0
10-
1114
- Introduce picklable ingest error wrapper (#448)
1215
- Add support for Git source to fetch files from Git repositories (#439)
1316

@@ -16,7 +19,6 @@
1619
### Changed
1720

1821
- ragbits-core updated to version v0.10.2
19-
2022
- Remove obsolete ImageDescriber and llm from UnstructuredImageProvider (#430)
2123
- Make SourceError and its subclasses picklable (#435)
2224
- Allow for setting custom headers in WebSource (#437)
@@ -26,7 +28,6 @@
2628
### Changed
2729

2830
- ragbits-core updated to version v0.10.1
29-
3031
- BREAKING CHANGE: Renamed HttpSource to WebSource and changed property names (#420)
3132
- Better error distinction for WebSource (#421)
3233

@@ -35,14 +36,12 @@
3536
### Changed
3637

3738
- ragbits-core updated to version v0.10.0
38-
3939
- BREAKING CHANGE: Processing strategies refactored to ingest strategies (#394)
4040
- Compability with the new Vector Store interface from ragbits-core (#288)
4141
- Fix docstring formatting to resolve Griffe warnings
4242
- Introduce intermediate image elements (#139)
4343
- Add HTTP source type, which downloads a file from the provided URL (#397)
44-
45-
- added traceable
44+
- added traceable
4645

4746
## 0.9.0 (2025-02-25)
4847

0 commit comments

Comments
 (0)