Skip to content

Commit 92306b5

Browse files
authored
refactor: document processing interface (#419)
1 parent 5852042 commit 92306b5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+1276
-1285
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,3 +98,5 @@ chroma/
9898
qdrant/
9999

100100
.aider*
101+
102+
.DS_Store
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Element Enrichers
2+
3+
::: ragbits.document_search.ingestion.enrichers.router.ElementEnricherRouter
4+
5+
::: ragbits.document_search.ingestion.enrichers.base.ElementEnricher
6+
7+
::: ragbits.document_search.ingestion.enrichers.image.ImageElementEnricher
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Document Parsers
2+
3+
::: ragbits.document_search.ingestion.parsers.router.DocumentParserRouter
4+
5+
::: ragbits.document_search.ingestion.parsers.base.DocumentParser
6+
7+
::: ragbits.document_search.ingestion.parsers.base.TextDocumentParser
8+
9+
::: ragbits.document_search.ingestion.parsers.base.ImageDocumentParser
10+
11+
::: ragbits.document_search.ingestion.parsers.unstructured.UnstructuredDocumentParser

docs/api_reference/document_search/processing.md

Lines changed: 0 additions & 24 deletions
This file was deleted.

docs/how-to/document_search/ingest/strategies/async_processing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
In Ragbits, a component called "processing execution strategy" controls how document processing is executed during ingestion. There are multiple execution strategies available in Ragbits that can be easily interchanged. You can also [create new custom execution strategies](create_custom_execution_strategy.md) to meet your specific needs.
44

55
!!! note
6-
It's important to note that processing execution strategies are a separate concept from processors. While the former manage how the processing is executed, the latter deals with the actual processing of documents. Processors are managed by [DocumentProcessorRouter][ragbits.document_search.ingestion.document_processor.DocumentProcessorRouter].
6+
It's important to note that processing execution strategies are a separate concept from processors. While the former manage how the processing is executed, the latter deals with the actual processing of documents. Processors are managed by [DocumentParserRouter][ragbits.document_search.ingestion.parsers.router.DocumentParserRouter].
77

88
## The Synchronous Execution Strategy
99

docs/how-to/document_search/ingest/strategies/create_custom_execution_strategy.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -19,16 +19,16 @@ from collections.abc import Sequence
1919

2020
from ragbits.document_search.documents.document import Document, DocumentMeta, Source
2121
from ragbits.document_search.documents.element import Element
22-
from ragbits.document_search.ingestion.document_processor import DocumentProcessorRouter
22+
from ragbits.document_search.ingestion.parsers.router import DocumentParserRouter
2323
from ragbits.document_search.ingestion.strategies import IngestStrategy
24-
from ragbits.document_search.ingestion.providers.base import BaseProvider
24+
from ragbits.document_search.ingestion.parsers.base import DocumentParser
2525

2626
class DelayedExecutionStrategy(IngestStrategy):
2727
async def process_documents(
2828
self,
2929
documents: Sequence[DocumentMeta | Document | Source],
30-
processor_router: DocumentProcessorRouter,
31-
processor_overwrite: BaseProvider | None = None,
30+
processor_router: DocumentParserRouter,
31+
processor_overwrite: DocumentParser | None = None,
3232
) -> list[Element]:
3333
elements = []
3434
for document in documents:
@@ -48,24 +48,24 @@ from collections.abc import Sequence
4848

4949
from ragbits.document_search.documents.document import Document, DocumentMeta, Source
5050
from ragbits.document_search.documents.element import Element
51-
from ragbits.document_search.ingestion.document_processor import DocumentProcessorRouter
51+
from ragbits.document_search.ingestion.parsers.router import DocumentParserRouter
5252
from ragbits.document_search.ingestion.strategies import IngestStrategy
53-
from ragbits.document_search.ingestion.providers.base import BaseProvider
53+
from ragbits.document_search.ingestion.parsers.base import DocumentParser
5454

5555
class DelayedExecutionStrategy(IngestStrategy):
5656
async def process_documents(
5757
self,
5858
documents: Sequence[DocumentMeta | Document | Source],
59-
processor_router: DocumentProcessorRouter,
60-
processor_overwrite: BaseProvider | None = None,
59+
processor_router: DocumentParserRouter,
60+
processor_overwrite: DocumentParser | None = None,
6161
) -> list[Element]:
6262
elements = []
6363
for document in documents:
6464
# Convert the document to DocumentMeta
6565
document_meta = await self.to_document_meta(document)
6666

6767
# Get the processor for the document
68-
processor = processor_overwrite or processor_router.get_provider(document)
68+
processor = processor_overwrite or processor_router.get(document)
6969

7070
await asyncio.sleep(1)
7171

docs/how-to/document_search/search_documents.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -76,26 +76,26 @@ library that supports parsing and chunking of most common document types (i.e. p
7676
from ragbits.core.embeddings.litellm import LiteLLMEmbedder
7777
from ragbits.core.vector_stores.in_memory import InMemoryVectorStore
7878
from ragbits.document_search import DocumentSearch
79-
from ragbits.document_search.ingestion.document_processor import DocumentProcessorRouter
79+
from ragbits.document_search.ingestion.parsers.router import DocumentParserRouter
8080
from ragbits.document_search.documents.document import DocumentType
81-
from ragbits.document_search.ingestion.providers.unstructured.default import UnstructuredDefaultProvider
81+
from ragbits.document_search.ingestion.parsers.unstructured.default import UnstructuredDocumentParser
8282

8383
embedder = LiteLLMEmbedder()
8484
vector_store = InMemoryVectorStore(embedder=embedder)
8585
document_search = DocumentSearch(
8686
vector_store=vector_store,
87-
parser_router=DocumentProcessorRouter({DocumentType.TXT: UnstructuredDefaultProvider()})
87+
parser_router=DocumentParserRouter({DocumentType.TXT: UnstructuredDocumentParser()})
8888
)
8989
```
9090

91-
If you want to implement a new provider you should extend the [`BaseProvider`][ragbits.document_search.ingestion.providers.base.BaseProvider] class:
91+
If you want to implement a new provider you should extend the [`DocumentParser`][ragbits.document_search.ingestion.parsers.base.DocumentParser] class:
9292
```python
9393
from ragbits.document_search.documents.document import DocumentMeta, DocumentType
9494
from ragbits.document_search.documents.element import Element
95-
from ragbits.document_search.ingestion.providers.base import BaseProvider
95+
from ragbits.document_search.ingestion.parsers.base import DocumentParser
9696

9797

98-
class CustomProvider(BaseProvider):
98+
class CustomProvider(DocumentParser):
9999
SUPPORTED_DOCUMENT_TYPES = { DocumentType.TXT } # provide supported document types
100100

101101
async def process(self, document_meta: DocumentMeta) -> list[Element]:
@@ -112,7 +112,7 @@ There is an additional functionality of [`DocumentSearch`][ragbits.document_sear
112112
config = {
113113
"vector_store": {...},
114114
"reranker": {...},
115-
"providers": {...},
115+
"parser_router": {...},
116116
"rephraser": {...},
117117
}
118118

docs/how-to/project/component_preferences.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,6 @@ This is the list of component types for which you can set a preferred configurat
167167
| `vector_store` | `ragbits-core` | [`VectorStore`][ragbits.core.vector_stores.base.VectorStore]| |
168168
| `history_compressor` | `ragbits-conversations` | [`ConversationHistoryCompressor`][ragbits.conversations.history.compressors.base.ConversationHistoryCompressor]| |
169169
| `document_search` | `ragbits-document-search` | [`DocumentSearch`][ragbits.document_search.DocumentSearch]| Specifics: [Configuration](#ds-configuration)|
170-
| `provider` | `ragbits-document-search` | [`BaseProvider`][ragbits.document_search.ingestion.providers.base.BaseProvider]| |
170+
| `parser` | `ragbits-document-search` | [`DocumentParser`][ragbits.document_search.ingestion.parsers.base.DocumentParser]| |
171171
| `rephraser` | `ragbits-document-search` | [`QueryRephraser`][ragbits.document_search.retrieval.rephrasers.QueryRephraser]| |
172172
| `reranker` | `ragbits-document-search` | [`Reranker`][ragbits.document_search.retrieval.rerankers.base.Reranker]| |

examples/document-search/configurable.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ class to rephrase the query.
9090
},
9191
},
9292
},
93-
"providers": {"txt": {"type": "DummyProvider"}},
93+
"parser_router": {"txt": {"type": "TextDocumentParser"}},
9494
"rephraser": {
9595
"type": "LLMQueryRephraser",
9696
"config": {

examples/document-search/multimodal.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,8 @@
4040
from ragbits.document_search import DocumentSearch
4141
from ragbits.document_search.documents.document import DocumentMeta, DocumentType
4242
from ragbits.document_search.documents.sources import LocalFileSource
43-
from ragbits.document_search.ingestion.document_processor import DocumentProcessorRouter
44-
from ragbits.document_search.ingestion.providers.dummy import DummyImageProvider
43+
from ragbits.document_search.ingestion.parsers.base import ImageDocumentParser
44+
from ragbits.document_search.ingestion.parsers.router import DocumentParserRouter
4545

4646
IMAGES_PATH = Path(__file__).parent / "images"
4747

@@ -68,7 +68,7 @@ async def main() -> None:
6868
vector_store_hybrid = HybridSearchVectorStore(vector_store_text, vector_store_image)
6969

7070
# For this example, we want to skip OCR and make sure that we test direct image embeddings.
71-
parser_router = DocumentProcessorRouter.from_config({DocumentType.JPG: DummyImageProvider()})
71+
parser_router = DocumentParserRouter({DocumentType.JPG: ImageDocumentParser()})
7272

7373
document_search = DocumentSearch(
7474
vector_store=vector_store_hybrid,

0 commit comments

Comments
 (0)