enoch3712
diff --git a/‎docs/assets/completion_concatenate.png‎
172 KB b/‎docs/assets/completion_concatenate.png‎
172 KB
diff --git a/‎docs/assets/completion_paginate.png‎
166 KB b/‎docs/assets/completion_paginate.png‎
166 KB
diff --git a/‎docs/assets/completion_strategies.png‎
152 KB b/‎docs/assets/completion_strategies.png‎
152 KB
diff --git a/‎docs/assets/gemini_example.png‎
280 KB b/‎docs/assets/gemini_example.png‎
280 KB
diff --git a/‎docs/core-concepts/completion-strategies/concatenate.md‎
Lines changed: 75 additions & 0 deletions b/‎docs/core-concepts/completion-strategies/concatenate.md‎
Lines changed: 75 additions & 0 deletions
diff --git a/‎docs/core-concepts/completion-strategies/index.md‎
Lines changed: 51 additions & 0 deletions b/‎docs/core-concepts/completion-strategies/index.md‎
Lines changed: 51 additions & 0 deletions
diff --git a/‎docs/core-concepts/completion-strategies/paginate.md‎
Lines changed: 81 additions & 0 deletions b/‎docs/core-concepts/completion-strategies/paginate.md‎
Lines changed: 81 additions & 0 deletions
diff --git a/‎docs/core-concepts/contracts/index.md‎
Lines changed: 6 additions & 6 deletions b/‎docs/core-concepts/contracts/index.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎docs/core-concepts/document-loaders/aws-textract.md‎
Lines changed: 11 additions & 9 deletions b/‎docs/core-concepts/document-loaders/aws-textract.md‎
Lines changed: 11 additions & 9 deletions
diff --git a/‎docs/core-concepts/document-loaders/azure-form.md‎
Lines changed: 2 additions & 9 deletions b/‎docs/core-concepts/document-loaders/azure-form.md‎
Lines changed: 2 additions & 9 deletions
@@ -0,0 +1,75 @@
+# Concatenate Strategy
+
+The Concatenate strategy is designed to handle content that exceeds the LLM's context window by splitting it into manageable chunks, processing them separately, and then combining the results.
+
+<div align="center">
+  <img src="../../../assets/completion_concatenate.png" alt="Concatenate Strategy" width="50%">
+</div>
+
+## How It Works
+
+**1. Initial Request**
+
+- Sends the content to the LLM with the desired response structure
+- Monitors the LLM's response completion status
+
+**2. Continuation Process**
+
+- If response is truncated (finish_reason="length"), builds a continuation request
+- Includes previous partial response for context
+- Continues until LLM indicates completion
+
+**3. Validation**
+
+- When LLM indicates completion (finish_reason="stop")
+- Validates the combined JSON response
+- Raises error if invalid JSON is received on completion
+
+**4. Response Processing**
+
+- Combines all response parts
+- Validates against the specified response model
+- Returns structured data
+
+## Usage
+
+```python
+from extract_thinker import Extractor
+from extract_thinker.models.completion_strategy import CompletionStrategy
+
+extractor = Extractor()
+extractor.load_llm("gpt-4o")
+
+result = extractor.extract(
+    file_path,
+    ResponseModel,
+    completion_strategy=CompletionStrategy.CONCATENATE
+)
+```
+
+## Benefits
+
+- **Handles Large Content**: Can process documents larger than the output context window
+- **Maintains Context**: Attempts to keep related content together
+
+## Implementation Details
+
+??? example "Concatenation Handler Implementation"
+    The ConcatenationHandler implements the CONCATENATE strategy:
+    ```python
+    --8<-- "extract_thinker/concatenation_handler.py"
+    ```
+
+## When to Use
+
+CONCATENATE is the best choice when:
+
+**Context window is large**
+
+- For models like gpt-4o, claude-3-5-sonnet, etc.
+
+**The content is not too large**
+
+- Should be used for documents that are not too large (e.g. 500 pages)
+
+For handling bigger documents, consider using the [PAGINATE strategy](../completion-strategies/paginate.md).
@@ -0,0 +1,51 @@
+# Completion Strategies
+
+ExtractThinker provides different strategies for handling document content processing through LLMs, especially when dealing with content that might exceed the model's context window. There are three main strategies: **Forbidden**, **Concatenate**, and **Paginate**.
+
+<div align="center">
+  <img src="../../assets/completion_strategies.png" alt="Completion Strategies">
+</div>
+
+### FORBIDDEN Strategy
+
+The FORBIDDEN strategy is the default approach - it prevents processing of content that exceeds the model's context window. This is the simplest strategy, while larger content can be handled using other available strategies.
+
+```python
+from extract_thinker import Extractor
+from extract_thinker.models.completion_strategy import CompletionStrategy
+
+extractor = Extractor()
+extractor.load_llm("gpt-4o")
+
+# Will raise ValueError if content is too large
+result = extractor.extract(
+    file_path,
+    ResponseModel,
+    completion_strategy=CompletionStrategy.FORBIDDEN # Default
+)
+```
+
+For more advanced strategies that handle larger content, see:
+
+- [CONCATENATE Strategy](concatenate.md) - For handling content larger than the context window
+- [PAGINATE Strategy](paginate.md) - For processing multi-page documents in parallel
+
+The choice of completion strategy depends on your specific use case:
+
+**Use FORBIDDEN when:**
+
+- Content is guaranteed to fit in context window
+- You need the simplest possible processing and default behavior
+- You want to ensure content is processed as a single unit
+
+**Use [CONCATENATE](concatenate.md) when:**
+
+- Content might exceed context window
+- The size exceeds the output but not the input context window.
+- You want automatic handling of large content
+
+**Use [PAGINATE](paginate.md) when:**
+
+- Processing multi-page documents
+- The size exceeds the output but and the input context window.
+- You need sophisticated conflict resolution between pages
@@ -0,0 +1,81 @@
+# PAGINATE Strategy
+
+The PAGINATE strategy processes multi-page documents by handling each page independently and then intelligently merging the results, including sophisticated conflict resolution when pages contain overlapping information.
+
+<div align="center">
+  <img src="../../../assets/completion_paginate.png" alt="Paginate Strategy" width="35%">
+</div>
+
+## How It Works
+
+**Page Separation**
+
+- Identifies individual pages
+- Preserves page metadata
+- Maintains document structure
+
+**Parallel Processing**
+
+- Each page processed independently
+- Uses full context window per page
+- Handles page-specific content
+
+**Result Collection**
+
+- Gathers results from all pages
+- Validates individual page results
+- Prepares for merging
+
+**Conflict Resolution**
+
+- Detects overlapping information
+- Resolves conflicts using confidence scores
+- Maintains data consistency
+
+## Usage
+
+```python
+from extract_thinker import Extractor
+from extract_thinker.models.completion_strategy import CompletionStrategy
+
+extractor = Extractor()
+extractor.load_llm("gpt-4")
+
+result = extractor.extract(
+    file_path,
+    ResponseModel,
+    completion_strategy=CompletionStrategy.PAGINATE
+)
+```
+
+## Benefits
+
+- **Cheaper**: Reduced parallel context window would be cheaper than a long Concatenate Strategy
+- **Parallel Processing**: Pages can be processed independently
+- **Conflict Resolution**: Smart merging of results from different pages
+- **Scalability**: Handles documents of any length
+- **Accuracy**: Each page gets full context window attention
+
+## Implementation Details
+
+??? example "Pagination Handler Implementation"
+    The PaginationHandler implements the PAGINATE strategy:
+    ```python
+    --8<-- "extract_thinker/pagination_handler.py"
+    ```
+
+## When to Use
+
+PAGINATE is the best choice when:
+
+**Context window is small**
+
+- For local LLMs with smaller context windows (e.g Llama 3.3 8k context window).
+
+**The content is too Big**
+
+- When the file will not fit in the entire context window (e.g 500 page document)
+
+**Model Accuracy**
+
+- Sometimes LLMs can lose focus when the context is too big, Paginate strategy will solve this problem
@@ -5,6 +5,11 @@
 
 Contracts in ExtractThinker are Pydantic models that define the structure of data you want to extract. They provide type safety and validation for your extracted data.
 
+??? example "Base Contract Implementation"
+    ```python
+    --8<-- "extract_thinker/models/contract.py"
+    ```
+
 ## Basic Usage
 
 ```python
@@ -24,9 +29,4 @@ class InvoiceContract(Contract):
     total_amount: float = Field(description="Total invoice amount")
     line_items: List[InvoiceLineItem] = Field(description="List of items in invoice")
     notes: Optional[str] = Field(description="Additional notes", default=None)
-```
-
-??? example "Base Contract Implementation"
-    ```python
-    --8<-- "extract_thinker/models/contract.py"
-    ```
+```
@@ -58,14 +58,16 @@ The loader returns a dictionary with the following structure:
 
 ## Best Practices
 
-1. **Document Preparation**
-   - Use high-quality scans
-   - Support formats: PDF, JPEG, PNG
-   - Consider file size limits
+**Document Preparation**
 
-2. **Performance**
-   - Cache results when possible
-   - Process pages individually for large documents
-   - Monitor API quotas and costs
+- Use high-quality scans
+- Support formats: `PDF`, `JPEG`, `PNG`
+- Consider file size limits
 
-For more examples and implementation details, check out the [AWS Stack](../../examples/aws-textract) in the repository. 
+**Performance**
+
+- Cache results when possible
+- Process pages individually for large documents
+- Monitor API quotas and costs
+
+For more examples and implementation details, check out the [AWS Stack](../../../examples/aws-stack) in the repository. 
@@ -52,13 +52,6 @@ for page in result["pages"]:
         print(f"Table data: {table}")
 ```
 
-Document Intelligence supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.
+Supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.
 
-## Best Practices
-
-- Use high-quality scans for best results
-- Consider caching results (built-in TTL of 300 seconds)
-- Handle tables and paragraphs separately for better accuracy
-- Process documents page by page for large files
-
-For more examples and implementation details, check out the [Azure Stack](../../examples/azure-form.md) in the repository. 
+For more examples and implementation details, check out the [Azure Stack](../../../examples/azure-stack) in the repository.