Skip to content

Commit b671989

Browse files
authored
Merge pull request #168 from enoch3712/167-add-strategy-to-docs-refactoring
Major refactoring and Strategy added to the docs
2 parents f2f6ab9 + 0dc6c31 commit b671989

36 files changed

+543
-644
lines changed
172 KB
Loading
166 KB
Loading
152 KB
Loading

docs/assets/gemini_example.png

280 KB
Loading
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Concatenate Strategy
2+
3+
The Concatenate strategy is designed to handle content that exceeds the LLM's context window by splitting it into manageable chunks, processing them separately, and then combining the results.
4+
5+
<div align="center">
6+
<img src="../../../assets/completion_concatenate.png" alt="Concatenate Strategy" width="50%">
7+
</div>
8+
9+
## How It Works
10+
11+
**1. Initial Request**
12+
13+
- Sends the content to the LLM with the desired response structure
14+
- Monitors the LLM's response completion status
15+
16+
**2. Continuation Process**
17+
18+
- If response is truncated (finish_reason="length"), builds a continuation request
19+
- Includes previous partial response for context
20+
- Continues until LLM indicates completion
21+
22+
**3. Validation**
23+
24+
- When LLM indicates completion (finish_reason="stop")
25+
- Validates the combined JSON response
26+
- Raises error if invalid JSON is received on completion
27+
28+
**4. Response Processing**
29+
30+
- Combines all response parts
31+
- Validates against the specified response model
32+
- Returns structured data
33+
34+
## Usage
35+
36+
```python
37+
from extract_thinker import Extractor
38+
from extract_thinker.models.completion_strategy import CompletionStrategy
39+
40+
extractor = Extractor()
41+
extractor.load_llm("gpt-4o")
42+
43+
result = extractor.extract(
44+
file_path,
45+
ResponseModel,
46+
completion_strategy=CompletionStrategy.CONCATENATE
47+
)
48+
```
49+
50+
## Benefits
51+
52+
- **Handles Large Content**: Can process documents larger than the output context window
53+
- **Maintains Context**: Attempts to keep related content together
54+
55+
## Implementation Details
56+
57+
??? example "Concatenation Handler Implementation"
58+
The ConcatenationHandler implements the CONCATENATE strategy:
59+
```python
60+
--8<-- "extract_thinker/concatenation_handler.py"
61+
```
62+
63+
## When to Use
64+
65+
CONCATENATE is the best choice when:
66+
67+
**Context window is large**
68+
69+
- For models like gpt-4o, claude-3-5-sonnet, etc.
70+
71+
**The content is not too large**
72+
73+
- Should be used for documents that are not too large (e.g. 500 pages)
74+
75+
For handling bigger documents, consider using the [PAGINATE strategy](../completion-strategies/paginate.md).
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Completion Strategies
2+
3+
ExtractThinker provides different strategies for handling document content processing through LLMs, especially when dealing with content that might exceed the model's context window. There are three main strategies: **Forbidden**, **Concatenate**, and **Paginate**.
4+
5+
<div align="center">
6+
<img src="../../assets/completion_strategies.png" alt="Completion Strategies">
7+
</div>
8+
9+
### FORBIDDEN Strategy
10+
11+
The FORBIDDEN strategy is the default approach - it prevents processing of content that exceeds the model's context window. This is the simplest strategy, while larger content can be handled using other available strategies.
12+
13+
```python
14+
from extract_thinker import Extractor
15+
from extract_thinker.models.completion_strategy import CompletionStrategy
16+
17+
extractor = Extractor()
18+
extractor.load_llm("gpt-4o")
19+
20+
# Will raise ValueError if content is too large
21+
result = extractor.extract(
22+
file_path,
23+
ResponseModel,
24+
completion_strategy=CompletionStrategy.FORBIDDEN # Default
25+
)
26+
```
27+
28+
For more advanced strategies that handle larger content, see:
29+
30+
- [CONCATENATE Strategy](concatenate.md) - For handling content larger than the context window
31+
- [PAGINATE Strategy](paginate.md) - For processing multi-page documents in parallel
32+
33+
The choice of completion strategy depends on your specific use case:
34+
35+
**Use FORBIDDEN when:**
36+
37+
- Content is guaranteed to fit in context window
38+
- You need the simplest possible processing and default behavior
39+
- You want to ensure content is processed as a single unit
40+
41+
**Use [CONCATENATE](concatenate.md) when:**
42+
43+
- Content might exceed context window
44+
- The size exceeds the output but not the input context window.
45+
- You want automatic handling of large content
46+
47+
**Use [PAGINATE](paginate.md) when:**
48+
49+
- Processing multi-page documents
50+
- The size exceeds the output but and the input context window.
51+
- You need sophisticated conflict resolution between pages
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# PAGINATE Strategy
2+
3+
The PAGINATE strategy processes multi-page documents by handling each page independently and then intelligently merging the results, including sophisticated conflict resolution when pages contain overlapping information.
4+
5+
<div align="center">
6+
<img src="../../../assets/completion_paginate.png" alt="Paginate Strategy" width="35%">
7+
</div>
8+
9+
## How It Works
10+
11+
**Page Separation**
12+
13+
- Identifies individual pages
14+
- Preserves page metadata
15+
- Maintains document structure
16+
17+
**Parallel Processing**
18+
19+
- Each page processed independently
20+
- Uses full context window per page
21+
- Handles page-specific content
22+
23+
**Result Collection**
24+
25+
- Gathers results from all pages
26+
- Validates individual page results
27+
- Prepares for merging
28+
29+
**Conflict Resolution**
30+
31+
- Detects overlapping information
32+
- Resolves conflicts using confidence scores
33+
- Maintains data consistency
34+
35+
## Usage
36+
37+
```python
38+
from extract_thinker import Extractor
39+
from extract_thinker.models.completion_strategy import CompletionStrategy
40+
41+
extractor = Extractor()
42+
extractor.load_llm("gpt-4")
43+
44+
result = extractor.extract(
45+
file_path,
46+
ResponseModel,
47+
completion_strategy=CompletionStrategy.PAGINATE
48+
)
49+
```
50+
51+
## Benefits
52+
53+
- **Cheaper**: Reduced parallel context window would be cheaper than a long Concatenate Strategy
54+
- **Parallel Processing**: Pages can be processed independently
55+
- **Conflict Resolution**: Smart merging of results from different pages
56+
- **Scalability**: Handles documents of any length
57+
- **Accuracy**: Each page gets full context window attention
58+
59+
## Implementation Details
60+
61+
??? example "Pagination Handler Implementation"
62+
The PaginationHandler implements the PAGINATE strategy:
63+
```python
64+
--8<-- "extract_thinker/pagination_handler.py"
65+
```
66+
67+
## When to Use
68+
69+
PAGINATE is the best choice when:
70+
71+
**Context window is small**
72+
73+
- For local LLMs with smaller context windows (e.g Llama 3.3 8k context window).
74+
75+
**The content is too Big**
76+
77+
- When the file will not fit in the entire context window (e.g 500 page document)
78+
79+
**Model Accuracy**
80+
81+
- Sometimes LLMs can lose focus when the context is too big, Paginate strategy will solve this problem

docs/core-concepts/contracts/index.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,11 @@
55

66
Contracts in ExtractThinker are Pydantic models that define the structure of data you want to extract. They provide type safety and validation for your extracted data.
77

8+
??? example "Base Contract Implementation"
9+
```python
10+
--8<-- "extract_thinker/models/contract.py"
11+
```
12+
813
## Basic Usage
914

1015
```python
@@ -24,9 +29,4 @@ class InvoiceContract(Contract):
2429
total_amount: float = Field(description="Total invoice amount")
2530
line_items: List[InvoiceLineItem] = Field(description="List of items in invoice")
2631
notes: Optional[str] = Field(description="Additional notes", default=None)
27-
```
28-
29-
??? example "Base Contract Implementation"
30-
```python
31-
--8<-- "extract_thinker/models/contract.py"
32-
```
32+
```

docs/core-concepts/document-loaders/aws-textract.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -58,14 +58,16 @@ The loader returns a dictionary with the following structure:
5858

5959
## Best Practices
6060

61-
1. **Document Preparation**
62-
- Use high-quality scans
63-
- Support formats: PDF, JPEG, PNG
64-
- Consider file size limits
61+
**Document Preparation**
6562

66-
2. **Performance**
67-
- Cache results when possible
68-
- Process pages individually for large documents
69-
- Monitor API quotas and costs
63+
- Use high-quality scans
64+
- Support formats: `PDF`, `JPEG`, `PNG`
65+
- Consider file size limits
7066

71-
For more examples and implementation details, check out the [AWS Stack](../../examples/aws-textract) in the repository.
67+
**Performance**
68+
69+
- Cache results when possible
70+
- Process pages individually for large documents
71+
- Monitor API quotas and costs
72+
73+
For more examples and implementation details, check out the [AWS Stack](../../../examples/aws-stack) in the repository.

docs/core-concepts/document-loaders/azure-form.md

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -52,13 +52,6 @@ for page in result["pages"]:
5252
print(f"Table data: {table}")
5353
```
5454

55-
Document Intelligence supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.
55+
Supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.
5656

57-
## Best Practices
58-
59-
- Use high-quality scans for best results
60-
- Consider caching results (built-in TTL of 300 seconds)
61-
- Handle tables and paragraphs separately for better accuracy
62-
- Process documents page by page for large files
63-
64-
For more examples and implementation details, check out the [Azure Stack](../../examples/azure-form.md) in the repository.
57+
For more examples and implementation details, check out the [Azure Stack](../../../examples/azure-stack) in the repository.

0 commit comments

Comments
 (0)