Skip to content

Commit fb431a3

Browse files
authored
Merge pull request #228 from enoch3712/225-add-pydanticai-support
add pydanticai support
2 parents 7236493 + 6a94d2f commit fb431a3

14 files changed

+382
-67
lines changed

docs/assets/dynamic_parsing.png

459 KB
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Dynamic Parsing
2+
3+
<div align="center">
4+
<img src="../../../assets/dynamic_parsing.png" alt="Dynamic Parsing" width="90%">
5+
</div>
6+
7+
Dynamic parsing enables flexible handling of structured outputs from LLM responses. This feature is particularly useful when reasoning models are used (e.g. DeepSeek R1).
8+
9+
## Overview
10+
11+
The dynamic parsing feature can be enabled using the `set_dynamic()` method on your LLM instance. When enabled, the LLM will:
12+
13+
1. Attempt to parse and validate JSON responses
14+
2. Include structured thinking process in the output
15+
3. Handle complex response models dynamically
16+
17+
## Usage
18+
19+
### Here's how to enable dynamic parsing:
20+
21+
```python
22+
from extract_thinker import LLM
23+
24+
# Initialize LLM
25+
llm = LLM("ollama/deepseek-r1:1.5b")
26+
27+
# Enable dynamic parsing
28+
llm.set_dynamic(True)
29+
```
30+
31+
### Uses this template structure:
32+
```python
33+
Please provide your thinking process within <think> tags, followed by your JSON output.
34+
35+
JSON structure:
36+
{your_structure}
37+
38+
OUTPUT example:
39+
<think>
40+
Your step-by-step reasoning and analysis goes here...
41+
</think>
42+
43+
##JSON OUTPUT
44+
{
45+
...
46+
}
47+
```
48+
49+
## Example: Invoice Extraction
50+
51+
Here's a complete example of using dynamic parsing for invoice extraction:
52+
53+
```python
54+
from extract_thinker import LLM, Extractor
55+
from extract_thinker.document_loader import DocumentLoaderPyPdf
56+
from pydantic import BaseModel
57+
from typing import List, Optional
58+
59+
# Define your invoice model
60+
class InvoiceData(BaseModel):
61+
invoice_number: str
62+
date: str
63+
total_amount: float
64+
vendor_name: str
65+
line_items: List[dict]
66+
payment_terms: Optional[str]
67+
68+
# Initialize LLM with dynamic parsing
69+
llm = LLM("ollama/deepseek-r1:1.5b")
70+
llm.set_dynamic(True) # Enable dynamic JSON parsing
71+
72+
# Setup document loader and extractor
73+
document_loader = DocumentLoaderPyPdf()
74+
extractor = Extractor(document_loader=document_loader, llm=llm)
75+
76+
# Extract information from invoice
77+
result = extractor.extract("path/to/invoice.pdf", response_model=InvoiceData)
78+
```
+43-12
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,61 @@
11
# LLM Integration
22

3-
!!! warning "🚧 In Development"
4-
This component is currently under active development. The API might change in future releases.
5-
63
The LLM component in ExtractThinker acts as a bridge between your document processing pipeline and various Language Model providers. It handles request formatting, response parsing, and provider-specific optimizations.
74

85
<div align="center">
96
<img src="../../assets/llm_image.png" alt="LLM Architecture" width="50%">
107
</div>
118

12-
The architecture consists of:
13-
14-
- **Parser**: Uses [instructor](https://github.com/jxnl/instructor) for structured outputs with Pydantic
15-
16-
- **LLM Broker**: Leverages [litellm](https://github.com/BerriAI/litellm) for unified model interface
17-
189
??? example "Base LLM Implementation"
1910
```python
2011
--8<-- "extract_thinker/llm.py"
2112
```
2213

23-
## Basic Usage
14+
The architecture supports two different stacks:
15+
16+
**Default Stack**: Combines instructor and litellm
17+
18+
- Uses [instructor](https://python.useinstructor.com/) for structured outputs with Pydantic
19+
- Leverages [litellm](https://docs.litellm.ai/docs/) for unified model interface
20+
21+
**Pydantic AI Stack** <span class="beta-badge">🧪 In Beta</span>
22+
23+
- All-in-one solution for Pydantic model integration
24+
- Handles both model interfacing and structured outputs
25+
- Built by the Pydantic team ([Learn more](https://ai.pydantic.dev/))
26+
27+
## Backend Options
2428

2529
```python
2630
from extract_thinker import LLM
31+
from extract_thinker.llm_engine import LLMEngine
2732

28-
# Initialize with specific model
33+
# Initialize with default stack (instructor + litellm)
2934
llm = LLM("gpt-4o")
30-
```
35+
36+
# Or use Pydantic AI stack (Beta)
37+
llm = LLM("openai:gpt-4o", backend=LLMEngine.PYDANTIC_AI)
38+
```
39+
40+
ExtractThinker supports two LLM stacks:
41+
42+
### Default Stack (instructor + litellm)
43+
The default stack combines instructor for structured outputs and litellm for model interfacing. It leverages [LiteLLM's unified API](https://docs.litellm.ai/docs/#litellm-python-sdk) for consistent model access:
44+
45+
```python
46+
llm = LLM("gpt-4o", backend=LLMEngine.DEFAULT)
47+
```
48+
49+
### Pydantic AI Stack (Beta)
50+
An alternative all-in-one solution for model integration powered by [Pydantic AI](https://ai.pydantic.dev/):
51+
52+
```python
53+
llm = LLM("openai:gpt-4o", backend=LLMEngine.PYDANTIC_AI)
54+
```
55+
56+
!!! note "Pydantic AI Limitations"
57+
- Batch processing is not supported with the Pydantic AI backend
58+
- Router functionality is not available
59+
- Requires the `pydantic-ai` package to be installed
60+
61+
[Read more about Pydantic AI features](https://ai.pydantic.dev/#why-use-pydanticai)

docs/stylesheets/extra.css

+11
Original file line numberDiff line numberDiff line change
@@ -315,4 +315,15 @@
315315
.md-nav__item a[href="#"] span:first-child {
316316
opacity: 0.7;
317317
filter: grayscale(1);
318+
}
319+
320+
/* Beta badge */
321+
.beta-badge {
322+
background-color: #f5f3ff;
323+
border: 1px solid #8b5cf6;
324+
border-radius: 4px;
325+
padding: 2px 8px;
326+
font-size: 0.875rem;
327+
color: #6d28d9;
328+
margin-left: 8px;
318329
}

extract_thinker/extractor.py

+14
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
import uuid
66
import litellm
77
from pydantic import BaseModel
8+
from extract_thinker.llm_engine import LLMEngine
89
from extract_thinker.concatenation_handler import ConcatenationHandler
910
from extract_thinker.document_loader.document_loader import DocumentLoader
1011
from extract_thinker.document_loader.document_loader_llm_image import DocumentLoaderLLMImage
@@ -759,7 +760,20 @@ def extract_batch(
759760
760761
Returns:
761762
A BatchJob object to monitor and retrieve batch processing results.
763+
764+
Raises:
765+
ValueError: If batch processing is not supported by the current LLM configuration
762766
"""
767+
if not self.llm:
768+
raise ValueError("LLM is not set. Please set an LLM before extraction.")
769+
770+
# Check if using pydantic-ai backend
771+
if self.llm.backend == LLMEngine.PYDANTIC_AI:
772+
raise ValueError(
773+
"Batch processing is not supported with the PYDANTIC_AI backend. "
774+
"Please use GPT4o models and default backend for batch operations."
775+
)
776+
763777
if not self.can_handle_batch():
764778
raise ValueError(
765779
f"Model {self.llm.model} does not support batch processing. "

extract_thinker/llm.py

+86-5
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
1+
import asyncio
12
from typing import List, Dict, Any, Optional
23
import instructor
34
import litellm
45
from litellm import Router
6+
from extract_thinker.llm_engine import LLMEngine
57
from extract_thinker.utils import add_classification_structure, extract_thinking_json
68

79
# Add these constants at the top of the file, after the imports
@@ -25,15 +27,72 @@ class LLM:
2527
TEMPERATURE = 0 # Always zero for deterministic outputs (IDP)
2628
TIMEOUT = 3000 # Timeout in milliseconds
2729

28-
def __init__(self,
29-
model: str,
30-
token_limit: int = None):
31-
self.client = instructor.from_litellm(litellm.completion, mode=instructor.Mode.MD_JSON)
30+
def __init__(
31+
self,
32+
model: str,
33+
token_limit: int = None,
34+
backend: LLMEngine = LLMEngine.DEFAULT
35+
):
36+
"""Initialize LLM with specified backend.
37+
38+
Args:
39+
model: The model name (e.g. "gpt-4", "claude-3")
40+
token_limit: Optional maximum tokens
41+
backend: LLMBackend enum (default: LITELLM)
42+
"""
3243
self.model = model
33-
self.router = None
3444
self.token_limit = token_limit
45+
self.router = None
3546
self.is_dynamic = False
47+
self.backend = backend
48+
49+
if self.backend == LLMEngine.DEFAULT:
50+
self.client = instructor.from_litellm(
51+
litellm.completion,
52+
mode=instructor.Mode.MD_JSON
53+
)
54+
self.agent = None
55+
elif self.backend == LLMEngine.PYDANTIC_AI:
56+
self._check_pydantic_ai()
57+
from pydantic_ai import Agent
58+
from pydantic_ai.models import KnownModelName
59+
from typing import cast
60+
import asyncio
61+
62+
self.client = None
63+
self.agent = Agent(
64+
cast(KnownModelName, self.model)
65+
)
66+
else:
67+
raise ValueError(f"Unsupported backend: {self.backend}")
68+
69+
@staticmethod
70+
def _check_pydantic_ai():
71+
"""Check if pydantic-ai is installed."""
72+
try:
73+
import pydantic_ai
74+
except ImportError:
75+
raise ImportError(
76+
"Could not import pydantic-ai package. "
77+
"Please install it with `pip install pydantic-ai`."
78+
)
79+
80+
@staticmethod
81+
def _get_pydantic_ai():
82+
"""Lazy load pydantic-ai."""
83+
try:
84+
import pydantic_ai
85+
return pydantic_ai
86+
except ImportError:
87+
raise ImportError(
88+
"Could not import pydantic-ai package. "
89+
"Please install it with `pip install pydantic-ai`."
90+
)
91+
3692
def load_router(self, router: Router) -> None:
93+
"""Load a LiteLLM router for model fallbacks."""
94+
if self.backend != LLMEngine.DEFAULT:
95+
raise ValueError("Router is only supported with LITELLM backend")
3796
self.router = router
3897

3998
def set_dynamic(self, is_dynamic: bool) -> None:
@@ -52,6 +111,28 @@ def request(
52111
messages: List[Dict[str, str]],
53112
response_model: Optional[str] = None
54113
) -> Any:
114+
# Handle Pydantic-AI backend differently
115+
if self.backend == LLMEngine.PYDANTIC_AI:
116+
# Combine messages into a single prompt
117+
combined_prompt = " ".join([m["content"] for m in messages])
118+
try:
119+
# Create event loop if it doesn't exist
120+
try:
121+
loop = asyncio.get_event_loop()
122+
except RuntimeError:
123+
loop = asyncio.new_event_loop()
124+
asyncio.set_event_loop(loop)
125+
126+
result = loop.run_until_complete(
127+
self.agent.run(
128+
combined_prompt,
129+
result_type=response_model if response_model else str
130+
)
131+
)
132+
return result.data
133+
except Exception as e:
134+
raise ValueError(f"Failed to extract from source: {str(e)}")
135+
55136
# Uncomment the following lines if you need to calculate max_tokens
56137
# contents = map(lambda message: message['content'], messages)
57138
# all_contents = ' '.join(contents)

extract_thinker/llm_engine.py

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
from enum import Enum
2+
3+
4+
class LLMEngine(Enum):
5+
"""Supported LLM backends.
6+
7+
Attributes:
8+
DEFAULT: Uses litellm + instructor for model interfacing and structured outputs
9+
PYDANTIC_AI: Uses pydantic-ai for enhanced Pydantic model integration
10+
"""
11+
DEFAULT = "default" # Default backend using litellm + instructor
12+
PYDANTIC_AI = "pydantic_ai" # Pydantic AI backend for enhanced model integration

mkdocs.yml

+1
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ nav:
5656
- Kofax: '#'
5757
- LLM Integration:
5858
- Overview: core-concepts/llm-integration/index.md
59+
- Dynamic Parsing: core-concepts/llm-integration/dynamic-parsing.md
5960
- Classification:
6061
- Overview: core-concepts/classification/index.md
6162
- Basic Classification: core-concepts/classification/basic.md

pyproject.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "extract_thinker"
3-
version = "0.1.3"
3+
version = "0.1.4"
44
description = "Library to extract data from files and documents agnositicaly using LLMs"
55
authors = ["Júlio Almeida <[email protected]>"]
66
readme = "README.md"

tests/critical/test_critical_classification.py

+6-2
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ def test_critical_classification():
2424
# Setup
2525
document_loader = DocumentLoaderPyPdf()
2626
extractor = Extractor(document_loader)
27-
extractor.load_llm("groq/llama-3.1-70b-versatile")
27+
extractor.load_llm("groq/llama-3.3-70b-versatile")
2828

2929
process = Process()
3030
process.add_classify_extractor([[extractor]])
@@ -47,4 +47,8 @@ def test_critical_classification():
4747

4848
# Assert
4949
assert result is not None
50-
assert result.name == "Invoice"
50+
assert result.name == "Invoice"
51+
52+
53+
if __name__ == "__main__":
54+
test_critical_classification()

tests/critical/test_critical_extraction.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ def test_critical_extract_with_pypdf():
4444

4545
extractor = Extractor()
4646
extractor.load_document_loader(DocumentLoaderPyPdf())
47-
extractor.load_llm("groq/llama-3.1-70b-versatile")
47+
extractor.load_llm("groq/llama-3.3-70b-versatile")
4848

4949
result = extractor.extract(test_file_path, InvoiceContract)
5050

0 commit comments

Comments
 (0)