Skip to content

Commit bb3a58f

Browse files
authored
Merge pull request #285 from enoch3712/284-feature-evals---v1
[feature] evals v1
2 parents a0c9b90 + 55127ac commit bb3a58f

27 files changed

+3887
-45
lines changed

docs/assets/student_teacher.png

693 KB
Loading
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# Cost Tracking <span class="beta-badge">🧪 In Beta</span>
2+
## Overview
3+
4+
ExtractThinker provides built-in cost tracking for evaluations, helping you monitor token usage and associated costs when using various LLM models.
5+
6+
## Basic Usage
7+
8+
To enable cost tracking in your evaluations:
9+
10+
```python
11+
from extract_thinker import Extractor, Contract
12+
from extract_thinker.eval import Evaluator, FileSystemDataset
13+
14+
# Initialize your extractor
15+
extractor = Extractor()
16+
extractor.load_llm("gpt-4o")
17+
18+
# Create evaluator with cost tracking enabled
19+
evaluator = Evaluator(
20+
extractor=extractor,
21+
response_model=YourContract,
22+
track_costs=True # Enable cost tracking
23+
)
24+
25+
# Run evaluation
26+
report = evaluator.evaluate(dataset)
27+
```
28+
29+
## Command Line Usage
30+
31+
You can also enable cost tracking through the CLI:
32+
33+
```bash
34+
extract_thinker-eval --config eval_config.json --output results.json --track-costs
35+
```
36+
37+
Or in your config file:
38+
39+
```json
40+
{
41+
"evaluation_name": "Invoice Extraction Test",
42+
"dataset_name": "Invoice Dataset",
43+
"contract_path": "./contracts/invoice_contract.py",
44+
"documents_dir": "./test_invoices/",
45+
"labels_path": "./test_invoices/labels.json",
46+
"track_costs": true,
47+
"llm": {
48+
"model": "gpt-4o"
49+
}
50+
}
51+
```
52+
53+
## How It Works
54+
55+
Cost tracking leverages LiteLLM's built-in cost calculation features to:
56+
57+
1. Count input and output tokens for each extraction
58+
2. Calculate costs based on current model pricing
59+
3. Aggregate metrics across all evaluated documents
60+
61+
## Interpreting Results
62+
63+
Cost metrics appear in the evaluation report:
64+
65+
```
66+
=== Cost Metrics ===
67+
Total cost: $2.4768
68+
Average cost per document: $0.0495
69+
Total tokens: 123,840
70+
- Input tokens: 98,450
71+
- Output tokens: 25,390
72+
```
73+
74+
The cost data is also available programmatically:
75+
76+
```python
77+
# Access overall cost metrics
78+
total_cost = report.metrics["total_cost"]
79+
average_cost = report.metrics["average_cost"]
80+
total_tokens = report.metrics["total_tokens"]
81+
82+
# Access document-specific costs
83+
for result in report.results:
84+
doc_id = result["doc_id"]
85+
doc_tokens = result["tokens"]
86+
doc_cost = result["cost"]
87+
88+
print(f"Document: {doc_id}")
89+
print(f" Cost: ${doc_cost:.4f}")
90+
print(f" Input tokens: {doc_tokens['input']}")
91+
print(f" Output tokens: {doc_tokens['output']}")
92+
print(f" Total tokens: {doc_tokens['total']}")
93+
```
94+
95+
## Cost-Benefit Analysis
96+
97+
Cost tracking is particularly useful for:
98+
99+
1. **Model comparison**: Understand the cost-accuracy tradeoffs between different models
100+
2. **Optimization**: Identify expensive documents that might need prompt optimization
101+
3. **Budgeting**: Estimate production deployment costs based on evaluation results
102+
4. **ROI calculation**: Calculate return on investment by comparing accuracy improvements to increased costs
103+
104+
## Teacher-Student Integration
105+
106+
Cost tracking works seamlessly with the teacher-student evaluation approach to help quantify the cost-benefit relationship of using more capable models:
107+
108+
```python
109+
from extract_thinker.eval import TeacherStudentEvaluator
110+
111+
# Set up teacher-student evaluator with cost tracking
112+
evaluator = TeacherStudentEvaluator(
113+
student_extractor=student_extractor,
114+
teacher_extractor=teacher_extractor,
115+
response_model=InvoiceContract,
116+
track_costs=True # Enable cost tracking
117+
)
118+
119+
# Run evaluation
120+
report = evaluator.evaluate(dataset)
121+
122+
# The report will include cost differences between teacher and student models
123+
student_cost = report.metrics["student_average_cost"]
124+
teacher_cost = report.metrics["teacher_average_cost"]
125+
cost_ratio = teacher_cost / student_cost
126+
127+
print(f"Cost ratio (teacher/student): {cost_ratio:.2f}x")
128+
print(f"Accuracy improvement: {report.metrics['document_accuracy_improvement']:.2f}%")
129+
```
130+
131+
This helps answer questions like "Is a 15% accuracy improvement worth a 3x cost increase?"
132+
133+
## Supported Models
134+
135+
Cost tracking works with all models supported by LiteLLM, including:
136+
- OpenAI models (GPT-3.5, GPT-4, etc.)
137+
- Claude models (Claude 3 Opus, Sonnet, etc.)
138+
- Mistral models
139+
- Most other major LLM providers
140+
141+
## Limitations
142+
143+
- Costs are estimated based on current pricing and may not reflect custom pricing arrangements
144+
- For some models, costs may be approximate if token counting methods vary
145+
- Document loading/preprocessing costs are not included
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Field Comparison Types <span class="beta-badge">🧪 In Beta</span>
2+
3+
When evaluating extraction results, different fields may require different comparison methods. For example:
4+
5+
- **ID fields** (like invoice numbers) typically require exact matching
6+
- **Text descriptions** might benefit from semantic similarity comparison
7+
- **Numeric values** could use tolerance-based comparison
8+
- **Notes or comments** might allow for fuzzy matching
9+
10+
ExtractThinker's evaluation framework supports multiple comparison methods to address these different requirements.
11+
12+
## Available Comparison Types
13+
14+
| Comparison Type | Description | Best For |
15+
|-----------------|-------------|----------|
16+
| `EXACT` | Perfect string/value match (default) | IDs, codes, dates, categorical values |
17+
| `FUZZY` | Approximate string matching using Levenshtein distance | Text with potential minor variations |
18+
| `SEMANTIC` | Semantic similarity using embeddings | Descriptions, summaries, longer text |
19+
| `NUMERIC` | Numeric comparison with percentage tolerance | Amounts, quantities, measurements |
20+
| `CUSTOM` | Custom comparison function | Complex or domain-specific comparisons |
21+
22+
## Basic Usage
23+
24+
```python
25+
from extract_thinker import Extractor, Contract
26+
from extract_thinker.eval import Evaluator, FileSystemDataset, ComparisonType
27+
28+
# Define your contract
29+
class InvoiceContract(Contract):
30+
invoice_number: str # Needs exact matching
31+
description: str # Can use semantic similarity
32+
total_amount: float # Can use numeric tolerance
33+
34+
# Initialize your extractor
35+
extractor = Extractor()
36+
extractor.load_llm("gpt-4o")
37+
38+
# Create a dataset
39+
dataset = FileSystemDataset(
40+
documents_dir="./test_invoices/",
41+
labels_path="./test_invoices/labels.json",
42+
name="Invoice Test Set"
43+
)
44+
45+
# Set up evaluator with different field comparison types
46+
evaluator = Evaluator(
47+
extractor=extractor,
48+
response_model=InvoiceContract,
49+
field_comparisons={
50+
"invoice_number": ComparisonType.EXACT, # Must match exactly
51+
"description": ComparisonType.SEMANTIC, # Compare meaning
52+
"total_amount": ComparisonType.NUMERIC # Allow small % difference
53+
}
54+
)
55+
56+
# Run evaluation
57+
report = evaluator.evaluate(dataset)
58+
```
59+
60+
## Configuring Comparison Parameters
61+
62+
Each comparison type has configurable parameters:
63+
64+
```python
65+
# Configure thresholds for semantic similarity (description should be at least 80% similar)
66+
evaluator.set_field_comparison(
67+
"description",
68+
ComparisonType.SEMANTIC,
69+
similarity_threshold=0.8
70+
)
71+
72+
# Configure tolerance for numeric fields (total_amount can be within 2% of expected)
73+
evaluator.set_field_comparison(
74+
"total_amount",
75+
ComparisonType.NUMERIC,
76+
numeric_tolerance=0.02
77+
)
78+
```
79+
80+
## Custom Comparison Functions
81+
82+
For specialized comparisons, you can define custom comparison functions:
83+
84+
```python
85+
def compare_dates(expected, predicted):
86+
"""Custom date comparison that handles different date formats."""
87+
from datetime import datetime
88+
# Try to parse both as dates
89+
try:
90+
expected_date = datetime.strptime(expected, "%Y-%m-%d")
91+
# Try different formats for predicted
92+
for fmt in ["%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y", "%B %d, %Y"]:
93+
try:
94+
predicted_date = datetime.strptime(predicted, fmt)
95+
return expected_date == predicted_date
96+
except ValueError:
97+
continue
98+
return False
99+
except ValueError:
100+
return expected == predicted
101+
102+
# Set custom comparison
103+
evaluator.set_field_comparison(
104+
"invoice_date",
105+
ComparisonType.CUSTOM,
106+
custom_comparator=compare_dates
107+
)
108+
```
109+
110+
## Results Interpretation
111+
112+
The evaluation report will show which comparison type was used for each field:
113+
114+
```
115+
=== Field-Level Metrics ===
116+
invoice_number (comparison: exact):
117+
Precision: 98.00%
118+
Recall: 98.00%
119+
F1 Score: 98.00%
120+
Accuracy: 98.00%
121+
description (comparison: semantic):
122+
Precision: 92.00%
123+
Recall: 92.00%
124+
F1 Score: 92.00%
125+
Accuracy: 92.00%
126+
total_amount (comparison: numeric):
127+
Precision: 96.00%
128+
Recall: 96.00%
129+
F1 Score: 96.00%
130+
Accuracy: 96.00%
131+
```
132+
133+
## Best Practices
134+
135+
- Use `EXACT` for fields where precise matching is critical (IDs, codes)
136+
- Use `SEMANTIC` for long-form text that may vary in wording but should convey the same meaning
137+
- Use `NUMERIC` for financial data, allowing for small rounding differences
138+
- Use `FUZZY` for fields that may contain typos or minor variations
139+
- Configure thresholds based on your application's tolerance for errors

0 commit comments

Comments
 (0)