enoch3712
diff --git a/‎docs/assets/student_teacher.png‎
693 KB b/‎docs/assets/student_teacher.png‎
693 KB
diff --git a/‎docs/core-concepts/evals/cost-tracking.md‎
Lines changed: 145 additions & 0 deletions b/‎docs/core-concepts/evals/cost-tracking.md‎
Lines changed: 145 additions & 0 deletions
diff --git a/‎docs/core-concepts/evals/field-comparison.md‎
Lines changed: 139 additions & 0 deletions b/‎docs/core-concepts/evals/field-comparison.md‎
Lines changed: 139 additions & 0 deletions
@@ -0,0 +1,145 @@
+# Cost Tracking <span class="beta-badge">🧪 In Beta</span>
+## Overview
+
+ExtractThinker provides built-in cost tracking for evaluations, helping you monitor token usage and associated costs when using various LLM models.
+
+## Basic Usage
+
+To enable cost tracking in your evaluations:
+
+```python
+from extract_thinker import Extractor, Contract
+from extract_thinker.eval import Evaluator, FileSystemDataset
+
+# Initialize your extractor
+extractor = Extractor()
+extractor.load_llm("gpt-4o")
+
+# Create evaluator with cost tracking enabled
+evaluator = Evaluator(
+    extractor=extractor,
+    response_model=YourContract,
+    track_costs=True  # Enable cost tracking
+)
+
+# Run evaluation
+report = evaluator.evaluate(dataset)
+```
+
+## Command Line Usage
+
+You can also enable cost tracking through the CLI:
+
+```bash
+extract_thinker-eval --config eval_config.json --output results.json --track-costs
+```
+
+Or in your config file:
+
+```json
+{
+  "evaluation_name": "Invoice Extraction Test",
+  "dataset_name": "Invoice Dataset",
+  "contract_path": "./contracts/invoice_contract.py",
+  "documents_dir": "./test_invoices/",
+  "labels_path": "./test_invoices/labels.json",
+  "track_costs": true,
+  "llm": {
+    "model": "gpt-4o"
+  }
+}
+```
+
+## How It Works
+
+Cost tracking leverages LiteLLM's built-in cost calculation features to:
+
+1. Count input and output tokens for each extraction
+2. Calculate costs based on current model pricing
+3. Aggregate metrics across all evaluated documents
+
+## Interpreting Results
+
+Cost metrics appear in the evaluation report:
+
+```
+=== Cost Metrics ===
+Total cost: $2.4768
+Average cost per document: $0.0495
+Total tokens: 123,840
+  - Input tokens: 98,450
+  - Output tokens: 25,390
+```
+
+The cost data is also available programmatically:
+
+```python
+# Access overall cost metrics
+total_cost = report.metrics["total_cost"]
+average_cost = report.metrics["average_cost"]
+total_tokens = report.metrics["total_tokens"]
+
+# Access document-specific costs
+for result in report.results:
+    doc_id = result["doc_id"]
+    doc_tokens = result["tokens"]
+    doc_cost = result["cost"]
+    
+    print(f"Document: {doc_id}")
+    print(f"  Cost: ${doc_cost:.4f}")
+    print(f"  Input tokens: {doc_tokens['input']}")
+    print(f"  Output tokens: {doc_tokens['output']}")
+    print(f"  Total tokens: {doc_tokens['total']}")
+```
+
+## Cost-Benefit Analysis
+
+Cost tracking is particularly useful for:
+
+1. **Model comparison**: Understand the cost-accuracy tradeoffs between different models
+2. **Optimization**: Identify expensive documents that might need prompt optimization
+3. **Budgeting**: Estimate production deployment costs based on evaluation results
+4. **ROI calculation**: Calculate return on investment by comparing accuracy improvements to increased costs
+
+## Teacher-Student Integration
+
+Cost tracking works seamlessly with the teacher-student evaluation approach to help quantify the cost-benefit relationship of using more capable models:
+
+```python
+from extract_thinker.eval import TeacherStudentEvaluator
+
+# Set up teacher-student evaluator with cost tracking
+evaluator = TeacherStudentEvaluator(
+    student_extractor=student_extractor,
+    teacher_extractor=teacher_extractor,
+    response_model=InvoiceContract,
+    track_costs=True  # Enable cost tracking
+)
+
+# Run evaluation
+report = evaluator.evaluate(dataset)
+
+# The report will include cost differences between teacher and student models
+student_cost = report.metrics["student_average_cost"]
+teacher_cost = report.metrics["teacher_average_cost"]
+cost_ratio = teacher_cost / student_cost
+
+print(f"Cost ratio (teacher/student): {cost_ratio:.2f}x")
+print(f"Accuracy improvement: {report.metrics['document_accuracy_improvement']:.2f}%")
+```
+
+This helps answer questions like "Is a 15% accuracy improvement worth a 3x cost increase?"
+
+## Supported Models
+
+Cost tracking works with all models supported by LiteLLM, including:
+- OpenAI models (GPT-3.5, GPT-4, etc.)
+- Claude models (Claude 3 Opus, Sonnet, etc.)
+- Mistral models
+- Most other major LLM providers
+
+## Limitations
+
+- Costs are estimated based on current pricing and may not reflect custom pricing arrangements
+- For some models, costs may be approximate if token counting methods vary
+- Document loading/preprocessing costs are not included 
@@ -0,0 +1,139 @@
+# Field Comparison Types <span class="beta-badge">🧪 In Beta</span>
+
+When evaluating extraction results, different fields may require different comparison methods. For example:
+
+- **ID fields** (like invoice numbers) typically require exact matching
+- **Text descriptions** might benefit from semantic similarity comparison
+- **Numeric values** could use tolerance-based comparison
+- **Notes or comments** might allow for fuzzy matching
+
+ExtractThinker's evaluation framework supports multiple comparison methods to address these different requirements.
+
+## Available Comparison Types
+
+| Comparison Type | Description | Best For |
+|-----------------|-------------|----------|
+| `EXACT` | Perfect string/value match (default) | IDs, codes, dates, categorical values |
+| `FUZZY` | Approximate string matching using Levenshtein distance | Text with potential minor variations |
+| `SEMANTIC` | Semantic similarity using embeddings | Descriptions, summaries, longer text |
+| `NUMERIC` | Numeric comparison with percentage tolerance | Amounts, quantities, measurements |
+| `CUSTOM` | Custom comparison function | Complex or domain-specific comparisons |
+
+## Basic Usage
+
+```python
+from extract_thinker import Extractor, Contract
+from extract_thinker.eval import Evaluator, FileSystemDataset, ComparisonType
+
+# Define your contract
+class InvoiceContract(Contract):
+    invoice_number: str  # Needs exact matching
+    description: str     # Can use semantic similarity
+    total_amount: float  # Can use numeric tolerance
+
+# Initialize your extractor
+extractor = Extractor()
+extractor.load_llm("gpt-4o")
+
+# Create a dataset
+dataset = FileSystemDataset(
+    documents_dir="./test_invoices/",
+    labels_path="./test_invoices/labels.json",
+    name="Invoice Test Set"
+)
+
+# Set up evaluator with different field comparison types
+evaluator = Evaluator(
+    extractor=extractor,
+    response_model=InvoiceContract,
+    field_comparisons={
+        "invoice_number": ComparisonType.EXACT,  # Must match exactly
+        "description": ComparisonType.SEMANTIC,  # Compare meaning
+        "total_amount": ComparisonType.NUMERIC   # Allow small % difference
+    }
+)
+
+# Run evaluation
+report = evaluator.evaluate(dataset)
+```
+
+## Configuring Comparison Parameters
+
+Each comparison type has configurable parameters:
+
+```python
+# Configure thresholds for semantic similarity (description should be at least 80% similar)
+evaluator.set_field_comparison(
+    "description",
+    ComparisonType.SEMANTIC,
+    similarity_threshold=0.8
+)
+
+# Configure tolerance for numeric fields (total_amount can be within 2% of expected)
+evaluator.set_field_comparison(
+    "total_amount",
+    ComparisonType.NUMERIC,
+    numeric_tolerance=0.02
+)
+```
+
+## Custom Comparison Functions
+
+For specialized comparisons, you can define custom comparison functions:
+
+```python
+def compare_dates(expected, predicted):
+    """Custom date comparison that handles different date formats."""
+    from datetime import datetime
+    # Try to parse both as dates
+    try:
+        expected_date = datetime.strptime(expected, "%Y-%m-%d")
+        # Try different formats for predicted
+        for fmt in ["%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y", "%B %d, %Y"]:
+            try:
+                predicted_date = datetime.strptime(predicted, fmt)
+                return expected_date == predicted_date
+            except ValueError:
+                continue
+        return False
+    except ValueError:
+        return expected == predicted
+
+# Set custom comparison
+evaluator.set_field_comparison(
+    "invoice_date",
+    ComparisonType.CUSTOM,
+    custom_comparator=compare_dates
+)
+```
+
+## Results Interpretation
+
+The evaluation report will show which comparison type was used for each field:
+
+```
+=== Field-Level Metrics ===
+invoice_number (comparison: exact):
+  Precision: 98.00%
+  Recall: 98.00%
+  F1 Score: 98.00%
+  Accuracy: 98.00%
+description (comparison: semantic):
+  Precision: 92.00%
+  Recall: 92.00%
+  F1 Score: 92.00%
+  Accuracy: 92.00%
+total_amount (comparison: numeric):
+  Precision: 96.00%
+  Recall: 96.00%
+  F1 Score: 96.00%
+  Accuracy: 96.00%
+```
+
+## Best Practices
+
+- Use `EXACT` for fields where precise matching is critical (IDs, codes)
+- Use `SEMANTIC` for long-form text that may vary in wording but should convey the same meaning
+- Use `NUMERIC` for financial data, allowing for small rounding differences
+- Use `FUZZY` for fields that may contain typos or minor variations
+- Configure thresholds based on your application's tolerance for errors