|
| 1 | +# Field Comparison Types <span class="beta-badge">🧪 In Beta</span> |
| 2 | + |
| 3 | +When evaluating extraction results, different fields may require different comparison methods. For example: |
| 4 | + |
| 5 | +- **ID fields** (like invoice numbers) typically require exact matching |
| 6 | +- **Text descriptions** might benefit from semantic similarity comparison |
| 7 | +- **Numeric values** could use tolerance-based comparison |
| 8 | +- **Notes or comments** might allow for fuzzy matching |
| 9 | + |
| 10 | +ExtractThinker's evaluation framework supports multiple comparison methods to address these different requirements. |
| 11 | + |
| 12 | +## Available Comparison Types |
| 13 | + |
| 14 | +| Comparison Type | Description | Best For | |
| 15 | +|-----------------|-------------|----------| |
| 16 | +| `EXACT` | Perfect string/value match (default) | IDs, codes, dates, categorical values | |
| 17 | +| `FUZZY` | Approximate string matching using Levenshtein distance | Text with potential minor variations | |
| 18 | +| `SEMANTIC` | Semantic similarity using embeddings | Descriptions, summaries, longer text | |
| 19 | +| `NUMERIC` | Numeric comparison with percentage tolerance | Amounts, quantities, measurements | |
| 20 | +| `CUSTOM` | Custom comparison function | Complex or domain-specific comparisons | |
| 21 | + |
| 22 | +## Basic Usage |
| 23 | + |
| 24 | +```python |
| 25 | +from extract_thinker import Extractor, Contract |
| 26 | +from extract_thinker.eval import Evaluator, FileSystemDataset, ComparisonType |
| 27 | + |
| 28 | +# Define your contract |
| 29 | +class InvoiceContract(Contract): |
| 30 | + invoice_number: str # Needs exact matching |
| 31 | + description: str # Can use semantic similarity |
| 32 | + total_amount: float # Can use numeric tolerance |
| 33 | + |
| 34 | +# Initialize your extractor |
| 35 | +extractor = Extractor() |
| 36 | +extractor.load_llm("gpt-4o") |
| 37 | + |
| 38 | +# Create a dataset |
| 39 | +dataset = FileSystemDataset( |
| 40 | + documents_dir="./test_invoices/", |
| 41 | + labels_path="./test_invoices/labels.json", |
| 42 | + name="Invoice Test Set" |
| 43 | +) |
| 44 | + |
| 45 | +# Set up evaluator with different field comparison types |
| 46 | +evaluator = Evaluator( |
| 47 | + extractor=extractor, |
| 48 | + response_model=InvoiceContract, |
| 49 | + field_comparisons={ |
| 50 | + "invoice_number": ComparisonType.EXACT, # Must match exactly |
| 51 | + "description": ComparisonType.SEMANTIC, # Compare meaning |
| 52 | + "total_amount": ComparisonType.NUMERIC # Allow small % difference |
| 53 | + } |
| 54 | +) |
| 55 | + |
| 56 | +# Run evaluation |
| 57 | +report = evaluator.evaluate(dataset) |
| 58 | +``` |
| 59 | + |
| 60 | +## Configuring Comparison Parameters |
| 61 | + |
| 62 | +Each comparison type has configurable parameters: |
| 63 | + |
| 64 | +```python |
| 65 | +# Configure thresholds for semantic similarity (description should be at least 80% similar) |
| 66 | +evaluator.set_field_comparison( |
| 67 | + "description", |
| 68 | + ComparisonType.SEMANTIC, |
| 69 | + similarity_threshold=0.8 |
| 70 | +) |
| 71 | + |
| 72 | +# Configure tolerance for numeric fields (total_amount can be within 2% of expected) |
| 73 | +evaluator.set_field_comparison( |
| 74 | + "total_amount", |
| 75 | + ComparisonType.NUMERIC, |
| 76 | + numeric_tolerance=0.02 |
| 77 | +) |
| 78 | +``` |
| 79 | + |
| 80 | +## Custom Comparison Functions |
| 81 | + |
| 82 | +For specialized comparisons, you can define custom comparison functions: |
| 83 | + |
| 84 | +```python |
| 85 | +def compare_dates(expected, predicted): |
| 86 | + """Custom date comparison that handles different date formats.""" |
| 87 | + from datetime import datetime |
| 88 | + # Try to parse both as dates |
| 89 | + try: |
| 90 | + expected_date = datetime.strptime(expected, "%Y-%m-%d") |
| 91 | + # Try different formats for predicted |
| 92 | + for fmt in ["%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y", "%B %d, %Y"]: |
| 93 | + try: |
| 94 | + predicted_date = datetime.strptime(predicted, fmt) |
| 95 | + return expected_date == predicted_date |
| 96 | + except ValueError: |
| 97 | + continue |
| 98 | + return False |
| 99 | + except ValueError: |
| 100 | + return expected == predicted |
| 101 | + |
| 102 | +# Set custom comparison |
| 103 | +evaluator.set_field_comparison( |
| 104 | + "invoice_date", |
| 105 | + ComparisonType.CUSTOM, |
| 106 | + custom_comparator=compare_dates |
| 107 | +) |
| 108 | +``` |
| 109 | + |
| 110 | +## Results Interpretation |
| 111 | + |
| 112 | +The evaluation report will show which comparison type was used for each field: |
| 113 | + |
| 114 | +``` |
| 115 | +=== Field-Level Metrics === |
| 116 | +invoice_number (comparison: exact): |
| 117 | + Precision: 98.00% |
| 118 | + Recall: 98.00% |
| 119 | + F1 Score: 98.00% |
| 120 | + Accuracy: 98.00% |
| 121 | +description (comparison: semantic): |
| 122 | + Precision: 92.00% |
| 123 | + Recall: 92.00% |
| 124 | + F1 Score: 92.00% |
| 125 | + Accuracy: 92.00% |
| 126 | +total_amount (comparison: numeric): |
| 127 | + Precision: 96.00% |
| 128 | + Recall: 96.00% |
| 129 | + F1 Score: 96.00% |
| 130 | + Accuracy: 96.00% |
| 131 | +``` |
| 132 | + |
| 133 | +## Best Practices |
| 134 | + |
| 135 | +- Use `EXACT` for fields where precise matching is critical (IDs, codes) |
| 136 | +- Use `SEMANTIC` for long-form text that may vary in wording but should convey the same meaning |
| 137 | +- Use `NUMERIC` for financial data, allowing for small rounding differences |
| 138 | +- Use `FUZZY` for fields that may contain typos or minor variations |
| 139 | +- Configure thresholds based on your application's tolerance for errors |
0 commit comments