The evaluation framework uses a golden dataset of 50 labeled cases derived from the IBM AML synthetic transaction dataset.
| Pattern Type | Count | Typology Labels |
|---|---|---|
| FAN-OUT | ~7 | structuring |
| FAN-IN | ~6 | structuring |
| CYCLE | ~7 | round_tripping |
| SCATTER-GATHER | ~6 | layering |
| GATHER-SCATTER | ~6 | layering |
| STACK | ~6 | layering, structuring |
| BIPARTITE | ~6 | round_tripping, layering |
| RANDOM | ~6 | layering |
Each case includes:
- Transaction data (10-50 transactions per case)
- Ground truth typology labels
- Laundering rate (percentage of transactions that are laundering)
- Case-level summary features
- Reference SAR narrative (LLM-generated, human-reviewed)
IBM AML CSV → filter Is_Laundering=1 → group by pattern type
→ sample cases → build case summaries → generate reference narratives (Groq 70B)
→ store as golden_cases.json + golden_narratives.json
Compares detected typology labels against golden ground truth using Jaccard similarity. Maps pattern types to expected typology labels (e.g., FAN-OUT → structuring).
Expected Calibration Error measures whether confidence scores match actual accuracy. Bins confidence predictions and compares predicted vs. observed positive rates across 10 bins.
Rule-based scorer checks for presence and sufficient detail (>= 15 words) in each FinCEN-required section: Summary, WHO, WHAT, WHEN, WHERE, WHY, HOW, Recommended Action, Appendix.
Verifies that dollar amounts, account numbers, dates, and transaction counts in the narrative match source transaction data. Primary anti-hallucination check.
Checks for FinCEN-required elements: filing institution identification, report period, regulatory keywords ("suspicious", "BSA", "CTR"), objective tone, and date format consistency.
Separate LLM call (Groq) evaluates holistic narrative quality on clarity, completeness, professionalism, and actionability. Returns a 1-5 scale score.
Scans generated narratives for any unmasked PII entities (SSN, names, addresses) that should have been anonymized. Zero tolerance threshold.
Measures total pipeline execution time from case ingestion to narrative output. Includes all LLM inference calls.
Measures the rate at which the system generates SAR narratives for non-laundering cases. The false positive bypass should catch most clean cases.
python evaluation/run_eval.py [--golden-dir data/golden] [--report-dir evaluation/reports]- Load golden dataset (generate if missing)
- For each case-narrative pair, run all scorers
- Compute aggregate metrics (averages, pass rates, ECE)
- Log metrics to MLflow experiment
- Write JSON report to
evaluation/reports/ - Print human-readable summary
All evaluation runs are logged to MLflow with:
- Per-scorer average, min, max metrics
- Pass rates against thresholds
- Case count and timing metadata
- Experiment name:
SAR_Narrative_Evaluation
For extending the golden dataset:
- Ingest new investigation case
- Run automated typology agents to produce candidate labels
- Present candidates to human labeler in Streamlit evaluation page (
ui/pages/05_evaluation.py) - Labeler confirms, corrects, or adds labels
- Store labeled case in golden dataset
- Track inter-annotator agreement metrics