Evaluation Methodology

Golden Dataset

The evaluation framework uses a golden dataset of 50 labeled cases derived from the IBM AML synthetic transaction dataset.

Dataset Composition

Pattern Type	Count	Typology Labels
FAN-OUT	~7	structuring
FAN-IN	~6	structuring
CYCLE	~7	round_tripping
SCATTER-GATHER	~6	layering
GATHER-SCATTER	~6	layering
STACK	~6	layering, structuring
BIPARTITE	~6	round_tripping, layering
RANDOM	~6	layering

Each case includes:

Transaction data (10-50 transactions per case)
Ground truth typology labels
Laundering rate (percentage of transactions that are laundering)
Case-level summary features
Reference SAR narrative (LLM-generated, human-reviewed)

Generation Pipeline

IBM AML CSV → filter Is_Laundering=1 → group by pattern type
→ sample cases → build case summaries → generate reference narratives (Groq 70B)
→ store as golden_cases.json + golden_narratives.json

Scoring Dimensions

1. Typology Detection Accuracy (>= 0.85)

Compares detected typology labels against golden ground truth using Jaccard similarity. Maps pattern types to expected typology labels (e.g., FAN-OUT → structuring).

2. Confidence Calibration — ECE (<= 0.10)

Expected Calibration Error measures whether confidence scores match actual accuracy. Bins confidence predictions and compares predicted vs. observed positive rates across 10 bins.

3. Narrative Completeness — 5W1H (>= 0.90)

Rule-based scorer checks for presence and sufficient detail (>= 15 words) in each FinCEN-required section: Summary, WHO, WHAT, WHEN, WHERE, WHY, HOW, Recommended Action, Appendix.

4. Factual Grounding (>= 0.95)

Verifies that dollar amounts, account numbers, dates, and transaction counts in the narrative match source transaction data. Primary anti-hallucination check.

5. Regulatory Compliance (>= 0.90)

Checks for FinCEN-required elements: filing institution identification, report period, regulatory keywords ("suspicious", "BSA", "CTR"), objective tone, and date format consistency.

6. Narrative Quality — LLM Judge (>= 3.5 / 5.0)

Separate LLM call (Groq) evaluates holistic narrative quality on clarity, completeness, professionalism, and actionability. Returns a 1-5 scale score.

7. PII Leakage Rate (= 0.00)

Scans generated narratives for any unmasked PII entities (SSN, names, addresses) that should have been anonymized. Zero tolerance threshold.

8. End-to-End Latency (<= 120s)

Measures total pipeline execution time from case ingestion to narrative output. Includes all LLM inference calls.

9. False Positive Rate (<= 0.15)

Measures the rate at which the system generates SAR narratives for non-laundering cases. The false positive bypass should catch most clean cases.

Evaluation Pipeline

python evaluation/run_eval.py [--golden-dir data/golden] [--report-dir evaluation/reports]

Steps:

Load golden dataset (generate if missing)
For each case-narrative pair, run all scorers
Compute aggregate metrics (averages, pass rates, ECE)
Log metrics to MLflow experiment
Write JSON report to evaluation/reports/
Print human-readable summary

MLflow Integration

All evaluation runs are logged to MLflow with:

Per-scorer average, min, max metrics
Pass rates against thresholds
Case count and timing metadata
Experiment name: SAR_Narrative_Evaluation

Labeling Pipeline

For extending the golden dataset:

Ingest new investigation case
Run automated typology agents to produce candidate labels
Present candidates to human labeler in Streamlit evaluation page (ui/pages/05_evaluation.py)
Labeler confirms, corrects, or adds labels
Store labeled case in golden dataset
Track inter-annotator agreement metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Methodology

Golden Dataset

Dataset Composition

Generation Pipeline

Scoring Dimensions

1. Typology Detection Accuracy (>= 0.85)

2. Confidence Calibration — ECE (<= 0.10)

3. Narrative Completeness — 5W1H (>= 0.90)

4. Factual Grounding (>= 0.95)

5. Regulatory Compliance (>= 0.90)

6. Narrative Quality — LLM Judge (>= 3.5 / 5.0)

7. PII Leakage Rate (= 0.00)

8. End-to-End Latency (<= 120s)

9. False Positive Rate (<= 0.15)

Evaluation Pipeline

Steps:

MLflow Integration

Labeling Pipeline

FilesExpand file tree

evaluation_methodology.md

Latest commit

History

evaluation_methodology.md

File metadata and controls

Evaluation Methodology

Golden Dataset

Dataset Composition

Generation Pipeline

Scoring Dimensions

1. Typology Detection Accuracy (>= 0.85)

2. Confidence Calibration — ECE (<= 0.10)

3. Narrative Completeness — 5W1H (>= 0.90)

4. Factual Grounding (>= 0.95)

5. Regulatory Compliance (>= 0.90)

6. Narrative Quality — LLM Judge (>= 3.5 / 5.0)

7. PII Leakage Rate (= 0.00)

8. End-to-End Latency (<= 120s)

9. False Positive Rate (<= 0.15)

Evaluation Pipeline

Steps:

MLflow Integration

Labeling Pipeline