Skip to content

Latest commit

 

History

History
104 lines (70 loc) · 3.8 KB

File metadata and controls

104 lines (70 loc) · 3.8 KB

Evaluation Methodology

Golden Dataset

The evaluation framework uses a golden dataset of 50 labeled cases derived from the IBM AML synthetic transaction dataset.

Dataset Composition

Pattern Type Count Typology Labels
FAN-OUT ~7 structuring
FAN-IN ~6 structuring
CYCLE ~7 round_tripping
SCATTER-GATHER ~6 layering
GATHER-SCATTER ~6 layering
STACK ~6 layering, structuring
BIPARTITE ~6 round_tripping, layering
RANDOM ~6 layering

Each case includes:

  • Transaction data (10-50 transactions per case)
  • Ground truth typology labels
  • Laundering rate (percentage of transactions that are laundering)
  • Case-level summary features
  • Reference SAR narrative (LLM-generated, human-reviewed)

Generation Pipeline

IBM AML CSV → filter Is_Laundering=1 → group by pattern type
→ sample cases → build case summaries → generate reference narratives (Groq 70B)
→ store as golden_cases.json + golden_narratives.json

Scoring Dimensions

1. Typology Detection Accuracy (>= 0.85)

Compares detected typology labels against golden ground truth using Jaccard similarity. Maps pattern types to expected typology labels (e.g., FAN-OUT → structuring).

2. Confidence Calibration — ECE (<= 0.10)

Expected Calibration Error measures whether confidence scores match actual accuracy. Bins confidence predictions and compares predicted vs. observed positive rates across 10 bins.

3. Narrative Completeness — 5W1H (>= 0.90)

Rule-based scorer checks for presence and sufficient detail (>= 15 words) in each FinCEN-required section: Summary, WHO, WHAT, WHEN, WHERE, WHY, HOW, Recommended Action, Appendix.

4. Factual Grounding (>= 0.95)

Verifies that dollar amounts, account numbers, dates, and transaction counts in the narrative match source transaction data. Primary anti-hallucination check.

5. Regulatory Compliance (>= 0.90)

Checks for FinCEN-required elements: filing institution identification, report period, regulatory keywords ("suspicious", "BSA", "CTR"), objective tone, and date format consistency.

6. Narrative Quality — LLM Judge (>= 3.5 / 5.0)

Separate LLM call (Groq) evaluates holistic narrative quality on clarity, completeness, professionalism, and actionability. Returns a 1-5 scale score.

7. PII Leakage Rate (= 0.00)

Scans generated narratives for any unmasked PII entities (SSN, names, addresses) that should have been anonymized. Zero tolerance threshold.

8. End-to-End Latency (<= 120s)

Measures total pipeline execution time from case ingestion to narrative output. Includes all LLM inference calls.

9. False Positive Rate (<= 0.15)

Measures the rate at which the system generates SAR narratives for non-laundering cases. The false positive bypass should catch most clean cases.

Evaluation Pipeline

python evaluation/run_eval.py [--golden-dir data/golden] [--report-dir evaluation/reports]

Steps:

  1. Load golden dataset (generate if missing)
  2. For each case-narrative pair, run all scorers
  3. Compute aggregate metrics (averages, pass rates, ECE)
  4. Log metrics to MLflow experiment
  5. Write JSON report to evaluation/reports/
  6. Print human-readable summary

MLflow Integration

All evaluation runs are logged to MLflow with:

  • Per-scorer average, min, max metrics
  • Pass rates against thresholds
  • Case count and timing metadata
  • Experiment name: SAR_Narrative_Evaluation

Labeling Pipeline

For extending the golden dataset:

  1. Ingest new investigation case
  2. Run automated typology agents to produce candidate labels
  3. Present candidates to human labeler in Streamlit evaluation page (ui/pages/05_evaluation.py)
  4. Labeler confirms, corrects, or adds labels
  5. Store labeled case in golden dataset
  6. Track inter-annotator agreement metrics