This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Python CLI tool that evaluates AI-generated smart contract security audit findings against ground truth data using LLM-based comparison. It compares "junior auditor" (AI scan) results against verified vulnerability reports and produces precision/recall/F1 metrics.
uv sync # install runtime + dev dependenciesscoring-algo evaluate [--no-telemetry] [--log-level INFO]scoring-algo report --benchmarks ./benchmarks --scan-root ./data/baseline --out REPORT.mduv run ruff check scoring_algo/ # lint
uv run ruff check --fix scoring_algo/ # lint + autofix
uv run ruff format scoring_algo/ # format (black-compatible)
uv run mypy scoring_algo/ # type checkPre-commit hooks run ruff (lint + format) and mypy automatically on commit.
- Line length: 100 (ruff lint + format aligned)
- Ruff handles linting (replaces flake8), import sorting (replaces isort), and formatting (replaces black)
- Type hints throughout;
from __future__ import annotationsin all modules - Pydantic models for data validation and settings
scoring_algo/cli.py— Typer CLI withevaluateandreportsubcommandsscoring_algo/generate_report.py— Report generator (called viascoring-algo report)
Evaluation flow (evaluate.py):
- Load data (
storage.py) — reads ground truth fromdata/source_of_truth/<repo>.jsonand scan results fromdata/<scan_source>/<repo>_results.json; normalizes diverse JSON formats intoVulnerabilityPydantic models - Batch (
batching.py) — splits scan findings into batches ofBATCH_SIZE(default 10) - LLM compare (
llm.py) — for each truth finding × batch, sends prompt to OpenAI API viaAsyncOpenAIwith structured output parsing; returnsFindingobjects - Consensus (
iteration.py) — runs 3 LLM calls per comparison; majority vote determines result (2-of-3 wins; 1-1-1 tie → partial match) - Match resolution (
evaluate.py) — iterates batches until exact match found; removes matched findings from pool (one-to-one mapping); falls back to best partial - Post-process — suppresses duplicate partials, removes partials conflicting with exact matches, marks unmatched non-QA findings as false positives
Key modules:
types.py— Pydantic models (Vulnerability,Finding,EvaluatedFinding) and enums (Severity,Category)prompt.py— LLM prompt template (role: security expert, structured JSON output)telemetry.py— Optional Langfuse integration (disable with--no-telemetry)settings.py— PydanticBaseSettingswithSCORING_env prefix; loads from.env
data/
├── auditagent/ # AI-generated audit results
├── baseline/ # Baseline scan results
├── repos/ # Repository metadata
└── source_of_truth/ # Verified ground truth findings
Results are written to benchmarks/<repo>_results.json.
All runtime settings are in scoring_algo/settings.py via env vars (matched by field name, no prefix):
REPOS_TO_RUN— repos to evaluate (edit defaults insettings.py)MODEL— OpenAI model (must be inSUPPORTED_MODELS)BATCH_SIZE— number of scan findings per batch (default 10)SCAN_SOURCE— folder under data root (auditagentorbaseline)DATA_ROOT,OUTPUT_ROOT— paths (relative paths resolve fromscoring_algo/package dir)
Required env var: OPENAI_API_KEY. Optional: OPENAI_BASE_URL for third-party APIs.
- Consensus via majority voting: 3 LLM iterations per comparison with early exit when first 2 agree
- Conservative tie-breaking: a 1-exact + 1-partial + 1-false tie resolves to partial (not false)
- One-to-one mapping: matched scan findings are removed from the pool to prevent double-counting
- QA exclusion: Info and Best Practices severity findings are excluded from false positive counts
- Async parallelism: LLM calls run 2 in parallel with a conditional 3rd via
asyncio