CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Python CLI tool that evaluates AI-generated smart contract security audit findings against ground truth data using LLM-based comparison. It compares "junior auditor" (AI scan) results against verified vulnerability reports and produces precision/recall/F1 metrics.

Commands

Setup

uv sync                   # install runtime + dev dependencies

Run evaluation

scoring-algo evaluate [--no-telemetry] [--log-level INFO]

Generate report (from existing benchmark results)

scoring-algo report --benchmarks ./benchmarks --scan-root ./data/baseline --out REPORT.md

Linting & formatting

uv run ruff check scoring_algo/      # lint
uv run ruff check --fix scoring_algo/ # lint + autofix
uv run ruff format scoring_algo/     # format (black-compatible)
uv run mypy scoring_algo/            # type check

Pre-commit hooks run ruff (lint + format) and mypy automatically on commit.

Code Style

Line length: 100 (ruff lint + format aligned)
Ruff handles linting (replaces flake8), import sorting (replaces isort), and formatting (replaces black)
Type hints throughout; from __future__ import annotations in all modules
Pydantic models for data validation and settings

Architecture

Entry Points

scoring_algo/cli.py — Typer CLI with evaluate and report subcommands
scoring_algo/generate_report.py — Report generator (called via scoring-algo report)

Core Pipeline (`scoring_algo/core/`)

Evaluation flow (evaluate.py):

Load data (storage.py) — reads ground truth from data/source_of_truth/<repo>.json and scan results from data/<scan_source>/<repo>_results.json; normalizes diverse JSON formats into Vulnerability Pydantic models
Batch (batching.py) — splits scan findings into batches of BATCH_SIZE (default 10)
LLM compare (llm.py) — for each truth finding × batch, sends prompt to OpenAI API via AsyncOpenAI with structured output parsing; returns Finding objects
Consensus (iteration.py) — runs 3 LLM calls per comparison; majority vote determines result (2-of-3 wins; 1-1-1 tie → partial match)
Match resolution (evaluate.py) — iterates batches until exact match found; removes matched findings from pool (one-to-one mapping); falls back to best partial
Post-process — suppresses duplicate partials, removes partials conflicting with exact matches, marks unmatched non-QA findings as false positives

Key modules:

types.py — Pydantic models (Vulnerability, Finding, EvaluatedFinding) and enums (Severity, Category)
prompt.py — LLM prompt template (role: security expert, structured JSON output)
telemetry.py — Optional Langfuse integration (disable with --no-telemetry)
settings.py — Pydantic BaseSettings with SCORING_ env prefix; loads from .env

Data Layout

data/
├── auditagent/          # AI-generated audit results
├── baseline/            # Baseline scan results
├── repos/               # Repository metadata
└── source_of_truth/     # Verified ground truth findings

Results are written to benchmarks/<repo>_results.json.

Configuration

All runtime settings are in scoring_algo/settings.py via env vars (matched by field name, no prefix):

REPOS_TO_RUN — repos to evaluate (edit defaults in settings.py)
MODEL — OpenAI model (must be in SUPPORTED_MODELS)
BATCH_SIZE — number of scan findings per batch (default 10)
SCAN_SOURCE — folder under data root (auditagent or baseline)
DATA_ROOT, OUTPUT_ROOT — paths (relative paths resolve from scoring_algo/ package dir)

Required env var: OPENAI_API_KEY. Optional: OPENAI_BASE_URL for third-party APIs.

Key Design Decisions

Consensus via majority voting: 3 LLM iterations per comparison with early exit when first 2 agree
Conservative tie-breaking: a 1-exact + 1-partial + 1-false tie resolves to partial (not false)
One-to-one mapping: matched scan findings are removed from the pool to prevent double-counting
QA exclusion: Info and Best Practices severity findings are excluded from false positive counts
Async parallelism: LLM calls run 2 in parallel with a conditional 3rd via asyncio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Commands

Setup

Run evaluation

Generate report (from existing benchmark results)

Linting & formatting

Code Style

Architecture

Entry Points

Core Pipeline (`scoring_algo/core/`)

Data Layout

Configuration

Key Design Decisions

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Commands

Setup

Run evaluation

Generate report (from existing benchmark results)

Linting & formatting

Code Style

Architecture

Entry Points

Core Pipeline (scoring_algo/core/)

Data Layout

Configuration

Key Design Decisions

Core Pipeline (`scoring_algo/core/`)