Surogate Eval

Surogate Eval is the core evaluation engine for the Surogate LLMOps framework.
It provides a unified interface for benchmarking, security scanning (Red Teaming), stress testing, and custom metric evaluation for Large Language Models.

🚀 Quick Start

Installation

Install the package using uv (recommended) or pip:

# Basic installation
uv pip install -e .

# With specific extras
uv pip install -e ".[dev,security]"

# Full installation with all extras
uv pip install -e ".[all]"

# Install from lock file (recommended for reproducibility)
uv sync

Available Extras

Extra	Description
`dev`	Development tools (pytest, ruff, black)
`security`	Red teaming & guardrails (deepteam, deepeval)
`vision`	Vision-language model evaluation (ms-vlmeval)
`rag`	RAG evaluation (ragas, llama-index)
`inference`	Local inference backends (vllm, sglang, flash-attn)
`all`	All of the above

Basic Usage

The framework is primarily driven by the surogate-eval CLI command.

# Run an evaluation using a config file
surogate-eval eval --config configs/my_eval.yaml

# Run specific target only
surogate-eval eval --config configs/my_eval.yaml --target openrouter-gpt4

# List previous evaluation results
surogate-eval eval --list

# View a specific result file
surogate-eval eval --view results_20240114_120000.json

# Compare two evaluation runs
surogate-eval eval --compare run1.json run2.json

🛠 Features

Multi-Target Evaluation
Evaluate multiple models (Local, API-based, or Custom) in a single run.

Security & Guardrails
Integrated Red-Teaming via deepteam and automated guardrail validation.

Benchmark Integration
Native support for standard benchmarks like MMLU, GSM8K, ARC, HellaSwag, and more via evalscope.

Custom Dataset Support
Use translated or custom datasets for benchmarks from local paths or HuggingFace.

Stress Testing
Measure throughput, latency, and resource consumption under load.

Distributed Execution
Automatic detection of multi-GPU setups using torch.distributed.

📋 Configuration

Evaluations are defined in YAML configuration files.

Standard Evaluation

project:
  name: "Llama-3-Check"
  version: "1.0.0"

targets:
  - name: "llama3-8b"
    type: llm
    provider: openai
    model: meta-llama/llama-3.1-8b-instruct
    base_url: https://openrouter.ai/api/v1
    api_key: ${OPENROUTER_API_KEY}

    evaluations:
      - name: "General Knowledge"
        dataset: data/general_qa.jsonl
        metrics:
          - name: correctness
            type: g_eval
            criteria: "Is the response accurate?"
            judge_model:
              target: llama3-8b
          - name: latency
            type: latency
            threshold_ms: 5000

    red_teaming:
      enabled: true
      vulnerabilities:
        - toxicity
        - prompt_leakage

Benchmarks with Custom Datasets

targets:
  - name: "gpt-4"
    type: llm
    provider: openai
    model: openai/gpt-4-turbo-preview
    base_url: https://openrouter.ai/api/v1
    api_key: ${OPENROUTER_API_KEY}

    evaluations:
      - name: "benchmarks"
        benchmarks:
          # Default dataset (ModelScope)
          - name: mmlu
            num_fewshot: 5
            limit: 100

          # HuggingFace translated dataset
          - name: gsm8k
            num_fewshot: 3
            limit: 50
            dataset_hub: huggingface
            dataset_path: OpenLLM-Ro/ro_gsm8k

          # Local dataset
          - name: arc_challenge
            num_fewshot: 3
            dataset_path: ./datasets/arc_romanian

📂 Project Structure

.
├── pyproject.toml              # Dependencies and entry-points
├── uv.lock                     # Locked dependency versions
├── surogate_eval/              # Main package
│   ├── cli/
│   │   ├── main.py             # CLI entry point
│   │   └── eval.py             # Evaluation command
│   ├── benchmarks/
│   │   ├── backends/
│   │   │   └── evalscope_backend.py  # EvalScope integration
│   │   └── registry.py         # Benchmark registry
│   ├── metrics/                # Evaluation metrics
│   ├── targets/                # Target model interfaces
│   └── utils/                  # Logging and utilities
├── examples/
│   ├── config.yaml             # Example configuration
│   └── datasets/               # Sample datasets
└── eval_results/               # Output directory

📊 Results & Reporting

Results are saved as JSON files in eval_results/:

{
  "project": {"name": "my-eval", "version": "1.0.0"},
  "timestamp": "2026-01-14T15:00:00",
  "summary": {
    "total_targets": 2,
    "total_evaluations": 5
  },
  "targets": [
    {
      "name": "gpt-4",
      "status": "success",
      "evaluations": [...],
      "benchmarks": [...]
    }
  ]
}

Includes:

Project metadata and timestamps
Summary statistics across all targets
Detailed per-test-case inputs, outputs, and scores
Benchmark results with task breakdowns
Security findings from red-teaming and guardrails

🔧 Development

# Install dev dependencies
uv pip install -e ".[dev]"

# Run tests
pytest

# Lint
ruff check .

# Format
black .

Lock dependencies after updates:

uv lock

🛡 License

This project is licensed under the AGPL-3.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
docs		docs
examples		examples
surogate_eval		surogate_eval
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Surogate Eval

🚀 Quick Start

Installation

Available Extras

Basic Usage

🛠 Features

📋 Configuration

Standard Evaluation

Benchmarks with Custom Datasets

📂 Project Structure

📊 Results & Reporting

🔧 Development

🛡 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Surogate Eval

🚀 Quick Start

Installation

Available Extras

Basic Usage

🛠 Features

📋 Configuration

Standard Evaluation

Benchmarks with Custom Datasets

📂 Project Structure

📊 Results & Reporting

🔧 Development

🛡 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages