Contributing to agentevals

Thank you for your interest in contributing to agentevals! This document covers how to get started, contribute code, and get your changes merged.

Note: This project is under active development. Expect breaking changes.

Ways to Contribute

Report bugs or request features — use GitHub Issues. Search existing issues before opening a new one.
Fix a bug — open a PR with a test that reproduces the issue.
Add a feature — open an issue first to discuss the approach, then submit a PR.
Improve docs — PRs for documentation fixes and improvements are always welcome.

Development Setup

Prerequisites

Python 3.11+
uv (Python package manager)
Node.js 20+ and npm (for the UI)
Optionally, Nix — the project includes a flake.nix devshell

Getting Started

# Fork and clone
git clone https://github.com/YOUR_USERNAME/agentevals.git
cd agentevals
git remote add upstream https://github.com/agentevals-dev/agentevals.git

# Install Python dependencies
uv sync

# Install UI dependencies
cd ui && npm ci && cd ..

Running Locally

Start the backend and frontend in separate terminals:

# Terminal 1 — backend (FastAPI, port 8001)
make dev-backend

# Terminal 2 — frontend (Vite, port 5173)
make dev-frontend

Open http://localhost:5173 to access the UI.

To test the full bundled experience (UI embedded in the backend):

make dev-bundle

See DEVELOPMENT.md for build tiers, Makefile targets, and release instructions.

Running Tests

make test
# or directly:
uv run pytest

Contributing Code

Workflow

Create a branch from main: git checkout -b feature/my-change
Make your changes
Add or update tests as needed
Run uv run pytest and ensure all tests pass
Commit with a clear message (see Commit Messages)
Push to your fork and open a PR against main

Small Changes

For bug fixes or minor improvements (< 100 lines), open a PR directly with tests.

Large Changes

For new features, refactors, or anything that touches multiple files:

Open an issue describing the change and your proposed approach
Get alignment before investing significant effort
Open a draft PR early to get feedback
Iterate based on review

Code Style

Python

Use type hints for function signatures
Keep functions focused and small

TypeScript / React

Follow the project's existing patterns: inline styles, CSS variables, Ant Design components
Use TypeScript for type safety
Functional components with hooks

Commit Messages

Follow Conventional Commits:

type(scope): subject

body (optional)

Types: feat, fix, docs, refactor, test, chore

Examples:

feat(cli): add --threshold flag to run command
fix(ui): correct metric score display in eval table
docs: update development setup instructions
test: add coverage for OTLP trace parsing

Pull Request Process

Ensure tests pass
Update documentation if your change affects user-facing behavior
Keep PRs focused — one logical change per PR
Request a review from a maintainer

PR Checklist

Tests added or updated
All tests pass (uv run pytest)
Documentation updated (if applicable)
Commits are clean with meaningful messages

Project Structure

src/agentevals/       # Python backend (FastAPI, CLI, evaluation engine)
ui/src/               # React frontend (Vite, Ant Design, TypeScript)
tests/                # Python tests (pytest)
examples/             # Agent examples (zero-code, SDK, custom evaluators)
samples/              # Example traces and eval sets
docs/                 # Documentation

Trace Processing Architecture

agentevals converts OTel traces from agent frameworks into a common Invocation format for evaluation. If you're adding support for a new framework or changing how we extract data from spans, this section will help you find your way around.

Key Modules

Module	What it does
`trace_attrs.py`	Single source of truth for OTel attribute key constants (`OTEL_GENAI_` for standard semconv, `ADK_` for Google ADK)
`extraction.py`	Shared extraction functions, span classifiers, and the `TraceFormatExtractor` protocol with `AdkExtractor` / `GenAIExtractor`
`converter.py`	Batch conversion orchestration, turns ADK traces into `Invocation` objects
`genai_converter.py`	Batch conversion for GenAI semconv traces (single-turn and multi-turn)
`streaming/incremental_processor.py`	Real-time span processing for the live UI, uses the same shared extraction functions
`utils/log_enrichment.py`	Reconstructs `gen_ai.input/output.messages` from OTel log records into span attributes

Adding a new attribute constant

Add it to trace_attrs.py and import from there. Don't use hardcoded attribute key strings elsewhere.

Adding or modifying extraction logic

The extraction functions in extraction.py accept flat dict[str, Any] attribute maps. This means they work with both Span-based batch converters (via span.tags) and the raw OTLP dict incremental processor. When extracting data, check ADK-specific attributes first (they contain richer data), then fall back to GenAI semconv.

Supporting a new trace format

Add a new TraceFormatExtractor implementation in extraction.py with detect(), find_invocation_spans(), find_llm_spans_in(), find_tool_spans_in(), and classify_span()
Register it in the _EXTRACTORS list. Order matters here: more specific formats should come first so they get detected before the generic GenAI fallback
If the format introduces new attribute keys, add them to trace_attrs.py
If you need conversion logic that the shared extraction functions don't cover, add a dedicated converter module (see genai_converter.py for an example)
Add tests to tests/test_extraction.py for detection and span classification

Adding an SDK example

Each example directory under examples/ is self-contained with its own requirements.txt. The example needs to actually produce OTel spans. For OpenAI-based agents this means including opentelemetry-instrumentation-openai-v2 in the requirements. Make sure all framework-specific OTel dependencies are listed in the example's requirements.txt.

Responsible AI Usage

We welcome contributors who use AI tools to assist their work, but we ask that you use them responsibly:

Do not generate issues, comments, or PR descriptions with AI. These should be written by you in your own words. Maintainers need to communicate with the person behind the contribution, not a language model.
Do not "vibe code." AI should assist and accelerate code that you (the human!) would write on your own. You are expected to understand every line of code you submit. If you cannot explain a change during review, it will not be merged.
Indicate non-trivial AI assistance. If AI played a significant role in writing your code (beyond autocomplete or minor suggestions), mention it in your PR description. This helps reviewers calibrate their review.

Getting Help

Open an issue for bugs or questions
Check DEVELOPMENT.md for detailed build and release instructions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing to agentevals

Ways to Contribute

Development Setup

Prerequisites

Getting Started

Running Locally

Running Tests

Contributing Code

Workflow

Small Changes

Large Changes

Code Style

Python

TypeScript / React

Commit Messages

Pull Request Process

PR Checklist

Project Structure

Trace Processing Architecture

Key Modules

Adding a new attribute constant

Adding or modifying extraction logic

Supporting a new trace format

Adding an SDK example

Responsible AI Usage

Getting Help

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to agentevals

Ways to Contribute

Development Setup

Prerequisites

Getting Started

Running Locally

Running Tests

Contributing Code

Workflow

Small Changes

Large Changes

Code Style

Python

TypeScript / React

Commit Messages

Pull Request Process

PR Checklist

Project Structure

Trace Processing Architecture

Key Modules

Adding a new attribute constant

Adding or modifying extraction logic

Supporting a new trace format

Adding an SDK example

Responsible AI Usage

Getting Help