Skip to content

Latest commit

 

History

History
317 lines (220 loc) · 9.42 KB

File metadata and controls

317 lines (220 loc) · 9.42 KB

Contributing to EvalAP

Project Architecture

The stack is based on Fastapi+pydantic+sqlachemy for the API in conjonction with ZeroMQ for the Runner. The project includes:

  • A Documentation and Results platform based on Docusaurus (Port 3000).
  • A Local Results Explorer based on Streamlit (Port 8501).
evalap/
├── justfile    --> just is a handy way to save and run project-specific commands. See [https://just.systems](https://just.systems)
├── evalap/        --> The evalap code source
│   ├── api/        --> The evaluation API source code
│   ├── runner/     --> The runner (message passing) source code
│   ├── mcp/        --> The MCP client.
│   └── ui/         --> The user interface code source
│   └── docs/       --> The documentation code source
├── tests/      --> The tests
└── notebooks/  --> Example and demo notebooks

System Requirements

Install just to run project-specific commands. You will also need to install jq to parse JSON responses. You will need uv to install python requirements

Environment Variables

At a minimum, the project needs the following API key to be set perform LLM based metrics:

export OPENAI_API_KEY="Your secret key"

Recommended: Hugging Face Token

For downloading datasets from Hugging Face (used during database seeding), it's recommended to set a Hugging Face token:

export HF_TOKEN="Your Hugging Face token"

You can create a token at https://huggingface.co/settings/tokens. While not strictly required, having a token provides:

  • Higher rate limits for dataset downloads
  • Access to gated datasets (if needed)
  • Better reliability for API calls

The environement variables can also be defined in a .env file at the root of the project. See the .env.example file for an example.

All the project global settings and environmant variables are handled in evalap/api/config.py.

Python Requirements

Install python requirements with:

just sync

This will also install pre-commit hooks.

Development Setup

You can run EvalAP in two ways: using Docker Compose (recommended) or running services locally.

Option 1: Docker Compose (Recommended)

This is the easiest way to get started with full hot reloading for all services.

Quick Start

just run docker

This will:

  • Build the development Docker image (includes dev dependencies like watchdog)
  • Start PostgreSQL database
  • Run Alembic migrations automatically
  • Start all three services with hot reloading:
    • Uvicorn API on port 8000
    • Runner with file watching
    • Docusaurus UI on port 3000 (New platform)
    • Streamlit UI on port 8501 (Local results explorer)

Access Your Services

Hot Reloading

Your code is live-mounted into the container. Any changes you make will automatically trigger reloads:

  • Edit API code (e.g., evalap/api/main.py) → Uvicorn auto-reloads
  • Edit runner code (e.g., evalap/runners/tasks.py) → Runner auto-restarts
  • Edit Documentation code (e.g., docs/docs/index.md) → Docusaurus auto-reloads
  • Edit UI code (e.g., evalap/ui/demo_streamlit/app.py) → Streamlit auto-reloads

Managing Docker Services

# View logs (all services)
docker compose -f compose.dev.yml logs -f

# View logs (specific service)
docker compose -f compose.dev.yml logs -f evalap_dev

# Enter the container
docker compose -f compose.dev.yml exec evalap_dev bash

# Inside the container, check process status
supervisorctl status

# Restart a specific process
supervisorctl restart uvicorn
supervisorctl restart runner
supervisorctl restart docusaurus
supervisorctl restart streamlit

# Stop services
docker compose -f compose.dev.yml down

# Rebuild after dependency changes
docker compose -f compose.dev.yml up --build

Option 2: Local Development (Without Docker)

If you prefer to run services directly on your machine:

Database Setup

  1. Launch the PostgreSQL database:
docker compose -f compose.dev.yml up -d postgres
  1. Initialize/Update the database schema:
alembic -c evalap/api/alembic.ini upgrade head
  1. If you modify the schema:
alembic -c evalap/api/alembic.ini revision --autogenerate -m "text explication"
alembic -c evalap/api/alembic.ini upgrade head

Run All Services

Launch the API, runner, Docusaurus and Streamlit together:

just run
# or explicitly: just run local

This will:

  1. Seed the database with initial datasets from Hugging Face (if not already present):
    • llm-values-CIVICS: Cultural values evaluation dataset
    • lmsys-toxic-chat: Toxicity detection dataset
    • DECCP: Chinese censorship benchmark
  2. Start all four services in parallel with colored output and hot reloading

Note: Having an HF_TOKEN set is recommended for better dataset download reliability.

Run Services Separately

If needed you can run each service individually:

Launch the API:

uvicorn evalap.api.main:app --reload

Launch the runner:

PYTHONPATH="." python -m evalap.runners
# To change the default logging level:
LOG_LEVEL="DEBUG" PYTHONPATH="." python -m evalap.runners

Launch Docusaurus:

cd docs && npm run start

Launch Streamlit:

uv run streamlit run evalap/ui/demo_streamlit/app.py

Troubleshooting

Hot Reload Not Working?

  1. Check volume mounting: Ensure the volume is mounted correctly in compose.dev.yml
  2. Check logs: Look for [Reloader] messages in runner logs
  3. Verify file changes: Make sure you're editing files in the mounted directory

Process Crashed?

# Check status
docker compose -f compose.dev.yml exec evalap_dev supervisorctl status

# View logs
docker compose -f compose.dev.yml logs evalap_dev

# Restart the service
docker compose -f compose.dev.yml restart evalap_dev

Database Issues?

# Reset the database
docker compose -f compose.dev.yml down -v
docker compose -f compose.dev.yml up --build

Jupyter Tutorial

The notebook/ directory contains examples of API usage.

Adding new metrics

Each single metric should be defined in a file in evalap/api/metrics/{metric_name}.py. The file should be self-contained, i.e contains the eventual prompt and settings related to the metric. The metric should be decorated as following example to be registed as a known metric of EVALAP:

from . import metric_registry

@metric_registry.register(
    name="metric_name", # the name that identified the metric
    description="Explain the metrics briefly"
    metric_type="llm",  # to be documented, not yet used
    require=["output", "output_true", "query"] # the fields that should be present in the dataset related to experiment under evaluation
)
def metric_name_metric(output:str, output_true:str, **kwargs) -> float:
    # ...
    # ...You code goes here
    # ...
    return score
    # or, if you want to store the intermediate generated observation by the metric (like a judge answer typically)
    #return score, observation

Unit Tests

Tests can be found in api/tests. To run unit tests, use :

just test

Install python package

just publish

use ruff

just format

Dependency Management

This project uses Renovate for automated dependency management.

What Renovate Does

  • Python dependencies: Automatically updates the uv.lock file based on pyproject.toml constraints
  • Documentation dependencies: Updates npm packages in the docs/ folder
  • Docker dependencies: Updates base images in Dockerfiles and docker-compose files
  • GitHub Actions: Updates action versions in workflow files
  • Security updates: Creates immediate PRs for vulnerability alerts
  • Lock file maintenance: Monthly cleanup of lock files

Configuration

The Renovate configuration is located in .github/renovate.json5 and includes:

  • Scheduled updates: Regular dependency checks throughout the week
    • Monday: Documentation npm dependencies
    • Tuesday: Docker dependencies
    • Wednesday: Python dev dependencies
    • Thursday: GitHub Actions
    • Monthly: Lock file maintenance (1st of month)
  • Grouped updates: Dependencies are grouped to reduce PR noise
  • Version constraints: Respects Python >=3.12 and Node >=18.0 requirements

Managing Renovate PRs

  1. Review the changes: Ensure updates don't break functionality
  2. Test locally: Run just test after merging dependency updates
  3. Monitor schedules:
    • Documentation npm deps: Monday 6am UTC
    • Docker deps: Tuesday 6am UTC
    • Python dev deps: Wednesday 6am UTC
    • GitHub Actions: Thursday 6am UTC
    • Lock file maintenance: 1st of month 6am UTC
    • Security updates: Immediate when vulnerabilities detected

For more configuration options, see the Renovate documentation.