The stack is based on Fastapi+pydantic+sqlachemy for the API in conjonction with ZeroMQ for the Runner. The project includes:
- A Documentation and Results platform based on Docusaurus (Port 3000).
- A Local Results Explorer based on Streamlit (Port 8501).
evalap/
├── justfile --> just is a handy way to save and run project-specific commands. See [https://just.systems](https://just.systems)
├── evalap/ --> The evalap code source
│ ├── api/ --> The evaluation API source code
│ ├── runner/ --> The runner (message passing) source code
│ ├── mcp/ --> The MCP client.
│ └── ui/ --> The user interface code source
│ └── docs/ --> The documentation code source
├── tests/ --> The tests
└── notebooks/ --> Example and demo notebooks
Install just to run project-specific commands. You will also need to install jq to parse JSON responses. You will need uv to install python requirements
At a minimum, the project needs the following API key to be set perform LLM based metrics:
export OPENAI_API_KEY="Your secret key"For downloading datasets from Hugging Face (used during database seeding), it's recommended to set a Hugging Face token:
export HF_TOKEN="Your Hugging Face token"You can create a token at https://huggingface.co/settings/tokens. While not strictly required, having a token provides:
- Higher rate limits for dataset downloads
- Access to gated datasets (if needed)
- Better reliability for API calls
The environement variables can also be defined in a .env file at the root of the project. See the .env.example file for an example.
All the project global settings and environmant variables are handled in evalap/api/config.py.
Install python requirements with:
just syncThis will also install pre-commit hooks.
You can run EvalAP in two ways: using Docker Compose (recommended) or running services locally.
This is the easiest way to get started with full hot reloading for all services.
just run dockerThis will:
- Build the development Docker image (includes dev dependencies like
watchdog) - Start PostgreSQL database
- Run Alembic migrations automatically
- Start all three services with hot reloading:
- Uvicorn API on port 8000
- Runner with file watching
- Docusaurus UI on port 3000 (New platform)
- Streamlit UI on port 8501 (Local results explorer)
- API: http://localhost:8000
- API Docs: http://localhost:8000/api-docs or http://localhost:8000/redoc
- Docusaurus UI: http://localhost:3000 (Documentation & Results)
- Streamlit UI: http://localhost:8501 (Local results explorer)
- PostgreSQL: localhost:5432 (credentials: postgres/changeme)
Your code is live-mounted into the container. Any changes you make will automatically trigger reloads:
- Edit API code (e.g.,
evalap/api/main.py) → Uvicorn auto-reloads - Edit runner code (e.g.,
evalap/runners/tasks.py) → Runner auto-restarts - Edit Documentation code (e.g.,
docs/docs/index.md) → Docusaurus auto-reloads - Edit UI code (e.g.,
evalap/ui/demo_streamlit/app.py) → Streamlit auto-reloads
# View logs (all services)
docker compose -f compose.dev.yml logs -f
# View logs (specific service)
docker compose -f compose.dev.yml logs -f evalap_dev
# Enter the container
docker compose -f compose.dev.yml exec evalap_dev bash
# Inside the container, check process status
supervisorctl status
# Restart a specific process
supervisorctl restart uvicorn
supervisorctl restart runner
supervisorctl restart docusaurus
supervisorctl restart streamlit
# Stop services
docker compose -f compose.dev.yml down
# Rebuild after dependency changes
docker compose -f compose.dev.yml up --buildIf you prefer to run services directly on your machine:
- Launch the PostgreSQL database:
docker compose -f compose.dev.yml up -d postgres- Initialize/Update the database schema:
alembic -c evalap/api/alembic.ini upgrade head- If you modify the schema:
alembic -c evalap/api/alembic.ini revision --autogenerate -m "text explication"
alembic -c evalap/api/alembic.ini upgrade headLaunch the API, runner, Docusaurus and Streamlit together:
just run
# or explicitly: just run localThis will:
- Seed the database with initial datasets from Hugging Face (if not already present):
- llm-values-CIVICS: Cultural values evaluation dataset
- lmsys-toxic-chat: Toxicity detection dataset
- DECCP: Chinese censorship benchmark
- Start all four services in parallel with colored output and hot reloading
Note: Having an HF_TOKEN set is recommended for better dataset download reliability.
If needed you can run each service individually:
Launch the API:
uvicorn evalap.api.main:app --reloadLaunch the runner:
PYTHONPATH="." python -m evalap.runners
# To change the default logging level:
LOG_LEVEL="DEBUG" PYTHONPATH="." python -m evalap.runnersLaunch Docusaurus:
cd docs && npm run startLaunch Streamlit:
uv run streamlit run evalap/ui/demo_streamlit/app.py- Check volume mounting: Ensure the volume is mounted correctly in
compose.dev.yml - Check logs: Look for
[Reloader]messages in runner logs - Verify file changes: Make sure you're editing files in the mounted directory
# Check status
docker compose -f compose.dev.yml exec evalap_dev supervisorctl status
# View logs
docker compose -f compose.dev.yml logs evalap_dev
# Restart the service
docker compose -f compose.dev.yml restart evalap_dev# Reset the database
docker compose -f compose.dev.yml down -v
docker compose -f compose.dev.yml up --buildThe notebook/ directory contains examples of API usage.
Each single metric should be defined in a file in evalap/api/metrics/{metric_name}.py.
The file should be self-contained, i.e contains the eventual prompt and settings related to the metric.
The metric should be decorated as following example to be registed as a known metric of EVALAP:
from . import metric_registry
@metric_registry.register(
name="metric_name", # the name that identified the metric
description="Explain the metrics briefly"
metric_type="llm", # to be documented, not yet used
require=["output", "output_true", "query"] # the fields that should be present in the dataset related to experiment under evaluation
)
def metric_name_metric(output:str, output_true:str, **kwargs) -> float:
# ...
# ...You code goes here
# ...
return score
# or, if you want to store the intermediate generated observation by the metric (like a judge answer typically)
#return score, observationTests can be found in api/tests. To run unit tests, use :
just test
just publish
just format
This project uses Renovate for automated dependency management.
- Python dependencies: Automatically updates the
uv.lockfile based onpyproject.tomlconstraints - Documentation dependencies: Updates npm packages in the
docs/folder - Docker dependencies: Updates base images in Dockerfiles and docker-compose files
- GitHub Actions: Updates action versions in workflow files
- Security updates: Creates immediate PRs for vulnerability alerts
- Lock file maintenance: Monthly cleanup of lock files
The Renovate configuration is located in .github/renovate.json5 and includes:
- Scheduled updates: Regular dependency checks throughout the week
- Monday: Documentation npm dependencies
- Tuesday: Docker dependencies
- Wednesday: Python dev dependencies
- Thursday: GitHub Actions
- Monthly: Lock file maintenance (1st of month)
- Grouped updates: Dependencies are grouped to reduce PR noise
- Version constraints: Respects Python >=3.12 and Node >=18.0 requirements
- Review the changes: Ensure updates don't break functionality
- Test locally: Run
just testafter merging dependency updates - Monitor schedules:
- Documentation npm deps: Monday 6am UTC
- Docker deps: Tuesday 6am UTC
- Python dev deps: Wednesday 6am UTC
- GitHub Actions: Thursday 6am UTC
- Lock file maintenance: 1st of month 6am UTC
- Security updates: Immediate when vulnerabilities detected
For more configuration options, see the Renovate documentation.