EvalHub

A lightweight REST API service for orchestrating LLM evaluations across multiple backends. Written in Go, it routes evaluation requests to frameworks like lm-evaluation-harness, RAGAS, Garak, and GuideLLM orchestrated via a complementary SDK, tracks experiments via MLflow, and runs natively on OpenShift.

Architecture

The service uses Go's standard net/http router, structured logging with zap, Prometheus metrics, and a pluggable storage layer (SQLite for development, PostgreSQL for production). Providers and benchmarks are declared in YAML configuration files shipped with the container image.

Quick start

Prerequisites

Go 1.25+
Make
Python 3 (for make test; used by scripts/grcat for colored output)
Podman (for container builds)
Access to an OpenShift or Kubernetes cluster (for deployment)

Run locally

make install-deps
make build
./bin/eval-hub

The API is available at http://localhost:8080. Verify it is running:

curl http://localhost:8080/api/v1/health

Interactive documentation is served at /docs.

Run in a container

podman build -t eval-hub:latest -f Containerfile .
podman run --rm -p 8080:8080 eval-hub:latest

Deploy to OpenShift

EvalHub is managed by the TrustyAI Service Operator via a custom resource:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: EvalHub
metadata:
  name: evalhub
  namespace: my-namespace
spec:
  replicas: 1
  env:
    - name: MLFLOW_TRACKING_URI
      value: "http://mlflow:5000"

Apply the CR to your cluster:

oc apply -f evalhub-cr.yaml
oc get evalhub -n my-namespace    # check status

Local development

make start-service          # start in background (logs to bin/service.log)
make stop-service           # stop

make test                   # unit tests
make test-fvt               # BDD functional tests (godog)
make test-all               # both
make test-coverage          # generate coverage.html

make lint                   # go vet
make fmt                    # go fmt

Run a single test:

go test -v ./internal/handlers -run TestHandleName

To create a Python wheel distribution of the server for local development and testing:

make cross-compile
make build-wheel

Exposing private functions for tests

Create a file called export_test.go:

package auth

var MatchEndpoint = matchEndpoint

Database

SQLite in-memory is the default. For PostgreSQL, use the targets in tests/postgres/Makefile:

make -C tests/postgres install-postgres && make -C tests/postgres start-postgres
make -C tests/postgres create-database && make -C tests/postgres create-user && make -C tests/postgres grant-permissions

Then set DB_URL to a PostgreSQL connection string:

export DB_URL="postgres://user@localhost:5432/eval_hub"

Configuration

Configuration is loaded from config/config.yaml, overridden by environment variables and secret files.

Variable	Purpose	Default
`PORT`	API listen port	`8080`
`DB_URL`	Database connection string	SQLite in-memory
`MLFLOW_TRACKING_URI`	MLflow tracking server	`http://localhost:5000`
`MLFLOW_INSECURE_SKIP_VERIFY`	Skip TLS verification for MLflow	`false`
`LOG_LEVEL`	Logging level	`INFO`

Provider configurations live in config/providers/ as YAML files. The default set includes lm-evaluation-harness (167 benchmarks), RAGAS, Garak, GuideLLM, LightEval, and MTEB.

API overview

All endpoints are versioned under /api/v1. Full specification at eval-hub.github.io/eval-hub.

Endpoint	Methods	Description
`/api/v1/evaluations/jobs`	POST, GET	Create or list evaluation jobs
`/api/v1/evaluations/jobs/{id}`	GET, DELETE	Get status or cancel a job
`/api/v1/evaluations/collections`	GET, POST	List or create benchmark collections
`/api/v1/evaluations/providers`	GET, POST	List or create providers
`/api/v1/evaluations/providers/{id}`	GET, PUT, PATCH, DELETE	Manage a provider
`/api/v1/evaluations/jobs/{id}/events`	POST	Submit job events
`/api/v1/health`	GET	Health check
`/metrics`	GET	Prometheus metrics

Detailed API documentation: eval-hub.github.io/eval-hub

Custom backends

EvalHub supports Bring Your Own Framework (BYOF). Extend the FrameworkAdapter class from the eval-hub-sdk and implement a single method -- EvalHub handles scheduling, status reporting, and result aggregation.

from evalhub.adapter import FrameworkAdapter, JobSpec, JobCallbacks, JobResults, EvaluationResult

class MyAdapter(FrameworkAdapter):
    def run_benchmark_job(self, config: JobSpec, callbacks: JobCallbacks) -> JobResults:
        # run your evaluation logic, report progress via callbacks
        callbacks.report_status(JobStatusUpdate(status=JobStatus.RUNNING, progress=0.5))
        score = evaluate(config.model, config.parameters)
        return JobResults(
            id=config.id,
            benchmark_id=config.benchmark_id,
            model_name=config.model.name,
            results=[EvaluationResult(metric_name="accuracy", metric_value=score)],
            num_examples_evaluated=100,
            duration_seconds=elapsed,
        )

Register the new provider by adding a YAML entry to the providers ConfigMap. No additional services or TCP listeners are required -- adapters run as jobs, not servers. Once registered, the provider and its benchmarks are available through the standard /api/v1/evaluations/providers endpoint.

Project structure

eval-hub/
├── cmd/eval_hub/          # Entry point (main binary)
├── internal/
│   ├── handlers/          # HTTP request handlers
│   ├── storage/           # Database abstraction (SQLite, PostgreSQL)
│   ├── mlflow/            # MLflow client
│   ├── runtimes/          # Backend execution adapters
│   ├── config/            # Viper-based configuration
│   ├── validation/        # Request validation
│   ├── metrics/           # Prometheus instrumentation
│   └── logging/           # Structured logging (zap)
├── config/                # config.yaml and provider definitions
├── docs/src/              # OpenAPI 3.1.0 specification (source of truth)
├── tests/features/        # BDD tests (godog)
├── Containerfile          # Multi-stage UBI9 container build
└── Makefile               # Build, test, and dev targets

Licence

Apache 2.0 -- see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 732 Commits
.github		.github
.vscode		.vscode
auth		auth
cmd		cmd
config		config
containers		containers
docs		docs
examples		examples
internal		internal
pkg		pkg
python-server		python-server
scripts		scripts
tests		tests
.conf.go-integration-test		.conf.go-integration-test
.conf.go-test		.conf.go-test
.cz.toml		.cz.toml
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.gitleaksignore		.gitleaksignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.redocly.lint-ignore.yaml		.redocly.lint-ignore.yaml
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
COMPATIBILITY.md		COMPATIBILITY.md
CONTRIBUTING.md		CONTRIBUTING.md
Containerfile		Containerfile
LICENSE		LICENSE
MLFLOW.md		MLFLOW.md
Makefile		Makefile
OWNERS		OWNERS
README.md		README.md
VERSION		VERSION
codecov.yml		codecov.yml
go.mod		go.mod
go.sum		go.sum
package-lock.json		package-lock.json
package.json		package.json
redocly.yaml		redocly.yaml
semgrep.yaml		semgrep.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalHub

Architecture

Quick start

Prerequisites

Run locally

Run in a container

Deploy to OpenShift

Local development

Exposing private functions for tests

Database

Configuration

API overview

Custom backends

Project structure

Further reading

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvalHub

Architecture

Quick start

Prerequisites

Run locally

Run in a container

Deploy to OpenShift

Local development

Exposing private functions for tests

Database

Configuration

API overview

Custom backends

Project structure

Further reading

Licence

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages