diff --git a/idea_doc_gsoc_microservices.md b/idea_doc_gsoc_microservices.md new file mode 100644 index 000000000..7fdb85678 --- /dev/null +++ b/idea_doc_gsoc_microservices.md @@ -0,0 +1,498 @@ +# Idea Doc: End-to-End AI and Agent API Evaluation Framework (Microservices) +**Author:** Mohamed Salah +**Project:** Google Summer of Code — End-to-End AI & Agent API Evaluation Framework +**Date:** March 2026 + +--- + +## 1. Problem Statement + +Evaluating modern AI models and agent systems is fragmented. Researchers manually stitch together benchmark tools (lm-harness, lighteval), hand-craft API calls, and build one-off scripts for result analysis. There is no unified, UI-driven framework that handles: + +- Standard LLM benchmarks via existing tools +- Custom dataset evaluation across providers (OpenAI, Anthropic, HuggingFace, etc.) +- Multi-modal evaluation (text, image, voice) +- Agent evaluation with intermediate action tracing + +This project builds that unified framework — designed as a **microservices system** so each concern scales, deploys, and fails independently. + +--- + +## 2. Microservices Architecture Overview + +``` + + ┌────────────--┐ + │ UI Backend │ ← BFF (Backend for Frontend) + │ Service │ Aggregates data for the UI + └──────┬───────┘ + │ REST + ┌──────▼───────┐ + │ API Gateway │ ← Auth, routing, rate limiting + └──┬──┬──┬─────┘ + │ │ │ + ┌────────────────┘ │ └─────────────────┐ + │ │ │ + ┌──────▼──────┐ ┌───────▼──────┐ ┌────────▼────────┐ + │ Dataset │ │ Eval Engine │ │ Agent Tracer │ + │ Service │ │ Service │ │ Service │ + └─────────────┘ └──────┬───────┘ └────────┬────────┘ + │ │ + ┌────────▼─────────────────────▼────---─┐ + │ Message Queue (Kafka) + └────────┬──────────────────────────────┘ + │ + ┌────────▼────────┐ + │ Benchmark │ + │ Runner Service │ ← lm-harness, lighteval + └─────────────────┘ + + Shared: PostgreSQL · Redis · Object Storage (S3-compatible) +``` + +--- + +## 3. Service-by-Service Design + +### 3.1 API Gateway — Port 8000 + +Single entry point for all clients. Handles auth (API key / OAuth2), routes requests to downstream services, enforces rate limits, and aggregates errors. + +| Method | Endpoint | Description | +|---|---|---| +| POST | `/api/experiments` | Create and submit a new evaluation experiment. Routes to Eval Engine. | +| GET | `/api/experiments/{id}` | Fetch the full status and results of a single experiment. | +| GET | `/api/experiments` | List all experiments for the authenticated user (filterable by status, date). | +| DELETE | `/api/experiments/{id}` | Cancel a running experiment or delete a completed one. | +| POST | `/api/datasets` | Upload a new dataset file. Routes to Dataset Service. | +| GET | `/api/datasets` | List all datasets uploaded by the authenticated user. | +| GET | `/api/benchmarks` | List all available benchmark tasks. Routes to Benchmark Runner. | +| POST | `/api/benchmarks/run` | Trigger a benchmark run for a given task and model. | +| POST | `/api/agents/trace` | Submit an agent for evaluation. Routes to Agent Tracer. | +| GET | `/api/agents/trace/{id}` | Retrieve a completed agent trace and its metrics. | +| POST | `/api/auth/token` | Issue an API token (API key or OAuth2 flow). | +| GET | `/health` | Gateway health check. Returns status of all downstream services. | + +--- + +### 3.2 Eval Engine — Port 8001 + +Core runner. Loads datasets, dispatches async requests to AI provider adapters, computes metrics, and stores results. + +**Kafka topics — publishes:** `eval.completed`, `eval.progress` +**Kafka topics — consumes:** `benchmark.result` + +| Method | Endpoint | Description | +|---|---|---| +| POST | `/experiments` | Start a new evaluation run. Loads dataset, dispatches provider requests, computes metrics, and stores results. | +| GET | `/experiments/{id}` | Return the current state of an experiment (pending / running / completed) with partial results if still in progress. | +| GET | `/experiments/{id}/results` | Return the full result set for a completed experiment, including per-sample scores and aggregate metrics. | +| POST | `/experiments/{id}/cancel` | Stop a running experiment. Ongoing API calls are aborted. | +| GET | `/experiments/{id}/cost-estimate` | Return a token-count and cost estimate before the experiment runs. | +| GET | `/adapters` | List all registered model adapters (OpenAI, Anthropic, HuggingFace, Custom Agent) and their supported parameters. | +| GET | `/metrics` | List all available evaluation metrics (exact match, BLEU, ROUGE, SBERT, CLIP, WER) with their applicable modalities. | +| GET | `/health` | Service health and Redis / PostgreSQL connectivity status. | + +--- + +### 3.3 Benchmark Runner — Port 8002 + +Wraps lm-harness and lighteval so standard benchmarks run through the system without modification. Intentionally isolated because lm-harness is CPU/memory heavy and scales independently. + +**Kafka topics — consumes:** `benchmark.run` +**Kafka topics — publishes:** `benchmark.result` + +| Method | Endpoint | Description | +|---|---|---| +| GET | `/tasks` | List all available benchmark tasks across lm-harness and lighteval, plus any custom YAML tasks. | +| GET | `/tasks/{name}` | Return metadata for a single task — description, sample count, expected metrics, and data source. | +| POST | `/runs` | Queue a new benchmark run for a given task and model adapter. Publishes a job to the `benchmark.run` Kafka topic. | +| GET | `/runs/{id}` | Return the status and results of a benchmark run, with scores per subtask. | +| GET | `/runs/{id}/logs` | Stream the raw stdout/stderr of the underlying lm-harness or lighteval process for debugging. | +| POST | `/tasks/custom` | Register a new custom benchmark task by uploading a YAML config file. | +| GET | `/health` | Service health and Kafka consumer status. | + +--- + +### 3.4 Agent Tracer — Port 8003 + +Evaluates tool-calling and multi-step agents. Calls the agent endpoint, intercepts each tool call and LLM response turn, logs the full trace, and computes agent-specific metrics. + +**Kafka topics — publishes:** `agent.trace.completed` + +| Method | Endpoint | Description | +|---|---|---| +| POST | `/traces` | Submit an agent endpoint and a task. The service calls the agent, intercepts each step, and records the full execution trace. | +| GET | `/traces/{id}` | Return a completed trace — all steps, tool calls, LLM responses, final answer, and computed metrics. | +| GET | `/traces` | List all traces with filters by agent endpoint, success status, or date range. | +| GET | `/traces/{id}/steps` | Return only the intermediate steps of a trace — for debugging agent reasoning without the full metric payload. | +| GET | `/traces/{id}/metrics` | Return just the computed agent metrics — task success, steps taken, tool accuracy, hallucinated actions, latency. | +| POST | `/traces/batch` | Submit multiple agent tasks in a single batch. Each task gets an independent trace and metric computation. | +| GET | `/health` | Service health and Kafka producer status. | + +**Agent metrics computed:** +| Metric | Description | +|---|---| +| Task success rate | Did the agent complete the goal? | +| Steps to completion | Fewer = better | +| Tool accuracy | Correct tool called with valid args? | +| Hallucinated actions | Non-existent or malformed tool calls | +| Latency | Total wall-clock trace time | + +--- + +### 3.5 Dataset Service — Port 8004 + +Handles dataset upload, validation, storage in MinIO, and streaming to the Eval Engine during active runs. + +**Supported formats:** CSV, JSONL, HuggingFace Hub +**Column mapping:** `input`, `expected_output`, `context`, `image` (path/base64), `audio` (path/base64) + +| Method | Endpoint | Description | +|---|---|---| +| POST | `/datasets/upload` | Upload a CSV or JSONL file. Validates structure, stores in MinIO, returns a `dataset_id`. | +| POST | `/datasets/import` | Import a dataset from HuggingFace Hub by name and split (e.g. `squad/validation`). | +| GET | `/datasets` | List all datasets owned by the user with metadata (row count, columns, modality, upload date). | +| GET | `/datasets/{id}` | Return full metadata for a single dataset including schema and detected modality. | +| GET | `/datasets/{id}/sample` | Return a small preview of N rows. Used by the UI before the user launches a run. | +| GET | `/datasets/{id}/stream` | Stream dataset rows to the Eval Engine in batches during an active experiment. | +| POST | `/datasets/{id}/validate` | Validate a column mapping config — checks that required fields exist and are non-empty. | +| DELETE | `/datasets/{id}` | Permanently delete a dataset and its stored file from MinIO. | +| GET | `/health` | Service health and MinIO connectivity status. | + +--- + +### 3.6 UI Backend (BFF) — Port 8005 + +Backend for Frontend. Aggregates data from multiple services into UI-ready responses and relays real-time Kafka events to the browser over WebSocket. + +| Method | Endpoint | Description | +|---|---|---| +| GET | `/dashboard/experiments` | Return a combined experiment list with status, dataset name, model, and top-level metrics — aggregated from Eval Engine and Dataset Service in one call. | +| GET | `/dashboard/experiments/{id}` | Return a fully assembled detail view — config, dataset info, results, and linked agent traces — ready for the results page. | +| GET | `/dashboard/compare` | Return a side-by-side comparison of two or more experiments. Accepts a list of experiment IDs and returns aligned metrics. | +| GET | `/dashboard/providers` | Return all supported model providers and available models, pulled from the Eval Engine adapter registry. | +| GET | `/dashboard/cost-estimate` | Return a cost and token-count estimate for a proposed experiment config before submission. | +| WS | `/ws/experiments/{id}/progress` | WebSocket. Relays real-time `eval.progress` Kafka events to the browser — sample count, current score, estimated time remaining. | +| WS | `/ws/benchmarks/{id}/progress` | WebSocket. Relays `benchmark.run` progress events — current task, subtask scores as they complete. | +| GET | `/exports/experiments/{id}` | Download the full results of an experiment as a CSV or JSON file. | +| GET | `/health` | BFF health check and Kafka consumer connectivity. | + +--- + +## 4. Shared Infrastructure + +| Component | Purpose | Technology | +|---|---|---| +| Message Bus | Async job dispatch, event streaming, replayable logs | Kafka | +| Cache | Response caching, rate-limit buckets | Redis | +| Database | Experiments, results, traces, datasets metadata | PostgreSQL | +| Object Storage | Dataset files, audio/image blobs | MinIO (S3-compatible) | + +--- + +## 5. Multi-Modal Support + +| Modality | Tasks | Metrics | +|---|---|---| +| **Text** | QA, summarization, classification, translation | Exact match, BLEU, ROUGE, SBERT similarity | +| **Image** | Image captioning, Visual QA | CLIP similarity, CIDEr | +| **Voice** | Speech-to-text, voice assistants | Word Error Rate (WER), latency | + +Image and audio fields are uploaded via Dataset Service and passed as base64 blobs in provider API calls. + +--- + +## 6. Deployment + +### Docker Compose (self-hosted / development) +```yaml +services: + api-gateway: { build: ./services/gateway, ports: ["8000:8000"] } + eval-engine: { build: ./services/eval, ports: ["8001:8001"] } + benchmark-runner: { build: ./services/benchmark, ports: ["8002:8002"] } + agent-tracer: { build: ./services/agent, ports: ["8003:8003"] } + dataset-service: { build: ./services/dataset, ports: ["8004:8004"] } + ui-backend: { build: ./services/ui-backend, ports: ["8005:8005"] } + ui: { build: ./ui, ports: ["3000:3000"] } + zookeeper: { image: confluentinc/cp-zookeeper:7.6.0, ports: ["2181:2181"] } + kafka: { image: confluentinc/cp-kafka:7.6.0, ports: ["9092:9092"] } + redis: { image: redis:7 } + postgres: { image: postgres:16 } + minio: { image: minio/minio } +``` + +### Kubernetes (production) +Each service gets its own `Deployment` + `Service` manifest. The Benchmark Runner is configured with a `HorizontalPodAutoscaler` since benchmark jobs are CPU-heavy and bursty. Kafka and PostgreSQL run as StatefulSets. + +``` +k8s/ + ├── gateway/ deployment.yaml, service.yaml + ├── eval-engine/ deployment.yaml, service.yaml, hpa.yaml + ├── benchmark/ deployment.yaml, service.yaml, hpa.yaml + ├── agent-tracer/ deployment.yaml, service.yaml + ├── dataset/ deployment.yaml, service.yaml + ├── ui-backend/ deployment.yaml, service.yaml + ├── infra/ kafka.yaml, zookeeper.yaml, redis.yaml, postgres.yaml, minio.yaml + └── ingress.yaml +``` + +--- + +## 7. Proposed Timeline (GSoC ~12 weeks) + +| Phase | Weeks | Deliverable | +|---|---|---| +| Scaffold | 1 | Repo structure, Docker Compose, shared DB schema, Kafka setup | +| Adapters + Eval Engine | 2–3 | Adapter layer (OpenAI, Anthropic, HF), Eval Engine Service end-to-end | +| Dataset Service | 4 | Upload, validation, streaming, MinIO integration | +| Benchmark Runner | 5–6 | lm-harness + lighteval bridges; MMLU / GSM8K working | +| Agent Tracer | 7–8 | Trace schema, agent metrics, CustomAgentAPIAdapter | +| Multi-Modal | 9 | Image + voice adapters, CLIP / WER metrics | +| UI + BFF | 10–11 | React UI: experiment builder, live run view, results dashboard, WebSocket progress | +| K8s + Docs | 12 | K8s manifests, full docs, ≥80% test coverage, example notebooks | + +--- + +## 8. Open Questions for Maintainer + +1. Should the Benchmark Runner spawn lm-harness via subprocess or use its Python API directly? +2. Are there specific agent benchmarks (WebArena, AgentBench, τ-bench) to prioritize? +3. Is MinIO acceptable for object storage, or is there a preferred S3-compatible solution? +4. Should inter-service auth (service-to-service JWT) be in scope for GSoC, or is internal trust assumed? + +--- + +## 9. File Architecture + +### Monorepo Root + +``` +ai-eval-framework/ +├── services/ ← one folder per microservice +├── ui/ ← React frontend +├── k8s/ ← all Kubernetes manifests +├── shared/ ← shared Pydantic schemas + Kafka event types +│ ├── schemas.py ← RequestConfig, ModelResponse, ExperimentResult … +│ └── events.py ← Kafka topic names + event payload models +├── docker-compose.yml +├── .env.example +└── README.md +``` + +> `shared/` is imported by every service. `schemas.py` holds models that cross service boundaries. `events.py` holds Kafka topic names and payload shapes — both producer and consumer import from the same file so schema mismatches are caught at import time, not at runtime. + +--- + +### API Gateway — `:8000` + +``` +services/gateway/ +├── app/ +│ ├── main.py ← FastAPI app, mounts all routers +│ ├── routers/ +│ │ ├── experiments.py ← proxies to eval-engine +│ │ ├── datasets.py ← proxies to dataset-service +│ │ ├── benchmarks.py ← proxies to benchmark-runner +│ │ ├── agents.py ← proxies to agent-tracer +│ │ └── auth.py ← token issuance + validation +│ ├── middleware/ +│ │ ├── auth.py ← API key / JWT verification on every request +│ │ └── rate_limit.py ← per-user rate limiting via Redis +│ └── config.py ← downstream service URLs from env vars +├── tests/ +│ └── test_routing.py +├── Dockerfile +└── requirements.txt +``` + +--- + +### Eval Engine — `:8001` + +``` +services/eval-engine/ +├── app/ +│ ├── main.py +│ ├── routers/ +│ │ ├── experiments.py ← POST/GET /experiments +│ │ ├── adapters.py ← GET /adapters +│ │ └── metrics.py ← GET /metrics +│ ├── adapters/ ← one file per AI provider +│ │ ├── base.py ← BaseModelAdapter ABC +│ │ ├── openai.py +│ │ ├── anthropic.py +│ │ ├── huggingface.py +│ │ └── custom_agent.py +│ ├── engine/ +│ │ ├── runner.py ← async dispatcher, rate-limit, retries, caching +│ │ ├── dataset_client.py ← streams rows from dataset-service +│ │ └── cost_estimator.py +│ ├── metrics/ +│ │ ├── text.py ← BLEU, ROUGE, exact match, SBERT +│ │ ├── image.py ← CLIP similarity, CIDEr +│ │ ├── voice.py ← WER, latency +│ │ └── registry.py ← maps metric names → functions +│ ├── kafka/ +│ │ ├── producer.py ← publishes eval.progress, eval.completed +│ │ └── consumer.py ← consumes benchmark.result +│ ├── db/ +│ │ ├── models.py ← SQLAlchemy Experiment, Result models +│ │ └── crud.py +│ └── config.py +├── tests/ +│ ├── test_adapters.py +│ ├── test_runner.py +│ └── test_metrics.py +├── Dockerfile +└── requirements.txt +``` + +--- + +### Benchmark Runner — `:8002` + +``` +services/benchmark-runner/ +├── app/ +│ ├── main.py +│ ├── routers/ +│ │ ├── tasks.py ← GET /tasks, GET /tasks/{name} +│ │ └── runs.py ← POST /runs, GET /runs/{id}, GET /runs/{id}/logs +│ ├── bridges/ ← wraps external benchmark tools +│ │ ├── base.py ← BenchmarkBridge ABC +│ │ ├── lm_harness.py ← wraps lm-harness CLI / Python API +│ │ └── lighteval.py ← wraps lighteval +│ ├── tasks/ ← YAML task definitions (mounted as a k8s volume) +│ │ ├── mmlu.yaml +│ │ ├── gsm8k.yaml +│ │ └── truthfulqa.yaml +│ ├── kafka/ +│ │ ├── consumer.py ← consumes benchmark.run +│ │ └── producer.py ← publishes benchmark.result +│ ├── db/ +│ │ ├── models.py ← BenchmarkRun model +│ │ └── crud.py +│ └── config.py +├── tests/ +│ ├── test_lm_harness_bridge.py +│ └── test_lighteval_bridge.py +├── Dockerfile +└── requirements.txt +``` + +--- + +### Agent Tracer — `:8003` + +``` +services/agent-tracer/ +├── app/ +│ ├── main.py +│ ├── routers/ +│ │ └── traces.py ← all /traces endpoints +│ ├── tracer/ +│ │ ├── runner.py ← calls agent endpoint, intercepts each step +│ │ ├── interceptor.py ← captures tool calls + LLM responses mid-trace +│ │ └── metrics.py ← task success, tool accuracy, hallucination detection +│ ├── kafka/ +│ │ └── producer.py ← publishes agent.trace.completed +│ ├── db/ +│ │ ├── models.py ← Trace, TraceStep models +│ │ └── crud.py +│ └── config.py +├── tests/ +│ ├── test_tracer_runner.py +│ └── test_metrics.py +├── Dockerfile +└── requirements.txt +``` + +--- + +### Dataset Service — `:8004` + +``` +services/dataset-service/ +├── app/ +│ ├── main.py +│ ├── routers/ +│ │ └── datasets.py ← all /datasets endpoints +│ ├── storage/ +│ │ └── minio_client.py ← upload, download, delete from MinIO +│ ├── parsers/ +│ │ ├── csv_parser.py +│ │ ├── jsonl_parser.py +│ │ └── huggingface_loader.py ← pulls from HuggingFace Hub +│ ├── validators/ +│ │ └── schema_validator.py ← checks column mapping, non-empty fields +│ ├── db/ +│ │ ├── models.py ← Dataset metadata model +│ │ └── crud.py +│ └── config.py +├── tests/ +│ ├── test_parsers.py +│ └── test_validators.py +├── Dockerfile +└── requirements.txt +``` + +--- + +### UI Backend (BFF) — `:8005` + +``` +services/ui-backend/ +├── app/ +│ ├── main.py +│ ├── routers/ +│ │ ├── dashboard.py ← all /dashboard/* REST endpoints +│ │ ├── websocket.py ← /ws/experiments/{id} and /ws/benchmarks/{id} +│ │ └── exports.py ← CSV / JSON download endpoints +│ ├── aggregators/ ← calls multiple services, merges responses +│ │ ├── experiment_aggregator.py ← merges eval + dataset + trace data +│ │ └── compare_aggregator.py ← aligns metrics across experiments +│ ├── kafka/ +│ │ └── consumer.py ← reads eval.progress → forwards to WebSocket +│ ├── clients/ ← typed HTTP clients for each upstream service +│ │ ├── eval_client.py +│ │ ├── dataset_client.py +│ │ ├── benchmark_client.py +│ │ └── agent_client.py +│ └── config.py +├── tests/ +│ ├── test_aggregators.py +│ └── test_websocket.py +├── Dockerfile +└── requirements.txt +``` + +--- + +### React UI — `:3000` + +``` +ui/ +├── src/ +│ ├── pages/ +│ │ ├── ExperimentBuilder.tsx ← configure provider, dataset, metrics +│ │ ├── LiveRun.tsx ← progress bar, streaming results +│ │ ├── Results.tsx ← scores, failure cases, charts +│ │ └── Compare.tsx ← side-by-side model comparison +│ ├── components/ ← reusable UI components +│ ├── hooks/ +│ │ └── useExperimentSocket.ts ← manages WebSocket connection +│ ├── api/ ← typed fetch wrappers for every BFF endpoint +│ └── main.tsx +├── Dockerfile +├── package.json +└── vite.config.ts +``` + +--- + +## 10. Summary + +This design decomposes the framework into **6 independent microservices**, each with a clear single responsibility. The Benchmark Runner is isolated because lm-harness is CPU-heavy. The Agent Tracer is isolated because agent evaluation has a fundamentally different execution model from standard inference. The Dataset Service is isolated because dataset I/O is I/O-bound and benefits from independent scaling. + +Services communicate synchronously over REST for request/response flows and asynchronously over **Kafka** for long-running jobs (benchmark runs, large eval batches). Kafka's log-based model is a deliberate choice over a traditional message queue — it allows event replay for debugging failed runs, reprocessing results with updated metrics, and auditing the full history of an experiment without re-running it. This makes the system resilient and observable: a slow benchmark run does not block the UI or eval engine, and no progress event is ever lost.