A local-first, end-to-end web interface for running LLM benchmarks via the EleutherAI lm-eval harness. Built as a Proof of Concept for the AI eval project. This is for testing cloud api's like gemini and groq models, the same proxy pattern can be used for local model testing too.
The core challenge of this PoC was that lm-eval expects every API to behave exactly like OpenAI. But in reality, different providers have incredibly strict, weird, annoying schema validators.
For eg:
- Google Gemini instantly throws a
400 Bad Requestif it receives OpenAI-specific parameters likeseed. - Groq crashes if it receives unexpected
typeproperties inside themessagesarray.
(this was the coolest part :)
Instead of running evaluations directly from the React UI, we trigger an HTTP POST to a decoupled FastAPI backend. This backend runs lm-eval safely in a background thread.
To fix the schema crashes, we built a Native Middleware Pattern:
lm-evalis instructed to send its API requests to our local FastAPI proxy routes (e.g.,http://localhost:8000/proxy/v1/...) instead of the cloud provider.- The proxy intercepts the payload, strips out the vendor-incompatible parameters (like removing
seedfor Gemini, or poppingtypefor Groq). - It seamlessly forwards the sanitized request to the actual provider, gets the response, and hands it back to
lm-eval.
To get the live terminal feel in the browser. Here, we attached a custom logging.Handler directly into Python's logging module inside the background thread. This intercepts the logs as lm-eval generates them and streams them to the React frontend in real-time using Server-Sent Events (SSE).
- Zero-File I/O: Evaluations run entirely in-memory using background threads, requiring no temp file generation.
- Live Log Streaming: Real-time execution logs piped directly to a custom web terminal via SSE.
- Multiple Vendors: Demonstrated compatibility with strict APIs (Gemini, Groq) via custom payload sanitization routes.
- Modern Stack: FastAPI backend + Vite/React (TypeScript, SWC, Tailwind v4) frontend.
Prerequisites: Ensure you have Python 3.10+ and Node.js 18+ installed on your machine.
Note: The first boot might take 2-5 minutes as it downloads lm-eval, and Node modules.
Clone the repository and run the startup script in your terminal (like the VS Code integrated terminal). This will automatically create the Python virtual environment, install all dependencies, and boot both servers.
For Windows:
.\start.batFor Mac/Linux:
chmod +x start.sh
./start.shIf the automated scripts fail due to strict execution policies on your machine, you can start the servers manually using two terminals.
Terminal 1: Backend
cd backend
python -m venv venv
# On Windows use: venv\Scripts\activate
# On Mac/Linux use: source venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --reloadTerminal 2 – Frontend
cd frontend
npm install
npm run devOpen http://localhost:5173 in your browser once both are running.
Open your browser to http://localhost:5173 Enter your own API Key. (Note: This PoC is verified specifically for Gemini and Groq keys. Standard OpenAI keys may fail until a dedicated OpenAI passthrough route is added). Enter the corresponding model name (e.g., gemini-flash-latest or llama-3.3-70b-versatile ) Enter your desired task (e.g., gsm8k) and Sample Limit Click Run Evaluation and watch the logs stream in real-time Test Results: In local testing using this framework, I tested these:
Gemini 3.0 Flash on GSM8K (5 samples): 0.2000 (20%) Exact Match Llama 3.3 70B on GSM8K (5 samples): 1.0000 (100%) Exact Match
