Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions poc_ai_eval_framework/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@

# API Dash Eval PoC

A local-first, end-to-end web interface for running LLM benchmarks via the EleutherAI `lm-eval` harness. Built as a Proof of Concept for the AI eval project. This is for testing cloud api's like gemini and groq models, the same proxy pattern can be used for local model testing too.

![App Demo](./demo.gif)

## How It Works (The Architecture)

The core challenge of this PoC was that `lm-eval` expects every API to behave exactly like OpenAI. But in reality, different providers have incredibly strict, weird, annoying schema validators.

For eg:
* **Google Gemini** instantly throws a `400 Bad Request` if it receives OpenAI-specific parameters like `seed`.
* **Groq** crashes if it receives unexpected `type` properties inside the `messages` array.

### The Proxy Middleware Solution
(this was the coolest part :)
Instead of running evaluations directly from the React UI, we trigger an HTTP POST to a decoupled FastAPI backend. This backend runs `lm-eval` safely in a background thread.

To fix the schema crashes, we built a **Native Middleware Pattern**:
1. `lm-eval` is instructed to send its API requests to our local FastAPI proxy routes (e.g., `http://localhost:8000/proxy/v1/...`) instead of the cloud provider.
2. The proxy intercepts the payload, strips out the vendor-incompatible parameters (like removing `seed` for Gemini, or popping `type` for Groq).
3. It seamlessly forwards the sanitized request to the actual provider, gets the response, and hands it back to `lm-eval`.

### Real-Time Log Streaming
To get the live terminal feel in the browser. Here, we attached a custom `logging.Handler` directly into Python's `logging` module inside the background thread. This intercepts the logs as `lm-eval` generates them and streams them to the React frontend in real-time using Server-Sent Events (SSE).

## Key Features
* **Zero-File I/O:** Evaluations run entirely in-memory using background threads, requiring no temp file generation.
* **Live Log Streaming:** Real-time execution logs piped directly to a custom web terminal via SSE.
* **Multiple Vendors:** Demonstrated compatibility with strict APIs (Gemini, Groq) via custom payload sanitization routes.
* **Modern Stack:** FastAPI backend + Vite/React (TypeScript, SWC, Tailwind v4) frontend.

---
## this was all for the standard repo installation, use your own preferred way for testing now
## One-Command Setup

**Prerequisites:** Ensure you have **Python 3.10+** and **Node.js 18+** installed on your machine.

**Note:** The first boot might take 2-5 minutes as it downloads `lm-eval`, and Node modules.

Clone the repository and run the startup script in your terminal (like the VS Code integrated terminal). This will automatically create the Python virtual environment, install all dependencies, and boot both servers.

**For Windows:**
```cmd
.\start.bat
```

**For Mac/Linux:**
```bash
chmod +x start.sh
./start.sh
```

### Manual Setup (Fallback)
If the automated scripts fail due to strict execution policies on your machine, you can start the servers manually using two terminals.

**Terminal 1: Backend**
```bash
cd backend
python -m venv venv
# On Windows use: venv\Scripts\activate
# On Mac/Linux use: source venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --reload
```

Terminal 2 – Frontend
```bash
cd frontend
npm install
npm run dev
```
Open http://localhost:5173 in your browser once both are running.

## How to Use & Concrete Results

Open your browser to http://localhost:5173
Enter your own API Key. (Note: This PoC is verified specifically for Gemini and Groq keys. Standard OpenAI keys may fail until a dedicated OpenAI passthrough route is added).
Enter the corresponding model name (e.g., gemini-flash-latest or llama-3.3-70b-versatile )
Enter your desired task (e.g., gsm8k) and Sample Limit
Click Run Evaluation and watch the logs stream in real-time
Test Results:
In local testing using this framework, I tested these:

Gemini 3.0 Flash on GSM8K (5 samples): 0.2000 (20%) Exact Match
Llama 3.3 70B on GSM8K (5 samples): 1.0000 (100%) Exact Match
2 changes: 2 additions & 0 deletions poc_ai_eval_framework/backend/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#not necessary for testing now, used global osenviron in code instead
GEMINI_API_KEY=
57 changes: 57 additions & 0 deletions poc_ai_eval_framework/backend/eval_runner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import lm_eval
import logging
import asyncio
import os

class QueueLogHandler(logging.Handler):
def __init__(self, queue: asyncio.Queue, loop: asyncio.AbstractEventLoop):
super().__init__()
self.queue = queue
self.loop = loop

def emit(self, record):
msg = self.format(record)
# Safely push the log into the async queue from this synchronous thread
asyncio.run_coroutine_threadsafe(self.queue.put(msg), self.loop)

def run_evaluation_thread(run_id: str, model_name: str, api_key: str, queue: asyncio.Queue, loop: asyncio.AbstractEventLoop, results_store: dict):
# 1. Attach our custom handler to the root logger so we catch everything
logger = logging.getLogger()
handler = QueueLogHandler(queue, loop)
handler.setFormatter(logging.Formatter('%(message)s'))
logger.addHandler(handler)
logger.setLevel(logging.INFO)

try:
os.environ["OPENAI_API_KEY"] = api_key

# 2. for running the eval, we are pointing base_url to our FastAPI proxy route if it's a Gemini model. This allows us to strip the 'seed' parameter that causes issues.
if "gemini" in model_name.lower():
# Route through our Gemini proxy
args_string = f"model={model_name},base_url=http://localhost:8000/proxy/v1/chat/completions"
elif "llama" in model_name.lower() or "mixtral" in model_name.lower() or "gemma" in model_name.lower():
# THE FIX: Route through our new Groq proxy to sanitize the message array
args_string = f"model={model_name},base_url=http://localhost:8000/proxy/groq/v1/chat/completions"
else:
# Default behavior: Let lm-eval talk straight to OpenAI
args_string = f"model={model_name}"

results = lm_eval.simple_evaluate(
model="local-chat-completions",
model_args=args_string,
tasks=["gsm8k"],
limit=5,
log_samples=False,
apply_chat_template=True
)

# 3. Store the final results and signal completion
results_store[run_id] = {"status": "completed", "data": results['results']}
asyncio.run_coroutine_threadsafe(queue.put("[EVAL_DONE]"), loop)

except Exception as e:
results_store[run_id] = {"status": "error", "error": str(e)}
asyncio.run_coroutine_threadsafe(queue.put(f"[EVAL_ERROR] {str(e)}"), loop)
finally:
# Clean up the handler so it doesn't leak across different runs
logger.removeHandler(handler)
117 changes: 117 additions & 0 deletions poc_ai_eval_framework/backend/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
from fastapi import FastAPI, Request, Response
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import StreamingResponse
import uuid
import asyncio
import httpx
import json
import threading
from eval_runner import run_evaluation_thread

app = FastAPI()

# Enable CORS for the Vite frontend (port 5173)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

# In-memory state management
queues = {}
results_store = {}

@app.post("/api/evaluate")
async def start_eval(request: Request):
data = await request.json()
run_id = str(uuid.uuid4())

queues[run_id] = asyncio.Queue()
results_store[run_id] = {"status": "running"}

# Spawn the background thread
loop = asyncio.get_running_loop()
threading.Thread(
target=run_evaluation_thread,
args=(run_id, data['model'], data['api_key'], queues[run_id], loop, results_store),
daemon=True
).start()

return {"run_id": run_id}

@app.get("/api/stream/{run_id}")
async def stream_logs(run_id: str):
async def event_generator():
queue = queues.get(run_id)
if not queue:
yield f"data: {json.dumps({'error': 'Invalid run_id'})}\n\n"
return

while True:
msg = await queue.get()
if msg == "[EVAL_DONE]":
final_data = results_store.get(run_id, {})
yield f"data: {json.dumps({'done': True, 'results': final_data.get('data')})}\n\n"
break
elif msg.startswith("[EVAL_ERROR]"):
yield f"data: {json.dumps({'error': msg})}\n\n"
break
else:
yield f"data: {json.dumps({'log': msg})}\n\n"

return StreamingResponse(event_generator(), media_type="text/event-stream")

#this is to strip the 'seed' parameter, this is for gemini, this can all be moved to a separate file in the future.
@app.post("/proxy/v1/chat/completions")
async def gemini_proxy(request: Request):
payload = await request.json()

payload.pop('seed', None)
payload.pop('presence_penalty', None)
payload.pop('frequency_penalty', None)

auth_header = request.headers.get('Authorization')
url = "https://generativelanguage.googleapis.com/v1beta/openai/chat/completions"

async with httpx.AsyncClient() as client:
resp = await client.post(
url,
json=payload,
headers={'Authorization': auth_header},
timeout=60.0
)
return Response(content=resp.content, status_code=resp.status_code, media_type="application/json")

#this is the groq adapter, to remove type from messages array as groq doesn't accept it
@app.post("/proxy/groq/v1/chat/completions")
async def groq_proxy(request: Request):
payload = await request.json()

# 1. Groq rejects the 'type' property inside message objects
if 'messages' in payload:
for msg in payload['messages']:
msg.pop('type', None) # Strip the unsupported 'type' key

# If lm-eval accidentally sends the new vision format (list of dicts),
# flatten it down to a standard string for Groq.
if isinstance(msg.get('content'), list):
text_content = " ".join([item.get('text', '') for item in msg['content'] if item.get('type') == 'text'])
msg['content'] = text_content

# 2. Groq also doesn't support logprobs yet on chat completions
payload.pop('logprobs', None)
payload.pop('top_logprobs', None)

auth_header = request.headers.get('Authorization')
url = "https://api.groq.com/openai/v1/chat/completions"

async with httpx.AsyncClient() as client:
resp = await client.post(
url,
json=payload,
headers={'Authorization': auth_header},
timeout=60.0
)
return Response(content=resp.content, status_code=resp.status_code, media_type="application/json")
Binary file added poc_ai_eval_framework/backend/requirements.txt
Binary file not shown.
89 changes: 89 additions & 0 deletions poc_ai_eval_framework/backend/test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
from http.server import BaseHTTPRequestHandler, HTTPServer
import json
import os
import threading
from dotenv import load_dotenv
import lm_eval
import urllib

load_dotenv()

# 1. Hardcode your Gemini API Key
# We set it as OPENAI_API_KEY because we are utilizing the OpenAI-compatible wrapper.
# Ensure you replace this with your actual Gemini key before running.
os.environ["OPENAI_API_KEY"] = os.getenv("GEMINI_API_KEY")
api_key = os.getenv("OPENAI_API_KEY")

class GeminiProxyHandler(BaseHTTPRequestHandler):
def log_message(self, format, *args):
pass # Suppress proxy logs to keep your terminal clean

def do_POST(self):
content_length = int(self.headers['Content-Length'])
post_data = self.rfile.read(content_length)
payload = json.loads(post_data)

# THE FIX: Strip 'seed' (and other penalty params) that Gemini rejects
payload.pop('seed', None)
payload.pop('presence_penalty', None)
payload.pop('frequency_penalty', None)

# Forward the sanitized request directly to Google
url = "https://generativelanguage.googleapis.com/v1beta/openai/chat/completions"
req = urllib.request.Request(
url,
data=json.dumps(payload).encode('utf-8'),
headers={
'Content-Type': 'application/json',
'Authorization': f'Bearer {api_key}'
}
)
try:
with urllib.request.urlopen(req) as response:
self.send_response(response.getcode())
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(response.read())
except urllib.error.HTTPError as e:
self.send_response(e.code)
self.end_headers()
self.wfile.write(e.read())

def start_proxy():
server = HTTPServer(('localhost', 8080), GeminiProxyHandler)
server.serve_forever()

# Start the inline proxy in a daemon background thread
threading.Thread(target=start_proxy, daemon=True).start()




def run_raw_logic_prover():
print("Initializing Phase 1: Raw Logic Prover...")
print("Booting lm-eval with local/OpenAI-compatible syntax targeting Gemini...")

try:
# 2. Execute the blocking evaluation call
# We use the 'openai-chat-completions' model type and point the base_url
# to Google's official OpenAI-compatible endpoint.
results = lm_eval.simple_evaluate(
model="local-chat-completions",
model_args="model=gemini-flash-latest,base_url=http://localhost:8080/v1",
tasks=["gsm8k"],
limit=5,
log_samples=False,
apply_chat_template=True
)

print("\n--- Evaluation Successful ---")

# 3. Dump the raw JSON output to map the UI extraction
print("\n--- Raw JSON Payload ---")
print(json.dumps(results['results'], indent=2))

except Exception as e:
print(f"\n[ERROR] Evaluation failed: {e}")

if __name__ == "__main__":
run_raw_logic_prover()
Binary file added poc_ai_eval_framework/demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 24 additions & 0 deletions poc_ai_eval_framework/frontend/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
lerna-debug.log*

node_modules
dist
dist-ssr
*.local

# Editor directories and files
.vscode/*
!.vscode/extensions.json
.idea
.DS_Store
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?
Loading