foss42 · Spark960 · Mar 22, 2026
diff --git a/poc_ai_eval_framework/README.md b/poc_ai_eval_framework/README.md
@@ -0,0 +1,87 @@
+
+# API Dash Eval PoC
+
+A local-first, end-to-end web interface for running LLM benchmarks via the EleutherAI `lm-eval` harness. Built as a Proof of Concept for the AI eval project. This is for testing cloud api's like gemini and groq models, the same proxy pattern can be used for local model testing too.
+
+![App Demo](./demo.gif) 
+
+## How It Works (The Architecture)
+
+The core challenge of this PoC was that `lm-eval` expects every API to behave exactly like OpenAI. But in reality, different providers have incredibly strict, weird, annoying schema validators. 
+
+For eg:
+* **Google Gemini** instantly throws a `400 Bad Request` if it receives OpenAI-specific parameters like `seed`.
+* **Groq** crashes if it receives unexpected `type` properties inside the `messages` array.
+
+### The Proxy Middleware Solution
+(this was the coolest part :)
+Instead of running evaluations directly from the React UI, we trigger an HTTP POST to a decoupled FastAPI backend. This backend runs `lm-eval` safely in a background thread. 
+
+To fix the schema crashes, we built a **Native Middleware Pattern**:
+1. `lm-eval` is instructed to send its API requests to our local FastAPI proxy routes (e.g., `http://localhost:8000/proxy/v1/...`) instead of the cloud provider.
+2. The proxy intercepts the payload, strips out the vendor-incompatible parameters (like removing `seed` for Gemini, or popping `type` for Groq).
+3. It seamlessly forwards the sanitized request to the actual provider, gets the response, and hands it back to `lm-eval`.
+
+### Real-Time Log Streaming
+To get the live terminal feel in the browser. Here, we attached a custom `logging.Handler` directly into Python's `logging` module inside the background thread. This intercepts the logs as `lm-eval` generates them and streams them to the React frontend in real-time using Server-Sent Events (SSE).
+
+## Key Features
+* **Zero-File I/O:** Evaluations run entirely in-memory using background threads, requiring no temp file generation.
+* **Live Log Streaming:** Real-time execution logs piped directly to a custom web terminal via SSE.
+* **Multiple Vendors:** Demonstrated compatibility with strict APIs (Gemini, Groq) via custom payload sanitization routes.
+* **Modern Stack:** FastAPI backend + Vite/React (TypeScript, SWC, Tailwind v4) frontend.
+
+---
+## this was all for the standard repo installation, use your own preferred way for testing now
+## One-Command Setup
+
+**Prerequisites:** Ensure you have **Python 3.10+** and **Node.js 18+** installed on your machine.
+
+**Note:** The first boot might take 2-5 minutes as it downloads  `lm-eval`, and Node modules.
+
+Clone the repository and run the startup script in your terminal (like the VS Code integrated terminal). This will automatically create the Python virtual environment, install all dependencies, and boot both servers.
+
+**For Windows:**
+```cmd
+.\start.bat
+```
+
+**For Mac/Linux:**
+```bash
+chmod +x start.sh
+./start.sh
+```
+
+### Manual Setup (Fallback)
+If the automated scripts fail due to strict execution policies on your machine, you can start the servers manually using two terminals.
+
+**Terminal 1: Backend**
+```bash
+cd backend
+python -m venv venv
+# On Windows use: venv\Scripts\activate
+# On Mac/Linux use: source venv/bin/activate
+pip install -r requirements.txt
+uvicorn main:app --reload
+```
+
+Terminal 2 – Frontend
+```bash
+cd frontend
+npm install
+npm run dev
+```
+Open http://localhost:5173 in your browser once both are running.
+
+## How to Use & Concrete Results
+
+Open your browser to http://localhost:5173
+Enter your own API Key. (Note: This PoC is verified specifically for Gemini and Groq keys. Standard OpenAI keys may fail until a dedicated OpenAI passthrough route is added).
+Enter the corresponding model name (e.g., gemini-flash-latest or llama-3.3-70b-versatile )
+Enter your desired task (e.g., gsm8k) and Sample Limit
+Click Run Evaluation and watch the logs stream in real-time
+Test Results:
+In local testing using this framework, I tested these:
+
+Gemini 3.0 Flash on GSM8K (5 samples): 0.2000 (20%) Exact Match
+Llama 3.3 70B on GSM8K (5 samples): 1.0000 (100%) Exact Match
diff --git a/poc_ai_eval_framework/backend/.env b/poc_ai_eval_framework/backend/.env
@@ -0,0 +1,2 @@
+#not necessary for testing now, used global osenviron in code instead
+GEMINI_API_KEY=
diff --git a/poc_ai_eval_framework/backend/eval_runner.py b/poc_ai_eval_framework/backend/eval_runner.py
@@ -0,0 +1,57 @@
+import lm_eval
+import logging
+import asyncio
+import os
+
+class QueueLogHandler(logging.Handler):
+    def __init__(self, queue: asyncio.Queue, loop: asyncio.AbstractEventLoop):
+        super().__init__()
+        self.queue = queue
+        self.loop = loop
+
+    def emit(self, record):
+        msg = self.format(record)
+        # Safely push the log into the async queue from this synchronous thread
+        asyncio.run_coroutine_threadsafe(self.queue.put(msg), self.loop)
+
+def run_evaluation_thread(run_id: str, model_name: str, api_key: str, queue: asyncio.Queue, loop: asyncio.AbstractEventLoop, results_store: dict):
+    # 1. Attach our custom handler to the root logger so we catch everything
+    logger = logging.getLogger()
+    handler = QueueLogHandler(queue, loop)
+    handler.setFormatter(logging.Formatter('%(message)s'))
+    logger.addHandler(handler)
+    logger.setLevel(logging.INFO)
+
+    try:
+        os.environ["OPENAI_API_KEY"] = api_key
+
+        # 2. for running the eval, we are pointing base_url to our FastAPI proxy route if it's a Gemini model. This allows us to strip the 'seed' parameter that causes issues.
+        if "gemini" in model_name.lower():
+            # Route through our Gemini proxy
+            args_string = f"model={model_name},base_url=http://localhost:8000/proxy/v1/chat/completions"
+        elif "llama" in model_name.lower() or "mixtral" in model_name.lower() or "gemma" in model_name.lower():
+            # THE FIX: Route through our new Groq proxy to sanitize the message array
+            args_string = f"model={model_name},base_url=http://localhost:8000/proxy/groq/v1/chat/completions"
+        else:
+            # Default behavior: Let lm-eval talk straight to OpenAI
+            args_string = f"model={model_name}"
+
+        results = lm_eval.simple_evaluate(
+            model="local-chat-completions",
+            model_args=args_string, 
+            tasks=["gsm8k"],
+            limit=5,
+            log_samples=False,
+            apply_chat_template=True
+        )
+
+        # 3. Store the final results and signal completion
+        results_store[run_id] = {"status": "completed", "data": results['results']}
+        asyncio.run_coroutine_threadsafe(queue.put("[EVAL_DONE]"), loop)
+
+    except Exception as e:
+        results_store[run_id] = {"status": "error", "error": str(e)}
+        asyncio.run_coroutine_threadsafe(queue.put(f"[EVAL_ERROR] {str(e)}"), loop)
+    finally:
+        # Clean up the handler so it doesn't leak across different runs
+        logger.removeHandler(handler)
diff --git a/poc_ai_eval_framework/backend/main.py b/poc_ai_eval_framework/backend/main.py
@@ -0,0 +1,117 @@
+from fastapi import FastAPI, Request, Response
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import StreamingResponse
+import uuid
+import asyncio
+import httpx
+import json
+import threading
+from eval_runner import run_evaluation_thread
+
+app = FastAPI()
+
+# Enable CORS for the Vite frontend (port 5173)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"], 
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+
+# In-memory state management
+queues = {}
+results_store = {}
+
+@app.post("/api/evaluate")
+async def start_eval(request: Request):
+    data = await request.json()
+    run_id = str(uuid.uuid4())
+
+    queues[run_id] = asyncio.Queue()
+    results_store[run_id] = {"status": "running"}
+
+    # Spawn the background thread
+    loop = asyncio.get_running_loop()
+    threading.Thread(
+        target=run_evaluation_thread,
+        args=(run_id, data['model'], data['api_key'], queues[run_id], loop, results_store),
+        daemon=True
+    ).start()
+
+    return {"run_id": run_id}
+
+@app.get("/api/stream/{run_id}")
+async def stream_logs(run_id: str):
+    async def event_generator():
+        queue = queues.get(run_id)
+        if not queue:
+            yield f"data: {json.dumps({'error': 'Invalid run_id'})}\n\n"
+            return
+
+        while True:
+            msg = await queue.get()
+            if msg == "[EVAL_DONE]":
+                final_data = results_store.get(run_id, {})
+                yield f"data: {json.dumps({'done': True, 'results': final_data.get('data')})}\n\n"
+                break
+            elif msg.startswith("[EVAL_ERROR]"):
+                yield f"data: {json.dumps({'error': msg})}\n\n"
+                break
+            else:
+                yield f"data: {json.dumps({'log': msg})}\n\n"
+
+    return StreamingResponse(event_generator(), media_type="text/event-stream")
+
+#this is to strip the 'seed' parameter, this is for gemini, this can all be moved to a separate file in the future.
+@app.post("/proxy/v1/chat/completions")
+async def gemini_proxy(request: Request):
+    payload = await request.json()
+
+    payload.pop('seed', None)
+    payload.pop('presence_penalty', None)
+    payload.pop('frequency_penalty', None)
+
+    auth_header = request.headers.get('Authorization')
+    url = "https://generativelanguage.googleapis.com/v1beta/openai/chat/completions"
+
+    async with httpx.AsyncClient() as client:
+        resp = await client.post(
+            url, 
+            json=payload, 
+            headers={'Authorization': auth_header},
+            timeout=60.0
+        )
+        return Response(content=resp.content, status_code=resp.status_code, media_type="application/json")
+
+#this is the groq adapter, to remove type from messages array as groq doesn't accept it
+@app.post("/proxy/groq/v1/chat/completions")
+async def groq_proxy(request: Request):
+    payload = await request.json()
+
+    # 1. Groq rejects the 'type' property inside message objects
+    if 'messages' in payload:
+        for msg in payload['messages']:
+            msg.pop('type', None) # Strip the unsupported 'type' key
+
+            # If lm-eval accidentally sends the new vision format (list of dicts), 
+            # flatten it down to a standard string for Groq.
+            if isinstance(msg.get('content'), list):
+                text_content = " ".join([item.get('text', '') for item in msg['content'] if item.get('type') == 'text'])
+                msg['content'] = text_content
+
+    # 2. Groq also doesn't support logprobs yet on chat completions
+    payload.pop('logprobs', None)
+    payload.pop('top_logprobs', None)
+
+    auth_header = request.headers.get('Authorization')
+    url = "https://api.groq.com/openai/v1/chat/completions"
+
+    async with httpx.AsyncClient() as client:
+        resp = await client.post(
+            url, 
+            json=payload, 
+            headers={'Authorization': auth_header},
+            timeout=60.0
+        )
+        return Response(content=resp.content, status_code=resp.status_code, media_type="application/json")
diff --git a/poc_ai_eval_framework/backend/requirements.txt b/poc_ai_eval_framework/backend/requirements.txt
diff --git a/poc_ai_eval_framework/backend/test.py b/poc_ai_eval_framework/backend/test.py
@@ -0,0 +1,89 @@
+from http.server import BaseHTTPRequestHandler, HTTPServer
+import json
+import os
+import threading
+from dotenv import load_dotenv
+import lm_eval
+import urllib
+
+load_dotenv()
+
+# 1. Hardcode your Gemini API Key
+# We set it as OPENAI_API_KEY because we are utilizing the OpenAI-compatible wrapper.
+# Ensure you replace this with your actual Gemini key before running.
+os.environ["OPENAI_API_KEY"] = os.getenv("GEMINI_API_KEY")
+api_key = os.getenv("OPENAI_API_KEY")
+
+class GeminiProxyHandler(BaseHTTPRequestHandler):
+    def log_message(self, format, *args):
+        pass # Suppress proxy logs to keep your terminal clean
+
+    def do_POST(self):
+        content_length = int(self.headers['Content-Length'])
+        post_data = self.rfile.read(content_length)
+        payload = json.loads(post_data)
+
+        # THE FIX: Strip 'seed' (and other penalty params) that Gemini rejects
+        payload.pop('seed', None)
+        payload.pop('presence_penalty', None)
+        payload.pop('frequency_penalty', None)
+
+        # Forward the sanitized request directly to Google
+        url = "https://generativelanguage.googleapis.com/v1beta/openai/chat/completions"
+        req = urllib.request.Request(
+            url, 
+            data=json.dumps(payload).encode('utf-8'), 
+            headers={
+                'Content-Type': 'application/json', 
+                'Authorization': f'Bearer {api_key}'
+            }
+        )
+        try:
+            with urllib.request.urlopen(req) as response:
+                self.send_response(response.getcode())
+                self.send_header('Content-Type', 'application/json')
+                self.end_headers()
+                self.wfile.write(response.read())
+        except urllib.error.HTTPError as e:
+            self.send_response(e.code)
+            self.end_headers()
+            self.wfile.write(e.read())
+
+def start_proxy():
+    server = HTTPServer(('localhost', 8080), GeminiProxyHandler)
+    server.serve_forever()
+
+# Start the inline proxy in a daemon background thread
+threading.Thread(target=start_proxy, daemon=True).start()
+
+
+
+
+def run_raw_logic_prover():
+    print("Initializing Phase 1: Raw Logic Prover...")
+    print("Booting lm-eval with local/OpenAI-compatible syntax targeting Gemini...")
+
+    try:
+        # 2. Execute the blocking evaluation call
+        # We use the 'openai-chat-completions' model type and point the base_url 
+        # to Google's official OpenAI-compatible endpoint.
+        results = lm_eval.simple_evaluate(
+            model="local-chat-completions",
+            model_args="model=gemini-flash-latest,base_url=http://localhost:8080/v1", 
+            tasks=["gsm8k"],  
+            limit=5,
+            log_samples=False,
+            apply_chat_template=True 
+        )
+
+        print("\n--- Evaluation Successful ---")
+
+        # 3. Dump the raw JSON output to map the UI extraction
+        print("\n--- Raw JSON Payload ---")
+        print(json.dumps(results['results'], indent=2))
+
+    except Exception as e:
+        print(f"\n[ERROR] Evaluation failed: {e}")
+
+if __name__ == "__main__":
+    run_raw_logic_prover()
diff --git a/poc_ai_eval_framework/demo.gif b/poc_ai_eval_framework/demo.gif
diff --git a/poc_ai_eval_framework/frontend/.gitignore b/poc_ai_eval_framework/frontend/.gitignore
@@ -0,0 +1,24 @@
+# Logs
+logs
+*.log
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+pnpm-debug.log*
+lerna-debug.log*
+
+node_modules
+dist
+dist-ssr
+*.local
+
+# Editor directories and files
+.vscode/*
+!.vscode/extensions.json
+.idea
+.DS_Store
+*.suo
+*.ntvs*
+*.njsproj
+*.sln
+*.sw?
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		#not necessary for testing now, used global osenviron in code instead
		GEMINI_API_KEY=