Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Description
Hi team,
Following upon my initial idea submission (#1136 ) where a PoC was requested,
I created a decoupled FastAPI + React pipeline that runs lm-eval safely in background threads.
The coolest part: I built a proxy middleware layer that intercepts lm-eval payloads and cleanses/sanitizes them. This completely solves the vendor specific schema crashes we see with strict APIs like Gemini and Groq (e.g: Gemini instantly throwing a 400 Bad Request if it sees a seed parameter). This proves we can make this tool truly vendor neutral!
Piped real-time execution logs from Python directly to the frontend using Server-Sent Events (SSE).
PoC Repository: https://github.com/Spark960/ai-eval
Demo: A gif demonstration of the live evaluation pipeline streaming via SSE is available in the PoC readme. I've also attached it below for your reference
This is for the testing cloud api's like gemini, groq. The same proxy middleware layer can also be used to test the local models too. So, at the end of all this, we would be able to test local models, cloud models and agents too (through light eval). This is turning out to be a very interesting project. I'm currently working on the proposal too and will submit it soon.
Related Issues
Checklist
mainbranch before making this PRflutter upgradeand verify)flutter test) and all tests are passingAdded/updated tests?
We encourage you to add relevant test cases.
OS on which you have developed and tested the feature?