-
Notifications
You must be signed in to change notification settings - Fork 3
Switch to Gemini 2.5 Pro model and add evaluation tools #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Updated LLM model references in curriculum_gateway.py, gateway.py, and guardrails.py to use 'models/gemini-2.5-pro' for improved output quality. Added a comprehensive evaluation framework, experimentation guide, batch and API test scripts, realistic test data, and supporting files to enable systematic testing and prompt improvement for SEAL project endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request updates the SEAL project to use the Gemini 2.5 Pro model and adds a comprehensive evaluation framework for testing and improving prompt outputs. The changes include model reference updates across three core files and the addition of extensive testing, evaluation, and data generation tools.
Key Changes:
- Updated LLM model from 'gemini-1.5-flash-002' to 'models/gemini-2.5-pro' in gateway files and guardrails
- Added comprehensive evaluation framework with automated and human evaluation capabilities
- Introduced realistic test data generation based on educational research patterns
- Created extensive documentation and experimentation guides
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| app/llm/gateway.py | Updated Gemini model version to 2.5 Pro |
| app/llm/curriculum_gateway.py | Updated Gemini model version to 2.5 Pro |
| app/safety/guardrails.py | Updated Gemini model version to 2.5 Pro |
| test/evaluation_framework.py | Added automated evaluation framework with quality scoring |
| test/realistic_data_generator.py | Added realistic test data generation with educational patterns |
| test/human_evaluation_framework.py | Added HTML-based human evaluation interface |
| test/run_comprehensive_evaluation.py | Added comprehensive evaluation runner integrating all components |
| test/quick_start_evaluation.py | Added quick start tools and test case generation |
| test/manual_prompt_runner.py | Added manual prompt testing utility |
| test/batch_prompt_runner.py | Added batch processing for test cases |
| test/api_test_script.py | Added API endpoint testing script |
| test/quick_test_cases.json | Added sample test cases for EMT and curriculum |
| test/results/batch_results.csv | Added batch test results with sample responses |
| test/data/realistic_profiles.json | Added realistic class profile data |
| test/data/curriculum_test_cases.json | Added curriculum intervention test cases |
| test/EXPERIMENTATION_GUIDE.md | Added comprehensive experimentation guide |
| SEAL_EVALUATION_FRAMEWORK_SUMMARY.md | Added framework summary documentation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| print("\n" + "=" * 50) | ||
|
|
||
| # Load test cases | ||
| test_cases = create_simple_test_cases() |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function create_simple_test_cases() is called but never defined in this file. This will cause a NameError at runtime when main() is executed.
| "num_students": 19, | ||
| "emt_scores": { | ||
| "EMT1": [ | ||
| 100, |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line numbers in the realistic_profiles.json file appear incorrect. Line 1088 shows a value of 100 which should be written as 100.0 for consistency with the other float values in the EMT scores arrays throughout the file.
| 100, | |
| 100.0, |
| 67.1, | ||
| 94.1, | ||
| 67.6, | ||
| 100, |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line shows a value of 100 which should be written as 100.0 for consistency with the other float values in the EMT scores arrays throughout the file.
| 100, | |
| 100.0, |
|
|
||
| def create_test_cases(self) -> List[TestCase]: | ||
| """Create comprehensive test cases for both prompt types""" | ||
| test_cases = [] |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable test_cases is not used.
| test_cases = [] |
| test_cases = evaluator.generate_realistic_test_data(args.num_classes) | ||
|
|
||
| # Step 2: Run automated evaluation | ||
| automated_results = evaluator.run_automated_evaluation(test_cases) |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable automated_results is not used.
| automated_results = evaluator.run_automated_evaluation(test_cases) | |
| evaluator.run_automated_evaluation(test_cases) |
| Simplified script to get Mayur started with prompt testing | ||
| """ | ||
|
|
||
| import os |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
| import os |
|
|
||
| import random | ||
| import json | ||
| from typing import Dict, List, Any, Tuple |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Tuple' is not used.
| from typing import Dict, List, Any, Tuple | |
| from typing import Dict, List, Any |
| from typing import Dict, List, Any, Tuple | ||
| from dataclasses import dataclass | ||
| from pathlib import Path | ||
| import numpy as np |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'np' is not used.
| import numpy as np |
| Integrates automated testing, realistic data generation, and human evaluation | ||
| """ | ||
|
|
||
| import os |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
| import os |
| parser.add_argument("--out", default=None, help="Output CSV path (optional)") | ||
| args = parser.parse_args() | ||
|
|
||
| cases = json.load(open(args.input, "r", encoding="utf-8")) |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File is opened but is not closed.
| cases = json.load(open(args.input, "r", encoding="utf-8")) | |
| with open(args.input, "r", encoding="utf-8") as f: | |
| cases = json.load(f) |
Updated LLM model references in curriculum_gateway.py, gateway.py, and guardrails.py to use 'models/gemini-2.5-pro' for improved output quality. Added a comprehensive evaluation framework, experimentation guide, batch and API test scripts, realistic test data, and supporting files to enable systematic testing and prompt improvement for SEAL project endpoints.
All Submissions:
New Feature Submissions:
Changes to Core Features: