-
Notifications
You must be signed in to change notification settings - Fork 144
Add tests #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests #12
Conversation
WalkthroughThe updates remove the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLIInput
participant ScenarioEvaluationService
participant OutputFile
User->>CLIInput: Provide input (auth, scenarios)
CLIInput->>CLIInput: Validate credentials and parse scenarios
CLIInput->>ScenarioEvaluationService: Pass validated scenarios
ScenarioEvaluationService->>ScenarioEvaluationService: Evaluate scenarios
ScenarioEvaluationService->>ScenarioEvaluationService: _dump_results()
ScenarioEvaluationService->>OutputFile: Write JSON results
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
rogue/tests/models/test_evaluation_result.py (1)
12-36: Consider using pytest fixtures for better test organization.The test data is well-structured, but using pytest fixtures would improve maintainability and follow best practices.
-class TestEvaluationResults: - scenario_1 = Scenario(scenario="Scenario 1") - scenario_2 = Scenario(scenario="Scenario 2") - - conversation_1_passed = ConversationEvaluation( - messages=ChatHistory(messages=[Message(role="user", content="message 1")]), - passed=True, - reason="reason", - ) - conversation_2_passed = ConversationEvaluation( - messages=ChatHistory(messages=[Message(role="user", content="message 2")]), - passed=True, - reason="reason", - ) - - conversation_1_failed = ConversationEvaluation( - messages=ChatHistory(messages=[Message(role="user", content="message 1")]), - passed=False, - reason="reason", - ) - conversation_2_failed = ConversationEvaluation( - messages=ChatHistory(messages=[Message(role="user", content="message 2")]), - passed=False, - reason="reason", - ) +@pytest.fixture +def scenario_1(): + return Scenario(scenario="Scenario 1") + +@pytest.fixture +def scenario_2(): + return Scenario(scenario="Scenario 2") + +@pytest.fixture +def conversation_1_passed(): + return ConversationEvaluation( + messages=ChatHistory(messages=[Message(role="user", content="message 1")]), + passed=True, + reason="reason", + ) + +@pytest.fixture +def conversation_2_passed(): + return ConversationEvaluation( + messages=ChatHistory(messages=[Message(role="user", content="message 2")]), + passed=True, + reason="reason", + ) + +@pytest.fixture +def conversation_1_failed(): + return ConversationEvaluation( + messages=ChatHistory(messages=[Message(role="user", content="message 1")]), + passed=False, + reason="reason", + ) + +@pytest.fixture +def conversation_2_failed(): + return ConversationEvaluation( + messages=ChatHistory(messages=[Message(role="user", content="message 2")]), + passed=False, + reason="reason", + ) + +class TestEvaluationResults:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite
📒 Files selected for processing (7)
rogue/models/__init__.py(1 hunks)rogue/models/user_config.py(0 hunks)rogue/services/scenario_evaluation_service.py(1 hunks)rogue/tests/models/test_cli_input.py(1 hunks)rogue/tests/models/test_config.py(1 hunks)rogue/tests/models/test_evaluation_result.py(1 hunks)rogue/tests/models/test_scenario.py(1 hunks)
💤 Files with no reviewable changes (1)
- rogue/models/user_config.py
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/clean-code.mdc:0-0
Timestamp: 2025-07-01T11:52:43.736Z
Learning: Keep tests readable and maintainable
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/tailwind.mdc:0-0
Timestamp: 2025-07-01T11:55:18.381Z
Learning: Implement proper testing
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/clean-code.mdc:0-0
Timestamp: 2025-07-01T11:52:43.736Z
Learning: Write tests before fixing bugs
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/clean-code.mdc:0-0
Timestamp: 2025-07-01T11:52:43.736Z
Learning: Test edge cases and error conditions
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Applies to tests/**/*.py : Implement proper fixtures
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Place tests in `tests/` directory parallel to `src/`
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/self_improve.mdc:0-0
Timestamp: 2025-07-01T11:54:49.379Z
Learning: Update rules after major refactors
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/self_improve.mdc:0-0
Timestamp: 2025-07-01T11:54:49.379Z
Learning: Document breaking changes
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/dev_workflow.mdc:0-0
Timestamp: 2025-07-01T11:53:32.567Z
Learning: Applies to {.env,.cursor/mcp.json} : Non-API key settings (like model selections, MAX_TOKENS, LOG_LEVEL) are no longer configured via environment variables. Use the task-master models command or models MCP tool.
rogue/services/scenario_evaluation_service.py (1)
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Applies to tests/**/*.py : Test all error scenarios
rogue/tests/models/test_evaluation_result.py (1)
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Applies to tests/**/*.py : Test all error scenarios
rogue/tests/models/test_cli_input.py (3)
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Applies to tests/**/*.py : Test all error scenarios
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Applies to tests/**/*.py : Use proper mocking with pytest-mock
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Applies to tests/**/*.py : Implement proper fixtures
rogue/tests/models/test_scenario.py (2)
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Applies to tests/**/*.py : Test all error scenarios
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Applies to tests/**/*.py : Implement proper fixtures
🧬 Code Graph Analysis (4)
rogue/tests/models/test_evaluation_result.py (3)
rogue/models/chat_history.py (2)
ChatHistory(9-13)Message(4-6)rogue/models/evaluation_result.py (3)
EvaluationResults(21-38)ConversationEvaluation(9-12)add_result(24-30)rogue/models/scenario.py (1)
Scenario(12-50)
rogue/tests/models/test_cli_input.py (2)
rogue/models/cli_input.py (2)
CLIInput(9-46)get_scenarios_from_file(25-35)rogue/models/config.py (1)
AuthType(7-11)
rogue/tests/models/test_scenario.py (1)
rogue/models/scenario.py (6)
Scenario(12-50)ScenarioType(7-9)Scenarios(53-69)get_scenarios_by_type(56-63)get_policy_scenarios(65-66)get_prompt_injection_scenarios(68-69)
rogue/tests/models/test_config.py (1)
rogue/models/config.py (3)
AgentConfig(14-45)AuthType(7-11)get_auth_header(48-60)
🔇 Additional comments (14)
rogue/models/__init__.py (1)
2-5: LGTM! Import changes align with model restructuring.The new imports for
cli_input,config,evaluation_result, andprompt_injectionmodules are appropriate and follow Python conventions. This properly exposes the new model structure.rogue/services/scenario_evaluation_service.py (2)
131-135: Good refactoring to improve code organization.Extracting the result dumping logic into a private method
_dump_resultsimproves readability and follows the single responsibility principle. The method properly encapsulates the JSON serialization with consistent formatting.
144-144: Proper method call placement.The
_dump_results()call is correctly placed after all scenario evaluations are complete and before yielding the final results.rogue/tests/models/test_cli_input.py (2)
13-44: Excellent test coverage for authentication validation.The parameterized test comprehensively covers all authentication scenarios including edge cases. The test structure follows best practices with proper error message matching and validation assertions.
46-96: Well-designed test for file parsing with proper mocking.The test effectively covers file existence, type validation, and content parsing scenarios. The use of
CLIInputWithMockScenariossubclass to avoid pydantic validation on the mock is a clever solution. The parameterization covers all relevant error conditions.rogue/tests/models/test_config.py (2)
11-44: Comprehensive authentication validation testing.The parameterized test thoroughly covers all authentication scenarios for
AgentConfig. The test structure mirrors the CLI input tests, providing consistent validation patterns across the codebase.
47-62: Complete coverage of auth header generation.The test validates all authentication types and edge cases for the
get_auth_headerfunction. The parameterization ensures all header formats are correctly generated.rogue/tests/models/test_scenario.py (4)
11-37: Thorough validation testing for scenario types.The parameterized test effectively validates the dataset requirement logic for different scenario types. The test correctly verifies that datasets are required for
PROMPT_INJECTIONscenarios but ignored forPOLICYscenarios.
39-70: Comprehensive dataset sample size validation.The test covers all scenarios for dataset sample size validation, ensuring that the size is required when a dataset is provided and properly handled when the dataset is None.
73-104: Well-structured filtering test with good coverage.The parameterized test efficiently validates the scenario filtering logic with various combinations of scenario types and expected results.
106-136: Complete testing of convenience filtering methods.Both
test_get_policy_scenariosandtest_get_prompt_injection_scenariosproperly validate the filtering convenience methods, ensuring they return the correct scenario types.rogue/tests/models/test_evaluation_result.py (3)
1-9: Imports look good and necessary.The imports are appropriate for testing the EvaluationResults model and its dependencies.
38-47: Helper method is clean and well-designed.The static method provides a clear way to create EvaluationResult instances for testing.
49-161: Comprehensive test coverage for add_result method.The parameterized test covers all the important scenarios for the add_result method, including empty results, non-overlapping additions, and various scenario overlap cases with different passed status combinations.
| class TestEvaluationResults: | ||
| scenario_1 = Scenario(scenario="Scenario 1") | ||
| scenario_2 = Scenario(scenario="Scenario 2") | ||
|
|
||
| conversation_1_passed = ConversationEvaluation( | ||
| messages=ChatHistory(messages=[Message(role="user", content="message 1")]), | ||
| passed=True, | ||
| reason="reason", | ||
| ) | ||
| conversation_2_passed = ConversationEvaluation( | ||
| messages=ChatHistory(messages=[Message(role="user", content="message 2")]), | ||
| passed=True, | ||
| reason="reason", | ||
| ) | ||
|
|
||
| conversation_1_failed = ConversationEvaluation( | ||
| messages=ChatHistory(messages=[Message(role="user", content="message 1")]), | ||
| passed=False, | ||
| reason="reason", | ||
| ) | ||
| conversation_2_failed = ConversationEvaluation( | ||
| messages=ChatHistory(messages=[Message(role="user", content="message 2")]), | ||
| passed=False, | ||
| reason="reason", | ||
| ) | ||
|
|
||
| @staticmethod | ||
| def get_evaluation_result( | ||
| scenario: Scenario, | ||
| conversation: ConversationEvaluation, | ||
| ) -> EvaluationResult: | ||
| return EvaluationResult( | ||
| scenario=scenario, | ||
| conversations=[conversation], | ||
| passed=conversation.passed, | ||
| ) | ||
|
|
||
| @pytest.mark.parametrize( | ||
| "existing_results, new_result, expected_results", | ||
| [ | ||
| # no overlap from empty results | ||
| ( | ||
| EvaluationResults(), | ||
| get_evaluation_result(scenario_1, conversation_1_passed), | ||
| EvaluationResults( | ||
| results=[get_evaluation_result(scenario_1, conversation_1_passed)] | ||
| ), | ||
| ), | ||
| # no overlap from non-empty results | ||
| ( | ||
| EvaluationResults( | ||
| results=[get_evaluation_result(scenario_1, conversation_1_passed)] | ||
| ), | ||
| get_evaluation_result(scenario_2, conversation_1_failed), | ||
| EvaluationResults( | ||
| results=[ | ||
| get_evaluation_result(scenario_1, conversation_1_passed), | ||
| get_evaluation_result(scenario_2, conversation_1_failed), | ||
| ] | ||
| ), | ||
| ), | ||
| # scenario overlap with passed unchanged True -> True | ||
| ( | ||
| EvaluationResults( | ||
| results=[get_evaluation_result(scenario_1, conversation_1_passed)] | ||
| ), | ||
| get_evaluation_result(scenario_1, conversation_2_passed), | ||
| EvaluationResults( | ||
| results=[ | ||
| EvaluationResult( | ||
| scenario=scenario_1, | ||
| conversations=[ | ||
| conversation_1_passed, | ||
| conversation_2_passed, | ||
| ], | ||
| passed=True, | ||
| ), | ||
| ] | ||
| ), | ||
| ), | ||
| # scenario overlap with passed changed True -> False | ||
| ( | ||
| EvaluationResults( | ||
| results=[get_evaluation_result(scenario_1, conversation_1_passed)] | ||
| ), | ||
| get_evaluation_result(scenario_1, conversation_2_failed), | ||
| EvaluationResults( | ||
| results=[ | ||
| EvaluationResult( | ||
| scenario=scenario_1, | ||
| conversations=[ | ||
| conversation_1_passed, | ||
| conversation_2_failed, | ||
| ], | ||
| passed=False, | ||
| ), | ||
| ] | ||
| ), | ||
| ), | ||
| # scenario overlap with passed unchanged False -> False (#1) | ||
| ( | ||
| EvaluationResults( | ||
| results=[get_evaluation_result(scenario_1, conversation_1_failed)] | ||
| ), | ||
| get_evaluation_result(scenario_1, conversation_2_failed), | ||
| EvaluationResults( | ||
| results=[ | ||
| EvaluationResult( | ||
| scenario=scenario_1, | ||
| conversations=[ | ||
| conversation_1_failed, | ||
| conversation_2_failed, | ||
| ], | ||
| passed=False, | ||
| ), | ||
| ] | ||
| ), | ||
| ), | ||
| # scenario overlap with passed unchanged False -> False (#2) | ||
| ( | ||
| EvaluationResults( | ||
| results=[get_evaluation_result(scenario_1, conversation_1_failed)] | ||
| ), | ||
| get_evaluation_result( | ||
| scenario_1, | ||
| conversation_2_passed, | ||
| ), # passed is ignored here because we already failed | ||
| EvaluationResults( | ||
| results=[ | ||
| EvaluationResult( | ||
| scenario=scenario_1, | ||
| conversations=[ | ||
| conversation_1_failed, | ||
| conversation_2_passed, | ||
| ], | ||
| passed=False, | ||
| ), | ||
| ] | ||
| ), | ||
| ), | ||
| ], | ||
| ) | ||
| def test_add_result( | ||
| self, | ||
| existing_results: EvaluationResults, | ||
| new_result: EvaluationResult, | ||
| expected_results: EvaluationResults, | ||
| ): | ||
| existing_results.add_result(new_result) | ||
| assert existing_results == expected_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Expand test coverage to include other methods and edge cases.
The test file only covers the add_result method but doesn't test other methods like combine and __bool__. Consider adding tests for these methods and edge cases.
def test_combine_method(self):
"""Test the combine method with various scenarios."""
existing_results = EvaluationResults(
results=[self.get_evaluation_result(self.scenario_1, self.conversation_1_passed)]
)
other_results = EvaluationResults(
results=[self.get_evaluation_result(self.scenario_2, self.conversation_1_failed)]
)
existing_results.combine(other_results)
assert len(existing_results.results) == 2
assert existing_results.results[0].scenario == self.scenario_1
assert existing_results.results[1].scenario == self.scenario_2
def test_combine_with_empty_results(self):
"""Test combining with empty results."""
existing_results = EvaluationResults()
other_results = EvaluationResults(
results=[self.get_evaluation_result(self.scenario_1, self.conversation_1_passed)]
)
existing_results.combine(other_results)
assert len(existing_results.results) == 1
def test_combine_with_none(self):
"""Test combining with None."""
existing_results = EvaluationResults()
existing_results.combine(None)
assert len(existing_results.results) == 0
def test_bool_method(self):
"""Test the __bool__ method."""
empty_results = EvaluationResults()
assert not bool(empty_results)
non_empty_results = EvaluationResults(
results=[self.get_evaluation_result(self.scenario_1, self.conversation_1_passed)]
)
assert bool(non_empty_results)🤖 Prompt for AI Agents
In rogue/tests/models/test_evaluation_result.py around lines 12 to 161, the
current tests only cover the add_result method. To improve coverage, add new
test methods for combine and __bool__ methods. Implement tests for combine with
normal, empty, and None inputs to verify merging behavior, and tests for
__bool__ to check truthiness of empty and non-empty EvaluationResults instances.
Add these as separate test functions within the TestEvaluationResults class.
Summary by CodeRabbit
Bug Fixes
Tests
Chores