[LEADS-349] Calculate aggregated score from key metrics#227
[LEADS-349] Calculate aggregated score from key metrics#227xmican10 wants to merge 1 commit intolightspeed-core:mainfrom
Conversation
WalkthroughThis pull request introduces a Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant ConfigLoader
participant SystemConfig
participant OutputHandler
participant QualityReport
participant Storage
User->>ConfigLoader: Load system.yaml with quality_score
ConfigLoader->>ConfigLoader: Parse quality_score section
ConfigLoader->>SystemConfig: Create config with QualityScoreConfig
SystemConfig->>SystemConfig: Validate metrics exist in metadata
ConfigLoader-->>User: Return validated config
User->>OutputHandler: Generate reports
OutputHandler->>OutputHandler: Check system_config.quality_score
alt quality_score configured
OutputHandler->>QualityReport: create_report(by_metric, metrics_list)
QualityReport->>QualityReport: Partition into quality/extra metrics
QualityReport->>QualityReport: Calculate weighted average (mean × weight)
QualityReport-->>OutputHandler: Return QualityReport
OutputHandler->>OutputHandler: _generate_quality_score_report()
OutputHandler->>Storage: Write quality_report.json (timestamp, aggregated_score, warnings)
else no quality_score
OutputHandler->>Storage: Generate standard reports only
end
OutputHandler-->>User: Reports generated
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
dd37457 to
53a55e5
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_evaluation/core/models/quality.py`:
- Around line 116-118: The loop currently skips when score_stats is None without
any signal; update the block where "if score_stats is None: continue" to emit a
warning before continuing so missing metric data is visible. Use the module
logger (e.g., logger or logging.getLogger(__name__)) and include identifying
info about the metric (the variable that refers to the quality metric — e.g.,
metric.name, metric.key, or metric_id used in the loop) and any relevant context
(evaluation id or config name) in the warning message, then continue unchanged;
keep the rest of the aggregation logic intact.
In `@src/lightspeed_evaluation/core/system/loader.py`:
- Around line 131-141: The code treats an explicitly empty quality_score: {} as
absent because it uses truthiness checks; update the checks to use explicit None
checks so empty dicts still trigger validation and defaults processing—replace
both "if quality_score_data:" and the ternary constructing
QualityScoreConfig(**quality_score_data) if quality_score_data else None with
checks like "if quality_score_data is not None:" and construct
QualityScoreConfig(**quality_score_data) when quality_score_data is not None,
ensuring _process_quality_score_defaults(quality_score_data,
turn_level_metadata, conversation_level_metadata) and QualityScoreConfig(...)
run for empty dicts as well.
- Around line 196-205: The code indexes
turn_level_metadata[metric_id]["default"] (and the conversation equivalent)
which raises if the "default" key is missing; change the assignment to use
dict.get and respect an explicit default_flag by setting
turn_level_metadata[metric_id]["default"] = default_flag if default_flag is not
None else turn_level_metadata[metric_id].get("default", False) (and do the
analogous change for conversation_level_metadata) so missing keys default to
False instead of causing a KeyError.
In `@tests/unit/core/models/test_quality.py`:
- Around line 105-120: The test test_quality_report_sample_size_zero relies on
the first warning entry which is order-sensitive; update the assertions to
search the report.warnings list for an entry containing both
"ragas:faithfulness" and "excluded" instead of indexing report.warnings[0].
Locate the call to QualityReport.create_report and the subsequent assertions
that reference report.warnings, and replace the indexed checks with a
membership/search check (e.g., any(...) over report.warnings) that verifies at
least one warning contains the required substrings.
In `@tests/unit/core/system/test_loader.py`:
- Around line 520-566: The test should add an unrelated turn-level metric in the
YAML (e.g., "other:timeliness" with threshold and default:false) that is not
listed under quality_score.metrics, then after loading via
ConfigLoader().load_system_config assert that
config.default_turn_metrics_metadata["other:timeliness"]["default"] is still
False to prove only the targeted metrics were flipped; update the YAML in
test_quality_score_default_true_sets_default_on_metrics and add the
corresponding assertion checking the unrelated metric remains unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0457fd03-1467-4a4e-b294-4b1d3fa6abf9
📒 Files selected for processing (12)
config/system.yamldocs/configuration.mdsrc/lightspeed_evaluation/core/models/quality.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/system/loader.pytests/unit/core/models/conftest.pytests/unit/core/models/test_quality.pytests/unit/core/models/test_system.pytests/unit/core/output/conftest.pytests/unit/core/output/test_generator.pytests/unit/core/system/test_loader.py
53a55e5 to
fac6a29
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
VladimirKadlec
left a comment
There was a problem hiding this comment.
LGTM, thank you 👍
I have doubts about the weighting strategy, but this PR implements the current one.
asamal4
left a comment
There was a problem hiding this comment.
Thank you !! minor comments.
Note that the quality score json should always get created, but with current flow it won't get created if we remove json from enabled output or in future we make othe file storage optional. But it is okay for now, we will handle this in future if required.
| class QualityReport(BaseModel): | ||
| """Aggregated quality score from selected metrics.""" | ||
|
|
||
| aggregated_quality_score: float = Field( |
There was a problem hiding this comment.
| aggregated_quality_score: float = Field( | |
| quality_score: float = Field( |
| default_factory=list, | ||
| description="Warnings about quality metrics configuration or usage", | ||
| ) | ||
| api_latency: float = Field( |
There was a problem hiding this comment.
| api_latency: float = Field( | |
| agent_latency: float = Field( |
| api_latency: float = Field( | ||
| default=0.0, description="[Placeholder] Average API response time in seconds" | ||
| ) | ||
| api_tokens: int = Field( |
There was a problem hiding this comment.
| api_tokens: int = Field( | |
| agent_token_usage: int = Field( |
| judge_panel_data = config_data.get("judge_panel") | ||
| judge_panel = JudgePanelConfig(**judge_panel_data) if judge_panel_data else None | ||
|
|
||
| # Parse storage backends with backward compatibility for legacy 'output' section |
There was a problem hiding this comment.
why change this for this work ?
Description
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
Release Notes
New Features
Documentation