[LEADS-349] Calculate aggregated score from key metrics by xmican10 · Pull Request #227 · lightspeed-core/lightspeed-evaluation

xmican10 · 2026-04-28T15:59:29Z

Description

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude(e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Release Notes

New Features
- Quality scoring system now aggregates selected metrics into an overall quality score using weighted averaging.
- JSON quality reports generated with aggregated scores and metric breakdowns.
Documentation
- Added configuration guide for quality score setup and metric selection.

coderabbitai · 2026-04-28T15:59:37Z

Walkthrough

This pull request introduces a quality_score feature that aggregates selected evaluation metrics into a weighted average quality score. The implementation includes configuration handling, Pydantic models for quality reports, validation of metric references, integration into the output generation pipeline, and comprehensive test coverage.

Changes

Cohort / File(s)	Summary
Configuration & Models `config/system.yaml`, `src/lightspeed_evaluation/core/models/quality.py`, `src/lightspeed_evaluation/core/models/system.py`	Adds `quality_score` YAML configuration block with metrics list and default flag. Introduces `QualityMetricResult` and `QualityReport` Pydantic models with weighted-average aggregation logic and validation. Adds `QualityScoreConfig` model with duplicate detection and `SystemConfig` post-validation to ensure configured metrics exist in metadata.
Documentation `docs/configuration.md`	Adds documentation for `quality_score` configuration section, including example usage, field definitions, and validation constraints.
Core Implementation `src/lightspeed_evaluation/core/output/generator.py`, `src/lightspeed_evaluation/core/system/loader.py`	Integrates quality score into output handler to generate `_quality_report.json` with ISO timestamps, aggregated scores, metric weights, and warnings. Updates config loader to parse and apply `quality_score` section, mutating metric metadata defaults and validating metric references.
Test Fixtures & Models `tests/unit/core/models/conftest.py`, `tests/unit/core/models/test_quality.py`, `tests/unit/core/models/test_system.py`	Adds `quality_by_metric` and `quality_by_metric_zero` fixtures; introduces `TestQualityReport` suite covering happy path, missing metrics, zero samples, and partial zero-sample scenarios. Extends `test_system.py` with `QualityScoreConfig` validation and `SystemConfig` integration tests.
Test Integration & Output `tests/unit/core/output/conftest.py`, `tests/unit/core/output/test_generator.py`, `tests/unit/core/system/test_loader.py`	Updates mock config fixture with `quality_score` attribute. Adds `TestQualityReportGeneration` suite validating JSON output structure and partial metric handling. Introduces `TestConfigLoaderQualityScore` covering config parsing, default mutation, and error cases.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant ConfigLoader
    participant SystemConfig
    participant OutputHandler
    participant QualityReport
    participant Storage

    User->>ConfigLoader: Load system.yaml with quality_score
    ConfigLoader->>ConfigLoader: Parse quality_score section
    ConfigLoader->>SystemConfig: Create config with QualityScoreConfig
    SystemConfig->>SystemConfig: Validate metrics exist in metadata
    ConfigLoader-->>User: Return validated config

    User->>OutputHandler: Generate reports
    OutputHandler->>OutputHandler: Check system_config.quality_score
    alt quality_score configured
        OutputHandler->>QualityReport: create_report(by_metric, metrics_list)
        QualityReport->>QualityReport: Partition into quality/extra metrics
        QualityReport->>QualityReport: Calculate weighted average (mean × weight)
        QualityReport-->>OutputHandler: Return QualityReport
        OutputHandler->>OutputHandler: _generate_quality_score_report()
        OutputHandler->>Storage: Write quality_report.json (timestamp, aggregated_score, warnings)
    else no quality_score
        OutputHandler->>Storage: Generate standard reports only
    end
    OutputHandler-->>User: Reports generated

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A quality score hops into view,
Weighted metrics blended just right, it's true!
Aggregation magic, from mean to mean,
The finest report you have ever seen! ✨
Config now flows through validation's gate,
Quality scores celebrate! 🎉

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[LEADS-349] Calculate aggregated score from key metrics' accurately and specifically summarizes the main change in this pull request, which adds functionality to compute an aggregated quality score from selected metrics.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/core/models/quality.py`:
- Around line 116-118: The loop currently skips when score_stats is None without
any signal; update the block where "if score_stats is None: continue" to emit a
warning before continuing so missing metric data is visible. Use the module
logger (e.g., logger or logging.getLogger(__name__)) and include identifying
info about the metric (the variable that refers to the quality metric — e.g.,
metric.name, metric.key, or metric_id used in the loop) and any relevant context
(evaluation id or config name) in the warning message, then continue unchanged;
keep the rest of the aggregation logic intact.

In `@src/lightspeed_evaluation/core/system/loader.py`:
- Around line 131-141: The code treats an explicitly empty quality_score: {} as
absent because it uses truthiness checks; update the checks to use explicit None
checks so empty dicts still trigger validation and defaults processing—replace
both "if quality_score_data:" and the ternary constructing
QualityScoreConfig(**quality_score_data) if quality_score_data else None with
checks like "if quality_score_data is not None:" and construct
QualityScoreConfig(**quality_score_data) when quality_score_data is not None,
ensuring _process_quality_score_defaults(quality_score_data,
turn_level_metadata, conversation_level_metadata) and QualityScoreConfig(...)
run for empty dicts as well.
- Around line 196-205: The code indexes
turn_level_metadata[metric_id]["default"] (and the conversation equivalent)
which raises if the "default" key is missing; change the assignment to use
dict.get and respect an explicit default_flag by setting
turn_level_metadata[metric_id]["default"] = default_flag if default_flag is not
None else turn_level_metadata[metric_id].get("default", False) (and do the
analogous change for conversation_level_metadata) so missing keys default to
False instead of causing a KeyError.

In `@tests/unit/core/models/test_quality.py`:
- Around line 105-120: The test test_quality_report_sample_size_zero relies on
the first warning entry which is order-sensitive; update the assertions to
search the report.warnings list for an entry containing both
"ragas:faithfulness" and "excluded" instead of indexing report.warnings[0].
Locate the call to QualityReport.create_report and the subsequent assertions
that reference report.warnings, and replace the indexed checks with a
membership/search check (e.g., any(...) over report.warnings) that verifies at
least one warning contains the required substrings.

In `@tests/unit/core/system/test_loader.py`:
- Around line 520-566: The test should add an unrelated turn-level metric in the
YAML (e.g., "other:timeliness" with threshold and default:false) that is not
listed under quality_score.metrics, then after loading via
ConfigLoader().load_system_config assert that
config.default_turn_metrics_metadata["other:timeliness"]["default"] is still
False to prove only the targeted metrics were flipped; update the YAML in
test_quality_score_default_true_sets_default_on_metrics and add the
corresponding assertion checking the unrelated metric remains unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0457fd03-1467-4a4e-b294-4b1d3fa6abf9

📥 Commits

Reviewing files that changed from the base of the PR and between 8762b7f and 53a55e5.

📒 Files selected for processing (12)

config/system.yaml
docs/configuration.md
src/lightspeed_evaluation/core/models/quality.py
src/lightspeed_evaluation/core/models/system.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/system/loader.py
tests/unit/core/models/conftest.py
tests/unit/core/models/test_quality.py
tests/unit/core/models/test_system.py
tests/unit/core/output/conftest.py
tests/unit/core/output/test_generator.py
tests/unit/core/system/test_loader.py

xmican10 · 2026-04-30T12:30:05Z

@coderabbitai full review

coderabbitai · 2026-04-30T12:34:03Z

✅ Actions performed

Full review triggered.

VladimirKadlec

LGTM, thank you 👍

I have doubts about the weighting strategy, but this PR implements the current one.

asamal4

Thank you !! minor comments.

Note that the quality score json should always get created, but with current flow it won't get created if we remove json from enabled output or in future we make othe file storage optional. But it is okay for now, we will handle this in future if required.

asamal4 · 2026-04-30T18:04:35Z

+class QualityReport(BaseModel):
+    """Aggregated quality score from selected metrics."""
+
+    aggregated_quality_score: float = Field(


Suggested change

aggregated_quality_score: float = Field(

quality_score: float = Field(

asamal4 · 2026-04-30T18:05:20Z

+        default_factory=list,
+        description="Warnings about quality metrics configuration or usage",
+    )
+    api_latency: float = Field(


Suggested change

api_latency: float = Field(

agent_latency: float = Field(

asamal4 · 2026-04-30T18:06:17Z

+    api_latency: float = Field(
+        default=0.0, description="[Placeholder] Average API response time in seconds"
+    )
+    api_tokens: int = Field(


Suggested change

api_tokens: int = Field(

agent_token_usage: int = Field(

asamal4 · 2026-04-30T18:11:16Z

        judge_panel_data = config_data.get("judge_panel")
        judge_panel = JudgePanelConfig(**judge_panel_data) if judge_panel_data else None

-        # Parse storage backends with backward compatibility for legacy 'output' section


why change this for this work ?

xmican10 force-pushed the LEADS-349-calculate-aggregated-score-from-key-metrics branch 2 times, most recently from dd37457 to 53a55e5 Compare April 30, 2026 11:41

xmican10 marked this pull request as ready for review April 30, 2026 11:49

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

LEADS-349-calculate-aggregated-score-from-key-metrics

fac6a29

xmican10 force-pushed the LEADS-349-calculate-aggregated-score-from-key-metrics branch from 53a55e5 to fac6a29 Compare April 30, 2026 12:22

VladimirKadlec approved these changes Apr 30, 2026

View reviewed changes

asamal4 requested changes Apr 30, 2026

View reviewed changes

	aggregated_quality_score: float = Field(
	quality_score: float = Field(

Conversation

xmican10 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xmican10 commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

asamal4 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

asamal4 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

asamal4 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xmican10 commented Apr 28, 2026 •

edited

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading