Skip to content

[LEADS-349] Calculate aggregated score from key metrics#227

Open
xmican10 wants to merge 1 commit intolightspeed-core:mainfrom
xmican10:LEADS-349-calculate-aggregated-score-from-key-metrics
Open

[LEADS-349] Calculate aggregated score from key metrics#227
xmican10 wants to merge 1 commit intolightspeed-core:mainfrom
xmican10:LEADS-349-calculate-aggregated-score-from-key-metrics

Conversation

@xmican10
Copy link
Copy Markdown
Collaborator

@xmican10 xmican10 commented Apr 28, 2026

Description

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude(e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
  • Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Release Notes

  • New Features

    • Quality scoring system now aggregates selected metrics into an overall quality score using weighted averaging.
    • JSON quality reports generated with aggregated scores and metric breakdowns.
  • Documentation

    • Added configuration guide for quality score setup and metric selection.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Walkthrough

This pull request introduces a quality_score feature that aggregates selected evaluation metrics into a weighted average quality score. The implementation includes configuration handling, Pydantic models for quality reports, validation of metric references, integration into the output generation pipeline, and comprehensive test coverage.

Changes

Cohort / File(s) Summary
Configuration & Models
config/system.yaml, src/lightspeed_evaluation/core/models/quality.py, src/lightspeed_evaluation/core/models/system.py
Adds quality_score YAML configuration block with metrics list and default flag. Introduces QualityMetricResult and QualityReport Pydantic models with weighted-average aggregation logic and validation. Adds QualityScoreConfig model with duplicate detection and SystemConfig post-validation to ensure configured metrics exist in metadata.
Documentation
docs/configuration.md
Adds documentation for quality_score configuration section, including example usage, field definitions, and validation constraints.
Core Implementation
src/lightspeed_evaluation/core/output/generator.py, src/lightspeed_evaluation/core/system/loader.py
Integrates quality score into output handler to generate _quality_report.json with ISO timestamps, aggregated scores, metric weights, and warnings. Updates config loader to parse and apply quality_score section, mutating metric metadata defaults and validating metric references.
Test Fixtures & Models
tests/unit/core/models/conftest.py, tests/unit/core/models/test_quality.py, tests/unit/core/models/test_system.py
Adds quality_by_metric and quality_by_metric_zero fixtures; introduces TestQualityReport suite covering happy path, missing metrics, zero samples, and partial zero-sample scenarios. Extends test_system.py with QualityScoreConfig validation and SystemConfig integration tests.
Test Integration & Output
tests/unit/core/output/conftest.py, tests/unit/core/output/test_generator.py, tests/unit/core/system/test_loader.py
Updates mock config fixture with quality_score attribute. Adds TestQualityReportGeneration suite validating JSON output structure and partial metric handling. Introduces TestConfigLoaderQualityScore covering config parsing, default mutation, and error cases.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant ConfigLoader
    participant SystemConfig
    participant OutputHandler
    participant QualityReport
    participant Storage

    User->>ConfigLoader: Load system.yaml with quality_score
    ConfigLoader->>ConfigLoader: Parse quality_score section
    ConfigLoader->>SystemConfig: Create config with QualityScoreConfig
    SystemConfig->>SystemConfig: Validate metrics exist in metadata
    ConfigLoader-->>User: Return validated config

    User->>OutputHandler: Generate reports
    OutputHandler->>OutputHandler: Check system_config.quality_score
    alt quality_score configured
        OutputHandler->>QualityReport: create_report(by_metric, metrics_list)
        QualityReport->>QualityReport: Partition into quality/extra metrics
        QualityReport->>QualityReport: Calculate weighted average (mean × weight)
        QualityReport-->>OutputHandler: Return QualityReport
        OutputHandler->>OutputHandler: _generate_quality_score_report()
        OutputHandler->>Storage: Write quality_report.json (timestamp, aggregated_score, warnings)
    else no quality_score
        OutputHandler->>Storage: Generate standard reports only
    end
    OutputHandler-->>User: Reports generated
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A quality score hops into view,
Weighted metrics blended just right, it's true!
Aggregation magic, from mean to mean,
The finest report you have ever seen! ✨
Config now flows through validation's gate,
Quality scores celebrate! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title '[LEADS-349] Calculate aggregated score from key metrics' accurately and specifically summarizes the main change in this pull request, which adds functionality to compute an aggregated quality score from selected metrics.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@xmican10 xmican10 force-pushed the LEADS-349-calculate-aggregated-score-from-key-metrics branch 2 times, most recently from dd37457 to 53a55e5 Compare April 30, 2026 11:41
@xmican10 xmican10 marked this pull request as ready for review April 30, 2026 11:49
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/core/models/quality.py`:
- Around line 116-118: The loop currently skips when score_stats is None without
any signal; update the block where "if score_stats is None: continue" to emit a
warning before continuing so missing metric data is visible. Use the module
logger (e.g., logger or logging.getLogger(__name__)) and include identifying
info about the metric (the variable that refers to the quality metric — e.g.,
metric.name, metric.key, or metric_id used in the loop) and any relevant context
(evaluation id or config name) in the warning message, then continue unchanged;
keep the rest of the aggregation logic intact.

In `@src/lightspeed_evaluation/core/system/loader.py`:
- Around line 131-141: The code treats an explicitly empty quality_score: {} as
absent because it uses truthiness checks; update the checks to use explicit None
checks so empty dicts still trigger validation and defaults processing—replace
both "if quality_score_data:" and the ternary constructing
QualityScoreConfig(**quality_score_data) if quality_score_data else None with
checks like "if quality_score_data is not None:" and construct
QualityScoreConfig(**quality_score_data) when quality_score_data is not None,
ensuring _process_quality_score_defaults(quality_score_data,
turn_level_metadata, conversation_level_metadata) and QualityScoreConfig(...)
run for empty dicts as well.
- Around line 196-205: The code indexes
turn_level_metadata[metric_id]["default"] (and the conversation equivalent)
which raises if the "default" key is missing; change the assignment to use
dict.get and respect an explicit default_flag by setting
turn_level_metadata[metric_id]["default"] = default_flag if default_flag is not
None else turn_level_metadata[metric_id].get("default", False) (and do the
analogous change for conversation_level_metadata) so missing keys default to
False instead of causing a KeyError.

In `@tests/unit/core/models/test_quality.py`:
- Around line 105-120: The test test_quality_report_sample_size_zero relies on
the first warning entry which is order-sensitive; update the assertions to
search the report.warnings list for an entry containing both
"ragas:faithfulness" and "excluded" instead of indexing report.warnings[0].
Locate the call to QualityReport.create_report and the subsequent assertions
that reference report.warnings, and replace the indexed checks with a
membership/search check (e.g., any(...) over report.warnings) that verifies at
least one warning contains the required substrings.

In `@tests/unit/core/system/test_loader.py`:
- Around line 520-566: The test should add an unrelated turn-level metric in the
YAML (e.g., "other:timeliness" with threshold and default:false) that is not
listed under quality_score.metrics, then after loading via
ConfigLoader().load_system_config assert that
config.default_turn_metrics_metadata["other:timeliness"]["default"] is still
False to prove only the targeted metrics were flipped; update the YAML in
test_quality_score_default_true_sets_default_on_metrics and add the
corresponding assertion checking the unrelated metric remains unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0457fd03-1467-4a4e-b294-4b1d3fa6abf9

📥 Commits

Reviewing files that changed from the base of the PR and between 8762b7f and 53a55e5.

📒 Files selected for processing (12)
  • config/system.yaml
  • docs/configuration.md
  • src/lightspeed_evaluation/core/models/quality.py
  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/system/loader.py
  • tests/unit/core/models/conftest.py
  • tests/unit/core/models/test_quality.py
  • tests/unit/core/models/test_system.py
  • tests/unit/core/output/conftest.py
  • tests/unit/core/output/test_generator.py
  • tests/unit/core/system/test_loader.py

Comment thread src/lightspeed_evaluation/core/models/quality.py
Comment thread src/lightspeed_evaluation/core/system/loader.py Outdated
Comment thread src/lightspeed_evaluation/core/system/loader.py
Comment thread tests/unit/core/models/test_quality.py Outdated
Comment thread tests/unit/core/system/test_loader.py
@xmican10 xmican10 force-pushed the LEADS-349-calculate-aggregated-score-from-key-metrics branch from 53a55e5 to fac6a29 Compare April 30, 2026 12:22
@xmican10
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown
Member

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you 👍

I have doubts about the weighting strategy, but this PR implements the current one.

Copy link
Copy Markdown
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !! minor comments.

Note that the quality score json should always get created, but with current flow it won't get created if we remove json from enabled output or in future we make othe file storage optional. But it is okay for now, we will handle this in future if required.

class QualityReport(BaseModel):
"""Aggregated quality score from selected metrics."""

aggregated_quality_score: float = Field(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
aggregated_quality_score: float = Field(
quality_score: float = Field(

default_factory=list,
description="Warnings about quality metrics configuration or usage",
)
api_latency: float = Field(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
api_latency: float = Field(
agent_latency: float = Field(

api_latency: float = Field(
default=0.0, description="[Placeholder] Average API response time in seconds"
)
api_tokens: int = Field(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
api_tokens: int = Field(
agent_token_usage: int = Field(

judge_panel_data = config_data.get("judge_panel")
judge_panel = JudgePanelConfig(**judge_panel_data) if judge_panel_data else None

# Parse storage backends with backward compatibility for legacy 'output' section
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this for this work ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants