[LEADS-192] feat: integrate panel of judges by asamal4 · Pull Request #184 · lightspeed-core/lightspeed-evaluation

asamal4 · 2026-03-10T15:18:50Z

Description

Judge Panel integration from new config
Backward compatibility when existing llm config is given
Maintains one row per metric with aggregated (currently max) result
New column to track individual judge llm's score/response/token usage
It only considers one expected response

A lot of improvements can be done, but it is already a big change.
Doc change will be done once all basic requirements are in place.

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Cursor

Related Tickets & Documents

Related Issue # LEADS-192 Add Panel of Judges Support for LLM-based Metrics #104
Closes # LEADS-192

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Multi-judge panel evaluation with per-judge scoring, aggregation (default "max"), and metric-based judge selection.
- Per-judge score model exposed and system config now supports judge panels and LLM pools.
Bug Fixes / Outputs
- CSV and JSON summaries now include serialized per-judge scores.
Refactor
- Evaluation flow split into panel vs single-judge paths with orchestrated per-judge handlers.
Tests
- Added/updated tests for judge panels, scoring, evaluator flows, and backward compatibility.

coderabbitai · 2026-03-10T15:24:58Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds multi-judge panel evaluation: new JudgeScore model and CSV/JSON field, judge-panel-aware LLMManager (judge_id, get_judges_for_metric), JudgeOrchestrator to run per-judge evaluations and aggregate scores (max), SystemConfig parsing for llm_pool/judge_panel, evaluator routing for panel/non‑LLM flows, and tests/serialization.

Changes

Cohort / File(s)	Summary
Constants & CSV output `src/lightspeed_evaluation/core/constants.py`, `src/lightspeed_evaluation/core/output/generator.py`	Added `NON_LLM_FRAMEWORKS = frozenset({"nlp","script"})`; added `"judge_scores"` to supported CSV columns and serialize `judge_scores` as JSON in CSV/JSON outputs.
Data models & exports `src/lightspeed_evaluation/core/models/data.py`, `src/lightspeed_evaluation/core/models/__init__.py`	Added `JudgeScore` Pydantic model and added `judge_scores: Optional[list[JudgeScore]]` to `MetricResult`; re-exported `JudgeScore` in package `__all__`.
LLM manager & system model changes `src/lightspeed_evaluation/core/llm/manager.py`, `src/lightspeed_evaluation/core/models/system.py`, `src/lightspeed_evaluation/core/system/loader.py`	LLMManager gains `judge_id` (default `"primary"`), constructs per-pool judge managers keyed by pool IDs, adds `get_judges_for_metric()`. `SystemConfig.get_judge_configs()` now returns `list[tuple[str, LLMConfig]]`; loader parses optional `llm_pool` and `judge_panel`.
Judge orchestration `src/lightspeed_evaluation/pipeline/evaluation/judges.py`	New `JudgeOrchestrator` class: orchestrates per-judge evaluations, caches/creates per-judge handlers via a factory, handles per-judge errors (score=None), accumulates tokens, and aggregates scores using the `max` strategy.
Evaluator integration `src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`	Integrated judge orchestration: added `system_config`, `llm_manager`, `embedding_manager`, `judge_orchestrator`, `_create_handler_for_judge`, `_will_use_panel`, and panel-aware evaluation paths including panel-only single-expected-response handling and non-LLM behavior adjustments.
System config defaults & validation `src/lightspeed_evaluation/core/models/system.py`	Default `aggregation_strategy` changed to `"max"`; `get_judge_configs()` now yields `(pool_key, LLMConfig)` tuples and raises when `judge_panel` exists without `llm_pool`.
Tests & fixtures `tests/unit/core/llm/test_llm_manager.py`, `tests/unit/core/models/test_data.py`, `tests/unit/core/models/test_system.py`, `tests/unit/pipeline/evaluation/conftest.py`, `tests/unit/pipeline/evaluation/test_evaluator.py`, `tests/unit/pipeline/evaluation/test_judges.py`	Added tests for LLMManager backward compatibility, `JudgeScore` validation and MetricResult integration, adjusted `get_judge_configs` tests for tuple returns, added `create_mock_llm_manager` helper, new JudgeOrchestrator tests (aggregation, caching), and updated evaluator tests to use the new mocking helper and adjusted assertions.

Sequence Diagram(s)

sequenceDiagram
    participant Evaluator
    participant LLMManager
    participant JudgeOrchestrator
    participant MetricHandler
    participant OutputGenerator

    Evaluator->>LLMManager: get_judges_for_metric(metric_id)
    LLMManager->>LLMManager: should_use_panel_for_metric(metric_id)?
    alt panel enabled
        LLMManager-->>Evaluator: [judge_manager...]
        Evaluator->>JudgeOrchestrator: evaluate_with_judges(request, scope, token_tracker, threshold)
        JudgeOrchestrator->>JudgeOrchestrator: for each judge_manager
        loop per judge
            JudgeOrchestrator->>MetricHandler: get/instantiate handler (handler_factory)
            MetricHandler->>JudgeOrchestrator: evaluate(single_response)
            JudgeOrchestrator->>JudgeOrchestrator: record JudgeScore (score, reason, tokens)
        end
        JudgeOrchestrator->>JudgeOrchestrator: aggregate_scores(judge_scores)  %% uses max
        JudgeOrchestrator-->>Evaluator: MetricResult (aggregated score + judge_scores)
    else no panel / non-LLM
        LLMManager-->>Evaluator: primary judge manager
        Evaluator->>MetricHandler: evaluate(standard or non-LLM path)
        MetricHandler-->>Evaluator: MetricResult
    end
    Evaluator->>OutputGenerator: serialize MetricResult (include judge_scores)
    OutputGenerator-->>Evaluator: CSV/JSON output

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: integrate panel of judges' directly and clearly summarizes the main objective of the changeset: integrating a judge panel feature while maintaining backward compatibility.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

asamal4 · 2026-03-10T23:06:21Z

@coderabbitai review

coderabbitai · 2026-03-10T23:06:32Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 778-804: The _aggregate_scores method currently always computes an
average and ignores JudgePanelConfig.aggregation_strategy, causing non-average
strategies (max, majority_vote) to be silently ignored; update _aggregate_scores
to branch on the aggregation_strategy value from the panel config (e.g.,
"average", "max", "majority_vote") and implement: for "average" compute mean of
non-None JudgeScore.score entries, for "max" return the maximum score and a
reason indicating which judge(s) produced it, and for "majority_vote" compute
the rounded modal score (or None if no consensus) with an explanatory reason;
ensure you still handle the single-judge shortcut and the case of no valid
scores, and reference JudgeScore.score and JudgeScore.reason when building
aggregated reason strings.
- Around line 680-698: The code uses judge_manager.config.model as the
cache/output identifier which can collide after resolve_llm_config(); change all
uses that identify/cached judges (including the judge_id assigned to JudgeScore
and any cache key inside _get_handler_for_judge) to use the llm_pool key from
the judge config (e.g., judge_manager.config.llm_pool or the config property
that represents the pool key) instead of judge_manager.config.model so handlers
are keyed by the unique llm_pool entry and scores remain distinct.
- Around line 271-281: The current branch that handles uses_panel and
multiple_expected_responses silently drops extra references by calling
_evaluate_panel_single_expected_response; instead, fail fast or preserve
multi-reference semantics: modify the branch in evaluator.py (the if uses_panel
and multiple_expected_responses block) to raise a clear exception (e.g.,
NotImplementedError or RuntimeError) when a judge panel is enabled but multiple
expected_responses are present, with a message saying NxM evaluation for panels
is not implemented, or alternatively call/implement the existing multi-reference
evaluation path (the multi-reference evaluator) so metrics that accept lists
keep their “best/pass across references” behavior rather than truncating to the
first reference. Ensure the error message mentions
request.conv_data.conversation_group_id and the conflicting flags so callers can
fix their config.
- Around line 800-801: The comprehension currently leaves js.score as
Optional[float], so remove the "# type: ignore" by filtering out None values in
the comprehension: build scores as a list of float by using "scores = [js.score
for js in valid_judges if js.score is not None]" (this narrows the type to
list[float]) and then compute aggregated from scores as before; adjust any
surrounding code that assumed Optional to rely on the narrowed scores list and
aggregated variable.
- Around line 711-724: The current broad except in evaluator.py swallows all
Exceptions and treats configuration/handler bugs as per-judge failures; change
the except to only catch the expected custom exceptions from
core.system.exceptions (e.g., the specific exception classes used for
judge/handler errors) around the judge evaluation block (the code invoking
_get_handler_for_judge() and the JudgeScore creation), logging and recording a
JudgeScore only for those expected exceptions, and re-raise any other unexpected
exceptions so they surface to the caller; ensure you import the required
exception classes from core.system.exceptions and keep token counts and
JudgeScore behavior identical when handling the expected exceptions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ebe174e9-2ebb-4dba-b455-501507ba9111

📥 Commits

Reviewing files that changed from the base of the PR and between 34cbae4 and 2ccbdc8.

📒 Files selected for processing (11)

src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/llm/manager.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/llm/test_llm_manager.py
tests/unit/core/models/test_data.py
tests/unit/pipeline/evaluation/conftest.py
tests/unit/pipeline/evaluation/test_evaluator.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
291-301: ⚠️ Potential issue | 🟠 Major

Don’t truncate expected_response lists when a panel is active.

This still changes list-based metrics from “best/pass across references” to “first reference only” whenever panel evaluation is enabled, which can flip PASS/FAIL on existing datasets. Preserve the existing multi-reference flow or fail fast until NxM evaluation is implemented.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py` around lines 291
- 301, The current branch in evaluate where uses_panel and
multiple_expected_responses causes truncation to the first expected_response via
_evaluate_panel_single_expected_response; instead preserve multi-reference
semantics by either calling the existing multi-reference evaluation path (the
same flow used when multiple_expected_responses is true and uses_panel is false)
so metrics remain “best/pass across references”, or, if NxM panel-vs-references
evaluation isn't implemented, fail fast by raising a clear exception (e.g.,
NotImplementedError/ValueError) rather than silently using only the first
reference; update the conditional around uses_panel &
multiple_expected_responses and remove the fallback to
_evaluate_panel_single_expected_response or replace it with a fail-fast error
until proper NxM support is added.
src/lightspeed_evaluation/pipeline/evaluation/judges.py (1)
255-266: ⚠️ Potential issue | 🟠 Major

Honor aggregation_strategy or reject non-average configs.

JudgePanelConfig accepts max and majority_vote, but this path still averages every multi-judge result after only logging a warning. Any non-default panel config will therefore return the wrong score/status while looking supported.

Also applies to: 279-282
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py` around lines 255 -
266, The current code merely logs a warning for non-"average"
aggregation_strategy but continues to average results; update the checks in
judges.py (the block using self.llm_manager.system_config.judge_panel and the
similar block at 279-282) to reject unsupported strategies instead of silently
falling back: if aggregation_strategy is not "average", raise a clear
configuration error (e.g., ValueError) that includes the invalid strategy and
acceptable options ("average", "max", "majority_vote"), so callers cannot assume
unsupported strategies are honored; alternatively, implement proper handling for
"max" and "majority_vote" and wire that behaviour into the existing aggregation
code paths (refer to the aggregation_strategy variable and any aggregation
functions used thereafter).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 95-101: Replace the builtin exceptions with the project's custom
exception from core.system.exceptions: import and raise the appropriate
exception (e.g., UnsupportedLLMFrameworkError) instead of ValueError in the
evaluator selection branch that returns RagasMetrics / DeepEvalMetrics /
CustomMetrics (the "framework" check and the raise ValueError(f\"Unsupported LLM
framework for panel: {framework}\") line), and likewise replace any RuntimeError
uses mentioned (e.g., at the other occurrence around lines 555-557) with the
same project exception so failures are classifiable by the rest of the package.

In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py`:
- Around line 198-212: The current except block in judges.py that catches
Exception and returns a synthetic JudgeScore (using judge_id,
token_tracker.get_counts(), and JudgeScore) should instead re-raise the
exception for single-judge failures so upstream JudgeOrchestrator/
MetricsEvaluator can stop on errors; only collapse the exception into a
JudgeScore when running in panel mode (i.e., check the panel-mode flag or
context provided to the evaluator). Update the except Exception as e block to
call token_tracker.get_counts() to capture counts, and then: if panel_mode (or
the existing panel indicator) is True, return the synthetic JudgeScore as
before; otherwise re-raise e (or wrap/raise a clear EvaluationError that
includes the token counts) so the error is propagated.

---

Duplicate comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 291-301: The current branch in evaluate where uses_panel and
multiple_expected_responses causes truncation to the first expected_response via
_evaluate_panel_single_expected_response; instead preserve multi-reference
semantics by either calling the existing multi-reference evaluation path (the
same flow used when multiple_expected_responses is true and uses_panel is false)
so metrics remain “best/pass across references”, or, if NxM panel-vs-references
evaluation isn't implemented, fail fast by raising a clear exception (e.g.,
NotImplementedError/ValueError) rather than silently using only the first
reference; update the conditional around uses_panel &
multiple_expected_responses and remove the fallback to
_evaluate_panel_single_expected_response or replace it with a fail-fast error
until proper NxM support is added.

In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py`:
- Around line 255-266: The current code merely logs a warning for non-"average"
aggregation_strategy but continues to average results; update the checks in
judges.py (the block using self.llm_manager.system_config.judge_panel and the
similar block at 279-282) to reject unsupported strategies instead of silently
falling back: if aggregation_strategy is not "average", raise a clear
configuration error (e.g., ValueError) that includes the invalid strategy and
acceptable options ("average", "max", "majority_vote"), so callers cannot assume
unsupported strategies are honored; alternatively, implement proper handling for
"max" and "majority_vote" and wire that behaviour into the existing aggregation
code paths (refer to the aggregation_strategy variable and any aggregation
functions used thereafter).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c3eae248-6c4e-4e7b-8c61-0d3686820d26

📥 Commits

Reviewing files that changed from the base of the PR and between 2ccbdc8 and aa4f415.

📒 Files selected for processing (14)

src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/llm/manager.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/system.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/pipeline/evaluation/judges.py
tests/unit/core/llm/test_llm_manager.py
tests/unit/core/models/test_data.py
tests/unit/core/models/test_system.py
tests/unit/pipeline/evaluation/conftest.py
tests/unit/pipeline/evaluation/test_evaluator.py

🚧 Files skipped from review as they are similar to previous changes (8)

tests/unit/pipeline/evaluation/conftest.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/constants.py
tests/unit/core/llm/test_llm_manager.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/core/models/init.py
tests/unit/core/models/test_data.py
src/lightspeed_evaluation/core/llm/manager.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/unit/pipeline/evaluation/test_evaluator.py (1)

767-778: ⚠️ Potential issue | 🟡 Minor

Lock in the partial-error status explicitly.

This test now validates the score/token aggregation path, but it no longer proves the final status stays FAIL. It also still reads like an ERROR case in the name/docstring, which makes the contract misleading.

💡 Suggested tightening

-    def test_evaluate_multiple_expected_responses_error_preserves_tokens(
+    def test_evaluate_multiple_expected_responses_partial_error_accumulates_tokens(
         self,
         config_loader: ConfigLoader,
         mock_metric_manager: MetricManager,
         mock_script_manager: ScriptExecutionManager,
         mocker: MockerFixture,
     ) -> None:
-        """Test token preservation when error occurs during multiple expected responses evaluation.
+        """Test token accumulation when a later expected-response iteration errors.

         Scenario: First iteration succeeds with tokens, second iteration fails.
-        Expected: Error result should preserve tokens from first iteration.
+        Expected: The final result stays FAIL, preserves the best score, and
+        accumulates tokens from both iterations.
         """

         assert result is not None
+        assert result.result == "FAIL"
         assert result.score == 0.3  # From first iteration

Also applies to: 843-851

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/pipeline/evaluation/test_evaluator.py` around lines 767 - 778,
Rename/update the test
test_evaluate_multiple_expected_responses_error_preserves_tokens (and the
similar sibling test around the other block) and its docstring to reflect a
partial-failure/FAIL final status, and explicitly assert that the final
evaluation status equals FAIL after the second iteration fails; ensure the test
sets/locks the partial-error state (e.g., by simulating the second iteration
error via the mocked evaluator/ScriptExecutionManager) and then checks tokens
were preserved while also asserting result.status == FAIL to guarantee the
contract.

🧹 Nitpick comments (1)

src/lightspeed_evaluation/pipeline/evaluation/judges.py (1)

276-284: Remove redundant filtering of valid scores.

Line 276 filters for js.score is not None, then line 283 re-applies the same filter. Simplify by extracting scores directly from valid_scores.

♻️ Suggested simplification

         # Multiple judges: filter valid scores and aggregate
         valid_scores = [js for js in judge_scores if js.score is not None]
 
         if not valid_scores:
             # All judges failed
             return None, "All judges failed to produce a score"
 
         # Multiple judges: average (type-safe extraction)
-        scores = [js.score for js in valid_scores if js.score is not None]
+        scores: list[float] = [js.score for js in valid_scores]  # type: ignore[misc]
         avg_score = sum(scores) / len(scores)

Alternatively, use a type guard or cast since valid_scores already guarantees non-None scores.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py` around lines 276 -
284, valid_scores already contains only entries with non-None scores, so remove
the redundant filtering when building scores: replace the list comprehension
that re-checks "if js.score is not None" with one that extracts scores directly
from valid_scores (e.g., scores = [js.score for js in valid_scores]), and ensure
any necessary type hint/cast is applied so avg_score = sum(scores) / len(scores)
remains type-safe; update references in the same function where valid_scores,
scores, and avg_score are used.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/unit/pipeline/evaluation/test_evaluator.py`:
- Around line 767-778: Rename/update the test
test_evaluate_multiple_expected_responses_error_preserves_tokens (and the
similar sibling test around the other block) and its docstring to reflect a
partial-failure/FAIL final status, and explicitly assert that the final
evaluation status equals FAIL after the second iteration fails; ensure the test
sets/locks the partial-error state (e.g., by simulating the second iteration
error via the mocked evaluator/ScriptExecutionManager) and then checks tokens
were preserved while also asserting result.status == FAIL to guarantee the
contract.

---

Nitpick comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py`:
- Around line 276-284: valid_scores already contains only entries with non-None
scores, so remove the redundant filtering when building scores: replace the list
comprehension that re-checks "if js.score is not None" with one that extracts
scores directly from valid_scores (e.g., scores = [js.score for js in
valid_scores]), and ensure any necessary type hint/cast is applied so avg_score
= sum(scores) / len(scores) remains type-safe; update references in the same
function where valid_scores, scores, and avg_score are used.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 654e25a6-1e9c-49b5-b58b-ce7c1074c20f

📥 Commits

Reviewing files that changed from the base of the PR and between aa4f415 and 460ecaf.

📒 Files selected for processing (14)

src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/llm/manager.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/system.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/pipeline/evaluation/judges.py
tests/unit/core/llm/test_llm_manager.py
tests/unit/core/models/test_data.py
tests/unit/core/models/test_system.py
tests/unit/pipeline/evaluation/conftest.py
tests/unit/pipeline/evaluation/test_evaluator.py

🚧 Files skipped from review as they are similar to previous changes (4)

src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/system/loader.py
tests/unit/core/llm/test_llm_manager.py
src/lightspeed_evaluation/core/models/init.py

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/core/models/system.py`:
- Around line 583-589: The aggregation_strategy Field default should be reverted
to "average" (change the Field default in the aggregation_strategy declaration)
so configs match the intended rollout; update any docstring to reflect that
"average" is the default and that "max" and "majority_vote" are not yet
implemented. Also ensure evaluator.py::_aggregate_scores continues to compute
with average by default and only emits a warning (not switch behavior) when
aggregation_strategy is "max" or "majority_vote" so those branches do not alter
the current averaging logic.

In `@src/lightspeed_evaluation/core/system/loader.py`:
- Around line 164-172: The loader currently does **config_data["llm_pool"] and
**config_data["judge_panel"] without guarding, which will raise TypeError if
those keys exist but are None or non-mapping; update the parsing in loader.py so
that before calling LLMPoolConfig(**...) and JudgePanelConfig(**...) you check
that config_data.get("llm_pool") and config_data.get("judge_panel") are mappings
(e.g., isinstance(value, dict) or collections.abc.Mapping); if they are
mappings, pass them into LLMPoolConfig and JudgePanelConfig respectively,
otherwise leave llm_pool / judge_panel as None (or raise a clear validation
error) so the loader doesn't crash on **None or non-mapping values.

In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 552-559: Replace the RuntimeError in the validation block inside
evaluator.py with a custom exception from core.system.exceptions (e.g.,
EvaluationError or DataValidationError): import the chosen exception at the top
of the file, then raise that exception when evaluation_scope.turn_data or
evaluation_scope.turn_data.expected_response is missing, including the same
message (use request.metric_identifier in the message) to preserve context;
update any tests or callers if they assert RuntimeError to expect the new custom
exception.

In `@tests/unit/core/models/test_data.py`:
- Around line 579-584: The inline pylint suppression should be removed; instead,
after asserting result.judge_scores is not None in the test, assign a typed
local like judge_scores: Sequence[YourJudgeScoreType] = result.judge_scores (or
use the concrete type used elsewhere) and replace subsequent references
(result.judge_scores[0], result.judge_scores[1], etc.) with judge_scores[0],
judge_scores[1], ensuring imports/types are available so the linter is
satisfied; remove the "# pylint: disable=unsubscriptable-object" comment.

In `@tests/unit/pipeline/evaluation/test_judges.py`:
- Around line 82-110: Update the tests to reflect that
JudgeOrchestrator.aggregate_scores uses the "average" aggregation by default: in
test_multiple_judges_returns_max change expected score from 0.9 to the average
of the three scores (0.75) and assert the reason contains "Average of 3 judges"
and the formatted average (e.g., "0.750"); in
test_multiple_judges_with_one_failure change expected score from 0.85 to the
average of the two valid scores ((0.7+0.85)/2 = 0.775) and assert the reason
contains "Average of 2 judges". Also apply the same adjustment to the other
related assertions mentioned in the review (the later tests covering lines
127-186) so they check for average-based scores/reasoning rather than
max/majority behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3f7661c1-2815-496a-8022-6b2528be748b

📥 Commits

Reviewing files that changed from the base of the PR and between 460ecaf and 5985708.

📒 Files selected for processing (15)

src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/llm/manager.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/system.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/pipeline/evaluation/judges.py
tests/unit/core/llm/test_llm_manager.py
tests/unit/core/models/test_data.py
tests/unit/core/models/test_system.py
tests/unit/pipeline/evaluation/conftest.py
tests/unit/pipeline/evaluation/test_evaluator.py
tests/unit/pipeline/evaluation/test_judges.py

🚧 Files skipped from review as they are similar to previous changes (4)

src/lightspeed_evaluation/core/constants.py
tests/unit/pipeline/evaluation/conftest.py
tests/unit/pipeline/evaluation/test_evaluator.py
src/lightspeed_evaluation/core/output/generator.py

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/core/llm/manager.py`:
- Around line 170-184: The current get_judges_for_metric returns the first panel
judge when should_use_panel_for_metric is false, causing non-panel metrics to
use judge_managers[0]; change get_judges_for_metric so that when
should_use_panel_for_metric(metric_identifier) is false it returns the legacy
top-level LLM from system_config (e.g., [self.system_config.llm]) instead of
[self.get_primary_judge()]; keep the panel branch returning
self.get_judge_managers() and ensure the return type remains a list of
LLMManager instances.

In `@src/lightspeed_evaluation/core/system/loader.py`:
- Around line 164-174: The loader currently ignores non-dict non-None values for
"llm_pool" and "judge_panel"; change it to validate types and raise a custom
configuration exception from core.system.exceptions instead of silently dropping
them: import the appropriate exception (e.g., InvalidConfigError or
ConfigValidationError) from core.system.exceptions at the top of the file, then
for each value use config_data.get("llm_pool") and
config_data.get("judge_panel") and if the value is None keep behaving as before,
if isinstance(..., dict) instantiate LLMPoolConfig(**llm_pool_data) /
JudgePanelConfig(**judge_panel_data), but if the value is not None and not a
dict raise the imported exception with a clear message referencing the offending
key ("llm_pool" or "judge_panel") and the actual type/value so callers fail fast
on typos or wrong types.

In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py`:
- Around line 89-93: The debug log that currently formats the message string
should be converted to structured logging by passing those values as named
fields rather than embedding them in the message: include metric_identifier from
request.metric_identifier, judge_count from len(judge_managers), judge_ids (map
of each judge_manager.id or .judge.id), per-judge input_tokens/output_tokens
(from each judge_manager or its token counters), and aggregation_strategy (the
variable used where aggregation is chosen) as explicit log attributes; update
the logger.debug call(s) around the evaluation entry (the current logger.debug
using request.metric_identifier and judge_managers) and the other noted logging
sites (the blocks that reference judge managers, token usage, and aggregation)
to emit these structured fields via your logging system’s structured API (e.g.,
logger.debug(..., extra={...}) or logger.bind(...). Make sure field names match
metric_identifier, judge_id(s), judge_count, input_tokens, output_tokens, and
aggregation_strategy so downstream log processors can parse them.
- Around line 202-218: The except block in _evaluate_single_judge currently only
catches EvaluationError so other exceptions (e.g., ConfigurationError or
unexpected bugs) escape; change it to catch Exception (or a broader base) so all
failures are converted into a JudgeScore(score=None) and returned to
aggregate_scores for uniform handling; update the except clause around the
logger.error and the JudgeScore(...) return to handle Exception instead of
EvaluationError, keeping the logging of judge_id and the error and preserving
token_tracker.get_counts(), reason, input_tokens, and output_tokens.
- Around line 123-149: The helper signatures currently use inline pylint
disables in _evaluate_all_judges and _evaluate_single_judge because they accept
separate framework and metric_name arguments; remove the pylint disable by
shrinking the parameter list: stop passing framework and metric_name and instead
derive them inside the helpers from request.metric_identifier (or create and
pass a tiny context object that contains both values) so the signatures only
take (self, judge_manager(s), request, evaluation_scope, token_tracker) and
update all call sites accordingly; remove the "# pylint:
disable=too-many-arguments,too-many-positional-arguments" comments from the
function definitions after the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d85b507f-2658-4687-87e1-110a660c3721

📥 Commits

Reviewing files that changed from the base of the PR and between 5985708 and 7a57043.

📒 Files selected for processing (15)

src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/llm/manager.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/system.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/pipeline/evaluation/judges.py
tests/unit/core/llm/test_llm_manager.py
tests/unit/core/models/test_data.py
tests/unit/core/models/test_system.py
tests/unit/pipeline/evaluation/conftest.py
tests/unit/pipeline/evaluation/test_evaluator.py
tests/unit/pipeline/evaluation/test_judges.py

🚧 Files skipped from review as they are similar to previous changes (6)

tests/unit/pipeline/evaluation/conftest.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/models/data.py
tests/unit/pipeline/evaluation/test_judges.py
src/lightspeed_evaluation/core/models/system.py

bsatapat-jpg

LGTM. Thank you

VladimirKadlec

LGTM, nice work 💪 , thank you.
One non-blocking comment.

…anel feat: integrate panel of judges

asamal4 force-pushed the integrate-judge-panel branch from c545958 to 2ccbdc8 Compare March 10, 2026 15:24

coderabbitai Bot reviewed Mar 10, 2026

View reviewed changes

asamal4 force-pushed the integrate-judge-panel branch from 2ccbdc8 to aa4f415 Compare March 10, 2026 23:42

asamal4 marked this pull request as ready for review March 10, 2026 23:54

asamal4 changed the title ~~feat: integrate panel of judges~~ [DRAFT] feat: integrate panel of judges Mar 10, 2026

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/pipeline/evaluation/evaluator.py Outdated

Comment thread src/lightspeed_evaluation/pipeline/evaluation/judges.py Outdated

asamal4 force-pushed the integrate-judge-panel branch from aa4f415 to 460ecaf Compare March 11, 2026 11:16

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

asamal4 force-pushed the integrate-judge-panel branch from 460ecaf to 5985708 Compare March 11, 2026 12:42

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

asamal4 force-pushed the integrate-judge-panel branch from 5985708 to 7a57043 Compare March 11, 2026 13:32

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

feat: integrate panel of judges

8d603e3

asamal4 force-pushed the integrate-judge-panel branch from 7a57043 to 8d603e3 Compare March 11, 2026 14:21

asamal4 changed the title ~~[DRAFT] feat: integrate panel of judges~~ feat: integrate panel of judges Mar 11, 2026

bsatapat-jpg approved these changes Mar 12, 2026

View reviewed changes

VladimirKadlec approved these changes Mar 12, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/pipeline/evaluation/judges.py

asamal4 merged commit 17da61a into lightspeed-core:main Mar 12, 2026
16 checks passed

asamal4 changed the title ~~feat: integrate panel of judges~~ [LEADS-192] feat: integrate panel of judges Mar 12, 2026

emac-E pushed a commit to emac-E/lightspeed-evaluation that referenced this pull request Apr 10, 2026

Merge pull request lightspeed-core#184 from asamal4/integrate-judge-p…

735e81e

…anel feat: integrate panel of judges

Conversation

asamal4 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

asamal4 commented Mar 10, 2026

Uh oh!

coderabbitai Bot commented Mar 10, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bsatapat-jpg left a comment

Choose a reason for hiding this comment

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asamal4 commented Mar 10, 2026 •

edited

Loading

coderabbitai Bot commented Mar 10, 2026 •

edited

Loading