Skip to content

[LEADS-192] feat: integrate panel of judges#184

Merged
asamal4 merged 1 commit intolightspeed-core:mainfrom
asamal4:integrate-judge-panel
Mar 12, 2026
Merged

[LEADS-192] feat: integrate panel of judges#184
asamal4 merged 1 commit intolightspeed-core:mainfrom
asamal4:integrate-judge-panel

Conversation

@asamal4
Copy link
Copy Markdown
Collaborator

@asamal4 asamal4 commented Mar 10, 2026

Description

  • Judge Panel integration from new config
  • Backward compatibility when existing llm config is given
  • Maintains one row per metric with aggregated (currently max) result
  • New column to track individual judge llm's score/response/token usage
  • It only considers one expected response

A lot of improvements can be done, but it is already a big change.
Doc change will be done once all basic requirements are in place.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Cursor

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • Multi-judge panel evaluation with per-judge scoring, aggregation (default "max"), and metric-based judge selection.
    • Per-judge score model exposed and system config now supports judge panels and LLM pools.
  • Bug Fixes / Outputs

    • CSV and JSON summaries now include serialized per-judge scores.
  • Refactor

    • Evaluation flow split into panel vs single-judge paths with orchestrated per-judge handlers.
  • Tests

    • Added/updated tests for judge panels, scoring, evaluator flows, and backward compatibility.

@asamal4 asamal4 force-pushed the integrate-judge-panel branch from c545958 to 2ccbdc8 Compare March 10, 2026 15:24
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds multi-judge panel evaluation: new JudgeScore model and CSV/JSON field, judge-panel-aware LLMManager (judge_id, get_judges_for_metric), JudgeOrchestrator to run per-judge evaluations and aggregate scores (max), SystemConfig parsing for llm_pool/judge_panel, evaluator routing for panel/non‑LLM flows, and tests/serialization.

Changes

Cohort / File(s) Summary
Constants & CSV output
src/lightspeed_evaluation/core/constants.py, src/lightspeed_evaluation/core/output/generator.py
Added NON_LLM_FRAMEWORKS = frozenset({"nlp","script"}); added "judge_scores" to supported CSV columns and serialize judge_scores as JSON in CSV/JSON outputs.
Data models & exports
src/lightspeed_evaluation/core/models/data.py, src/lightspeed_evaluation/core/models/__init__.py
Added JudgeScore Pydantic model and added judge_scores: Optional[list[JudgeScore]] to MetricResult; re-exported JudgeScore in package __all__.
LLM manager & system model changes
src/lightspeed_evaluation/core/llm/manager.py, src/lightspeed_evaluation/core/models/system.py, src/lightspeed_evaluation/core/system/loader.py
LLMManager gains judge_id (default "primary"), constructs per-pool judge managers keyed by pool IDs, adds get_judges_for_metric(). SystemConfig.get_judge_configs() now returns list[tuple[str, LLMConfig]]; loader parses optional llm_pool and judge_panel.
Judge orchestration
src/lightspeed_evaluation/pipeline/evaluation/judges.py
New JudgeOrchestrator class: orchestrates per-judge evaluations, caches/creates per-judge handlers via a factory, handles per-judge errors (score=None), accumulates tokens, and aggregates scores using the max strategy.
Evaluator integration
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
Integrated judge orchestration: added system_config, llm_manager, embedding_manager, judge_orchestrator, _create_handler_for_judge, _will_use_panel, and panel-aware evaluation paths including panel-only single-expected-response handling and non-LLM behavior adjustments.
System config defaults & validation
src/lightspeed_evaluation/core/models/system.py
Default aggregation_strategy changed to "max"; get_judge_configs() now yields (pool_key, LLMConfig) tuples and raises when judge_panel exists without llm_pool.
Tests & fixtures
tests/unit/core/llm/test_llm_manager.py, tests/unit/core/models/test_data.py, tests/unit/core/models/test_system.py, tests/unit/pipeline/evaluation/conftest.py, tests/unit/pipeline/evaluation/test_evaluator.py, tests/unit/pipeline/evaluation/test_judges.py
Added tests for LLMManager backward compatibility, JudgeScore validation and MetricResult integration, adjusted get_judge_configs tests for tuple returns, added create_mock_llm_manager helper, new JudgeOrchestrator tests (aggregation, caching), and updated evaluator tests to use the new mocking helper and adjusted assertions.

Sequence Diagram(s)

sequenceDiagram
    participant Evaluator
    participant LLMManager
    participant JudgeOrchestrator
    participant MetricHandler
    participant OutputGenerator

    Evaluator->>LLMManager: get_judges_for_metric(metric_id)
    LLMManager->>LLMManager: should_use_panel_for_metric(metric_id)?
    alt panel enabled
        LLMManager-->>Evaluator: [judge_manager...]
        Evaluator->>JudgeOrchestrator: evaluate_with_judges(request, scope, token_tracker, threshold)
        JudgeOrchestrator->>JudgeOrchestrator: for each judge_manager
        loop per judge
            JudgeOrchestrator->>MetricHandler: get/instantiate handler (handler_factory)
            MetricHandler->>JudgeOrchestrator: evaluate(single_response)
            JudgeOrchestrator->>JudgeOrchestrator: record JudgeScore (score, reason, tokens)
        end
        JudgeOrchestrator->>JudgeOrchestrator: aggregate_scores(judge_scores)  %% uses max
        JudgeOrchestrator-->>Evaluator: MetricResult (aggregated score + judge_scores)
    else no panel / non-LLM
        LLMManager-->>Evaluator: primary judge manager
        Evaluator->>MetricHandler: evaluate(standard or non-LLM path)
        MetricHandler-->>Evaluator: MetricResult
    end
    Evaluator->>OutputGenerator: serialize MetricResult (include judge_scores)
    OutputGenerator-->>Evaluator: CSV/JSON output
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: integrate panel of judges' directly and clearly summarizes the main objective of the changeset: integrating a judge panel feature while maintaining backward compatibility.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@asamal4
Copy link
Copy Markdown
Collaborator Author

asamal4 commented Mar 10, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 10, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 778-804: The _aggregate_scores method currently always computes an
average and ignores JudgePanelConfig.aggregation_strategy, causing non-average
strategies (max, majority_vote) to be silently ignored; update _aggregate_scores
to branch on the aggregation_strategy value from the panel config (e.g.,
"average", "max", "majority_vote") and implement: for "average" compute mean of
non-None JudgeScore.score entries, for "max" return the maximum score and a
reason indicating which judge(s) produced it, and for "majority_vote" compute
the rounded modal score (or None if no consensus) with an explanatory reason;
ensure you still handle the single-judge shortcut and the case of no valid
scores, and reference JudgeScore.score and JudgeScore.reason when building
aggregated reason strings.
- Around line 680-698: The code uses judge_manager.config.model as the
cache/output identifier which can collide after resolve_llm_config(); change all
uses that identify/cached judges (including the judge_id assigned to JudgeScore
and any cache key inside _get_handler_for_judge) to use the llm_pool key from
the judge config (e.g., judge_manager.config.llm_pool or the config property
that represents the pool key) instead of judge_manager.config.model so handlers
are keyed by the unique llm_pool entry and scores remain distinct.
- Around line 271-281: The current branch that handles uses_panel and
multiple_expected_responses silently drops extra references by calling
_evaluate_panel_single_expected_response; instead, fail fast or preserve
multi-reference semantics: modify the branch in evaluator.py (the if uses_panel
and multiple_expected_responses block) to raise a clear exception (e.g.,
NotImplementedError or RuntimeError) when a judge panel is enabled but multiple
expected_responses are present, with a message saying NxM evaluation for panels
is not implemented, or alternatively call/implement the existing multi-reference
evaluation path (the multi-reference evaluator) so metrics that accept lists
keep their “best/pass across references” behavior rather than truncating to the
first reference. Ensure the error message mentions
request.conv_data.conversation_group_id and the conflicting flags so callers can
fix their config.
- Around line 800-801: The comprehension currently leaves js.score as
Optional[float], so remove the "# type: ignore" by filtering out None values in
the comprehension: build scores as a list of float by using "scores = [js.score
for js in valid_judges if js.score is not None]" (this narrows the type to
list[float]) and then compute aggregated from scores as before; adjust any
surrounding code that assumed Optional to rely on the narrowed scores list and
aggregated variable.
- Around line 711-724: The current broad except in evaluator.py swallows all
Exceptions and treats configuration/handler bugs as per-judge failures; change
the except to only catch the expected custom exceptions from
core.system.exceptions (e.g., the specific exception classes used for
judge/handler errors) around the judge evaluation block (the code invoking
_get_handler_for_judge() and the JudgeScore creation), logging and recording a
JudgeScore only for those expected exceptions, and re-raise any other unexpected
exceptions so they surface to the caller; ensure you import the required
exception classes from core.system.exceptions and keep token counts and
JudgeScore behavior identical when handling the expected exceptions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ebe174e9-2ebb-4dba-b455-501507ba9111

📥 Commits

Reviewing files that changed from the base of the PR and between 34cbae4 and 2ccbdc8.

📒 Files selected for processing (11)
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/llm/manager.py
  • src/lightspeed_evaluation/core/models/__init__.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/system/loader.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/unit/core/llm/test_llm_manager.py
  • tests/unit/core/models/test_data.py
  • tests/unit/pipeline/evaluation/conftest.py
  • tests/unit/pipeline/evaluation/test_evaluator.py

Comment thread src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
Comment thread src/lightspeed_evaluation/pipeline/evaluation/evaluator.py Outdated
Comment thread src/lightspeed_evaluation/pipeline/evaluation/evaluator.py Outdated
Comment thread src/lightspeed_evaluation/pipeline/evaluation/evaluator.py Outdated
Comment thread src/lightspeed_evaluation/pipeline/evaluation/evaluator.py Outdated
@asamal4 asamal4 force-pushed the integrate-judge-panel branch from 2ccbdc8 to aa4f415 Compare March 10, 2026 23:42
@asamal4 asamal4 marked this pull request as ready for review March 10, 2026 23:54
@asamal4 asamal4 changed the title feat: integrate panel of judges [DRAFT] feat: integrate panel of judges Mar 10, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

291-301: ⚠️ Potential issue | 🟠 Major

Don’t truncate expected_response lists when a panel is active.

This still changes list-based metrics from “best/pass across references” to “first reference only” whenever panel evaluation is enabled, which can flip PASS/FAIL on existing datasets. Preserve the existing multi-reference flow or fail fast until NxM evaluation is implemented.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py` around lines 291
- 301, The current branch in evaluate where uses_panel and
multiple_expected_responses causes truncation to the first expected_response via
_evaluate_panel_single_expected_response; instead preserve multi-reference
semantics by either calling the existing multi-reference evaluation path (the
same flow used when multiple_expected_responses is true and uses_panel is false)
so metrics remain “best/pass across references”, or, if NxM panel-vs-references
evaluation isn't implemented, fail fast by raising a clear exception (e.g.,
NotImplementedError/ValueError) rather than silently using only the first
reference; update the conditional around uses_panel &
multiple_expected_responses and remove the fallback to
_evaluate_panel_single_expected_response or replace it with a fail-fast error
until proper NxM support is added.
src/lightspeed_evaluation/pipeline/evaluation/judges.py (1)

255-266: ⚠️ Potential issue | 🟠 Major

Honor aggregation_strategy or reject non-average configs.

JudgePanelConfig accepts max and majority_vote, but this path still averages every multi-judge result after only logging a warning. Any non-default panel config will therefore return the wrong score/status while looking supported.

Also applies to: 279-282

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py` around lines 255 -
266, The current code merely logs a warning for non-"average"
aggregation_strategy but continues to average results; update the checks in
judges.py (the block using self.llm_manager.system_config.judge_panel and the
similar block at 279-282) to reject unsupported strategies instead of silently
falling back: if aggregation_strategy is not "average", raise a clear
configuration error (e.g., ValueError) that includes the invalid strategy and
acceptable options ("average", "max", "majority_vote"), so callers cannot assume
unsupported strategies are honored; alternatively, implement proper handling for
"max" and "majority_vote" and wire that behaviour into the existing aggregation
code paths (refer to the aggregation_strategy variable and any aggregation
functions used thereafter).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 95-101: Replace the builtin exceptions with the project's custom
exception from core.system.exceptions: import and raise the appropriate
exception (e.g., UnsupportedLLMFrameworkError) instead of ValueError in the
evaluator selection branch that returns RagasMetrics / DeepEvalMetrics /
CustomMetrics (the "framework" check and the raise ValueError(f\"Unsupported LLM
framework for panel: {framework}\") line), and likewise replace any RuntimeError
uses mentioned (e.g., at the other occurrence around lines 555-557) with the
same project exception so failures are classifiable by the rest of the package.

In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py`:
- Around line 198-212: The current except block in judges.py that catches
Exception and returns a synthetic JudgeScore (using judge_id,
token_tracker.get_counts(), and JudgeScore) should instead re-raise the
exception for single-judge failures so upstream JudgeOrchestrator/
MetricsEvaluator can stop on errors; only collapse the exception into a
JudgeScore when running in panel mode (i.e., check the panel-mode flag or
context provided to the evaluator). Update the except Exception as e block to
call token_tracker.get_counts() to capture counts, and then: if panel_mode (or
the existing panel indicator) is True, return the synthetic JudgeScore as
before; otherwise re-raise e (or wrap/raise a clear EvaluationError that
includes the token counts) so the error is propagated.

---

Duplicate comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 291-301: The current branch in evaluate where uses_panel and
multiple_expected_responses causes truncation to the first expected_response via
_evaluate_panel_single_expected_response; instead preserve multi-reference
semantics by either calling the existing multi-reference evaluation path (the
same flow used when multiple_expected_responses is true and uses_panel is false)
so metrics remain “best/pass across references”, or, if NxM panel-vs-references
evaluation isn't implemented, fail fast by raising a clear exception (e.g.,
NotImplementedError/ValueError) rather than silently using only the first
reference; update the conditional around uses_panel &
multiple_expected_responses and remove the fallback to
_evaluate_panel_single_expected_response or replace it with a fail-fast error
until proper NxM support is added.

In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py`:
- Around line 255-266: The current code merely logs a warning for non-"average"
aggregation_strategy but continues to average results; update the checks in
judges.py (the block using self.llm_manager.system_config.judge_panel and the
similar block at 279-282) to reject unsupported strategies instead of silently
falling back: if aggregation_strategy is not "average", raise a clear
configuration error (e.g., ValueError) that includes the invalid strategy and
acceptable options ("average", "max", "majority_vote"), so callers cannot assume
unsupported strategies are honored; alternatively, implement proper handling for
"max" and "majority_vote" and wire that behaviour into the existing aggregation
code paths (refer to the aggregation_strategy variable and any aggregation
functions used thereafter).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c3eae248-6c4e-4e7b-8c61-0d3686820d26

📥 Commits

Reviewing files that changed from the base of the PR and between 2ccbdc8 and aa4f415.

📒 Files selected for processing (14)
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/llm/manager.py
  • src/lightspeed_evaluation/core/models/__init__.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/system/loader.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/pipeline/evaluation/judges.py
  • tests/unit/core/llm/test_llm_manager.py
  • tests/unit/core/models/test_data.py
  • tests/unit/core/models/test_system.py
  • tests/unit/pipeline/evaluation/conftest.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
🚧 Files skipped from review as they are similar to previous changes (8)
  • tests/unit/pipeline/evaluation/conftest.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/constants.py
  • tests/unit/core/llm/test_llm_manager.py
  • src/lightspeed_evaluation/core/system/loader.py
  • src/lightspeed_evaluation/core/models/init.py
  • tests/unit/core/models/test_data.py
  • src/lightspeed_evaluation/core/llm/manager.py

Comment thread src/lightspeed_evaluation/pipeline/evaluation/evaluator.py Outdated
Comment thread src/lightspeed_evaluation/pipeline/evaluation/judges.py Outdated
@asamal4 asamal4 force-pushed the integrate-judge-panel branch from aa4f415 to 460ecaf Compare March 11, 2026 11:16
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unit/pipeline/evaluation/test_evaluator.py (1)

767-778: ⚠️ Potential issue | 🟡 Minor

Lock in the partial-error status explicitly.

This test now validates the score/token aggregation path, but it no longer proves the final status stays FAIL. It also still reads like an ERROR case in the name/docstring, which makes the contract misleading.

💡 Suggested tightening
-    def test_evaluate_multiple_expected_responses_error_preserves_tokens(
+    def test_evaluate_multiple_expected_responses_partial_error_accumulates_tokens(
         self,
         config_loader: ConfigLoader,
         mock_metric_manager: MetricManager,
         mock_script_manager: ScriptExecutionManager,
         mocker: MockerFixture,
     ) -> None:
-        """Test token preservation when error occurs during multiple expected responses evaluation.
+        """Test token accumulation when a later expected-response iteration errors.

         Scenario: First iteration succeeds with tokens, second iteration fails.
-        Expected: Error result should preserve tokens from first iteration.
+        Expected: The final result stays FAIL, preserves the best score, and
+        accumulates tokens from both iterations.
         """
         assert result is not None
+        assert result.result == "FAIL"
         assert result.score == 0.3  # From first iteration

Also applies to: 843-851

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/pipeline/evaluation/test_evaluator.py` around lines 767 - 778,
Rename/update the test
test_evaluate_multiple_expected_responses_error_preserves_tokens (and the
similar sibling test around the other block) and its docstring to reflect a
partial-failure/FAIL final status, and explicitly assert that the final
evaluation status equals FAIL after the second iteration fails; ensure the test
sets/locks the partial-error state (e.g., by simulating the second iteration
error via the mocked evaluator/ScriptExecutionManager) and then checks tokens
were preserved while also asserting result.status == FAIL to guarantee the
contract.
🧹 Nitpick comments (1)
src/lightspeed_evaluation/pipeline/evaluation/judges.py (1)

276-284: Remove redundant filtering of valid scores.

Line 276 filters for js.score is not None, then line 283 re-applies the same filter. Simplify by extracting scores directly from valid_scores.

♻️ Suggested simplification
         # Multiple judges: filter valid scores and aggregate
         valid_scores = [js for js in judge_scores if js.score is not None]
 
         if not valid_scores:
             # All judges failed
             return None, "All judges failed to produce a score"
 
         # Multiple judges: average (type-safe extraction)
-        scores = [js.score for js in valid_scores if js.score is not None]
+        scores: list[float] = [js.score for js in valid_scores]  # type: ignore[misc]
         avg_score = sum(scores) / len(scores)

Alternatively, use a type guard or cast since valid_scores already guarantees non-None scores.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py` around lines 276 -
284, valid_scores already contains only entries with non-None scores, so remove
the redundant filtering when building scores: replace the list comprehension
that re-checks "if js.score is not None" with one that extracts scores directly
from valid_scores (e.g., scores = [js.score for js in valid_scores]), and ensure
any necessary type hint/cast is applied so avg_score = sum(scores) / len(scores)
remains type-safe; update references in the same function where valid_scores,
scores, and avg_score are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/unit/pipeline/evaluation/test_evaluator.py`:
- Around line 767-778: Rename/update the test
test_evaluate_multiple_expected_responses_error_preserves_tokens (and the
similar sibling test around the other block) and its docstring to reflect a
partial-failure/FAIL final status, and explicitly assert that the final
evaluation status equals FAIL after the second iteration fails; ensure the test
sets/locks the partial-error state (e.g., by simulating the second iteration
error via the mocked evaluator/ScriptExecutionManager) and then checks tokens
were preserved while also asserting result.status == FAIL to guarantee the
contract.

---

Nitpick comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py`:
- Around line 276-284: valid_scores already contains only entries with non-None
scores, so remove the redundant filtering when building scores: replace the list
comprehension that re-checks "if js.score is not None" with one that extracts
scores directly from valid_scores (e.g., scores = [js.score for js in
valid_scores]), and ensure any necessary type hint/cast is applied so avg_score
= sum(scores) / len(scores) remains type-safe; update references in the same
function where valid_scores, scores, and avg_score are used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 654e25a6-1e9c-49b5-b58b-ce7c1074c20f

📥 Commits

Reviewing files that changed from the base of the PR and between aa4f415 and 460ecaf.

📒 Files selected for processing (14)
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/llm/manager.py
  • src/lightspeed_evaluation/core/models/__init__.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/system/loader.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/pipeline/evaluation/judges.py
  • tests/unit/core/llm/test_llm_manager.py
  • tests/unit/core/models/test_data.py
  • tests/unit/core/models/test_system.py
  • tests/unit/pipeline/evaluation/conftest.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/system/loader.py
  • tests/unit/core/llm/test_llm_manager.py
  • src/lightspeed_evaluation/core/models/init.py

@asamal4 asamal4 force-pushed the integrate-judge-panel branch from 460ecaf to 5985708 Compare March 11, 2026 12:42
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/core/models/system.py`:
- Around line 583-589: The aggregation_strategy Field default should be reverted
to "average" (change the Field default in the aggregation_strategy declaration)
so configs match the intended rollout; update any docstring to reflect that
"average" is the default and that "max" and "majority_vote" are not yet
implemented. Also ensure evaluator.py::_aggregate_scores continues to compute
with average by default and only emits a warning (not switch behavior) when
aggregation_strategy is "max" or "majority_vote" so those branches do not alter
the current averaging logic.

In `@src/lightspeed_evaluation/core/system/loader.py`:
- Around line 164-172: The loader currently does **config_data["llm_pool"] and
**config_data["judge_panel"] without guarding, which will raise TypeError if
those keys exist but are None or non-mapping; update the parsing in loader.py so
that before calling LLMPoolConfig(**...) and JudgePanelConfig(**...) you check
that config_data.get("llm_pool") and config_data.get("judge_panel") are mappings
(e.g., isinstance(value, dict) or collections.abc.Mapping); if they are
mappings, pass them into LLMPoolConfig and JudgePanelConfig respectively,
otherwise leave llm_pool / judge_panel as None (or raise a clear validation
error) so the loader doesn't crash on **None or non-mapping values.

In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 552-559: Replace the RuntimeError in the validation block inside
evaluator.py with a custom exception from core.system.exceptions (e.g.,
EvaluationError or DataValidationError): import the chosen exception at the top
of the file, then raise that exception when evaluation_scope.turn_data or
evaluation_scope.turn_data.expected_response is missing, including the same
message (use request.metric_identifier in the message) to preserve context;
update any tests or callers if they assert RuntimeError to expect the new custom
exception.

In `@tests/unit/core/models/test_data.py`:
- Around line 579-584: The inline pylint suppression should be removed; instead,
after asserting result.judge_scores is not None in the test, assign a typed
local like judge_scores: Sequence[YourJudgeScoreType] = result.judge_scores (or
use the concrete type used elsewhere) and replace subsequent references
(result.judge_scores[0], result.judge_scores[1], etc.) with judge_scores[0],
judge_scores[1], ensuring imports/types are available so the linter is
satisfied; remove the "# pylint: disable=unsubscriptable-object" comment.

In `@tests/unit/pipeline/evaluation/test_judges.py`:
- Around line 82-110: Update the tests to reflect that
JudgeOrchestrator.aggregate_scores uses the "average" aggregation by default: in
test_multiple_judges_returns_max change expected score from 0.9 to the average
of the three scores (0.75) and assert the reason contains "Average of 3 judges"
and the formatted average (e.g., "0.750"); in
test_multiple_judges_with_one_failure change expected score from 0.85 to the
average of the two valid scores ((0.7+0.85)/2 = 0.775) and assert the reason
contains "Average of 2 judges". Also apply the same adjustment to the other
related assertions mentioned in the review (the later tests covering lines
127-186) so they check for average-based scores/reasoning rather than
max/majority behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3f7661c1-2815-496a-8022-6b2528be748b

📥 Commits

Reviewing files that changed from the base of the PR and between 460ecaf and 5985708.

📒 Files selected for processing (15)
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/llm/manager.py
  • src/lightspeed_evaluation/core/models/__init__.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/system/loader.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/pipeline/evaluation/judges.py
  • tests/unit/core/llm/test_llm_manager.py
  • tests/unit/core/models/test_data.py
  • tests/unit/core/models/test_system.py
  • tests/unit/pipeline/evaluation/conftest.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • tests/unit/pipeline/evaluation/test_judges.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/lightspeed_evaluation/core/constants.py
  • tests/unit/pipeline/evaluation/conftest.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • src/lightspeed_evaluation/core/output/generator.py

Comment thread src/lightspeed_evaluation/core/models/system.py
Comment thread src/lightspeed_evaluation/core/system/loader.py Outdated
Comment thread src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
Comment thread tests/unit/core/models/test_data.py
Comment thread tests/unit/pipeline/evaluation/test_judges.py
Comment thread tests/unit/pipeline/evaluation/test_judges.py
@asamal4 asamal4 force-pushed the integrate-judge-panel branch from 5985708 to 7a57043 Compare March 11, 2026 13:32
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/core/llm/manager.py`:
- Around line 170-184: The current get_judges_for_metric returns the first panel
judge when should_use_panel_for_metric is false, causing non-panel metrics to
use judge_managers[0]; change get_judges_for_metric so that when
should_use_panel_for_metric(metric_identifier) is false it returns the legacy
top-level LLM from system_config (e.g., [self.system_config.llm]) instead of
[self.get_primary_judge()]; keep the panel branch returning
self.get_judge_managers() and ensure the return type remains a list of
LLMManager instances.

In `@src/lightspeed_evaluation/core/system/loader.py`:
- Around line 164-174: The loader currently ignores non-dict non-None values for
"llm_pool" and "judge_panel"; change it to validate types and raise a custom
configuration exception from core.system.exceptions instead of silently dropping
them: import the appropriate exception (e.g., InvalidConfigError or
ConfigValidationError) from core.system.exceptions at the top of the file, then
for each value use config_data.get("llm_pool") and
config_data.get("judge_panel") and if the value is None keep behaving as before,
if isinstance(..., dict) instantiate LLMPoolConfig(**llm_pool_data) /
JudgePanelConfig(**judge_panel_data), but if the value is not None and not a
dict raise the imported exception with a clear message referencing the offending
key ("llm_pool" or "judge_panel") and the actual type/value so callers fail fast
on typos or wrong types.

In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py`:
- Around line 89-93: The debug log that currently formats the message string
should be converted to structured logging by passing those values as named
fields rather than embedding them in the message: include metric_identifier from
request.metric_identifier, judge_count from len(judge_managers), judge_ids (map
of each judge_manager.id or .judge.id), per-judge input_tokens/output_tokens
(from each judge_manager or its token counters), and aggregation_strategy (the
variable used where aggregation is chosen) as explicit log attributes; update
the logger.debug call(s) around the evaluation entry (the current logger.debug
using request.metric_identifier and judge_managers) and the other noted logging
sites (the blocks that reference judge managers, token usage, and aggregation)
to emit these structured fields via your logging system’s structured API (e.g.,
logger.debug(..., extra={...}) or logger.bind(...). Make sure field names match
metric_identifier, judge_id(s), judge_count, input_tokens, output_tokens, and
aggregation_strategy so downstream log processors can parse them.
- Around line 202-218: The except block in _evaluate_single_judge currently only
catches EvaluationError so other exceptions (e.g., ConfigurationError or
unexpected bugs) escape; change it to catch Exception (or a broader base) so all
failures are converted into a JudgeScore(score=None) and returned to
aggregate_scores for uniform handling; update the except clause around the
logger.error and the JudgeScore(...) return to handle Exception instead of
EvaluationError, keeping the logging of judge_id and the error and preserving
token_tracker.get_counts(), reason, input_tokens, and output_tokens.
- Around line 123-149: The helper signatures currently use inline pylint
disables in _evaluate_all_judges and _evaluate_single_judge because they accept
separate framework and metric_name arguments; remove the pylint disable by
shrinking the parameter list: stop passing framework and metric_name and instead
derive them inside the helpers from request.metric_identifier (or create and
pass a tiny context object that contains both values) so the signatures only
take (self, judge_manager(s), request, evaluation_scope, token_tracker) and
update all call sites accordingly; remove the "# pylint:
disable=too-many-arguments,too-many-positional-arguments" comments from the
function definitions after the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d85b507f-2658-4687-87e1-110a660c3721

📥 Commits

Reviewing files that changed from the base of the PR and between 5985708 and 7a57043.

📒 Files selected for processing (15)
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/llm/manager.py
  • src/lightspeed_evaluation/core/models/__init__.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/system/loader.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/pipeline/evaluation/judges.py
  • tests/unit/core/llm/test_llm_manager.py
  • tests/unit/core/models/test_data.py
  • tests/unit/core/models/test_system.py
  • tests/unit/pipeline/evaluation/conftest.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • tests/unit/pipeline/evaluation/test_judges.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • tests/unit/pipeline/evaluation/conftest.py
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/models/data.py
  • tests/unit/pipeline/evaluation/test_judges.py
  • src/lightspeed_evaluation/core/models/system.py

Comment thread src/lightspeed_evaluation/core/llm/manager.py
Comment thread src/lightspeed_evaluation/core/system/loader.py Outdated
Comment thread src/lightspeed_evaluation/pipeline/evaluation/judges.py
Comment thread src/lightspeed_evaluation/pipeline/evaluation/judges.py
Comment thread src/lightspeed_evaluation/pipeline/evaluation/judges.py
@asamal4 asamal4 force-pushed the integrate-judge-panel branch from 7a57043 to 8d603e3 Compare March 11, 2026 14:21
@asamal4 asamal4 changed the title [DRAFT] feat: integrate panel of judges feat: integrate panel of judges Mar 11, 2026
Copy link
Copy Markdown
Collaborator

@bsatapat-jpg bsatapat-jpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you

Copy link
Copy Markdown
Member

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work 💪 , thank you.
One non-blocking comment.

Comment thread src/lightspeed_evaluation/pipeline/evaluation/judges.py
@asamal4 asamal4 merged commit 17da61a into lightspeed-core:main Mar 12, 2026
16 checks passed
@asamal4 asamal4 changed the title feat: integrate panel of judges [LEADS-192] feat: integrate panel of judges Mar 12, 2026
emac-E pushed a commit to emac-E/lightspeed-evaluation that referenced this pull request Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants