[LEADS-192] feat: integrate panel of judges#184
Conversation
c545958 to
2ccbdc8
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds multi-judge panel evaluation: new JudgeScore model and CSV/JSON field, judge-panel-aware LLMManager (judge_id, get_judges_for_metric), JudgeOrchestrator to run per-judge evaluations and aggregate scores (max), SystemConfig parsing for llm_pool/judge_panel, evaluator routing for panel/non‑LLM flows, and tests/serialization. Changes
Sequence Diagram(s)sequenceDiagram
participant Evaluator
participant LLMManager
participant JudgeOrchestrator
participant MetricHandler
participant OutputGenerator
Evaluator->>LLMManager: get_judges_for_metric(metric_id)
LLMManager->>LLMManager: should_use_panel_for_metric(metric_id)?
alt panel enabled
LLMManager-->>Evaluator: [judge_manager...]
Evaluator->>JudgeOrchestrator: evaluate_with_judges(request, scope, token_tracker, threshold)
JudgeOrchestrator->>JudgeOrchestrator: for each judge_manager
loop per judge
JudgeOrchestrator->>MetricHandler: get/instantiate handler (handler_factory)
MetricHandler->>JudgeOrchestrator: evaluate(single_response)
JudgeOrchestrator->>JudgeOrchestrator: record JudgeScore (score, reason, tokens)
end
JudgeOrchestrator->>JudgeOrchestrator: aggregate_scores(judge_scores) %% uses max
JudgeOrchestrator-->>Evaluator: MetricResult (aggregated score + judge_scores)
else no panel / non-LLM
LLMManager-->>Evaluator: primary judge manager
Evaluator->>MetricHandler: evaluate(standard or non-LLM path)
MetricHandler-->>Evaluator: MetricResult
end
Evaluator->>OutputGenerator: serialize MetricResult (include judge_scores)
OutputGenerator-->>Evaluator: CSV/JSON output
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 778-804: The _aggregate_scores method currently always computes an
average and ignores JudgePanelConfig.aggregation_strategy, causing non-average
strategies (max, majority_vote) to be silently ignored; update _aggregate_scores
to branch on the aggregation_strategy value from the panel config (e.g.,
"average", "max", "majority_vote") and implement: for "average" compute mean of
non-None JudgeScore.score entries, for "max" return the maximum score and a
reason indicating which judge(s) produced it, and for "majority_vote" compute
the rounded modal score (or None if no consensus) with an explanatory reason;
ensure you still handle the single-judge shortcut and the case of no valid
scores, and reference JudgeScore.score and JudgeScore.reason when building
aggregated reason strings.
- Around line 680-698: The code uses judge_manager.config.model as the
cache/output identifier which can collide after resolve_llm_config(); change all
uses that identify/cached judges (including the judge_id assigned to JudgeScore
and any cache key inside _get_handler_for_judge) to use the llm_pool key from
the judge config (e.g., judge_manager.config.llm_pool or the config property
that represents the pool key) instead of judge_manager.config.model so handlers
are keyed by the unique llm_pool entry and scores remain distinct.
- Around line 271-281: The current branch that handles uses_panel and
multiple_expected_responses silently drops extra references by calling
_evaluate_panel_single_expected_response; instead, fail fast or preserve
multi-reference semantics: modify the branch in evaluator.py (the if uses_panel
and multiple_expected_responses block) to raise a clear exception (e.g.,
NotImplementedError or RuntimeError) when a judge panel is enabled but multiple
expected_responses are present, with a message saying NxM evaluation for panels
is not implemented, or alternatively call/implement the existing multi-reference
evaluation path (the multi-reference evaluator) so metrics that accept lists
keep their “best/pass across references” behavior rather than truncating to the
first reference. Ensure the error message mentions
request.conv_data.conversation_group_id and the conflicting flags so callers can
fix their config.
- Around line 800-801: The comprehension currently leaves js.score as
Optional[float], so remove the "# type: ignore" by filtering out None values in
the comprehension: build scores as a list of float by using "scores = [js.score
for js in valid_judges if js.score is not None]" (this narrows the type to
list[float]) and then compute aggregated from scores as before; adjust any
surrounding code that assumed Optional to rely on the narrowed scores list and
aggregated variable.
- Around line 711-724: The current broad except in evaluator.py swallows all
Exceptions and treats configuration/handler bugs as per-judge failures; change
the except to only catch the expected custom exceptions from
core.system.exceptions (e.g., the specific exception classes used for
judge/handler errors) around the judge evaluation block (the code invoking
_get_handler_for_judge() and the JudgeScore creation), logging and recording a
JudgeScore only for those expected exceptions, and re-raise any other unexpected
exceptions so they surface to the caller; ensure you import the required
exception classes from core.system.exceptions and keep token counts and
JudgeScore behavior identical when handling the expected exceptions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ebe174e9-2ebb-4dba-b455-501507ba9111
📒 Files selected for processing (11)
src/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/llm/manager.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/system/loader.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/llm/test_llm_manager.pytests/unit/core/models/test_data.pytests/unit/pipeline/evaluation/conftest.pytests/unit/pipeline/evaluation/test_evaluator.py
2ccbdc8 to
aa4f415
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
291-301:⚠️ Potential issue | 🟠 MajorDon’t truncate
expected_responselists when a panel is active.This still changes list-based metrics from “best/pass across references” to “first reference only” whenever panel evaluation is enabled, which can flip PASS/FAIL on existing datasets. Preserve the existing multi-reference flow or fail fast until NxM evaluation is implemented.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py` around lines 291 - 301, The current branch in evaluate where uses_panel and multiple_expected_responses causes truncation to the first expected_response via _evaluate_panel_single_expected_response; instead preserve multi-reference semantics by either calling the existing multi-reference evaluation path (the same flow used when multiple_expected_responses is true and uses_panel is false) so metrics remain “best/pass across references”, or, if NxM panel-vs-references evaluation isn't implemented, fail fast by raising a clear exception (e.g., NotImplementedError/ValueError) rather than silently using only the first reference; update the conditional around uses_panel & multiple_expected_responses and remove the fallback to _evaluate_panel_single_expected_response or replace it with a fail-fast error until proper NxM support is added.src/lightspeed_evaluation/pipeline/evaluation/judges.py (1)
255-266:⚠️ Potential issue | 🟠 MajorHonor
aggregation_strategyor reject non-averageconfigs.
JudgePanelConfigacceptsmaxandmajority_vote, but this path still averages every multi-judge result after only logging a warning. Any non-default panel config will therefore return the wrong score/status while looking supported.Also applies to: 279-282
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py` around lines 255 - 266, The current code merely logs a warning for non-"average" aggregation_strategy but continues to average results; update the checks in judges.py (the block using self.llm_manager.system_config.judge_panel and the similar block at 279-282) to reject unsupported strategies instead of silently falling back: if aggregation_strategy is not "average", raise a clear configuration error (e.g., ValueError) that includes the invalid strategy and acceptable options ("average", "max", "majority_vote"), so callers cannot assume unsupported strategies are honored; alternatively, implement proper handling for "max" and "majority_vote" and wire that behaviour into the existing aggregation code paths (refer to the aggregation_strategy variable and any aggregation functions used thereafter).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 95-101: Replace the builtin exceptions with the project's custom
exception from core.system.exceptions: import and raise the appropriate
exception (e.g., UnsupportedLLMFrameworkError) instead of ValueError in the
evaluator selection branch that returns RagasMetrics / DeepEvalMetrics /
CustomMetrics (the "framework" check and the raise ValueError(f\"Unsupported LLM
framework for panel: {framework}\") line), and likewise replace any RuntimeError
uses mentioned (e.g., at the other occurrence around lines 555-557) with the
same project exception so failures are classifiable by the rest of the package.
In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py`:
- Around line 198-212: The current except block in judges.py that catches
Exception and returns a synthetic JudgeScore (using judge_id,
token_tracker.get_counts(), and JudgeScore) should instead re-raise the
exception for single-judge failures so upstream JudgeOrchestrator/
MetricsEvaluator can stop on errors; only collapse the exception into a
JudgeScore when running in panel mode (i.e., check the panel-mode flag or
context provided to the evaluator). Update the except Exception as e block to
call token_tracker.get_counts() to capture counts, and then: if panel_mode (or
the existing panel indicator) is True, return the synthetic JudgeScore as
before; otherwise re-raise e (or wrap/raise a clear EvaluationError that
includes the token counts) so the error is propagated.
---
Duplicate comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 291-301: The current branch in evaluate where uses_panel and
multiple_expected_responses causes truncation to the first expected_response via
_evaluate_panel_single_expected_response; instead preserve multi-reference
semantics by either calling the existing multi-reference evaluation path (the
same flow used when multiple_expected_responses is true and uses_panel is false)
so metrics remain “best/pass across references”, or, if NxM panel-vs-references
evaluation isn't implemented, fail fast by raising a clear exception (e.g.,
NotImplementedError/ValueError) rather than silently using only the first
reference; update the conditional around uses_panel &
multiple_expected_responses and remove the fallback to
_evaluate_panel_single_expected_response or replace it with a fail-fast error
until proper NxM support is added.
In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py`:
- Around line 255-266: The current code merely logs a warning for non-"average"
aggregation_strategy but continues to average results; update the checks in
judges.py (the block using self.llm_manager.system_config.judge_panel and the
similar block at 279-282) to reject unsupported strategies instead of silently
falling back: if aggregation_strategy is not "average", raise a clear
configuration error (e.g., ValueError) that includes the invalid strategy and
acceptable options ("average", "max", "majority_vote"), so callers cannot assume
unsupported strategies are honored; alternatively, implement proper handling for
"max" and "majority_vote" and wire that behaviour into the existing aggregation
code paths (refer to the aggregation_strategy variable and any aggregation
functions used thereafter).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c3eae248-6c4e-4e7b-8c61-0d3686820d26
📒 Files selected for processing (14)
src/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/llm/manager.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/system/loader.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/pipeline/evaluation/judges.pytests/unit/core/llm/test_llm_manager.pytests/unit/core/models/test_data.pytests/unit/core/models/test_system.pytests/unit/pipeline/evaluation/conftest.pytests/unit/pipeline/evaluation/test_evaluator.py
🚧 Files skipped from review as they are similar to previous changes (8)
- tests/unit/pipeline/evaluation/conftest.py
- src/lightspeed_evaluation/core/output/generator.py
- src/lightspeed_evaluation/core/constants.py
- tests/unit/core/llm/test_llm_manager.py
- src/lightspeed_evaluation/core/system/loader.py
- src/lightspeed_evaluation/core/models/init.py
- tests/unit/core/models/test_data.py
- src/lightspeed_evaluation/core/llm/manager.py
aa4f415 to
460ecaf
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/pipeline/evaluation/test_evaluator.py (1)
767-778:⚠️ Potential issue | 🟡 MinorLock in the partial-error status explicitly.
This test now validates the score/token aggregation path, but it no longer proves the final status stays
FAIL. It also still reads like anERRORcase in the name/docstring, which makes the contract misleading.💡 Suggested tightening
- def test_evaluate_multiple_expected_responses_error_preserves_tokens( + def test_evaluate_multiple_expected_responses_partial_error_accumulates_tokens( self, config_loader: ConfigLoader, mock_metric_manager: MetricManager, mock_script_manager: ScriptExecutionManager, mocker: MockerFixture, ) -> None: - """Test token preservation when error occurs during multiple expected responses evaluation. + """Test token accumulation when a later expected-response iteration errors. Scenario: First iteration succeeds with tokens, second iteration fails. - Expected: Error result should preserve tokens from first iteration. + Expected: The final result stays FAIL, preserves the best score, and + accumulates tokens from both iterations. """assert result is not None + assert result.result == "FAIL" assert result.score == 0.3 # From first iterationAlso applies to: 843-851
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/pipeline/evaluation/test_evaluator.py` around lines 767 - 778, Rename/update the test test_evaluate_multiple_expected_responses_error_preserves_tokens (and the similar sibling test around the other block) and its docstring to reflect a partial-failure/FAIL final status, and explicitly assert that the final evaluation status equals FAIL after the second iteration fails; ensure the test sets/locks the partial-error state (e.g., by simulating the second iteration error via the mocked evaluator/ScriptExecutionManager) and then checks tokens were preserved while also asserting result.status == FAIL to guarantee the contract.
🧹 Nitpick comments (1)
src/lightspeed_evaluation/pipeline/evaluation/judges.py (1)
276-284: Remove redundant filtering of valid scores.Line 276 filters for
js.score is not None, then line 283 re-applies the same filter. Simplify by extracting scores directly fromvalid_scores.♻️ Suggested simplification
# Multiple judges: filter valid scores and aggregate valid_scores = [js for js in judge_scores if js.score is not None] if not valid_scores: # All judges failed return None, "All judges failed to produce a score" # Multiple judges: average (type-safe extraction) - scores = [js.score for js in valid_scores if js.score is not None] + scores: list[float] = [js.score for js in valid_scores] # type: ignore[misc] avg_score = sum(scores) / len(scores)Alternatively, use a type guard or cast since
valid_scoresalready guarantees non-None scores.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py` around lines 276 - 284, valid_scores already contains only entries with non-None scores, so remove the redundant filtering when building scores: replace the list comprehension that re-checks "if js.score is not None" with one that extracts scores directly from valid_scores (e.g., scores = [js.score for js in valid_scores]), and ensure any necessary type hint/cast is applied so avg_score = sum(scores) / len(scores) remains type-safe; update references in the same function where valid_scores, scores, and avg_score are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/unit/pipeline/evaluation/test_evaluator.py`:
- Around line 767-778: Rename/update the test
test_evaluate_multiple_expected_responses_error_preserves_tokens (and the
similar sibling test around the other block) and its docstring to reflect a
partial-failure/FAIL final status, and explicitly assert that the final
evaluation status equals FAIL after the second iteration fails; ensure the test
sets/locks the partial-error state (e.g., by simulating the second iteration
error via the mocked evaluator/ScriptExecutionManager) and then checks tokens
were preserved while also asserting result.status == FAIL to guarantee the
contract.
---
Nitpick comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py`:
- Around line 276-284: valid_scores already contains only entries with non-None
scores, so remove the redundant filtering when building scores: replace the list
comprehension that re-checks "if js.score is not None" with one that extracts
scores directly from valid_scores (e.g., scores = [js.score for js in
valid_scores]), and ensure any necessary type hint/cast is applied so avg_score
= sum(scores) / len(scores) remains type-safe; update references in the same
function where valid_scores, scores, and avg_score are used.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 654e25a6-1e9c-49b5-b58b-ce7c1074c20f
📒 Files selected for processing (14)
src/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/llm/manager.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/system/loader.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/pipeline/evaluation/judges.pytests/unit/core/llm/test_llm_manager.pytests/unit/core/models/test_data.pytests/unit/core/models/test_system.pytests/unit/pipeline/evaluation/conftest.pytests/unit/pipeline/evaluation/test_evaluator.py
🚧 Files skipped from review as they are similar to previous changes (4)
- src/lightspeed_evaluation/core/constants.py
- src/lightspeed_evaluation/core/system/loader.py
- tests/unit/core/llm/test_llm_manager.py
- src/lightspeed_evaluation/core/models/init.py
460ecaf to
5985708
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_evaluation/core/models/system.py`:
- Around line 583-589: The aggregation_strategy Field default should be reverted
to "average" (change the Field default in the aggregation_strategy declaration)
so configs match the intended rollout; update any docstring to reflect that
"average" is the default and that "max" and "majority_vote" are not yet
implemented. Also ensure evaluator.py::_aggregate_scores continues to compute
with average by default and only emits a warning (not switch behavior) when
aggregation_strategy is "max" or "majority_vote" so those branches do not alter
the current averaging logic.
In `@src/lightspeed_evaluation/core/system/loader.py`:
- Around line 164-172: The loader currently does **config_data["llm_pool"] and
**config_data["judge_panel"] without guarding, which will raise TypeError if
those keys exist but are None or non-mapping; update the parsing in loader.py so
that before calling LLMPoolConfig(**...) and JudgePanelConfig(**...) you check
that config_data.get("llm_pool") and config_data.get("judge_panel") are mappings
(e.g., isinstance(value, dict) or collections.abc.Mapping); if they are
mappings, pass them into LLMPoolConfig and JudgePanelConfig respectively,
otherwise leave llm_pool / judge_panel as None (or raise a clear validation
error) so the loader doesn't crash on **None or non-mapping values.
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 552-559: Replace the RuntimeError in the validation block inside
evaluator.py with a custom exception from core.system.exceptions (e.g.,
EvaluationError or DataValidationError): import the chosen exception at the top
of the file, then raise that exception when evaluation_scope.turn_data or
evaluation_scope.turn_data.expected_response is missing, including the same
message (use request.metric_identifier in the message) to preserve context;
update any tests or callers if they assert RuntimeError to expect the new custom
exception.
In `@tests/unit/core/models/test_data.py`:
- Around line 579-584: The inline pylint suppression should be removed; instead,
after asserting result.judge_scores is not None in the test, assign a typed
local like judge_scores: Sequence[YourJudgeScoreType] = result.judge_scores (or
use the concrete type used elsewhere) and replace subsequent references
(result.judge_scores[0], result.judge_scores[1], etc.) with judge_scores[0],
judge_scores[1], ensuring imports/types are available so the linter is
satisfied; remove the "# pylint: disable=unsubscriptable-object" comment.
In `@tests/unit/pipeline/evaluation/test_judges.py`:
- Around line 82-110: Update the tests to reflect that
JudgeOrchestrator.aggregate_scores uses the "average" aggregation by default: in
test_multiple_judges_returns_max change expected score from 0.9 to the average
of the three scores (0.75) and assert the reason contains "Average of 3 judges"
and the formatted average (e.g., "0.750"); in
test_multiple_judges_with_one_failure change expected score from 0.85 to the
average of the two valid scores ((0.7+0.85)/2 = 0.775) and assert the reason
contains "Average of 2 judges". Also apply the same adjustment to the other
related assertions mentioned in the review (the later tests covering lines
127-186) so they check for average-based scores/reasoning rather than
max/majority behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3f7661c1-2815-496a-8022-6b2528be748b
📒 Files selected for processing (15)
src/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/llm/manager.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/system/loader.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/pipeline/evaluation/judges.pytests/unit/core/llm/test_llm_manager.pytests/unit/core/models/test_data.pytests/unit/core/models/test_system.pytests/unit/pipeline/evaluation/conftest.pytests/unit/pipeline/evaluation/test_evaluator.pytests/unit/pipeline/evaluation/test_judges.py
🚧 Files skipped from review as they are similar to previous changes (4)
- src/lightspeed_evaluation/core/constants.py
- tests/unit/pipeline/evaluation/conftest.py
- tests/unit/pipeline/evaluation/test_evaluator.py
- src/lightspeed_evaluation/core/output/generator.py
5985708 to
7a57043
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_evaluation/core/llm/manager.py`:
- Around line 170-184: The current get_judges_for_metric returns the first panel
judge when should_use_panel_for_metric is false, causing non-panel metrics to
use judge_managers[0]; change get_judges_for_metric so that when
should_use_panel_for_metric(metric_identifier) is false it returns the legacy
top-level LLM from system_config (e.g., [self.system_config.llm]) instead of
[self.get_primary_judge()]; keep the panel branch returning
self.get_judge_managers() and ensure the return type remains a list of
LLMManager instances.
In `@src/lightspeed_evaluation/core/system/loader.py`:
- Around line 164-174: The loader currently ignores non-dict non-None values for
"llm_pool" and "judge_panel"; change it to validate types and raise a custom
configuration exception from core.system.exceptions instead of silently dropping
them: import the appropriate exception (e.g., InvalidConfigError or
ConfigValidationError) from core.system.exceptions at the top of the file, then
for each value use config_data.get("llm_pool") and
config_data.get("judge_panel") and if the value is None keep behaving as before,
if isinstance(..., dict) instantiate LLMPoolConfig(**llm_pool_data) /
JudgePanelConfig(**judge_panel_data), but if the value is not None and not a
dict raise the imported exception with a clear message referencing the offending
key ("llm_pool" or "judge_panel") and the actual type/value so callers fail fast
on typos or wrong types.
In `@src/lightspeed_evaluation/pipeline/evaluation/judges.py`:
- Around line 89-93: The debug log that currently formats the message string
should be converted to structured logging by passing those values as named
fields rather than embedding them in the message: include metric_identifier from
request.metric_identifier, judge_count from len(judge_managers), judge_ids (map
of each judge_manager.id or .judge.id), per-judge input_tokens/output_tokens
(from each judge_manager or its token counters), and aggregation_strategy (the
variable used where aggregation is chosen) as explicit log attributes; update
the logger.debug call(s) around the evaluation entry (the current logger.debug
using request.metric_identifier and judge_managers) and the other noted logging
sites (the blocks that reference judge managers, token usage, and aggregation)
to emit these structured fields via your logging system’s structured API (e.g.,
logger.debug(..., extra={...}) or logger.bind(...). Make sure field names match
metric_identifier, judge_id(s), judge_count, input_tokens, output_tokens, and
aggregation_strategy so downstream log processors can parse them.
- Around line 202-218: The except block in _evaluate_single_judge currently only
catches EvaluationError so other exceptions (e.g., ConfigurationError or
unexpected bugs) escape; change it to catch Exception (or a broader base) so all
failures are converted into a JudgeScore(score=None) and returned to
aggregate_scores for uniform handling; update the except clause around the
logger.error and the JudgeScore(...) return to handle Exception instead of
EvaluationError, keeping the logging of judge_id and the error and preserving
token_tracker.get_counts(), reason, input_tokens, and output_tokens.
- Around line 123-149: The helper signatures currently use inline pylint
disables in _evaluate_all_judges and _evaluate_single_judge because they accept
separate framework and metric_name arguments; remove the pylint disable by
shrinking the parameter list: stop passing framework and metric_name and instead
derive them inside the helpers from request.metric_identifier (or create and
pass a tiny context object that contains both values) so the signatures only
take (self, judge_manager(s), request, evaluation_scope, token_tracker) and
update all call sites accordingly; remove the "# pylint:
disable=too-many-arguments,too-many-positional-arguments" comments from the
function definitions after the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d85b507f-2658-4687-87e1-110a660c3721
📒 Files selected for processing (15)
src/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/llm/manager.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/system/loader.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/pipeline/evaluation/judges.pytests/unit/core/llm/test_llm_manager.pytests/unit/core/models/test_data.pytests/unit/core/models/test_system.pytests/unit/pipeline/evaluation/conftest.pytests/unit/pipeline/evaluation/test_evaluator.pytests/unit/pipeline/evaluation/test_judges.py
🚧 Files skipped from review as they are similar to previous changes (6)
- tests/unit/pipeline/evaluation/conftest.py
- src/lightspeed_evaluation/core/constants.py
- src/lightspeed_evaluation/core/output/generator.py
- src/lightspeed_evaluation/core/models/data.py
- tests/unit/pipeline/evaluation/test_judges.py
- src/lightspeed_evaluation/core/models/system.py
7a57043 to
8d603e3
Compare
VladimirKadlec
left a comment
There was a problem hiding this comment.
LGTM, nice work 💪 , thank you.
One non-blocking comment.
…anel feat: integrate panel of judges
Description
A lot of improvements can be done, but it is already a big change.
Doc change will be done once all basic requirements are in place.
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Bug Fixes / Outputs
Refactor
Tests