chore: refactor e2e tests#183
Conversation
WalkthroughConsolidates CI e2e orchestration into a single Make target, adds pytest-timeout, adjusts pytest defaults to exclude integration tests, and introduces new end-to-end integration tests plus a small test-data tweak to force API calls. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Integration Test
participant ConfigLoader as Config Loader
participant EvalFramework as Evaluation Framework
participant LSAPI as Lightspeed Stack API
participant OpenAI as OpenAI API
Test->>ConfigLoader: load system config (query/streaming)
ConfigLoader-->>Test: return config
Test->>EvalFramework: load evaluation data (YAML)
EvalFramework-->>Test: return evaluation data
Test->>EvalFramework: evaluate(config, eval_data)
EvalFramework->>LSAPI: send request(s) to endpoint
LSAPI->>OpenAI: forward requests (uses OPENAI_API_KEY)
OpenAI-->>LSAPI: return responses & token usage
LSAPI-->>EvalFramework: return enriched responses
EvalFramework-->>Test: return EvaluationResult (metrics, scores)
Test->>Test: assert results, metrics, thresholds, token usage
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
Makefile (1)
62-63: Adde2e_teststo.PHONY.Without this, a file or directory named
e2e_testswill cause the recipe to be skipped. Since this is a public automation target, it should be declared phony likehelpandtest.♻️ Proposed fix
-.PHONY: help test +.PHONY: help test e2e_tests🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 62 - 63, The Makefile defines the e2e_tests target but doesn't list it in the .PHONY declaration, so add "e2e_tests" to the existing .PHONY target (the line that currently contains targets like "help" and "test") to mark it as phony; edit the .PHONY entry to include e2e_tests (ensuring spacing/formatting matches the current .PHONY line) so the e2e_tests recipe always runs regardless of any file or directory named e2e_tests.pyproject.toml (1)
70-70: Please confirm the directpytest-timeoutdependency is intentional.This repo previously kept the
timeout = 300config without declaringpytest-timeout, so adding it here changes the dev dependency policy for every--group devsync. If that maintainer decision has not changed, this line should stay out ofpyproject.toml.Based on learnings: "In the lightspeed-evaluation project, the pytest.ini timeout configuration is intentionally configured without adding pytest-timeout as a declared dependency, as confirmed by the project maintainer."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` at line 70, Confirm whether adding the direct dev dependency "pytest-timeout>=2.4.0" is intentional; if it is not, remove the "pytest-timeout>=2.4.0" entry from pyproject.toml so the repo continues to rely on the existing pytest.ini "timeout = 300" configuration without declaring the plugin, and if it is intentional, leave the line but update project docs or dev-dependency policy to record the decision and reasoning for declaring pytest-timeout explicitly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/e2e_tests.yaml:
- Line 105: The workflow currently invokes the PR-controlled target "make
e2e_tests" under the "pull_request_target" context which allows the PR's
Makefile to execute with secrets (e.g. OPENAI_API_KEY) — change this by either
replacing the "make e2e_tests" step with explicit, inline test commands that run
against the trusted checkout/ref (so nothing from the PR Makefile executes), or
move the job to run on "pull_request" instead so the PR head is checked out
without exposing secrets; update the step that calls "make e2e_tests" and any
related prerequisites to reference the trusted commands or change the workflow
trigger accordingly.
In `@tests/integration/test_full_evaluation.py`:
- Line 1: Remove the file-level "# pylint: disable" and all "# type: ignore"
usages and fix the root issues: move the nested DataValidator class to module
scope (rename if needed) so it’s importable/testable, create a shared helper
function (e.g., load_validate_evaluate or evaluate_fixture) to encapsulate the
repeated load/validate/evaluate sequence used across tests to reduce local
variable churn, and replace any suppressed comparisons by asserting non-None for
score and threshold then assigning them to locals and comparing those locals
(e.g., assert score is not None; s = score; assert threshold is not None; t =
threshold; assert s >= t) instead of using type/lint ignores; update tests that
referenced the old nested class or inlined flow to use the module-level
DataValidator and the new helper.
- Around line 27-34: The check_api_available function currently probes "/" and
requires a 200 which can produce false failures; change it to request
"/v1/models" (the same endpoint the workflow waits on) and treat the service as
available for any non-5xx response (e.g., status_code < 500) so 200/401/404
still count as up; also broaden the exception catch to httpx.HTTPError (or
include RequestError/Timeout) to avoid uncaught errors. Update the logic inside
check_api_available accordingly and keep the same timeout variable and return
boolean semantics.
- Around line 173-175: The test currently checks relevancy_result.response with
`is not None`, which is ineffective because EvaluationResult.response defaults
to "" (empty string); update the assertions to verify a real API-enrichment
predicate such as a non-empty string (e.g., assert relevancy_result.response !=
"" or assert bool(relevancy_result.response)) and similarly replace the other
`is not None` checks (lines ~237-243) so the test selects the API-enriched
record instead of the first result; refer to the relevancy_result variable and
the EvaluationResult.response field when making these changes.
---
Nitpick comments:
In `@Makefile`:
- Around line 62-63: The Makefile defines the e2e_tests target but doesn't list
it in the .PHONY declaration, so add "e2e_tests" to the existing .PHONY target
(the line that currently contains targets like "help" and "test") to mark it as
phony; edit the .PHONY entry to include e2e_tests (ensuring spacing/formatting
matches the current .PHONY line) so the e2e_tests recipe always runs regardless
of any file or directory named e2e_tests.
In `@pyproject.toml`:
- Line 70: Confirm whether adding the direct dev dependency
"pytest-timeout>=2.4.0" is intentional; if it is not, remove the
"pytest-timeout>=2.4.0" entry from pyproject.toml so the repo continues to rely
on the existing pytest.ini "timeout = 300" configuration without declaring the
plugin, and if it is intentional, leave the line but update project docs or
dev-dependency policy to record the decision and reasoning for declaring
pytest-timeout explicitly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 24713dcf-7a54-4f39-aed9-590845801e58
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
.github/workflows/e2e_tests.yamlMakefilepyproject.tomlpytest.initests/integration/test_evaluation_data.yamltests/integration/test_full_evaluation.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/e2e_tests.yaml (1)
11-18:⚠️ Potential issue | 🟠 Major
matrix.modeis undefined; job name and env variable will be empty.The matrix only defines
lsc_image_path, but lines 15 and 18 referencematrix.modewhich doesn't exist. This will result in an empty string in the job name andLSC_IMAGE_NAMEenvironment variable.🐛 Proposed fix
Either remove the mode reference if it's no longer needed:
strategy: matrix: lsc_image_path: ["quay.io/lightspeed-core/lightspeed-stack:latest"] - name: "E2E Lightspeed Evaluation Test, mode: ${{ matrix.mode }}" + name: "E2E Lightspeed Evaluation Test" env: - LSC_IMAGE_NAME: "lightspeed-stack-test-mode-${{ matrix.mode }}" + LSC_IMAGE_NAME: "lightspeed-stack-test"Or add the mode back to the matrix if per-variant execution is still needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/e2e_tests.yaml around lines 11 - 18, The workflow references an undefined matrix variable matrix.mode in the job name ("name") and the environment variable LSC_IMAGE_NAME; either restore a mode entry in the strategy.matrix (e.g., add a "mode" array alongside lsc_image_path) so matrix.mode resolves, or remove the ${ { matrix.mode }} interpolation from the "name" and LSC_IMAGE_NAME env declaration so they no longer reference matrix.mode; update the "name" field and the env LSC_IMAGE_NAME accordingly.
♻️ Duplicate comments (2)
tests/integration/test_full_evaluation.py (1)
1-1: 🛠️ Refactor suggestion | 🟠 MajorDrop the lint suppressions and fix the underlying causes.
The file-level
# pylint: disableat line 1 and method-level suppressions at line 81 and# type: ignoreat line 170 violate the repository's coding guidelines. Consider:
- Extract
DataValidatorimport to module level or use a helper to reduce locals- For the type comparison at line 170, use explicit non-None assertions before the comparison
As per coding guidelines:
**/*.py: Do not disable lint warnings with# noqa,# type: ignore, or# pylint: disablecomments - fix the underlying issue instead.,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/test_full_evaluation.py` at line 1, Remove the file-level pylint disable and the method-level suppressions; instead move the DataValidator import out of the local scope into module-level imports (or add a small helper function that returns the validator) to avoid redefined-outer-name/locals warnings, and replace the blind `# type: ignore` used around the type comparison by asserting the value is not None before comparing (use explicit runtime checks or typing.cast/assertIsNotNone equivalents) so the type-checker and pylint are satisfied; update any affected test functions (those referencing DataValidator and the comparison) to use the module-level import/helper and the non-None assertion..github/workflows/e2e_tests.yaml (1)
105-105:⚠️ Potential issue | 🔴 CriticalSecurity risk: PR-controlled
maketarget runs underpull_request_targetwith secrets.The workflow checks out untrusted PR code (lines 26, 30) and then runs
make e2e_testswithOPENAI_API_KEYavailable. A malicious fork PR could modify the Makefile to exfiltrate the secret.Consider either:
- Switching to
pull_requesttrigger (secrets won't be available to forks, but tests won't run on fork PRs)- Keeping test commands inline rather than delegating to Makefile
- Using a two-workflow approach where approval gates the secret-using job
,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/e2e_tests.yaml at line 105, The workflow currently runs the Makefile target make e2e_tests under a pull_request_target trigger with OPENAI_API_KEY available, which lets untrusted PR code modify the Makefile and exfiltrate secrets; fix by either (A) switching the workflow trigger from pull_request_target to pull_request so secrets aren’t exposed to forked PRs, (B) inlining the e2e commands directly in the job instead of invoking make e2e_tests so the checked-out PR repo cannot alter secret-using commands, or (C) split into two workflows/jobs where the job that uses OPENAI_API_KEY (the job that runs e2e_tests) is gated/approved or runs only after a trusted merge to prevent PR-controlled Makefile execution—apply the chosen change to the job that invokes make e2e_tests and ensure OPENAI_API_KEY is not accessible to untrusted runs.
🧹 Nitpick comments (2)
tests/integration/test_full_evaluation.py (2)
122-128: Consider movingDataValidatorimport to module level.The inline import at lines 122 and 225 contributes to the high local variable count that necessitates the
too-many-localssuppression. Moving this import to the module level would help address the underlying lint issue.♻️ Proposed fix at module level
Add to the imports section (around line 23):
from lightspeed_evaluation.core.system import DataValidatorThen remove the inline imports at lines 122 and 225.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/test_full_evaluation.py` around lines 122 - 128, Move the inline import of DataValidator to the module-level imports: add "from lightspeed_evaluation.core.system import DataValidator" near the top of the test module and remove the two inline imports that instantiate DataValidator (the ones before creating validator/evaluation_data at the two locations currently importing DataValidator inline). This reduces local variable count and lets you remove the related too-many-locals suppression while preserving the existing DataValidator usage when creating validator and calling validator.load_evaluation_data.
168-175: Redundant assertion at line 174.Line 168 already asserts
relevancy_result.response.strip()is truthy, which implies the response exists and is non-empty. The assertion at line 174 (response is not None) is redundant sinceEvaluationResult.responsedefaults to""(neverNone).♻️ Proposed fix
# Verify score is above threshold assert relevancy_result.response.strip(), "Should have response from API" assert ( relevancy_result.score >= relevancy_result.threshold # type: ignore ), f"Score {relevancy_result.score} should be >= threshold {relevancy_result.threshold}" - # Verify we got a response from the API - assert relevancy_result.response is not None, "Should have response from API" assert relevancy_result.query == "What is the capital of France?"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/test_full_evaluation.py` around lines 168 - 175, Remove the redundant assertion that checks relevancy_result.response is not None because earlier assert relevancy_result.response.strip() already guarantees a non-empty string and EvaluationResult.response defaults to ""; delete the line containing the redundant assertion and keep the existing strip() check and the query/assertion on relevancy_result.query to validate behavior of relevancy_result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In @.github/workflows/e2e_tests.yaml:
- Around line 11-18: The workflow references an undefined matrix variable
matrix.mode in the job name ("name") and the environment variable
LSC_IMAGE_NAME; either restore a mode entry in the strategy.matrix (e.g., add a
"mode" array alongside lsc_image_path) so matrix.mode resolves, or remove the ${
{ matrix.mode }} interpolation from the "name" and LSC_IMAGE_NAME env
declaration so they no longer reference matrix.mode; update the "name" field and
the env LSC_IMAGE_NAME accordingly.
---
Duplicate comments:
In @.github/workflows/e2e_tests.yaml:
- Line 105: The workflow currently runs the Makefile target make e2e_tests under
a pull_request_target trigger with OPENAI_API_KEY available, which lets
untrusted PR code modify the Makefile and exfiltrate secrets; fix by either (A)
switching the workflow trigger from pull_request_target to pull_request so
secrets aren’t exposed to forked PRs, (B) inlining the e2e commands directly in
the job instead of invoking make e2e_tests so the checked-out PR repo cannot
alter secret-using commands, or (C) split into two workflows/jobs where the job
that uses OPENAI_API_KEY (the job that runs e2e_tests) is gated/approved or runs
only after a trusted merge to prevent PR-controlled Makefile execution—apply the
chosen change to the job that invokes make e2e_tests and ensure OPENAI_API_KEY
is not accessible to untrusted runs.
In `@tests/integration/test_full_evaluation.py`:
- Line 1: Remove the file-level pylint disable and the method-level
suppressions; instead move the DataValidator import out of the local scope into
module-level imports (or add a small helper function that returns the validator)
to avoid redefined-outer-name/locals warnings, and replace the blind `# type:
ignore` used around the type comparison by asserting the value is not None
before comparing (use explicit runtime checks or typing.cast/assertIsNotNone
equivalents) so the type-checker and pylint are satisfied; update any affected
test functions (those referencing DataValidator and the comparison) to use the
module-level import/helper and the non-None assertion.
---
Nitpick comments:
In `@tests/integration/test_full_evaluation.py`:
- Around line 122-128: Move the inline import of DataValidator to the
module-level imports: add "from lightspeed_evaluation.core.system import
DataValidator" near the top of the test module and remove the two inline imports
that instantiate DataValidator (the ones before creating
validator/evaluation_data at the two locations currently importing DataValidator
inline). This reduces local variable count and lets you remove the related
too-many-locals suppression while preserving the existing DataValidator usage
when creating validator and calling validator.load_evaluation_data.
- Around line 168-175: Remove the redundant assertion that checks
relevancy_result.response is not None because earlier assert
relevancy_result.response.strip() already guarantees a non-empty string and
EvaluationResult.response defaults to ""; delete the line containing the
redundant assertion and keep the existing strip() check and the query/assertion
on relevancy_result.query to validate behavior of relevancy_result.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fe639636-4f4c-452c-be16-b5619fb0af36
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
.github/workflows/e2e_tests.yamlMakefilepyproject.tomlpytest.initests/integration/test_evaluation_data.yamltests/integration/test_full_evaluation.py
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/integration/test_evaluation_data.yaml
- Makefile
- pyproject.toml
188d64b to
21fff16
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
Makefile (1)
62-64: Consider adding help text for the new target.Other Makefile targets include
## commentfor the help system (e.g.,test: install-deps-test ## Execute tests with Pytest). Adding similar documentation would ensuremake helpshows this new target.📝 Suggested change
-e2e_tests: install-deps-test +e2e_tests: install-deps-test ## Execute e2e integration tests with Pytest uv run pytest tests/integration -v -m integration🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 62 - 64, The new Makefile target e2e_tests lacks the help annotation used by the project's help system; update the e2e_tests target (symbol: e2e_tests) to include a help comment using the same pattern as other targets (add a "## ..." comment on the target line or immediately after it, e.g., "## Run end-to-end integration tests") so it appears in make help, keeping the existing dependency (install-deps-test) and command (uv run pytest tests/integration -v -m integration) unchanged.tests/integration/test_full_evaluation.py (1)
168-175: Remove the redundantis not Noneassertion.Line 174's assertion
relevancy_result.response is not Noneis redundant after Line 168'sassert relevancy_result.response.strip(). SinceEvaluationResult.responsedefaults to""(notNone), theis not Nonecheck always passes and doesn't add value.♻️ Suggested simplification
# Verify we got a response from the API - assert relevancy_result.response.strip(), "Should have response from API" - assert ( - relevancy_result.score >= relevancy_result.threshold # type: ignore - ), f"Score {relevancy_result.score} should be >= threshold {relevancy_result.threshold}" - - # Verify we got a response from the API - assert relevancy_result.response is not None, "Should have response from API" + assert relevancy_result.response.strip(), "Should have non-empty response from API" + assert relevancy_result.score is not None and relevancy_result.threshold is not None, ( + "Score and threshold should be set" + ) + assert ( + relevancy_result.score >= relevancy_result.threshold + ), f"Score {relevancy_result.score} should be >= threshold {relevancy_result.threshold}" assert relevancy_result.query == "What is the capital of France?"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/test_full_evaluation.py` around lines 168 - 175, Remove the redundant assertion that checks relevancy_result.response is not None; the earlier assertion using relevancy_result.response.strip() already guarantees a non-empty string (EvaluationResult.response defaults to ""), so delete the assert relevancy_result.response is not None line and keep the existing strip() check and other assertions (e.g., relevancy_result.query and score comparisons) intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@Makefile`:
- Around line 62-64: The new Makefile target e2e_tests lacks the help annotation
used by the project's help system; update the e2e_tests target (symbol:
e2e_tests) to include a help comment using the same pattern as other targets
(add a "## ..." comment on the target line or immediately after it, e.g., "##
Run end-to-end integration tests") so it appears in make help, keeping the
existing dependency (install-deps-test) and command (uv run pytest
tests/integration -v -m integration) unchanged.
In `@tests/integration/test_full_evaluation.py`:
- Around line 168-175: Remove the redundant assertion that checks
relevancy_result.response is not None; the earlier assertion using
relevancy_result.response.strip() already guarantees a non-empty string
(EvaluationResult.response defaults to ""), so delete the assert
relevancy_result.response is not None line and keep the existing strip() check
and other assertions (e.g., relevancy_result.query and score comparisons)
intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1850b091-f917-44c6-a042-d552c57f1daf
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
.github/workflows/e2e_tests.yamlMakefilepyproject.tomlpytest.initests/integration/test_evaluation_data.yamltests/integration/test_full_evaluation.py
🚧 Files skipped from review as they are similar to previous changes (1)
- pytest.ini
chore: refactor e2e tests
Description
Refactor e2e tests using pytest.
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Summary by CodeRabbit
Tests
Chores