This document serves as a guide for AI coding agents working on the LightSpeed Evaluation Framework. This document provides context, instructions, and best practices to help agents understand the codebase and contribute effectively.
| Directory | Status | New Features? | Bug Fixes/Security/Linting? |
|---|---|---|---|
src/lightspeed_evaluation/ |
Active Development | ✅ Yes | ✅ Yes |
lsc_agent_eval/ |
Deprecated - will be removed | ❌ No - add to lightspeed_evaluation instead |
✅ Yes (with confirmation) |
src/generate_answers/ |
Moving to separate repo | ❌ No - will be relocated | ✅ Yes (with confirmation) |
tests/, config/, docs/ |
Active | ✅ Yes | ✅ Yes |
Before making changes to deprecated/transitional directories:
- Inform the user about the directory's status
- Get explicit confirmation before proceeding
- For new features: suggest adding to
src/lightspeed_evaluation/instead
# ❌ WRONG - DO NOT USE
from unittest.mock import patch, MagicMock
# ✅ CORRECT - ALWAYS USE
def test_example(mocker):
mocker.patch('module.function')When modifying functionality, you MUST update:
docs/- Update relevant guide if feature behavior changesREADME.md- Update when features change or adding new featuresAGENTS.md- Update if adding new conventions or project structure changes
If the user provides new project conventions, coding standards, or constraints:
- Add them to this
AGENTS.mdfile in the appropriate section - Important guidelines go in this section above
- Best practices go in relevant sections below
Before considering any code change complete, you MUST run:
# Format code
make black-format
# Run all pre-commit checks at once (same as CI)
make pre-commit # Runs: bandit, check-types, pyright, docstyle, ruff, pylint, black-check
# or Run each quality checks individually:
make bandit # Security scan
make check-types # Type check
make pyright # Type check
make docstyle # Docstring style
make ruff # Lint check
make pylint # Lint check
make black-check # Check formatting
# Run tests
make test # Or: uv run pytest testsGit hooks are automatically installed via make install-deps-test. They run make pre-commit before commits and tests before pushes.
Do NOT skip these steps. If any check fails:
- Fix the issues in code you changed
- For pre-existing issues in unchanged code: notify the user but don't fix (to avoid scope creep)
- Re-run the checks until they pass
- Only then consider the task complete
Important: Do NOT disable lint warnings (e.g., # noqa, # type: ignore, # pylint: disable). Always try to fix the underlying issue. If a fix becomes too complicated, inform the user and discuss alternatives.
The LightSpeed Evaluation Framework is a comprehensive evaluation system for GenAI applications, supporting multiple evaluation frameworks (Ragas, DeepEval, custom metrics) with both turn-level and conversation-level assessments.
- Core Framework: Located in
src/lightspeed_evaluation/core/: Configuration, LLM management, metrics, output handlingpipeline/: Evaluation orchestration and processingrunner/: Command-line interface and main entry points
- Configuration: YAML-based system and evaluation data configs in
config/ - Testing: Comprehensive test suite in
tests/following pytest conventions
- Python 3.11+
uvpackage manager (preferred) orpip
uv sync --group dev
make install-deps-testRefer to README.md for full list. Key variables:
OPENAI_API_KEY- Required for LLM evaluationAPI_KEY- Optional, for API-enabled evaluationsKUBECONFIG- Optional, for script-based evaluations
src/lightspeed_evaluation/
├── core/
│ ├── api/ # API client for real-time data
│ ├── llm/ # LLM provider management
│ ├── metrics/ # Evaluation metrics (Ragas, DeepEval, custom)
│ ├── models/ # Pydantic data models
│ ├── output/ # Report generation and visualization
│ ├── script/ # Script execution for environment validation
│ └── system/ # Configuration and validation
├── pipeline/
│ └── evaluation/ # Main evaluation pipeline orchestration
└── runner/ # CLI interface and main entry points
EvaluationPipeline: Main orchestratorSystemConfig/EvaluationData: Pydantic config modelsMetricManager: Metric executionOutputHandler: Report generation
Mirror the source code structure in tests/:
- Test files:
test_*.py - Test functions:
test_* - Test classes:
Test*
def test_llm_manager(mocker):
"""Test LLM manager with mocked provider."""
mock_client = mocker.patch('lightspeed_evaluation.core.llm.openai.OpenAI')
mock_client.return_value.chat.completions.create.return_value = mock_response
manager = LLMManager(config)
result = manager.evaluate_response(query, response)
assert result.score > 0.5- Aim for >80% on new code
- Run:
uv run pytest tests --cov=src --cov-report=html
- Type Hints: Required for all public functions and methods
- Docstrings: Google-style docstrings for all public APIs
- Error Handling: Use custom exceptions from
core.system.exceptions - Logging: Use structured logging with appropriate levels
- Custom Metrics: Add to
src/lightspeed_evaluation/core/metrics/custom/ - Register: Update
MetricManagersupported_metrics dictionary - Configure: Add metadata to
config/system.yamlmetrics_metadata section - Test: Add comprehensive tests with mocked LLM calls using pytest
- Configuration Errors: Check
core/system/validator.py - Metric Failures: Enable DEBUG logging
- API Issues: Verify API_KEY and endpoint connectivity
- Test Failures: Run
make testand check specific error messages