- TLDR;
- Prerequisites
- Setting up your development environment
- Definition of Done
- Automation
- Testing
- Code style
- Create your own fork of the repo
- Make changes to the code in your fork
- Run unit tests and verification checks
- Check the code with linters
- Submit PR from your fork to main branch of the project repo
- git
- Python 3.11 or higher
- pip
The development requires at least Python 3.11 due to dependencies on modern ML/AI libraries and evaluation frameworks that leverage the latest Python features for performance and compatibility.
pip install --user uvuv --version-- should return no error
| Hook | What it runs | When |
|---|---|---|
pre-commit |
make pre-commit (all quality checks) |
Before each commit |
pre-push |
make test |
Before each push |
# clone your fork
git clone https://github.com/YOUR-GIT-PROFILE/lightspeed-evaluation.git
# move into the directory
cd lightspeed-evaluation
# setup your development environment with uv
uv sync --group dev
# Now you can run commands through make targets, or prefix commands with `uv run`
# Install dev dependencies and git hooks
make install-deps-test
# Format code
make black-format
# Run all pre-commit checks at once (same as CI)
make pre-commit # Runs: bandit, check-types, pyright, docstyle, ruff, pylint, black-check
# or Run each quality checks individually:
make bandit # Security scan
make check-types # Type check
make pyright # Type check
make docstyle # Docstring style
make ruff # Lint check
make pylint # Lint check
make black-check # Check formatting
# Run tests
make test
# run evaluation (requires OLS API to be running)
uv run evaluate --helpHappy hacking!
- Code is complete, commented, and merged to the relevant release branch
- User facing documentation written (where relevant)
- Acceptance criteria in the related Jira ticket (where applicable) are verified and fulfilled
- Pull request title+commit includes Jira number
- Changes are covered by unit tests that run cleanly in the CI environment (where relevant)
- Evaluation tests pass with the updated code (where relevant)
- All linters are running cleanly in the CI environment
- Code changes reviewed by at least one peer
- Code changes acked by at least one project owner
Code coverage tools are available through the pytest-cov plugin, which is installed as a development dependency. However, coverage measurement is not currently configured by default in the test runs. To run tests with coverage measurement, you can use:
uv run pytest tests --cov=src --cov-report=html
This will generate coverage reports in the htmlcov subdirectory.
It is possible to check if type hints added into the code are correct and whether assignments, function calls etc. use values of the right type. This check is invoked by following command:
make check-types
Please note that type hints check might be very slow on the first run. Subsequent runs are much faster thanks to the cache that Mypy uses. This check is part of a CI job that verifies sources.
Black, Ruff, Pyright, and Pylint tools are used as linters. These tools are installed as development dependencies. Currently, only basic Mypy configuration is present in pyproject.toml in the [tool.mypy] section. Additional linter configurations can be added as needed.
List of all Ruff rules recognized by Ruff can be retrieved by:
ruff linter
Description of all Ruff rules are available on https://docs.astral.sh/ruff/rules/
Ruff rules can be disabled in source code (for given line or block) by using special noqa comment line. For example:
# noqa: E501List of all Pylint rules can be retrieved by:
pylint --list-msgs
Description of all rules are available on https://pylint.readthedocs.io/en/latest/user_guide/checkers/features.html
To disable Pylint rule in source code, the comment line in following format can be used:
# pylint: disable=C0415Tests are used in this repository to verify the correctness of evaluation logic, data processing, and utility functions. The tests are designed to ensure that:
- Evaluation metrics are calculated correctly
- Data processing pipelines work as expected
- API interactions function properly
- Configuration parsing is robust
Tests can be started by using the following command:
make test
All tests are based on the Pytest framework. Code coverage can be measured using the pytest-cov plugin, which is available as a development dependency. For mocking and patching, the unittest framework is used.
As specified in Definition of Done, new changes need to be covered by tests.
WARNING:
Since tests are executed using Pytest, which relies heavily on fixtures,
we discourage use of patch decorators in all test code, as they may interact with one another.
It is possible to use patching inside the test implementation as a context manager:
def test_xyz():
with patch("lightspeed_core_evaluation.config", new=Mock()):
...
...
...new=allow us to use different function or classreturn_value=allow us to define return value (no mock will be called)
Sometimes it is needed to test whether some exception is thrown from a tested function or method. In this case pytest.raises can be used:
def test_evaluation_with_invalid_config(invalid_config):
"""Check if wrong configuration is detected properly."""
with pytest.raises(ValueError):
evaluate_model(invalid_config)It is also possible to check if the exception is thrown with the expected message. The message (or its part) is written as regexp:
def test_constructor_no_provider():
"""Test that constructor checks for provider."""
# we use bare Exception in the code, so need to check
# message, at least
with pytest.raises(Exception, match="ERROR: Missing provider"):
load_evaluation_model(provider=None)It is possible to capture stdout and stderr by using standard fixture capsys:
def test_evaluation_output(capsys):
"""Test the evaluation function that prints to stdout."""
run_evaluation("test_config.yaml")
# check captured log output
captured_out = capsys.readouterr().out
assert "Evaluation completed" in captured_out
captured_err = capsys.readouterr().err
assert captured_err == ""Capturing logs:
@patch.dict(os.environ, {"LOG_LEVEL": "INFO"})
def test_logger_show_message_flag(mock_load_dotenv, capsys):
"""Test logger set with show_message flag."""
logger = Logger(logger_name="evaluation", log_level=logging.INFO, show_message=True)
logger.logger.info("This is my debug message")
# check captured log output
# the log message should be captured
captured_out = capsys.readouterr().out
assert "This is my debug message" in captured_out
# error output should be empty
captured_err = capsys.readouterr().err
assert captured_err == ""We are using Google's docstring style.
Here is simple example:
def evaluate_model_response(query: str, response: str, ground_truth: str) -> float:
"""Evaluate model response against ground truth using similarity metrics.
Args:
query: The input query that was sent to the model.
response: The response generated by the model.
ground_truth: The expected/correct response.
Returns:
The similarity score between response and ground truth (0.0 to 1.0).
Raises:
ValueError: If any of the input parameters are empty or None.
"""For further guidance, see the rest of our codebase, or check sources online. There are many, eg. this one.