Skip to content

Latest commit

 

History

History
287 lines (204 loc) · 9.46 KB

File metadata and controls

287 lines (204 loc) · 9.46 KB

CONTRIBUTING

TLDR;

  1. Create your own fork of the repo
  2. Make changes to the code in your fork
  3. Run unit tests and verification checks
  4. Check the code with linters
  5. Submit PR from your fork to main branch of the project repo

Prerequisites

  • git
  • Python 3.11 or higher
  • pip

The development requires at least Python 3.11 due to dependencies on modern ML/AI libraries and evaluation frameworks that leverage the latest Python features for performance and compatibility.

Tooling installation

  1. pip install --user uv
  2. uv --version -- should return no error

Installed Hooks

Hook What it runs When
pre-commit make pre-commit (all quality checks) Before each commit
pre-push make test Before each push

Setting up your development environment

# clone your fork
git clone https://github.com/YOUR-GIT-PROFILE/lightspeed-evaluation.git

# move into the directory
cd lightspeed-evaluation

# setup your development environment with uv
uv sync --group dev

# Now you can run commands through make targets, or prefix commands with `uv run`

# Install dev dependencies and git hooks
make install-deps-test

# Format code
make black-format

# Run all pre-commit checks at once (same as CI)
make pre-commit      # Runs: bandit, check-types, pyright, docstyle, ruff, pylint, black-check
# or Run each quality checks individually:
make bandit          # Security scan
make check-types     # Type check
make pyright         # Type check
make docstyle        # Docstring style
make ruff            # Lint check
make pylint          # Lint check
make black-check     # Check formatting

# Run tests
make test

# run evaluation (requires OLS API to be running)
uv run evaluate --help

Happy hacking!

Definition of Done

A deliverable is to be considered "done" when

  • Code is complete, commented, and merged to the relevant release branch
  • User facing documentation written (where relevant)
  • Acceptance criteria in the related Jira ticket (where applicable) are verified and fulfilled
  • Pull request title+commit includes Jira number
  • Changes are covered by unit tests that run cleanly in the CI environment (where relevant)
  • Evaluation tests pass with the updated code (where relevant)
  • All linters are running cleanly in the CI environment
  • Code changes reviewed by at least one peer
  • Code changes acked by at least one project owner

Automation

Code coverage measurement

Code coverage tools are available through the pytest-cov plugin, which is installed as a development dependency. However, coverage measurement is not currently configured by default in the test runs. To run tests with coverage measurement, you can use:

uv run pytest tests --cov=src --cov-report=html

This will generate coverage reports in the htmlcov subdirectory.

Type hints checks

It is possible to check if type hints added into the code are correct and whether assignments, function calls etc. use values of the right type. This check is invoked by following command:

make check-types

Please note that type hints check might be very slow on the first run. Subsequent runs are much faster thanks to the cache that Mypy uses. This check is part of a CI job that verifies sources.

Linters

Black, Ruff, Pyright, and Pylint tools are used as linters. These tools are installed as development dependencies. Currently, only basic Mypy configuration is present in pyproject.toml in the [tool.mypy] section. Additional linter configurations can be added as needed.

List of all Ruff rules recognized by Ruff can be retrieved by:

ruff linter

Description of all Ruff rules are available on https://docs.astral.sh/ruff/rules/

Ruff rules can be disabled in source code (for given line or block) by using special noqa comment line. For example:

# noqa: E501

List of all Pylint rules can be retrieved by:

pylint --list-msgs

Description of all rules are available on https://pylint.readthedocs.io/en/latest/user_guide/checkers/features.html

To disable Pylint rule in source code, the comment line in following format can be used:

# pylint: disable=C0415

Testing

Tests are used in this repository to verify the correctness of evaluation logic, data processing, and utility functions. The tests are designed to ensure that:

  1. Evaluation metrics are calculated correctly
  2. Data processing pipelines work as expected
  3. API interactions function properly
  4. Configuration parsing is robust

Tests can be started by using the following command:

make test

All tests are based on the Pytest framework. Code coverage can be measured using the pytest-cov plugin, which is available as a development dependency. For mocking and patching, the unittest framework is used.

As specified in Definition of Done, new changes need to be covered by tests.

Tips and hints for developing unit tests

Patching

WARNING: Since tests are executed using Pytest, which relies heavily on fixtures, we discourage use of patch decorators in all test code, as they may interact with one another.

It is possible to use patching inside the test implementation as a context manager:

def test_xyz():
    with patch("lightspeed_core_evaluation.config", new=Mock()):
        ...
        ...
        ...
  • new= allow us to use different function or class
  • return_value= allow us to define return value (no mock will be called)

Verifying that some exception is thrown

Sometimes it is needed to test whether some exception is thrown from a tested function or method. In this case pytest.raises can be used:

def test_evaluation_with_invalid_config(invalid_config):
    """Check if wrong configuration is detected properly."""
    with pytest.raises(ValueError):
        evaluate_model(invalid_config)

It is also possible to check if the exception is thrown with the expected message. The message (or its part) is written as regexp:

def test_constructor_no_provider():
    """Test that constructor checks for provider."""
    # we use bare Exception in the code, so need to check
    # message, at least
    with pytest.raises(Exception, match="ERROR: Missing provider"):
        load_evaluation_model(provider=None)

Checking what was printed and logged to stdout or stderr by the tested code

It is possible to capture stdout and stderr by using standard fixture capsys:

def test_evaluation_output(capsys):
    """Test the evaluation function that prints to stdout."""
    run_evaluation("test_config.yaml")

    # check captured log output
    captured_out = capsys.readouterr().out
    assert "Evaluation completed" in captured_out
    captured_err = capsys.readouterr().err
    assert captured_err == ""

Capturing logs:

@patch.dict(os.environ, {"LOG_LEVEL": "INFO"})
def test_logger_show_message_flag(mock_load_dotenv, capsys):
    """Test logger set with show_message flag."""
    logger = Logger(logger_name="evaluation", log_level=logging.INFO, show_message=True)
    logger.logger.info("This is my debug message")

    # check captured log output
    # the log message should be captured
    captured_out = capsys.readouterr().out
    assert "This is my debug message" in captured_out

    # error output should be empty
    captured_err = capsys.readouterr().err
    assert captured_err == ""

Code style

Docstrings style

We are using Google's docstring style.

Here is simple example:

def evaluate_model_response(query: str, response: str, ground_truth: str) -> float:
    """Evaluate model response against ground truth using similarity metrics.
    
    Args:
        query: The input query that was sent to the model.
        response: The response generated by the model.
        ground_truth: The expected/correct response.
    
    Returns:
        The similarity score between response and ground truth (0.0 to 1.0).
    
    Raises:
        ValueError: If any of the input parameters are empty or None.
    """

For further guidance, see the rest of our codebase, or check sources online. There are many, eg. this one.