Skip to content

nhsengland/evalsense

Repository files navigation

Note

This project is under development. The API may undergo major changes between versions, so we recommend checking the CHANGELOG for any breaking changes before upgrading.

EvalSense: LLM Evaluation

status: experimental PyPI package version license: MIT EvalSense status Guide status Python TypeScript React

Python v3.12 uv Ruff Checked with pyright ESLint

About

EvalSense is a framework for systematic evaluation of large language models (LLMs) on open-ended generation tasks, with a particular focus on bespoke, domain-specific evaluations. Some of its key features include:

  • Broad model support. Out-of-the-box compatibility with a wide range of local and API-based model providers, including Ollama, Hugging Face, vLLM, OpenAI, Anthropic and others.
  • Evaluation guidance. An interactive evaluation guide and automated meta-evaluation tools assist in selecting the most appropriate evaluation methods for a specific use-case, including the use of perturbed data to assess method effectiveness.
  • Interactive UI. A web-based interface enables rapid experimentation with different evaluation workflows without requiring any code.
  • Advanced evaluation methods. EvalSense incorporates recent LLM-as-a-Judge and hybrid evaluation approaches, such as G-Eval and QAGS, while also supporting more traditional metrics like BERTScore and ROUGE.
  • Efficient execution. Intelligent experiment scheduling and resource management minimise computational overhead for local models. For remote APIs, EvalSense uses asynchronous parallel calls to maximise throughput.
  • Modularity and extensibility. Key components and evaluation methods can be used independently or replaced with user-defined implementations.
  • Comprehensive logging. All key aspects of evaluation are recorded in machine-readable logs, including model parameters, prompts, model outputs, evaluation results, and other metadata.

More information about EvalSense can be found on its homepage and in its documentation.

Note: Only public or fake data are shared in this repository.

Project Stucture

  • The main code for the EvalSense Python package can be found under evalsense/.
  • The accompanying documentation is available in the docs/ folder.
  • Code for the interactive LLM evaluation guide is located under guide/.
  • Jupyter notebooks with the evaluation experiments and examples are located under notebooks/.

Getting Started

Installation

You can install the project using pip by running the following command:

pip install evalsense

This will install the latest released version of the package from PyPI without any optional dependencies.

Depending on your use-case, you may want to install additional dependencies from the following groups:

  • webui: For using the interactive web UI.
  • jupyter: For running experiments in Jupyter notebooks (only needed if you don't already have the necessary libraries installed).
  • transformers: For using models and metrics requiring the Hugging Face Transformers library.
  • vllm: For using models and metrics requiring vLLM.
  • interactive: For using EvalSense with interactive UI features (currently includes webui and jupyter).
  • local: For installing all local model dependencies (currently includes transformers and vllm).
  • all: For installing all optional dependencies.

For example, if you want to install EvalSense with all optional dependencies, you can run:

pip install "evalsense[all]"

If you want to use EvalSense with the interactive features (interactive) and Hugging Face Transformers (transformers), you can run:

pip install "evalsense[interactive,transformers]"

and similarly for other combinations.

Installation for Development

To install the project for local development, you can follow the steps below:

To clone the repo:

git clone [email protected]:nhsengland/evalsense.git

To setup the Python environment for the project:

  • Install uv if it's not installed already
  • uv sync --all-extras
  • source .venv/bin/activate
  • pre-commit install

Note that the code is formatted with ruff and type-checked by pyright in standard type checking mode. For the best development experience, we recommend enabling the corresponding extensions in your preferred code editor.

To setup the Node environment for the LLM evaluation guide (located under guide/):

  • Install node if it's not installed already
  • Change to the guide/ directory (cd guide)
  • npm install
  • npm run start to run the development server

See also the separate README.md for the guide.

Programmatic Usage

For examples illustrating the usage of EvalSense, please check the notebooks under the notebooks/ folder:

  • The Demo notebook illustrates a basic application of EvalSense to the ACI-Bench dataset.
  • The Experiments notebook illustrates more thorough experiments on the same dataset, involving a larger number of evaluators and models.
  • The Meta-Evaluation notebook focuses on meta-evaluation on synthetically perturbed data, where the goal is to identify the most reliable evaluation methods rather than the best-performing models.

Web-Based UI

To use the interactive web-based UI implemented in EvalSense, simply run

evalsense webui

after installing the package and its dependencies. Note that you need to install EvalSense with the webui extra (pip install "evalsense[webui]") or an extra that includes it before running this command.

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b amazing-feature)
  3. Commit your Changes (git commit -m 'Add some amazing feature')
  4. Push to the Branch (git push origin amazing-feature)
  5. Open a Pull Request

See CONTRIBUTING.md for detailed guidance.

License

Unless stated otherwise, the codebase is released under the MIT Licence. This covers both the codebase and any sample code in the documentation.

See LICENSE for more information.

The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.

Contact

This project is currently maintained by @adamdejl. If you have any questions, suggestions for new features or want to report a bug, please open an issue. For security concerns, please file a private vulnerability report.

To find out more about the NHS England Data Science visit our project website or get in touch at [email protected].

Acknowledgements

We thank the Inspect AI development team for their work on the Inspect AI library, which serves as a basis for EvalSense.