Evaluation utilities used as part of Survey Assist
Survey Assist evaluation functions. This repository contains utilities for evaluating the performance of Survey Assist's Large Language Models (LLMs) in classifying Standard Industrial Classification (SIC) codes. The evaluation framework includes tools for batch processing of datasets, as well as a comprehensive suite of metrics to analyze and compare LLM performance against human coders.
- Batch Processing: Send large datasets to the API for SIC classification.
- Data Extraction and Processing: Utilities to extract survey response data from a Firestore database, reformat it, and save it in CSV format for analysis.
- Performance Evaluation: A comprehensive suite of metrics to analyze and compare LLM performance against human coders.
The Makefile defines a set of commonly used commands and workflows. Where possible use the files defined in the Makefile.
Ensure you have the following installed on your local machine:
- Python 3.12 (Recommended: use
pyenvto manage versions) poetry(for dependency management)- Google Cloud SDK (
gcloud) with appropriate permissions - Colima (if running locally with containers)
- Terraform (for infrastructure management)
-
Clone the repository
git clone [https://github.com/ONSdigital/survey-assist-eval.git](https://github.com/ONSdigital/survey-assist-eval.git) cd survey-assist-eval -
Create and activate a virtual environment
Using
pyenvandpyenv-virtualenv:python3.12 -m venv .venv source .venv/bin/activate -
Install Dependencies
poetry install
Note this installs partner repos (e.g.
sic-classification-utils, at a pinned version). To evaluate concurrent changes to the codebase locally, it may be preferable to install from a local path instead. To do this:- Clone the
sic-classification-utilsrepository at the same directory level as this repository. - From the root of this repository (with the virtual environment activated), run:
python -m pip install --no-deps --editable ../sic-classification-utils
- Clone the
-
Generate an API Token
The API uses Application Default Credentials to generate and authenticate tokens.
Ensure GOOGLE_APPLICATION_CREDENTIALS are not set in your environment.
unset GOODLE_APPLICATION_CREDENTIALSLogin to gcloud application default:
gcloud auth application-default login
Set to the correct GCP project:
gcloud auth application-default set-quota-project GCP-PROJECT-NAME
Check the project setting:
cat ~/.config/gcloud/application_default_credentials.jsonSet the required environment variables:
export SA_EMAIL="SERVICE-ACCOUNT-FOR-API-ACCESS" export API_GATEWAY="API GATEWAY URL NOT INC https://"
Then, run the make command to use default expiry (1h):
make generate-api-token
You can run from cli and pass in a chosen expiry time:
poetry run generate-api-token -e 7200
Code quality and static analysis are enforced using isort, black, ruff, mypy, pylint, and bandit.
- To check for errors without auto-fixing:
make check-python-nofix
- To check and automatically fix errors:
make check-python
Pytest is used for testing.
- To run unit tests:
make unit-tests
- To run all tests:
make all-tests
Pre-commit hooks are set up to run code quality checks before each commit. They will call make check-python under the hood as well.
To install the hooks, run:
pre-commit install