Skip to content

Latest commit

 

History

History
53 lines (35 loc) · 2.03 KB

File metadata and controls

53 lines (35 loc) · 2.03 KB

Development Instructions

Setup

pip install -e ".[dev]"

For leaderboard viewing features: pip install -e ".[dev,leaderboard]"

Testing

make test

For all tests including leaderboard: make test-all (requires .[dev,leaderboard])

Publication

To publish to pypi:

export PYPI_TOKEN=...
bash scripts/publish.sh

Schema Maintenance

The results leaderboard is a HuggingFace dataset with a fixed schema, defined in its README file. The README file is generated by the scripts/update_readme.py script, which reads the schema from the src/agenteval/leaderboard/dataset_features.yml file. That file is in turn generated from the Pydantic model by the scripts/update_schema.py script.

If you make changes to the Pydantic model, you should run the following commands to update the schema and README:

python scripts/update_schema.py
python scripts/update_readme.py sync-schema

When a new config version is defined (e.g. v1.1.0), you need to update the HF schema by running

python scripts/update_readme.py add-config --config-name <config_name> --split <split1> --split <split2> ...

The script will enforce that all config versions use the same data model for leaderboard submissions.

Bumping litellm

We do a couple of things in an effort to compute costs for things in a consistent way. One of these things is putting limits on the litellm version used (more details here).

If you need to bump litellm, please also update the version of the model_prices_and_context_window_backup.json file we point to in prep_litellm_cost_map() in cli.py to the version in the version of litellm you want to bump to (if you want a range, use the version from the upper limit). To do that, find the relevant release in the litellm repo, grab the corresponding SHA, and use it in desired_model_costs_url. In some cases, it may be desirable to rescore all the results of interest after doing this. More on this coming later.