pip install -e ".[dev]"For leaderboard viewing features: pip install -e ".[dev,leaderboard]"
make testFor all tests including leaderboard: make test-all (requires .[dev,leaderboard])
To publish to pypi:
export PYPI_TOKEN=...
bash scripts/publish.shThe results leaderboard is a HuggingFace dataset with a fixed schema, defined in its README file.
The README file is generated by the scripts/update_readme.py script,
which reads the schema from the src/agenteval/leaderboard/dataset_features.yml file. That file is
in turn generated from the Pydantic model by the scripts/update_schema.py script.
If you make changes to the Pydantic model, you should run the following commands to update the schema and README:
python scripts/update_schema.py
python scripts/update_readme.py sync-schemaWhen a new config version is defined (e.g. v1.1.0), you need to update the HF schema by running
python scripts/update_readme.py add-config --config-name <config_name> --split <split1> --split <split2> ...The script will enforce that all config versions use the same data model for leaderboard submissions.
We do a couple of things in an effort to compute costs for things in a consistent way. One of these things is putting limits on the litellm version used (more details here).
If you need to bump litellm, please also update the version of the model_prices_and_context_window_backup.json file we point to in prep_litellm_cost_map() in cli.py to the version in the version of litellm you want to bump to (if you want a range, use the version from the upper limit). To do that, find the relevant release in the litellm repo, grab the corresponding SHA, and use it in desired_model_costs_url. In some cases, it may be desirable to rescore all the results of interest after doing this. More on this coming later.