As part of the broader AutoGen project, AutoGenBench welcomes community contributions. Contributions are subject to AutoGen's contribution guidelines, as well as a few additional AutoGenBench-specific requirements outlined here. You may also wish to develop your own private benchmark scenarios and the guidance in this document will help with such efforts as well. Below you will find the general requirements, followed by a detailed technical description.
We ask that all contributions to AutoGenBench adhere to the following:
- Follow AutoGen's broader contribution guidelines
- All AutoGenBench benchmarks should live in a subfolder of
/benchmarks
alongsideHumanEval
,GAIA
, etc. - Benchmark scenarios should include a detailed README.md, in the root of their folder, describing the benchmark and providing citations where warranted.
- Benchmark data (tasks, ground truth, etc.) should be downloaded from their original sources rather than hosted in the AutoGen repository (unless the benchmark is original, and the repository is the original source)
- You can use the
Scripts/init_tasks.py
file to automate this download.
- You can use the
- Basic scoring should be compatible with the
agbench tabulate
command (e.g., by outputting logs compatible with the default tabulation mechanism, or by providing aScripts/custom_tabulate.py
file)
These requirements are further detailed below, but if you simply copy the HumanEval
folder, you will already be off to a great start.
At the core of any benchmark is a set of tasks. To implement tasks that are runnable by AutoGenBench, you must adhere to AutoGenBench's templating and scenario expansion algorithms, as outlined below.
All tasks are stored in JSONL files (in subdirectories under ./Tasks
). Each line of a tasks file is a JSON object with the following schema:
{
"id": string,
"template": dirname,
"substitutions" {
"filename1": {
"find_string1_1": replace_string1_1,
"find_string1_2": replace_string1_2,
...
"find_string1_M": replace_string1_N
}
"filename2": {
"find_string2_1": replace_string2_1,
"find_string2_2": replace_string2_2,
...
"find_string2_N": replace_string2_N
}
}
}
For example:
{
"id": "two_agent_stocks_gpt4",
"template": "default_two_agents",
"substitutions": {
"scenario.py": {
"__MODEL__": "gpt-4",
},
"prompt.txt": {
"__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD."
}
}
}
In this example, the string __MODEL__
will be replaced in the file scenarios.py
, while the string __PROMPT__
will be replaced in the prompt.txt
file.
The template
field can also take on a list value, but this usage is considered advanced and is not described here. See the agbench/run_cmd.py
code, or the GAIA
benchmark tasks files for additional information about this option.
Once the tasks have been defined, as per above, they must be "instantiated" before they can be run. This instantiation happens automatically when the user issues the agbench run
command and involves creating a local folder to share with Docker. Each instance and repetition gets its own folder along the path: ./results/[scenario]/[task_id]/[instance_id]
. For the sake of brevity we will refer to this folder as the DEST_FOLDER
.
The algorithm for populating the DEST_FOLDER
is as follows:
- Pre-populate DEST_FOLDER with all the basic starter files for running a scenario (found in
agbench/template
). - Recursively copy the template folder specified in the JSONL line to DEST_FOLDER (if the JSON
template
attribute points to a folder) If the JSONstemplate
attribute instead points to a file, copy the file, but rename it toscenario.py
- Apply any string replacements, as outlined in the prior section.
- Write a run.sh file to DEST_FOLDER that will be executed by Docker when it is loaded. The
run.sh
is described below.
Once the task has been instantiated it is run (via run.sh). This script will execute the following steps:
- If a file named
global_init.sh
is present, run it. - If a file named
scenario_init.sh
is present, run it. - Install the requirements.txt file (if running in Docker)
- Run the task via
python scenario.py
- If the scenario.py exited cleanly (exit code 0), then print "SCENARIO.PY COMPLETE !#!#"
- Clean up (delete cache, etc.)
- If a file named
scenario_finalize.sh
is present, run it. - If a file named
global_finalize.sh
is present, run it. - echo "RUN COMPLETE !#!#", signaling that all steps completed.
Notably, this means that scenarios can add custom init and teardown logic by including scenario_init.sh
and scenario_finalize.sh
files.
At the time of this writing, the run.sh file is as follows:
export AUTOGEN_TESTBED_SETTING="Docker"
umask 000
# Run the global init script if it exists
if [ -f global_init.sh ] ; then
. ./global_init.sh
fi
# Run the scenario init script if it exists
if [ -f scenario_init.sh ] ; then
. ./scenario_init.sh
fi
# Run the scenario
pip install -r requirements.txt
python scenario.py
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
echo SCENARIO.PY EXITED WITH CODE: $EXIT_CODE !#!#
else
echo SCENARIO.PY COMPLETE !#!#
fi
# Clean up
if [ -d .cache ] ; then
rm -Rf .cache
fi
# Run the scenario finalize script if it exists
if [ -f scenario_finalize.sh ] ; then
. ./scenario_finalize.sh
fi
# Run the global finalize script if it exists
if [ -f global_finalize.sh ] ; then
. ./global_finalize.sh
fi
echo RUN.SH COMPLETE !#!#
Be warned that this listing is provided here for illustration purposes, and may vary over time. The source of truth are the run.sh
files found in the ./results/[taskset]/[task_id]/[instance_id]
folders.
The above details are sufficient for defining and running tasks, but if you wish to support the agbench tabulate
commands, a few additional steps are required.
If you wish to leverage the default tabulation logic, it is as simple as arranging your scenario.py
file to output the string "ALL TESTS PASSED !#!#" to the console in the event that a task was solved correctly.
If you wish to implement your own tabulation logic, simply create the file Scripts/custom_tabulate.py
and include a main(args)
method. Here, the args
parameter will be provided by AutoGenBench, and is a drop-in replacement for sys.argv
. In particular, args[0]
will be the invocation command (similar to the executable or script name in sys.argv
), and the remaining values (args[1:]
) are the command line parameters.
Should you provide a custom tabulation script, please implement --help
and -h
options for documenting your interface.
The scenarios/GAIA/Scripts/custom_tabulate.py
is a great example of custom tabulation. It also shows how you can reuse some components of the default tabulator to speed up development.
Finally, you should provide an Scripts/init_tasks.py
file, in your benchmark folder, and include a main()
method therein.
This init_tasks.py
script is a great place to download benchmarks from their original sources and convert them to the JSONL format required by AutoGenBench:
- See
HumanEval/Scripts/init_tasks.py
for an example of how to expand a benchmark from an original GitHub repository. - See
GAIA/Scripts/init_tasks.py
for an example of how to expand a benchmark fromHugging Face Hub
.