Contributing to AutoGenBench

As part of the broader AutoGen project, AutoGenBench welcomes community contributions. Contributions are subject to AutoGen's contribution guidelines, as well as a few additional AutoGenBench-specific requirements outlined here. You may also wish to develop your own private benchmark scenarios and the guidance in this document will help with such efforts as well. Below you will find the general requirements, followed by a detailed technical description.

General Contribution Requirements

We ask that all contributions to AutoGenBench adhere to the following:

Follow AutoGen's broader contribution guidelines
All AutoGenBench benchmarks should live in a subfolder of /benchmarks alongside HumanEval, GAIA, etc.
Benchmark scenarios should include a detailed README.md, in the root of their folder, describing the benchmark and providing citations where warranted.
Benchmark data (tasks, ground truth, etc.) should be downloaded from their original sources rather than hosted in the AutoGen repository (unless the benchmark is original, and the repository is the original source)
- You can use the Scripts/init_tasks.py file to automate this download.
Basic scoring should be compatible with the agbench tabulate command (e.g., by outputting logs compatible with the default tabulation mechanism, or by providing a Scripts/custom_tabulate.py file)

These requirements are further detailed below, but if you simply copy the HumanEval folder, you will already be off to a great start.

Implementing and Running Benchmark Tasks

At the core of any benchmark is a set of tasks. To implement tasks that are runnable by AutoGenBench, you must adhere to AutoGenBench's templating and scenario expansion algorithms, as outlined below.

Task Definitions

All tasks are stored in JSONL files (in subdirectories under ./Tasks). Each line of a tasks file is a JSON object with the following schema:

{
   "id": string,
   "template": dirname,
   "substitutions" {
       "filename1": {
       	   "find_string1_1": replace_string1_1,
           "find_string1_2": replace_string1_2,
           ...
           "find_string1_M": replace_string1_N
       }
       "filename2": {
       	   "find_string2_1": replace_string2_1,
           "find_string2_2": replace_string2_2,
           ...
           "find_string2_N": replace_string2_N
       }
   }
}

For example:

{
    "id": "two_agent_stocks_gpt4",
    "template": "default_two_agents",
    "substitutions": {
	"scenario.py": {
            "__MODEL__": "gpt-4",
	},
	"prompt.txt": {
            "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD."
        }
    }
}

In this example, the string __MODEL__ will be replaced in the file scenarios.py, while the string __PROMPT__ will be replaced in the prompt.txt file.

The template field can also take on a list value, but this usage is considered advanced and is not described here. See the agbench/run_cmd.py code, or the GAIA benchmark tasks files for additional information about this option.

Task Instance Expansion Algorithm

Once the tasks have been defined, as per above, they must be "instantiated" before they can be run. This instantiation happens automatically when the user issues the agbench run command and involves creating a local folder to share with Docker. Each instance and repetition gets its own folder along the path: ./results/[scenario]/[task_id]/[instance_id]. For the sake of brevity we will refer to this folder as the DEST_FOLDER.

The algorithm for populating the DEST_FOLDER is as follows:

Pre-populate DEST_FOLDER with all the basic starter files for running a scenario (found in agbench/template).
Recursively copy the template folder specified in the JSONL line to DEST_FOLDER (if the JSON template attribute points to a folder) If the JSONs template attribute instead points to a file, copy the file, but rename it to scenario.py
Apply any string replacements, as outlined in the prior section.
Write a run.sh file to DEST_FOLDER that will be executed by Docker when it is loaded. The run.sh is described below.

Scenario Execution Algorithm

Once the task has been instantiated it is run (via run.sh). This script will execute the following steps:

If a file named global_init.sh is present, run it.
If a file named scenario_init.sh is present, run it.
Install the requirements.txt file (if running in Docker)
Run the task via python scenario.py
If the scenario.py exited cleanly (exit code 0), then print "SCENARIO.PY COMPLETE !#!#"
Clean up (delete cache, etc.)
If a file named scenario_finalize.sh is present, run it.
If a file named global_finalize.sh is present, run it.
echo "RUN COMPLETE !#!#", signaling that all steps completed.

Notably, this means that scenarios can add custom init and teardown logic by including scenario_init.sh and scenario_finalize.sh files.

At the time of this writing, the run.sh file is as follows:

export AUTOGEN_TESTBED_SETTING="Docker"
umask 000

# Run the global init script if it exists
if [ -f global_init.sh ] ; then
    . ./global_init.sh
fi

# Run the scenario init script if it exists
if [ -f scenario_init.sh ] ; then
    . ./scenario_init.sh
fi

# Run the scenario
pip install -r requirements.txt
python scenario.py
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
    echo SCENARIO.PY EXITED WITH CODE: $EXIT_CODE !#!#
else
    echo SCENARIO.PY COMPLETE !#!#
fi

# Clean up
if [ -d .cache ] ; then
    rm -Rf .cache
fi

# Run the scenario finalize script if it exists
if [ -f scenario_finalize.sh ] ; then
    . ./scenario_finalize.sh
fi

# Run the global finalize script if it exists
if [ -f global_finalize.sh ] ; then
    . ./global_finalize.sh
fi

echo RUN.SH COMPLETE !#!#

Be warned that this listing is provided here for illustration purposes, and may vary over time. The source of truth are the run.sh files found in the ./results/[taskset]/[task_id]/[instance_id] folders.

Integrating with the `tabulate`

The above details are sufficient for defining and running tasks, but if you wish to support the agbench tabulate commands, a few additional steps are required.

Tabulations

If you wish to leverage the default tabulation logic, it is as simple as arranging your scenario.py file to output the string "ALL TESTS PASSED !#!#" to the console in the event that a task was solved correctly.

If you wish to implement your own tabulation logic, simply create the file Scripts/custom_tabulate.py and include a main(args) method. Here, the args parameter will be provided by AutoGenBench, and is a drop-in replacement for sys.argv. In particular, args[0] will be the invocation command (similar to the executable or script name in sys.argv), and the remaining values (args[1:]) are the command line parameters.

Should you provide a custom tabulation script, please implement --help and -h options for documenting your interface.

The scenarios/GAIA/Scripts/custom_tabulate.py is a great example of custom tabulation. It also shows how you can reuse some components of the default tabulator to speed up development.

Scripts/init_tasks.py

Finally, you should provide an Scripts/init_tasks.py file, in your benchmark folder, and include a main() method therein.

This init_tasks.py script is a great place to download benchmarks from their original sources and convert them to the JSONL format required by AutoGenBench:

See HumanEval/Scripts/init_tasks.py for an example of how to expand a benchmark from an original GitHub repository.
See GAIA/Scripts/init_tasks.py for an example of how to expand a benchmark from Hugging Face Hub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!