You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21-4Lines changed: 21 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,16 +7,20 @@
7
7
---
8
8
9
9
# BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
10
+
10
11
BALROG is a novel benchmark evaluating agentic LLM and VLM capabilities on long-horizon interactive tasks using reinforcement learning environments. Check out how current models fare on our [leaderboard](https://balrogai.com). You can read more about BALROG in our [paper](https://arxiv.org/abs/2411.13543).
11
12
12
13
## Features
14
+
13
15
- Comprehensive evaluation of agentic abilities
14
16
- Support for both language and vision-language models
15
17
- Integration with popular AI APIs and local deployment
16
18
- Easy integration for custom agents, new environments and new models
17
19
18
20
## Installation
21
+
19
22
We advise using conda for the installation
23
+
20
24
```bash
21
25
conda create -n balrog python=3.10 -y
22
26
conda activate balrog
@@ -26,12 +30,15 @@ cd BALROG
26
30
pip install -e .
27
31
balrog-post-install
28
32
```
33
+
29
34
On Mac make sure you have `wget` installed for the `balrog-post-install`
30
35
31
36
## Docker
37
+
32
38
We have provided some docker images. Please see the [relevant README](docker/README.md).
33
39
34
40
## ⚡️ Evaluate using vLLM locally
41
+
35
42
We support running LLMs/VLMs locally using [vLLM](https://github.com/vllm-project/vllm). You can spin up a vLLM client and evaluate your agent on BALROG in the following way:
36
43
37
44
```bash
@@ -49,43 +56,53 @@ python eval.py \
49
56
```
50
57
51
58
On Mac you might have to first export the following to suppress some fork() errors:
59
+
52
60
```
53
61
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
54
62
```
55
63
56
64
Check out [vLLM](https://github.com/vllm-project/vllm) for more options on how to serve your models fast and efficiently.
57
65
58
-
## 🛜 Evaluate using popular APIs
59
-
We support out of the box clients for OpenAI, Anthropic and Google Gemini APIs. First set up your API key:
66
+
## 🛜 Evaluate using API
67
+
68
+
We support how of the box clients for OpenAI, Anthropic and Google Gemini APIs. If you want to evaluate an agent using one of these APIs, you first have to set up your API key in one of two ways:
69
+
70
+
You can either directly export it:
60
71
61
72
```bash
62
73
export OPENAI_API_KEY=<KEY>
63
74
export ANTHROPIC_API_KEY=<KEY>
64
75
export GEMINI_API_KEY=<KEY>
65
76
```
66
77
67
-
Then run the evaluation with:
78
+
Or you can modify the `SECRETS` file, adding your api keys.
79
+
80
+
You can then run the evaluation with:
68
81
69
82
```bash
70
83
python eval.py \
71
84
agent.type=naive \
72
85
agent.max_image_history=0 \
73
-
eval.num_workers=64 \
86
+
agent.max_history=16 \
87
+
eval.num_workers=16 \
74
88
client.client_name=openai \
75
89
client.model_id=gpt-4o-mini-2024-07-18
76
90
```
77
91
78
92
## Documentation
93
+
79
94
-[Evaluation Guide](https://github.com/balrog-ai/BALROG/blob/main/docs/evaluation.md) - Detailed instructions for various evaluation scenarios
80
95
-[Agent Development](https://github.com/balrog-ai/BALROG/blob/main/docs/agents.md) - Tutorial on creating custom agents
81
96
-[Few Shot Learning](https://github.com/balrog-ai/BALROG/blob/main/docs/few_shot_learning.md) - Instructions on how to run Few Shot Learning
82
97
83
98
We welcome contributions! Please see our [Contributing Guidelines](https://github.com/balrog-ai/BALROG/blob/main/docs/contribution.md) for details.
84
99
85
100
## License
101
+
86
102
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
87
103
88
104
## Citation
105
+
89
106
If you use BALROG in any of your work, please cite:
0 commit comments