-
Notifications
You must be signed in to change notification settings - Fork 225
Description
webarena/evaluation_harness/helper_functions.py
Lines 196 to 203 in dce0468
| response = generate_from_openai_chat_completion( | |
| model="gpt-4-1106-preview", | |
| messages=messages, | |
| temperature=0, | |
| max_tokens=768, | |
| top_p=1.0, | |
| context_length=0, | |
| ).lower() |
Model name and temperature are hard-coded. Additionally, the current design does not allow for using different providers for evaluation (say, OpenAI) and action (say, vLLM or other), since it uses the common OPENAI_API_KEY
webarena/llms/providers/openai_utils.py
Lines 148 to 149 in dce0468
| openai.api_key = os.environ["OPENAI_API_KEY"] | |
| openai.organization = os.environ.get("OPENAI_ORGANIZATION", "") |
It would be good to support different OpenAI endpoints for WebArena internal evaluation and for the model being used (allowing for different providers), allow for setting model names through environment variables (gpt-4-1106-preview is legacy) so things don't break. Additionally, GPT-5 series models don't allow for non-default temperatures, which needs to be accounted for.