Hardcoded OpenAI API calls

https://github.com/web-arena-x/webarena/blob/dce04686a56253aefba7b18a4fa0937cf1dc987b/evaluation_harness/helper_functions.py#L196-L203

Model name and temperature are hard-coded. Additionally, the current design does not allow for using different providers for evaluation (say, OpenAI) and action (say, vLLM or other), since it uses the common `OPENAI_API_KEY`
https://github.com/web-arena-x/webarena/blob/dce04686a56253aefba7b18a4fa0937cf1dc987b/llms/providers/openai_utils.py#L148-L149

It would be good to support different OpenAI endpoints for WebArena internal evaluation and for the model being used (allowing for different providers), allow for setting model names through environment variables (`gpt-4-1106-preview` is legacy) so things don't break. Additionally, GPT-5 series models don't allow for non-default temperatures, which needs to be accounted for.

	response = generate_from_openai_chat_completion(
	model="gpt-4-1106-preview",
	messages=messages,
	temperature=0,
	max_tokens=768,
	top_p=1.0,
	context_length=0,
	).lower()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardcoded OpenAI API calls #242

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	openai.api_key = os.environ["OPENAI_API_KEY"]
	openai.organization = os.environ.get("OPENAI_ORGANIZATION", "")

Hardcoded OpenAI API calls #242

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions