Skip to content

Commit 6c402f9

Browse files
authored
webvoyager runner and viewer (#73)
1 parent e3bb613 commit 6c402f9

23 files changed

Lines changed: 4203 additions & 276 deletions

bun.lock

Lines changed: 5 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
[
2+
"Allrecipes--16",
3+
"Allrecipes--19",
4+
"Allrecipes--23",
5+
"Allrecipes--3",
6+
"Allrecipes--30",
7+
"Allrecipes--7",
8+
"Amazon--16",
9+
"Amazon--19",
10+
"Amazon--4",
11+
"Apple--1",
12+
"Apple--14",
13+
"Apple--16",
14+
"Apple--2",
15+
"Apple--20",
16+
"Apple--37",
17+
"Apple--41",
18+
"Apple--42",
19+
"Apple--7",
20+
"Apple--9",
21+
"ArXiv--11",
22+
"BBC News--14",
23+
"BBC News--16",
24+
"BBC News--18",
25+
"BBC News--2",
26+
"BBC News--21",
27+
"BBC News--33",
28+
"BBC News--37",
29+
"Booking--11",
30+
"Booking--13",
31+
"Booking--14",
32+
"Booking--6",
33+
"Coursera--17",
34+
"Coursera--28",
35+
"ESPN--19",
36+
"ESPN--2",
37+
"ESPN--21",
38+
"ESPN--26",
39+
"GitHub--22",
40+
"Google Flights--0",
41+
"Google Flights--20",
42+
"Google Flights--7",
43+
"Google Map--13",
44+
"Google Map--18",
45+
"Google Map--26",
46+
"Google Search--15",
47+
"Google Search--16",
48+
"Google Search--22",
49+
"Huggingface--1",
50+
"Huggingface--10",
51+
"Huggingface--20",
52+
"Huggingface--21",
53+
"Huggingface--22",
54+
"Huggingface--23",
55+
"Huggingface--32",
56+
"Huggingface--6"
57+
]
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
Given these web tasks that are evaluated as part of a benchmark for web LLM agents, consider any patches that need to be applied in order to adjust for recency.
2+
3+
Keep in mind that the original benchmark was created March 2, 2024.
4+
5+
Today is June 26, 2025.
6+
7+
For tasks where dates in the task are affected by today's date, or require a future date, they should be adjusted so that they are affectively similar in difficulty/feasibility as if today was March 2, 2024 in the original task set.
8+
9+
Any product information or things that have changed over time should be appropriately adjusted to have modern equivalents.
10+
11+
Do not patch any wording or phrasing unless its directly for the purpose of adjusting for dates/time/temporal feasibility.
12+
13+
Consider each task carefully, and output any required patches in this format:
14+
15+
{
16+
"task--id": {
17+
"reason": "justification for making the patch",
18+
"prev": "full original task",
19+
"new": "full adjusted task",
20+
}
21+
}
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/usr/bin/env python3
2+
import json
3+
4+
def filter_possible_tasks(original_file, impossible_file, output_file):
5+
# Load impossible task IDs
6+
with open(impossible_file, 'r') as f:
7+
impossible_ids = set(json.load(f))
8+
9+
print(f"Loaded {len(impossible_ids)} impossible task IDs")
10+
11+
# Process original tasks and filter out impossible ones
12+
possible_count = 0
13+
total_count = 0
14+
15+
with open(original_file, 'r') as infile, open(output_file, 'w') as outfile:
16+
for line in infile:
17+
total_count += 1
18+
task = json.loads(line.strip())
19+
20+
# Check if this task ID is in the impossible list
21+
if task['id'] not in impossible_ids:
22+
json.dump(task, outfile)
23+
outfile.write('\n')
24+
possible_count += 1
25+
26+
print(f"Processed {total_count} total tasks")
27+
print(f"Filtered out {len(impossible_ids)} impossible tasks")
28+
print(f"Wrote {possible_count} possible tasks to {output_file}")
29+
30+
return possible_count, total_count
31+
32+
if __name__ == "__main__":
33+
original_file = "originalTasks.jsonl"
34+
impossible_file = "impossibleTasks.json"
35+
output_file = "possibleTasks.jsonl"
36+
37+
possible, total = filter_possible_tasks(original_file, impossible_file, output_file)

evals/webvoyager/data/impossible_filter.py

Whitespace-only changes.

0 commit comments

Comments
 (0)