Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/test-envs-docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@ concurrency:
group: test-envs-docker-${{ github.ref }}-${{ github.event.inputs.site || 'all' }}
cancel-in-progress: true

concurrency:
group: test-envs-docker-${{ github.ref }}-${{ github.event.inputs.site || 'all' }}
cancel-in-progress: true

jobs:
test-all:
if: github.event.inputs.site == '' || github.event.inputs.site == 'all'
Expand Down
47 changes: 47 additions & 0 deletions assets/hf_dataset/README.md.jinja2
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
configs:
- config_name: default
data_files:
- split: full
path: output/build/hf_dataset/full.parquet
- split: hard
path: output/build/hf_dataset/hard.parquet
---

# WebArena-Verified

## Dataset description

WebArena-Verified is a curated benchmark dataset of web tasks designed for reproducible
evaluation of web agents across multiple realistic websites.

## Splits

- `full`: {{ full_count }} rows
- `hard`: {{ hard_count }} rows

## Schema notes

The table below reflects inferred column types from the `full` split during artifact generation.

| Column | Type |
| --- | --- |
{% for column_name, column_type in schema -%}
| `{{ column_name }}` | `{{ column_type }}` |
{% endfor %}

## Metadata

- Version: `{{ version }}`
- Git commit: `{{ git_commit }}`
- Generated at (UTC): `{{ generated_at }}`
- Dataset hash: `{{ dataset_hash }}`
- License: `Apache-2.0`
- Language: `en`
- Task categories: `web-navigation`, `information-retrieval`

## Notes

- Expected split counts: `full=812`, `hard=258`
- If custom split names are not supported by the Dataset Viewer, use one config per split
mapped to `train` as a compatibility fallback.
Loading
Loading