Skip to content

feat: add ARC-AGI-1, ARC-AGI-2, and BrowseComp benchmark tasks#1190

Merged
Luodian merged 1 commit into
dev-v0d7from
feat/arc-agi-benchmarks
Feb 24, 2026
Merged

feat: add ARC-AGI-1, ARC-AGI-2, and BrowseComp benchmark tasks#1190
Luodian merged 1 commit into
dev-v0d7from
feat/arc-agi-benchmarks

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Feb 23, 2026

Summary

  • Add ARC-AGI-1 image benchmark task using community dataset mertaylin/arc-agi-images (400 eval samples, exact grid match)
  • Add ARC-AGI-2 image benchmark task using community dataset vincentkoc/arc-agi-2 (120 eval samples, interleaved demo images)
  • Add BrowseComp text benchmark task from OpenAI (1,266 encrypted samples, LLM-as-judge with exact match fallback)

Details

ARC-AGI-1 (lmms_eval/tasks/arc_agi_1/)

  • Dataset: mertaylin/arc-agi-images (eval split)
  • Uses stacked demo composite image + test input image
  • Metric: arc_agi_1_acc (exact grid match, JSON array-of-arrays parsing)

ARC-AGI-2 (lmms_eval/tasks/arc_agi_2/)

  • Dataset: vincentkoc/arc-agi-2 (evaluation split)
  • Interleaves train input/output images as few-shot demos
  • Metric: arc_agi_2_acc (exact grid match with _normalize_grid type coercion)

BrowseComp (lmms_eval/tasks/browsecomp/)

  • Dataset: OpenAI Azure blob CSV (XOR-encrypted to prevent contamination)
  • Decryption at eval time using SHA256-derived key
  • LLM-as-Judge scoring via lmms_eval/llm_judge when API configured; falls back to exact match
  • Custom aggregation with per-topic accuracy breakdown
  • Metric: browsecomp_acc

Verification

  • All 3 tasks discoverable via --tasks list
  • Python syntax check passed ✅
  • pre-commit run (black + isort) passed ✅

Linear Issues

Resolves LMM-306, LMM-307, LMM-273

- ARC-AGI-1: visual reasoning using mertaylin/arc-agi-images (400 eval samples)
- ARC-AGI-2: visual reasoning using vincentkoc/arc-agi-2 (120 eval samples)
- BrowseComp: OpenAI info-finding benchmark with XOR-encrypted data and LLM judge

Resolves LMM-306, LMM-307, LMM-273
@Luodian Luodian merged commit 39713c5 into dev-v0d7 Feb 24, 2026
2 checks passed
Luodian added a commit that referenced this pull request Feb 28, 2026
- ARC-AGI-1: visual reasoning using mertaylin/arc-agi-images (400 eval samples)
- ARC-AGI-2: visual reasoning using vincentkoc/arc-agi-2 (120 eval samples)
- BrowseComp: OpenAI info-finding benchmark with XOR-encrypted data and LLM judge

Resolves LMM-306, LMM-307, LMM-273
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant