feat (tasks): add AAII GPQA Diamond tasks, extraction regex, and reasoning/non-reasoning wrappers by saint1729 · Pull Request #3547 · EleutherAI/lm-evaluation-harness

saint1729 · 2026-02-03T04:14:42Z

Summary:

Adds an AAII-style zero-shot task for GPQA diamond dataset and wrapper configs under
lm_eval/tasks/gpqa/generative/aaii/v_4_0.
Implements process_docs preprocessing to shuffle choices and set single-letter answers.
Adds generator to produce reasoning / non-reasoning wrapper YAMLs.
Extends RegexFilter in lm_eval/filters/extraction.py to support a fallback sequence
of regex patterns for robust single-letter answer extraction; task YAMLs use this.
Runs small end-to-end smoke tests (limit=5) with OpenAI chat-completions while
iterating on YAML and filter shapes; results and samples saved under output/.
No new third-party dependencies.

Files added:

lm_eval/tasks/gpqa/generative/aaii/v_4_0/_gpqa_diamond_generative_zeroshot_aaii_v_4_0_yaml
lm_eval/tasks/gpqa/generative/aaii/v_4_0/gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning.yaml
lm_eval/tasks/gpqa/generative/aaii/v_4_0/gpqa_diamond_generative_zeroshot_aaii_v_4_0_reasoning.yaml
lm_eval/tasks/gpqa/generative/aaii/v_4_0/_generate_configs.py
lm_eval/tasks/gpqa/generative/aaii/v_4_0/utils.py

Files modified:

lm_eval/filters/extraction.py (RegexFilter: add fallback_patterns support)

Notes:

Canonical generation kwarg is max_gen_toks (aliases accepted).
Use --apply_chat_template when running chat-completions models.
HF dataset Idavidrein/gpqa requires HF token.

CLAassistant · 2026-02-03T04:14:48Z

All committers have signed the CLA.

saint1729 · 2026-02-03T04:20:58Z

Reproduced number published by AAII for GPT-4.1 nano model (https://artificialanalysis.ai/models/gpt-4-1-nano#intelligence-evaluations).

(.venv) (base) saint1729@saint1729 v_4_0 % lm_eval --model openai-chat-completions --model_args model=gpt-4.1-nano-2025-04-14 --tasks gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning --apply_chat_template --output output/aaii_non_reasoning_full --log_samples      
2026-02-03:18:36:00 INFO     [config.evaluate_config:301] Using default fewshot_as_multiturn=True.
2026-02-03:18:36:04 INFO     [_cli.run:376] Selected Tasks: ['gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning']
2026-02-03:18:36:04 INFO     [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-02-03:18:36:04 INFO     [evaluator:236] Initializing openai-chat-completions model, with arguments: {'model': 'gpt-4.1-nano-2025-04-14'}
2026-02-03:18:36:04 INFO     [models.api_models:172] Using max length 2048 - 1
2026-02-03:18:36:04 INFO     [models.api_models:175] Concurrent requests are disabled. To enable concurrent requests, set `num_concurrent` > 1.
2026-02-03:18:36:04 INFO     [models.api_models:193] Using tokenizer None
2026-02-03:18:36:05 INFO     [tasks:700] Selected tasks:
2026-02-03:18:36:05 INFO     [tasks:691] Task: gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning (gpqa/generative/aaii/v_4_0/gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning.yaml)
2026-02-03:18:36:05 INFO     [evaluator:314] gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning: Using gen_kwargs: {'max_gen_toks': 16384, 'temperature': 0.0, 'until': ['</s>', 'Question:', '<|im_end|>']}
2026-02-03:18:36:05 WARNING  [evaluator:490] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2026-02-03:18:36:05 INFO     [api.task:311] Building contexts for gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning on rank 0...
100%|████████████████████████████████████████████████████████████████| 198/198 [00:00<00:00, 2939.42it/s]
2026-02-03:18:36:05 INFO     [evaluator:584] Running generate_until requests
2026-02-03:18:36:05 INFO     [models.api_models:733] Tokenized requests are disabled. Context + generation length is not checked.
Requesting API:   0%|                                                            | 0/198 [00:00<?, ?it/s]2026-02-03:18:36:05 WARNING  [models.api_models:376] Cannot determine EOS string to pass to stop sequence. Manually set by passing `eos_string` to model_args.
Requesting API: 100%|██████████████████████████████████████████████████| 198/198 [20:36<00:00,  6.24s/it]
2026-02-03:18:56:44 INFO     [loggers.evaluation_tracker:247] Saving results aggregated
2026-02-03:18:56:44 INFO     [loggers.evaluation_tracker:119] Saving per-task samples to output/aaii_non_reasoning_full/gpt-4.1-nano-2025-04-14/*.jsonl
openai-chat-completions ({'model': 'gpt-4.1-nano-2025-04-14'}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|                          Tasks                          | Version |      Filter       |n-shot|  Metric   |   |Value |   |Stderr|
|---------------------------------------------------------|---------|-------------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning|aaii-v4.0|aaii-extract-letter|     0|exact_match|↑  |0.5152|±  |0.0356|

…quashed)

saint1729 · 2026-02-04T02:56:59Z

@baberabb Gentle reminder on this simple CR

saint1729 requested a review from baberabb as a code owner February 3, 2026 04:14

saint1729 changed the title ~~Add AAII GPQA Diamond tasks, extraction regex, and reasoning/non-reasoning wrappers~~ feat (tasks): add AAII GPQA Diamond tasks, extraction regex, and reasoning/non-reasoning wrappers Feb 3, 2026

AAII: add GPQA diamond zero-shot task, regex extractor, and config (s…

9ecd623

…quashed)

saint1729 force-pushed the feature/aaai-index-methodology branch from 0288c02 to 9ecd623 Compare February 4, 2026 02:24

Merge branch 'EleutherAI:main' into feature/aaai-index-methodology

99931a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat (tasks): add AAII GPQA Diamond tasks, extraction regex, and reasoning/non-reasoning wrappers#3547

feat (tasks): add AAII GPQA Diamond tasks, extraction regex, and reasoning/non-reasoning wrappers#3547
saint1729 wants to merge 2 commits intoEleutherAI:mainfrom
saint1729:feature/aaai-index-methodology

saint1729 commented Feb 3, 2026

Uh oh!

CLAassistant commented Feb 3, 2026 •

edited

Loading

Uh oh!

saint1729 commented Feb 3, 2026 •

edited

Loading

Uh oh!

saint1729 commented Feb 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saint1729 commented Feb 3, 2026

Summary:

Files added:

Files modified:

Notes:

Uh oh!

CLAassistant commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saint1729 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saint1729 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Feb 3, 2026 •

edited

Loading

saint1729 commented Feb 3, 2026 •

edited

Loading

saint1729 commented Feb 4, 2026 •

edited

Loading