Skip to content

feat (tasks): add AAII GPQA Diamond tasks, extraction regex, and reasoning/non-reasoning wrappers#3547

Open
saint1729 wants to merge 2 commits intoEleutherAI:mainfrom
saint1729:feature/aaai-index-methodology
Open

feat (tasks): add AAII GPQA Diamond tasks, extraction regex, and reasoning/non-reasoning wrappers#3547
saint1729 wants to merge 2 commits intoEleutherAI:mainfrom
saint1729:feature/aaai-index-methodology

Conversation

@saint1729
Copy link

Summary:

  • Adds an AAII-style zero-shot task for GPQA diamond dataset and wrapper configs under
    lm_eval/tasks/gpqa/generative/aaii/v_4_0.
  • Implements process_docs preprocessing to shuffle choices and set single-letter answers.
  • Adds generator to produce reasoning / non-reasoning wrapper YAMLs.
  • Extends RegexFilter in lm_eval/filters/extraction.py to support a fallback sequence
    of regex patterns for robust single-letter answer extraction; task YAMLs use this.
  • Runs small end-to-end smoke tests (limit=5) with OpenAI chat-completions while
    iterating on YAML and filter shapes; results and samples saved under output/.
  • No new third-party dependencies.

Files added:

  • lm_eval/tasks/gpqa/generative/aaii/v_4_0/_gpqa_diamond_generative_zeroshot_aaii_v_4_0_yaml
  • lm_eval/tasks/gpqa/generative/aaii/v_4_0/gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning.yaml
  • lm_eval/tasks/gpqa/generative/aaii/v_4_0/gpqa_diamond_generative_zeroshot_aaii_v_4_0_reasoning.yaml
  • lm_eval/tasks/gpqa/generative/aaii/v_4_0/_generate_configs.py
  • lm_eval/tasks/gpqa/generative/aaii/v_4_0/utils.py

Files modified:

  • lm_eval/filters/extraction.py (RegexFilter: add fallback_patterns support)

Notes:

  • Canonical generation kwarg is max_gen_toks (aliases accepted).
  • Use --apply_chat_template when running chat-completions models.
  • HF dataset Idavidrein/gpqa requires HF token.

@saint1729 saint1729 requested a review from baberabb as a code owner February 3, 2026 04:14
@CLAassistant
Copy link

CLAassistant commented Feb 3, 2026

CLA assistant check
All committers have signed the CLA.

@saint1729
Copy link
Author

saint1729 commented Feb 3, 2026

Reproduced number published by AAII for GPT-4.1 nano model (https://artificialanalysis.ai/models/gpt-4-1-nano#intelligence-evaluations).

(.venv) (base) saint1729@saint1729 v_4_0 % lm_eval --model openai-chat-completions --model_args model=gpt-4.1-nano-2025-04-14 --tasks gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning --apply_chat_template --output output/aaii_non_reasoning_full --log_samples      
2026-02-03:18:36:00 INFO     [config.evaluate_config:301] Using default fewshot_as_multiturn=True.
2026-02-03:18:36:04 INFO     [_cli.run:376] Selected Tasks: ['gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning']
2026-02-03:18:36:04 INFO     [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-02-03:18:36:04 INFO     [evaluator:236] Initializing openai-chat-completions model, with arguments: {'model': 'gpt-4.1-nano-2025-04-14'}
2026-02-03:18:36:04 INFO     [models.api_models:172] Using max length 2048 - 1
2026-02-03:18:36:04 INFO     [models.api_models:175] Concurrent requests are disabled. To enable concurrent requests, set `num_concurrent` > 1.
2026-02-03:18:36:04 INFO     [models.api_models:193] Using tokenizer None
2026-02-03:18:36:05 INFO     [tasks:700] Selected tasks:
2026-02-03:18:36:05 INFO     [tasks:691] Task: gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning (gpqa/generative/aaii/v_4_0/gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning.yaml)
2026-02-03:18:36:05 INFO     [evaluator:314] gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning: Using gen_kwargs: {'max_gen_toks': 16384, 'temperature': 0.0, 'until': ['</s>', 'Question:', '<|im_end|>']}
2026-02-03:18:36:05 WARNING  [evaluator:490] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2026-02-03:18:36:05 INFO     [api.task:311] Building contexts for gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning on rank 0...
100%|████████████████████████████████████████████████████████████████| 198/198 [00:00<00:00, 2939.42it/s]
2026-02-03:18:36:05 INFO     [evaluator:584] Running generate_until requests
2026-02-03:18:36:05 INFO     [models.api_models:733] Tokenized requests are disabled. Context + generation length is not checked.
Requesting API:   0%|                                                            | 0/198 [00:00<?, ?it/s]2026-02-03:18:36:05 WARNING  [models.api_models:376] Cannot determine EOS string to pass to stop sequence. Manually set by passing `eos_string` to model_args.
Requesting API: 100%|██████████████████████████████████████████████████| 198/198 [20:36<00:00,  6.24s/it]
2026-02-03:18:56:44 INFO     [loggers.evaluation_tracker:247] Saving results aggregated
2026-02-03:18:56:44 INFO     [loggers.evaluation_tracker:119] Saving per-task samples to output/aaii_non_reasoning_full/gpt-4.1-nano-2025-04-14/*.jsonl
openai-chat-completions ({'model': 'gpt-4.1-nano-2025-04-14'}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|                          Tasks                          | Version |      Filter       |n-shot|  Metric   |   |Value |   |Stderr|
|---------------------------------------------------------|---------|-------------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_diamond_generative_zeroshot_aaii_v_4_0_non_reasoning|aaii-v4.0|aaii-extract-letter|     0|exact_match|↑  |0.5152|±  |0.0356|

@saint1729 saint1729 changed the title Add AAII GPQA Diamond tasks, extraction regex, and reasoning/non-reasoning wrappers feat (tasks): add AAII GPQA Diamond tasks, extraction regex, and reasoning/non-reasoning wrappers Feb 3, 2026
@saint1729 saint1729 force-pushed the feature/aaai-index-methodology branch from 0288c02 to 9ecd623 Compare February 4, 2026 02:24
@saint1729
Copy link
Author

saint1729 commented Feb 4, 2026

@baberabb Gentle reminder on this simple CR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants