Skip to content

feat(tasks): add LongProc benchmark (6 task types, 16 configs)#3544

Open
xiye17 wants to merge 2 commits intoEleutherAI:mainfrom
xiye17:feat/longproc
Open

feat(tasks): add LongProc benchmark (6 task types, 16 configs)#3544
xiye17 wants to merge 2 commits intoEleutherAI:mainfrom
xiye17:feat/longproc

Conversation

@xiye17
Copy link

@xiye17 xiye17 commented Feb 1, 2026

Summary

Hi,

This is the author from LongProc, trying to:

  • Add the LongProc long-form procedural generation benchmark
    (paper, COLM 2025) as a new task suite
  • Dataset: `PrincetonPli/LongProc` on HuggingFace Hub
  • 6 task types, 16 configs covering output lengths from 0.5k to 8k tokens:
    Task Configs Primary Metric
    countdown 0.5k, 2k, 8k accuracy
    path_traversal 0.5k, 2k, 8k accuracy
    tom_tracking 0.5k, 2k, 8k accuracy
    html_to_tsv 0.5k, 2k, 8k F1
    pseudo_to_code 0.5k, 2k accuracy
    travel_planning 2k, 8k accuracy
  • Group hierarchy: `longproc` → 6 sub-groups → 16 leaf tasks
  • All tasks aggregate via a unified `score` metric

Files added (29)

  • `lm_eval/tasks/longproc/utils.py` — `process_docs` to parse metadata JSON
  • `lm_eval/tasks/longproc/metrics.py` — 6 evaluation functions ported from original codebase
  • 3 base YAML configs (`default_yaml{0.5k,2k,8k}`)
  • 16 individual task YAMLs
  • 7 group YAMLs (6 sub-groups + 1 top-level)

Optional dependencies

  • pandas: required only for `html_to_tsv` evaluation (guarded import)
  • g++ (C++11): required only for `pseudo_to_code` evaluation (compiles and runs model-generated C++)

Test plan

  • `python -c 'from lm_eval.tasks import TaskManager; tm = TaskManager(); print([t for t in tm.all_tasks if "longproc" in t])'` lists all 23 entries
  • `lm_eval --model dummy --tasks longproc_countdown_0.5k --limit 50` runs without error
  • `lm_eval --model dummy --tasks longproc --limit 50` tests full group aggregation
  • Metric values match original LongProc evaluators on sample outputs

Add the LongProc long-form procedural generation benchmark from
"LongProc: Benchmarking Long-Context Language Models on Long
Procedural Generation" (COLM 2025).

Dataset: PrincetonPli/LongProc on HuggingFace Hub

Task types and configs:
  - countdown         (0.5k, 2k, 8k)
  - path_traversal    (0.5k, 2k, 8k)
  - tom_tracking      (0.5k, 2k, 8k)
  - html_to_tsv       (0.5k, 2k, 8k)
  - pseudo_to_code    (0.5k, 2k)
  - travel_planning   (2k, 8k)

Group hierarchy:
  longproc -> 6 sub-groups -> 16 individual tasks

Evaluation logic ported from the original LongProc codebase into
metrics.py with task-specific process_results functions.

Optional dependencies:
  - pandas (html_to_tsv evaluation)
  - g++ with C++11 support (pseudo_to_code evaluation)
@xiye17 xiye17 requested a review from baberabb as a code owner February 1, 2026 21:54
@CLAassistant
Copy link

CLAassistant commented Feb 1, 2026

CLA assistant check
All committers have signed the CLA.

@baberabb
Copy link
Contributor

baberabb commented Feb 2, 2026

Hi! Thanks for the PR! LGTM, just one small addition, could you add unsafe_code: true to the configs? We use that flag to warn users when a task executes arbitrary code.

@xiye17
Copy link
Author

xiye17 commented Feb 2, 2026

Thanks a lot for your quick review! I just marked the two subtasks requiring code execution. Let me know if I need to add the flag more globally for the whole group.

@xiye17
Copy link
Author

xiye17 commented Feb 14, 2026

Hi! Thanks for the PR! LGTM, just one small addition, could you add unsafe_code: true to the configs? We use that flag to warn users when a task executes arbitrary code.

Hi I just mark the unsafe_code: true to default yaml files and consolidate the git commit logs. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants