feat(tasks): add LongProc benchmark (6 task types, 16 configs) by xiye17 · Pull Request #3544 · EleutherAI/lm-evaluation-harness

xiye17 · 2026-02-01T21:54:39Z

Summary

Hi,

This is the author from LongProc, trying to:

Add the LongProc long-form procedural generation benchmark
(paper, COLM 2025) as a new task suite
Dataset: `PrincetonPli/LongProc` on HuggingFace Hub

6 task types, 16 configs covering output lengths from 0.5k to 8k tokens:

Task	Configs	Primary Metric
countdown	0.5k, 2k, 8k	accuracy
path_traversal	0.5k, 2k, 8k	accuracy
tom_tracking	0.5k, 2k, 8k	accuracy
html_to_tsv	0.5k, 2k, 8k	F1
pseudo_to_code	0.5k, 2k	accuracy
travel_planning	2k, 8k	accuracy

Group hierarchy: `longproc` → 6 sub-groups → 16 leaf tasks
All tasks aggregate via a unified `score` metric

Files added (29)

`lm_eval/tasks/longproc/utils.py` — `process_docs` to parse metadata JSON
`lm_eval/tasks/longproc/metrics.py` — 6 evaluation functions ported from original codebase
3 base YAML configs (`default_yaml{0.5k,2k,8k}`)
16 individual task YAMLs
7 group YAMLs (6 sub-groups + 1 top-level)

Optional dependencies

pandas: required only for `html_to_tsv` evaluation (guarded import)
g++ (C++11): required only for `pseudo_to_code` evaluation (compiles and runs model-generated C++)

Test plan

`python -c 'from lm_eval.tasks import TaskManager; tm = TaskManager(); print([t for t in tm.all_tasks if "longproc" in t])'` lists all 23 entries
`lm_eval --model dummy --tasks longproc_countdown_0.5k --limit 50` runs without error
`lm_eval --model dummy --tasks longproc --limit 50` tests full group aggregation
Metric values match original LongProc evaluators on sample outputs

Add the LongProc long-form procedural generation benchmark from "LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation" (COLM 2025). Dataset: PrincetonPli/LongProc on HuggingFace Hub Task types and configs: - countdown (0.5k, 2k, 8k) - path_traversal (0.5k, 2k, 8k) - tom_tracking (0.5k, 2k, 8k) - html_to_tsv (0.5k, 2k, 8k) - pseudo_to_code (0.5k, 2k) - travel_planning (2k, 8k) Group hierarchy: longproc -> 6 sub-groups -> 16 individual tasks Evaluation logic ported from the original LongProc codebase into metrics.py with task-specific process_results functions. Optional dependencies: - pandas (html_to_tsv evaluation) - g++ with C++11 support (pseudo_to_code evaluation)

CLAassistant · 2026-02-01T21:54:45Z

All committers have signed the CLA.

baberabb · 2026-02-02T10:47:57Z

Hi! Thanks for the PR! LGTM, just one small addition, could you add unsafe_code: true to the configs? We use that flag to warn users when a task executes arbitrary code.

xiye17 · 2026-02-02T16:03:43Z

Thanks a lot for your quick review! I just marked the two subtasks requiring code execution. Let me know if I need to add the flag more globally for the whole group.

xiye17 · 2026-02-14T16:35:31Z

Hi! Thanks for the PR! LGTM, just one small addition, could you add unsafe_code: true to the configs? We use that flag to warn users when a task executes arbitrary code.

Hi I just mark the unsafe_code: true to default yaml files and consolidate the git commit logs. Thanks!

xiye17 requested a review from baberabb as a code owner February 1, 2026 21:54

include a readme for longproc

4a95009

xiye17 force-pushed the feat/longproc branch from a2e5e2c to 4a95009 Compare February 14, 2026 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tasks): add LongProc benchmark (6 task types, 16 configs)#3544

feat(tasks): add LongProc benchmark (6 task types, 16 configs)#3544
xiye17 wants to merge 2 commits intoEleutherAI:mainfrom
xiye17:feat/longproc

xiye17 commented Feb 1, 2026

Uh oh!

CLAassistant commented Feb 1, 2026 •

edited

Loading

Uh oh!

baberabb commented Feb 2, 2026

Uh oh!

xiye17 commented Feb 2, 2026

Uh oh!

xiye17 commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiye17 commented Feb 1, 2026

Summary

Files added (29)

Optional dependencies

Test plan

Uh oh!

CLAassistant commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baberabb commented Feb 2, 2026

Uh oh!

xiye17 commented Feb 2, 2026

Uh oh!

xiye17 commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Feb 1, 2026 •

edited

Loading