EleutherAI / lm-evaluation-harness Public

Notifications You must be signed in to change notification settings
Fork 3k
Star 11.4k

Code
Issues 554
Pull requests 182
Actions
Projects 1
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Pull requests: EleutherAI/lm-evaluation-harness

Labels 10 Milestones 1

New pull request New

182 Open 1,776 Closed

Author

Filter by author

Uh oh!

There was an error while loading. Please reload this page.

Label

Filter by label

Uh oh!

There was an error while loading. Please reload this page.

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Uh oh!

There was an error while loading. Please reload this page.

Milestones

Filter by milestone

Uh oh!

There was an error while loading. Please reload this page.

Reviews

Filter by reviews

No reviews Review required Approved review Changes requested

Assignee

Filter by who’s assigned

Assigned to nobody

Uh oh!

There was an error while loading. Please reload this page.

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Pull requests list

fix: Update WatsonxLLM class mapping and errors

#3591 opened Feb 17, 2026 by Rafal-Chrzanowski-IBM

Loading…

Fix correctness issues in Arabic normalization and prompt loading

#3589 opened Feb 15, 2026 by RinZ27

Loading…

add GreekMMLU (official native-sourced benchmark) task configuration

#3581 opened Feb 11, 2026 by mersinkonomi

Loading…

fix: clean up MasakhaNEWS prompt whitespace and typo

#3580 opened Feb 11, 2026 by Mr-Neutr0n

Loading…

New NorEval tasks

#3572 opened Feb 9, 2026 by davda54

Loading…

Update of NorEval implementation

#3571 opened Feb 9, 2026 by davda54 • Draft

Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions)

#3570 opened Feb 8, 2026 by ajtgjmdjp

Loading…

6 tasks done

MMLU PRO chat variant

#3568 opened Feb 6, 2026 by anmarques

Loading…

fix(bigbench): add group for task discovery (bigbench)

#3556 opened Feb 4, 2026 by jayvenn21

Loading…

fix(aime): ensure non-empty prompt for API backends

#3555 opened Feb 4, 2026 by jayvenn21

Loading…

fix(tasks): correct MasakhaNEWS dataset field names

#3554 opened Feb 3, 2026 by jayvenn21

Loading…

feat(tasks): add Persian XNLI evaluation task

#3553 opened Feb 3, 2026 by jayvenn21

Loading…

Add Intel Gaudi support

#3550 opened Feb 3, 2026 by 12010486

Loading…

feat (tasks): add AAII GPQA Diamond tasks, extraction regex, and reasoning/non-reasoning wrappers

#3547 opened Feb 3, 2026 by saint1729

Loading…

feat(tasks): add LongProc benchmark (6 task types, 16 configs)

#3544 opened Feb 1, 2026 by xiye17

Loading…

4 tasks done

feat(task): add MMLU-CF contamination-free benchmark

#3542 opened Jan 31, 2026 by fistyee

Loading…

add french and korean gsm8k

#3541 opened Jan 30, 2026 by bknyaz

Loading…

Adds llm-as-a-judge support via new metric

#3534 opened Jan 27, 2026 by jbross-ibm-research

Loading…

Added pass@k and avg@k metrics to AIME benchmark

#3510 opened Jan 21, 2026 by annafontanaa

Loading…

[TASKS] add tasks from GDN paper

#3507 opened Jan 21, 2026 by mayank31398

Loading…

3 tasks done

Hineni

#3506 opened Jan 20, 2026 by Kevinobote

Loading…

Presets

#3494 opened Jan 13, 2026 by baberabb

Loading…

feat: support local directory as dataset_path

#3485 opened Jan 5, 2026 by fanjingxiang

Loading…

Implement new translation tasks for google WMT24++ datasets

#3480 opened Dec 25, 2025 by grzegorz-aniol

Loading…

Fix utils.py for MATH500 evaluation

#3478 opened Dec 24, 2025 by sheriyuo

Loading…

Previous 1 2 3 4 5 6 7 8 Next

Previous Next

ProTip! Follow long discussions with comments:>50.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!