feat(evaluation): Add lm-eval to Pruna Metrics by sky-2002 · Pull Request #380 · PrunaAI/pruna

sky-2002 · 2025-10-02T13:38:29Z

Description

Adds evaluation harness to the metrics lineup.

Related Issue

Fixes #378

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Have added basic unit tests for two tasks from eval harness.
Will add more tests as needed

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

EvalHarnessMetric inherits from BaseMetric since eval harness metrics are model level metrics.
Open to discuss and update code and tests.

Note

Adds LMEvalMetric wrapping lm-evaluation-harness metrics and enables lm_eval:<task> requests in Task, with tests and optional dependency.

Evaluation Metrics:
- Introduce LMEvalMetric (src/pruna/evaluation/metrics/metric_evalharness.py) wrapping lm-evaluation-harness metrics via registry/aggregation, with stateful accumulation and result reporting.
- Register metric under MetricRegistry as lm_eval_metric.
Task Integration:
- Support lm_eval:<task_name> requests in Task (src/pruna/evaluation/task.py), resolving metrics via lm_eval.tasks.get_task_dict and instantiating LMEvalMetric per task metric.
Tests:
- Add tests/evaluation/test_evalharness_metrics.py covering BLEU-like scoring, empty input behavior, and preds/refs length mismatch.
Dependencies:
- Add optional group evalharness with lm-eval>=0.4.0 in pyproject.toml.

^{Written by Cursor Bugbot for commit 3767da9. This will update automatically on new commits. Configure here.}

begumcig

Hi @sky-2002, thank you so much for your contribution! We’d love to bring lm-eval into Pruna, I think it's such a big task to integrate the entire evaluation suite at one go, but I think with some refactoring we can still make this work! In our codebase, we usually use the StatefulMetric interface for a single metric, whereas lm-eval is more of a full benchmarking suite. To align with our structure, we might need to break it into smaller components 🔨🔨. We also have a Task interface that coordinates multiple metrics together. I was thinking maybe we could integrate one task from lm-eval as one Task (inheriting from our Task interface) or alternatively, we could integrate some metrics from lm-eval metrics registry as StatefulMetrics.

What do you think? Do you think any of these approaches would align with lm-eval interface? Thank you again!

sky-2002 · 2025-10-19T07:59:21Z

@begumcig Yeah I agree, while implementing as well, I was not very comfortable with this approach.
From what I see in pruna code, a task has a request, which is a list of metrics, exactly as an lm-eval task, which has metrics defined in the yaml files.

alternatively, we could integrate some metrics from lm-eval metrics registry as StatefulMetrics.

Yes, this is a much cleaner way and aligns with lm-eval interface(to add each metric as a metric in pruna), let us do this. Will update code.
Also, you are right, I misunderstood StatefulMetric

I was thinking maybe we could integrate one task from lm-eval as one Task (inheriting from our Task interface)

The above two need to be done together no? we need to implement a task, and define its metrics or we need a way to create a task on the fly given a task name and the metrics needed for it.

For example, given that we have all metrics implemented in pruna:

We can instantiate a task for a lm-eval task:

from pruna.evaluation.task import Task

squad_metrics = ["lm_eval_exact_match", "lm_eval_f1"]

squad_task = Task(
    request=squad_metrics,
    datamodule=squad_datamodule, 
    device="cuda"
)

or have a task factory:

def make_pruna_task_from_lmeval(task_name, datamodule, model_args, device="cuda"):
    lm_task = lm_tasks.get_task(task_name)
    metrics = [f"lm_eval_{m['metric']}" for m in lm_task.config.metric_list]
    return Task(request=metrics, datamodule=datamodule, device=device)

Wdyt, how should we go about this?

begumcig · 2025-10-22T08:09:29Z

@sky-2002 Yes, that's exactly how our task and stateful metric interface work!

I was suggesting that you could:

Either create something like a LMEvalTask inheriting from our Task, that would also integrate the attributes and functions that are required for lm_eval task.
Or exactly like you suggested integrate the metrics as stateful metrics and initialize them using our own Task interface. I think the solution you suggested aligns perfectly with our interface and it would be amazing!

Once we have the metrics as Pruna metrics, we can use them with our Task interface like you have shown in the example! This all looks super good to me, and thank you so much for taking the time to look into this, you are a champ 🏆🏆🏆🏆

cursor · 2025-10-23T16:37:24Z

Bug: Mismatched Class and Method Interfaces

The test file attempts to import and use EvalHarnessMetric class, but the actual implementation defines LMEvalMetric class. Additionally, the test uses a completely different constructor signature (expecting tasks, model_args, device parameters) and calls compute(model=None, dataloader=None) method, while the actual LMEvalMetric constructor takes metric_name and optional call_type parameters, and the compute() method takes no parameters. The test expects result structure with task_scores in params, but the implementation returns different parameter structure. This complete interface mismatch will cause the tests to fail with ImportError and TypeError.

cursor · 2025-10-23T16:37:25Z

Bug: Error Message Mismatch with New Request Types

The error message references AVAILABLE_REQUESTS constant which only contains "image_generation_quality", but the function now supports requests starting with "lm_eval:". This creates misleading error messages that don't inform users about the newly supported lm_eval request format, potentially causing confusion when users receive errors claiming only image_generation_quality is available.

cursor · 2025-10-23T17:12:22Z

Bug: Outdated Constants Mislead on New Features

AVAILABLE_REQUESTS constant is outdated and doesn't include the newly added lm_eval functionality. The error message on lines 272-273 references this constant, making it misleading since lm_eval requests (with "lm_eval:" prefix) are now supported but not listed in AVAILABLE_REQUESTS.

cursor · 2025-10-23T17:12:23Z

Bug: Function Lacks Error Handling for Missing Keys and Attributes

The _get_lm_eval_task_metrics function has no error handling for potential KeyError when accessing task_dict[task_name] or AttributeError when accessing task.config.metric_list. If get_task_dict returns a dict without the requested task_name key, or if the task object doesn't have a config attribute with metric_list, this will cause unhandled exceptions.

cursor · 2025-10-23T17:12:24Z

Bug: Empty Task Name Validation Missing

No validation for empty task name after splitting the request. If the request is exactly "lm_eval:" with no task name following, task_name will be an empty string, which could cause issues in subsequent function calls.

sky-2002 · 2025-10-23T17:31:17Z

@begumcig I tried implementing. But I have a few concerns.

We can create wrapper around lm-eval metrics, but then we have to do it around everything, for example, for a task say hellaswag, we need its dataset (specified in yaml), its pre-processing and post-processing functions etc. which might make our code brittle. Might get confusing and maybe too much duplicate code too, because it is already there in lm-eval.

Either create something like a LMEvalTask inheriting from our Task, that would also integrate the attributes and functions that are required for lm_eval task.

I realise that this would be much cleaner and would leave lm-eval dependent code to itself, we don't need to keep updating it. But also, idk whether it should inherit from existing Task code, feel free to give suggestions, I am a little confused about what would be a good way to do this.

Or exactly like you suggested integrate the metrics as stateful metrics and initialize them using our own Task interface. I think the solution you suggested aligns perfectly with our interface and it would be amazing!

Or if we continue this way, how do we handle the datasets for each task and other related attributes and functions?

github-actions · 2025-11-03T00:08:33Z

This PR has been inactive for 10 days and is now marked as stale.

begumcig

Thank you so much @sky-2002, this is insanely good, you really dropped this 👑👑👑👑👑👑.
Sorry for the delay in reviewing; things have been pretty busy with the releases lately 🥺.
This is honestly so close to perfect, I just added a few tiny comments.
We’d really, really appreciate it if you could take a quick look when you get the chance, but if not, no stress at all! We can also happily take over from here, however you prefer 💜💜💜
Honestly can't wait to merge this into Pruna 💃🏻💃🏻

begumcig · 2025-11-11T12:58:51Z

src/pruna/evaluation/metrics/metric_evalharness.py

+
+        try:
+            # lm-eval metrics expect a list of (reference, prediction) tuples
+            raw_items = self.metric_fn(self.pairs)


I love this new version! Being super nitpicky here, but I think it makes sense to move line 105 (self.metric_fn()) to update, so the pairs collect the calculated metric scores and in compute we only do the final aggregation?

begumcig · 2025-11-11T12:59:39Z

src/pruna/evaluation/metrics/metric_evalharness.py

+            pruna_logger.error(f"Failed computing lm-eval metric {self.metric_name}: {e}")
+            raise
+
+        score_value = float(np.mean(list(score.values()))) if isinstance(score, dict) else float(score)


I am asking just to learn, if we do aggregation already on line 106 why do we take the mean here?

Oh yeah, i don't think we need the mean, removed it

begumcig · 2025-11-11T12:59:53Z

src/pruna/evaluation/task.py

            CMMD(device=stateful_metric_device),
        ]
+
+    elif request.startswith("lm_eval:"):


This is insanely cool!

thanks 😄

begumcig · 2025-11-11T13:00:35Z

tests/evaluation/test_evalharness_metrics.py

@@ -0,0 +1,46 @@
+import pytest


I love this test suite, could we possibly also add some tests for the "lm_eval:" case, using Task?

begumcig · 2025-11-11T13:14:35Z

@begumcig I tried implementing. But I have a few concerns.

We can create wrapper around lm-eval metrics, but then we have to do it around everything, for example, for a task say hellaswag, we need its dataset (specified in yaml), its pre-processing and post-processing functions etc. which might make our code brittle. Might get confusing and maybe too much duplicate code too, because it is already there in lm-eval.

Either create something like a LMEvalTask inheriting from our Task, that would also integrate the attributes and functions that are required for lm_eval task.

I realise that this would be much cleaner and would leave lm-eval dependent code to itself, we don't need to keep updating it. But also, idk whether it should inherit from existing Task code, feel free to give suggestions, I am a little confused about what would be a good way to do this.

Or exactly like you suggested integrate the metrics as stateful metrics and initialize them using our own Task interface. I think the solution you suggested aligns perfectly with our interface and it would be amazing!

Or if we continue this way, how do we handle the datasets for each task and other related attributes and functions?

@sky-2002 That makes a lot of sense, I totally agree that bringing dataset handling into the evaluation part would make things brittle and messy.
I think we could handle that through the PrunaDataModule instead? We could even consider adding the required data as datasets directly in our library, so the evaluation code stays clean and focused on its purpose and people could use these datasets for any purpose they prefer :)

sky-2002 · 2025-11-13T16:22:19Z

@begumcig I have tried to address review comments, thanks for the review.

~~Also I noticed that pruna's code structured changed a little bit (or is it just in algorithms), what changes will we need in this PR?~~

Edit: The code structure changes were in algorithms, ignore above question

github-actions · 2025-11-24T00:08:34Z

This PR has been inactive for 10 days and is now marked as stale.

davidberenstein1957 · 2025-12-09T11:33:20Z

@begumcig @sky-2002 any blockers here from either side?

github-actions · 2025-12-20T00:07:47Z

This PR has been inactive for 10 days and is now marked as stale.

sky-2002 · 2026-03-12T13:24:42Z

Hey @begumcig , sorry you had to push so many more commits, I sort of lost track of this PR, got busy with work. Will check that comment(if that is still relevant) when I am free.

…tations

begumcig · 2026-03-12T17:24:35Z

@sky-2002 No worries at all! I was just making a few small updates. This looks really amazing, and thank you so much for this contribution! We haven’t had the chance to integrate many text evaluation metrics into Pruna yet, and this is a big step forward. Really appreciate the work you put into this 🥹

begumcig · 2026-03-12T17:28:26Z

Also @sky-2002, while taking a look at the evaluation harness I noticed that the get_task_dict interface used here has been deprecated. If you’d be interested in updating it at some point, that could be a nice improvement, but absolutely no rush, we can also handle it in a future pass 💜

… we can continue using torchmetrics for that

sdiazlor requested a review from begumcig October 6, 2025 09:40

simlang requested review from simlang and removed request for simlang October 10, 2025 14:32

begumcig requested changes Oct 15, 2025

View reviewed changes

github-actions bot added the stale label Nov 3, 2025

github-actions bot closed this Nov 10, 2025

begumcig reopened this Nov 11, 2025

begumcig requested changes Nov 11, 2025

View reviewed changes

github-actions bot removed the stale label Nov 12, 2025

begumcig self-requested a review November 13, 2025 16:27

github-actions bot added the stale label Nov 24, 2025

github-actions bot closed this Dec 1, 2025

sharpenb reopened this Dec 8, 2025

github-actions bot removed the stale label Dec 9, 2025

github-actions bot added the stale label Dec 20, 2025

github-actions bot closed this Dec 27, 2025

github-actions bot removed the stale label Mar 11, 2026

begumcig mentioned this pull request Mar 11, 2026

ci: fix too many requests http error in the cpu tests #577

Merged

10 tasks

begumcig force-pushed the sky-2002/metric-evalharness branch 3 times, most recently from c4bb8d6 to 3fd5aaa Compare March 11, 2026 18:28

begumcig force-pushed the sky-2002/metric-evalharness branch from 128a395 to d0a01fb Compare March 12, 2026 13:31

sky-2002 and others added 14 commits March 12, 2026 13:33

add initial evalharness metric

9438fe8

add tests for eval harness metric

de15e74

add optional dependency for lm-eval

914b4e1

fix pre-commit

a63f8b7

refactor to create LMEvalMetric

dc862d5

get lm eval metrics for a task

dd01451

refactor evalharness tests

17111f3

only send metric name

6dbed10

move metric func to update

a2d7687

tests for the lm_eval: case

673ea7e

fix: fix ruff errors

e11f8f3

ci: add evalharness dependency to cpu tests workflow

5e360f0

refactor: docstring fix better extra name

458cda0

refactor: change metric update function to fit evaluation agent expec…

4ef8b35

…tations

begumcig force-pushed the sky-2002/metric-evalharness branch 2 times, most recently from 264c086 to 2ae25f8 Compare March 12, 2026 16:48

test: add control check for inputs to the metric, update the test

6847699

begumcig force-pushed the sky-2002/metric-evalharness branch from 2ae25f8 to 6847699 Compare March 12, 2026 17:20

begumcig added 2 commits March 12, 2026 17:58

fix: correct metric registry for the harness

9c00748

refactor: remove perplexity as it requires further implementation and…

c65c555

… we can continue using torchmetrics for that

begumcig force-pushed the sky-2002/metric-evalharness branch from da0d91a to c65c555 Compare March 13, 2026 10:56

Conversation

sky-2002 commented Oct 2, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Uh oh!

begumcig left a comment

Choose a reason for hiding this comment

Uh oh!

sky-2002 commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

begumcig commented Oct 22, 2025

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Mismatched Class and Method Interfaces

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Error Message Mismatch with New Request Types

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Outdated Constants Mislead on New Features

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Function Lacks Error Handling for Missing Keys and Attributes

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Empty Task Name Validation Missing

Uh oh!

sky-2002 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 3, 2025

Uh oh!

begumcig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

begumcig commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sky-2002 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

davidberenstein1957 commented Dec 9, 2025

Uh oh!

github-actions bot commented Dec 20, 2025

Uh oh!

sky-2002 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

begumcig commented Mar 12, 2026

Uh oh!

begumcig commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

sky-2002 commented Oct 2, 2025 •

edited by cursor bot

Loading

sky-2002 commented Oct 19, 2025 •

edited

Loading

sky-2002 commented Oct 23, 2025 •

edited

Loading

begumcig commented Nov 11, 2025 •

edited

Loading

sky-2002 commented Nov 13, 2025 •

edited

Loading

sky-2002 commented Mar 12, 2026 •

edited

Loading