Skip to content

feat(evaluation): Add lm-eval to Pruna Metrics#380

Open
sky-2002 wants to merge 17 commits intoPrunaAI:mainfrom
sky-2002:sky-2002/metric-evalharness
Open

feat(evaluation): Add lm-eval to Pruna Metrics#380
sky-2002 wants to merge 17 commits intoPrunaAI:mainfrom
sky-2002:sky-2002/metric-evalharness

Conversation

@sky-2002
Copy link

@sky-2002 sky-2002 commented Oct 2, 2025

Description

Adds evaluation harness to the metrics lineup.

Related Issue

Fixes #378

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Have added basic unit tests for two tasks from eval harness.
  • Will add more tests as needed

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

  • EvalHarnessMetric inherits from BaseMetric since eval harness metrics are model level metrics.
  • Open to discuss and update code and tests.

Note

Adds LMEvalMetric wrapping lm-evaluation-harness metrics and enables lm_eval:<task> requests in Task, with tests and optional dependency.

  • Evaluation Metrics:
    • Introduce LMEvalMetric (src/pruna/evaluation/metrics/metric_evalharness.py) wrapping lm-evaluation-harness metrics via registry/aggregation, with stateful accumulation and result reporting.
    • Register metric under MetricRegistry as lm_eval_metric.
  • Task Integration:
    • Support lm_eval:<task_name> requests in Task (src/pruna/evaluation/task.py), resolving metrics via lm_eval.tasks.get_task_dict and instantiating LMEvalMetric per task metric.
  • Tests:
    • Add tests/evaluation/test_evalharness_metrics.py covering BLEU-like scoring, empty input behavior, and preds/refs length mismatch.
  • Dependencies:
    • Add optional group evalharness with lm-eval>=0.4.0 in pyproject.toml.

Written by Cursor Bugbot for commit 3767da9. This will update automatically on new commits. Configure here.

@sdiazlor sdiazlor requested a review from begumcig October 6, 2025 09:40
@simlang simlang requested review from simlang and removed request for simlang October 10, 2025 14:32
Copy link
Member

@begumcig begumcig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sky-2002, thank you so much for your contribution! We’d love to bring lm-eval into Pruna, I think it's such a big task to integrate the entire evaluation suite at one go, but I think with some refactoring we can still make this work! In our codebase, we usually use the StatefulMetric interface for a single metric, whereas lm-eval is more of a full benchmarking suite. To align with our structure, we might need to break it into smaller components 🔨🔨. We also have a Task interface that coordinates multiple metrics together. I was thinking maybe we could integrate one task from lm-eval as one Task (inheriting from our Task interface) or alternatively, we could integrate some metrics from lm-eval metrics registry as StatefulMetrics.

What do you think? Do you think any of these approaches would align with lm-eval interface? Thank you again!

@sky-2002
Copy link
Author

sky-2002 commented Oct 19, 2025

@begumcig Yeah I agree, while implementing as well, I was not very comfortable with this approach.
From what I see in pruna code, a task has a request, which is a list of metrics, exactly as an lm-eval task, which has metrics defined in the yaml files.

alternatively, we could integrate some metrics from lm-eval metrics registry as StatefulMetrics.

Yes, this is a much cleaner way and aligns with lm-eval interface(to add each metric as a metric in pruna), let us do this. Will update code.
Also, you are right, I misunderstood StatefulMetric

I was thinking maybe we could integrate one task from lm-eval as one Task (inheriting from our Task interface)

The above two need to be done together no? we need to implement a task, and define its metrics or we need a way to create a task on the fly given a task name and the metrics needed for it.

For example, given that we have all metrics implemented in pruna:

We can instantiate a task for a lm-eval task:

from pruna.evaluation.task import Task

squad_metrics = ["lm_eval_exact_match", "lm_eval_f1"]

squad_task = Task(
    request=squad_metrics,
    datamodule=squad_datamodule, 
    device="cuda"
)

or have a task factory:

def make_pruna_task_from_lmeval(task_name, datamodule, model_args, device="cuda"):
    lm_task = lm_tasks.get_task(task_name)
    metrics = [f"lm_eval_{m['metric']}" for m in lm_task.config.metric_list]
    return Task(request=metrics, datamodule=datamodule, device=device)

Wdyt, how should we go about this?

@begumcig
Copy link
Member

@sky-2002 Yes, that's exactly how our task and stateful metric interface work!

I was suggesting that you could:

  • Either create something like a LMEvalTask inheriting from our Task, that would also integrate the attributes and functions that are required for lm_eval task.
  • Or exactly like you suggested integrate the metrics as stateful metrics and initialize them using our own Task interface. I think the solution you suggested aligns perfectly with our interface and it would be amazing!

Once we have the metrics as Pruna metrics, we can use them with our Task interface like you have shown in the example! This all looks super good to me, and thank you so much for taking the time to look into this, you are a champ 🏆🏆🏆🏆

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Mismatched Class and Method Interfaces

The test file attempts to import and use EvalHarnessMetric class, but the actual implementation defines LMEvalMetric class. Additionally, the test uses a completely different constructor signature (expecting tasks, model_args, device parameters) and calls compute(model=None, dataloader=None) method, while the actual LMEvalMetric constructor takes metric_name and optional call_type parameters, and the compute() method takes no parameters. The test expects result structure with task_scores in params, but the implementation returns different parameter structure. This complete interface mismatch will cause the tests to fail with ImportError and TypeError.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Error Message Mismatch with New Request Types

The error message references AVAILABLE_REQUESTS constant which only contains "image_generation_quality", but the function now supports requests starting with "lm_eval:". This creates misleading error messages that don't inform users about the newly supported lm_eval request format, potentially causing confusion when users receive errors claiming only image_generation_quality is available.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Outdated Constants Mislead on New Features

AVAILABLE_REQUESTS constant is outdated and doesn't include the newly added lm_eval functionality. The error message on lines 272-273 references this constant, making it misleading since lm_eval requests (with "lm_eval:" prefix) are now supported but not listed in AVAILABLE_REQUESTS.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Function Lacks Error Handling for Missing Keys and Attributes

The _get_lm_eval_task_metrics function has no error handling for potential KeyError when accessing task_dict[task_name] or AttributeError when accessing task.config.metric_list. If get_task_dict returns a dict without the requested task_name key, or if the task object doesn't have a config attribute with metric_list, this will cause unhandled exceptions.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Empty Task Name Validation Missing

No validation for empty task name after splitting the request. If the request is exactly "lm_eval:" with no task name following, task_name will be an empty string, which could cause issues in subsequent function calls.

Fix in Cursor Fix in Web

@sky-2002
Copy link
Author

sky-2002 commented Oct 23, 2025

@begumcig I tried implementing. But I have a few concerns.

  • We can create wrapper around lm-eval metrics, but then we have to do it around everything, for example, for a task say hellaswag, we need its dataset (specified in yaml), its pre-processing and post-processing functions etc. which might make our code brittle. Might get confusing and maybe too much duplicate code too, because it is already there in lm-eval.

Either create something like a LMEvalTask inheriting from our Task, that would also integrate the attributes and functions that are required for lm_eval task.

I realise that this would be much cleaner and would leave lm-eval dependent code to itself, we don't need to keep updating it. But also, idk whether it should inherit from existing Task code, feel free to give suggestions, I am a little confused about what would be a good way to do this.

Or exactly like you suggested integrate the metrics as stateful metrics and initialize them using our own Task interface. I think the solution you suggested aligns perfectly with our interface and it would be amazing!

Or if we continue this way, how do we handle the datasets for each task and other related attributes and functions?

@github-actions
Copy link

github-actions bot commented Nov 3, 2025

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions bot added the stale label Nov 3, 2025
@github-actions github-actions bot closed this Nov 10, 2025
@begumcig begumcig reopened this Nov 11, 2025
Copy link
Member

@begumcig begumcig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @sky-2002, this is insanely good, you really dropped this 👑👑👑👑👑👑.
Sorry for the delay in reviewing; things have been pretty busy with the releases lately 🥺.
This is honestly so close to perfect, I just added a few tiny comments.
We’d really, really appreciate it if you could take a quick look when you get the chance, but if not, no stress at all! We can also happily take over from here, however you prefer 💜💜💜
Honestly can't wait to merge this into Pruna 💃🏻💃🏻


try:
# lm-eval metrics expect a list of (reference, prediction) tuples
raw_items = self.metric_fn(self.pairs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this new version! Being super nitpicky here, but I think it makes sense to move line 105 (self.metric_fn()) to update, so the pairs collect the calculated metric scores and in compute we only do the final aggregation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pruna_logger.error(f"Failed computing lm-eval metric {self.metric_name}: {e}")
raise

score_value = float(np.mean(list(score.values()))) if isinstance(score, dict) else float(score)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am asking just to learn, if we do aggregation already on line 106 why do we take the mean here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, i don't think we need the mean, removed it

CMMD(device=stateful_metric_device),
]

elif request.startswith("lm_eval:"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is insanely cool!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks 😄

@@ -0,0 +1,46 @@
import pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this test suite, could we possibly also add some tests for the "lm_eval:" case, using Task?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@begumcig
Copy link
Member

begumcig commented Nov 11, 2025

@begumcig I tried implementing. But I have a few concerns.

  • We can create wrapper around lm-eval metrics, but then we have to do it around everything, for example, for a task say hellaswag, we need its dataset (specified in yaml), its pre-processing and post-processing functions etc. which might make our code brittle. Might get confusing and maybe too much duplicate code too, because it is already there in lm-eval.

Either create something like a LMEvalTask inheriting from our Task, that would also integrate the attributes and functions that are required for lm_eval task.

I realise that this would be much cleaner and would leave lm-eval dependent code to itself, we don't need to keep updating it. But also, idk whether it should inherit from existing Task code, feel free to give suggestions, I am a little confused about what would be a good way to do this.

Or exactly like you suggested integrate the metrics as stateful metrics and initialize them using our own Task interface. I think the solution you suggested aligns perfectly with our interface and it would be amazing!

Or if we continue this way, how do we handle the datasets for each task and other related attributes and functions?

@sky-2002 That makes a lot of sense, I totally agree that bringing dataset handling into the evaluation part would make things brittle and messy.
I think we could handle that through the PrunaDataModule instead? We could even consider adding the required data as datasets directly in our library, so the evaluation code stays clean and focused on its purpose and people could use these datasets for any purpose they prefer :)

@github-actions github-actions bot removed the stale label Nov 12, 2025
@sky-2002
Copy link
Author

sky-2002 commented Nov 13, 2025

@begumcig I have tried to address review comments, thanks for the review.

Also I noticed that pruna's code structured changed a little bit (or is it just in algorithms), what changes will we need in this PR?

Edit: The code structure changes were in algorithms, ignore above question

@begumcig begumcig self-requested a review November 13, 2025 16:27
@github-actions
Copy link

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions bot added the stale label Nov 24, 2025
@github-actions github-actions bot closed this Dec 1, 2025
@sharpenb sharpenb reopened this Dec 8, 2025
@github-actions github-actions bot removed the stale label Dec 9, 2025
@davidberenstein1957
Copy link
Member

@begumcig @sky-2002 any blockers here from either side?

@github-actions
Copy link

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions bot added the stale label Dec 20, 2025
@github-actions github-actions bot closed this Dec 27, 2025
@github-actions github-actions bot removed the stale label Mar 11, 2026
@begumcig begumcig force-pushed the sky-2002/metric-evalharness branch 3 times, most recently from c4bb8d6 to 3fd5aaa Compare March 11, 2026 18:28
@sky-2002
Copy link
Author

sky-2002 commented Mar 12, 2026

Hey @begumcig , sorry you had to push so many more commits, I sort of lost track of this PR, got busy with work. Will check that comment(if that is still relevant) when I am free.

@begumcig begumcig force-pushed the sky-2002/metric-evalharness branch from 128a395 to d0a01fb Compare March 12, 2026 13:31
@begumcig begumcig force-pushed the sky-2002/metric-evalharness branch 2 times, most recently from 264c086 to 2ae25f8 Compare March 12, 2026 16:48
@begumcig begumcig force-pushed the sky-2002/metric-evalharness branch from 2ae25f8 to 6847699 Compare March 12, 2026 17:20
@begumcig
Copy link
Member

@sky-2002 No worries at all! I was just making a few small updates. This looks really amazing, and thank you so much for this contribution! We haven’t had the chance to integrate many text evaluation metrics into Pruna yet, and this is a big step forward. Really appreciate the work you put into this 🥹

@begumcig
Copy link
Member

Also @sky-2002, while taking a look at the evaluation harness I noticed that the get_task_dict interface used here has been deprecated. If you’d be interested in updating it at some point, that could be a nice improvement, but absolutely no rush, we can also handle it in a future pass 💜

@begumcig begumcig force-pushed the sky-2002/metric-evalharness branch from da0d91a to c65c555 Compare March 13, 2026 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add lm-eval to our Metrics

5 participants