feat(evaluation): Add lm-eval to Pruna Metrics#380
feat(evaluation): Add lm-eval to Pruna Metrics#380sky-2002 wants to merge 17 commits intoPrunaAI:mainfrom
Conversation
begumcig
left a comment
There was a problem hiding this comment.
Hi @sky-2002, thank you so much for your contribution! We’d love to bring lm-eval into Pruna, I think it's such a big task to integrate the entire evaluation suite at one go, but I think with some refactoring we can still make this work! In our codebase, we usually use the StatefulMetric interface for a single metric, whereas lm-eval is more of a full benchmarking suite. To align with our structure, we might need to break it into smaller components 🔨🔨. We also have a Task interface that coordinates multiple metrics together. I was thinking maybe we could integrate one task from lm-eval as one Task (inheriting from our Task interface) or alternatively, we could integrate some metrics from lm-eval metrics registry as StatefulMetrics.
What do you think? Do you think any of these approaches would align with lm-eval interface? Thank you again!
|
@begumcig Yeah I agree, while implementing as well, I was not very comfortable with this approach.
Yes, this is a much cleaner way and aligns with
The above two need to be done together no? we need to implement a task, and define its metrics or we need a way to create a task on the fly given a task name and the metrics needed for it. For example, given that we have all metrics implemented in pruna: We can instantiate a task for a lm-eval task: from pruna.evaluation.task import Task
squad_metrics = ["lm_eval_exact_match", "lm_eval_f1"]
squad_task = Task(
request=squad_metrics,
datamodule=squad_datamodule,
device="cuda"
)or have a task factory: def make_pruna_task_from_lmeval(task_name, datamodule, model_args, device="cuda"):
lm_task = lm_tasks.get_task(task_name)
metrics = [f"lm_eval_{m['metric']}" for m in lm_task.config.metric_list]
return Task(request=metrics, datamodule=datamodule, device=device)Wdyt, how should we go about this? |
|
@sky-2002 Yes, that's exactly how our task and stateful metric interface work! I was suggesting that you could:
Once we have the metrics as Pruna metrics, we can use them with our Task interface like you have shown in the example! This all looks super good to me, and thank you so much for taking the time to look into this, you are a champ 🏆🏆🏆🏆 |
Bug: Mismatched Class and Method InterfacesThe test file attempts to import and use |
Bug: Error Message Mismatch with New Request TypesThe error message references |
Bug: Outdated Constants Mislead on New FeaturesAVAILABLE_REQUESTS constant is outdated and doesn't include the newly added lm_eval functionality. The error message on lines 272-273 references this constant, making it misleading since lm_eval requests (with "lm_eval:" prefix) are now supported but not listed in AVAILABLE_REQUESTS. |
Bug: Function Lacks Error Handling for Missing Keys and AttributesThe _get_lm_eval_task_metrics function has no error handling for potential KeyError when accessing task_dict[task_name] or AttributeError when accessing task.config.metric_list. If get_task_dict returns a dict without the requested task_name key, or if the task object doesn't have a config attribute with metric_list, this will cause unhandled exceptions. |
|
@begumcig I tried implementing. But I have a few concerns.
I realise that this would be much cleaner and would leave
Or if we continue this way, how do we handle the datasets for each task and other related attributes and functions? |
|
This PR has been inactive for 10 days and is now marked as stale. |
begumcig
left a comment
There was a problem hiding this comment.
Thank you so much @sky-2002, this is insanely good, you really dropped this 👑👑👑👑👑👑.
Sorry for the delay in reviewing; things have been pretty busy with the releases lately 🥺.
This is honestly so close to perfect, I just added a few tiny comments.
We’d really, really appreciate it if you could take a quick look when you get the chance, but if not, no stress at all! We can also happily take over from here, however you prefer 💜💜💜
Honestly can't wait to merge this into Pruna 💃🏻💃🏻
|
|
||
| try: | ||
| # lm-eval metrics expect a list of (reference, prediction) tuples | ||
| raw_items = self.metric_fn(self.pairs) |
There was a problem hiding this comment.
I love this new version! Being super nitpicky here, but I think it makes sense to move line 105 (self.metric_fn()) to update, so the pairs collect the calculated metric scores and in compute we only do the final aggregation?
| pruna_logger.error(f"Failed computing lm-eval metric {self.metric_name}: {e}") | ||
| raise | ||
|
|
||
| score_value = float(np.mean(list(score.values()))) if isinstance(score, dict) else float(score) |
There was a problem hiding this comment.
I am asking just to learn, if we do aggregation already on line 106 why do we take the mean here?
There was a problem hiding this comment.
Oh yeah, i don't think we need the mean, removed it
| CMMD(device=stateful_metric_device), | ||
| ] | ||
|
|
||
| elif request.startswith("lm_eval:"): |
| @@ -0,0 +1,46 @@ | |||
| import pytest | |||
There was a problem hiding this comment.
I love this test suite, could we possibly also add some tests for the "lm_eval:" case, using Task?
@sky-2002 That makes a lot of sense, I totally agree that bringing dataset handling into the evaluation part would make things brittle and messy. |
|
@begumcig I have tried to address review comments, thanks for the review.
Edit: The code structure changes were in algorithms, ignore above question |
|
This PR has been inactive for 10 days and is now marked as stale. |
|
This PR has been inactive for 10 days and is now marked as stale. |
c4bb8d6 to
3fd5aaa
Compare
|
Hey @begumcig , sorry you had to push so many more commits, I sort of lost track of this PR, got busy with work. Will check that comment(if that is still relevant) when I am free. |
128a395 to
d0a01fb
Compare
264c086 to
2ae25f8
Compare
2ae25f8 to
6847699
Compare
|
@sky-2002 No worries at all! I was just making a few small updates. This looks really amazing, and thank you so much for this contribution! We haven’t had the chance to integrate many text evaluation metrics into Pruna yet, and this is a big step forward. Really appreciate the work you put into this 🥹 |
|
Also @sky-2002, while taking a look at the evaluation harness I noticed that the get_task_dict interface used here has been deprecated. If you’d be interested in updating it at some point, that could be a nice improvement, but absolutely no rush, we can also handle it in a future pass 💜 |
… we can continue using torchmetrics for that
da0d91a to
c65c555
Compare
Description
Adds evaluation harness to the metrics lineup.
Related Issue
Fixes #378
Type of Change
How Has This Been Tested?
Checklist
Additional Notes
EvalHarnessMetricinherits fromBaseMetricsince eval harness metrics are model level metrics.Note
Adds
LMEvalMetricwrapping lm-evaluation-harness metrics and enableslm_eval:<task>requests inTask, with tests and optional dependency.LMEvalMetric(src/pruna/evaluation/metrics/metric_evalharness.py) wrapping lm-evaluation-harness metrics via registry/aggregation, with stateful accumulation and result reporting.MetricRegistryaslm_eval_metric.lm_eval:<task_name>requests inTask(src/pruna/evaluation/task.py), resolving metrics vialm_eval.tasks.get_task_dictand instantiatingLMEvalMetricper task metric.tests/evaluation/test_evalharness_metrics.pycovering BLEU-like scoring, empty input behavior, and preds/refs length mismatch.evalharnesswithlm-eval>=0.4.0inpyproject.toml.Written by Cursor Bugbot for commit 3767da9. This will update automatically on new commits. Configure here.