Investigate and run lm-eval with few-shot evaluations

## **Description**

We currently only report zero-shot metrics for checkpoints. To better understand model performance, we should investigate and run **lm-evaluation-harness** with few-shot settings (e.g., k-shot) on the latest checkpoints. The goal is to report few-shot results on the same benchmark tasks we already track.

## **Tasks**

* Run lm-eval with few-shot configurations on the latest checkpoints.
* Collect and summarize few-shot metrics (accuracy, acc_norm, etc.) for standard tasks (arc, hellaswag, winogrande, …).
* Log results in a consistent way with existing evaluation reports.

## **Acceptance Criteria**

* Few-shot metrics are produced and available for recent checkpoints.
* Results are clearly distinguishable from zero-shot runs.
* Report covers the same benchmark tasks as our existing evaluation flow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate and run lm-eval with few-shot evaluations #602

Description

Tasks

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate and run lm-eval with few-shot evaluations #602

Description

Description

Tasks

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions