-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
Description
We currently only report zero-shot metrics for checkpoints. To better understand model performance, we should investigate and run lm-evaluation-harness with few-shot settings (e.g., k-shot) on the latest checkpoints. The goal is to report few-shot results on the same benchmark tasks we already track.
Tasks
- Run lm-eval with few-shot configurations on the latest checkpoints.
- Collect and summarize few-shot metrics (accuracy, acc_norm, etc.) for standard tasks (arc, hellaswag, winogrande, …).
- Log results in a consistent way with existing evaluation reports.
Acceptance Criteria
- Few-shot metrics are produced and available for recent checkpoints.
- Results are clearly distinguishable from zero-shot runs.
- Report covers the same benchmark tasks as our existing evaluation flow.
Metadata
Metadata
Assignees
Labels
No labels