Work in progress. Scheduled for release in November.
We evaluated over 1,000 models on downstream tasks using OpenCompass, an LLM evaluation tool.
For details on the datasets and metrics, please refer to our paper.
This repository provides a crawled and consolidated dataframe of the evaluation results.
The dataframe consists of metadata about each evaluated model along with scores for downstream tasks:
model_name: The name of the model familydata_name: The name of the dataset used for SFT.allrefers to a merged dataset containing all 10 datasets
lora:true: LoRA tuningfalse: Full parameter tuning
epoch: The number of epochs each trained model was run on the dataset specified indata_namedata_size: The number of samples in the training datasetlr_strategy: Indicates whether learning rate scheduling usedcosinedecay- or was kept
constant
As you may know, we spent a tremendous amount of time on model evaluation.
Unfortunately, due to unexpected runtime errors and GPU server failures, it is difficult to report the exact evaluation time.
Therefore, this repository provides an approximate estimation instead.
From our experience, evaluation time is largely affected by two factors:
- The model family, and
- Whether inference uses a LoRA adapter or not.
To better estimate this, we conducted additional verification by weight-merging LoRA-tuned and fully tuned models for each model family, thereby constructing representative models under each condition.
By measuring the evaluation time of these representative models, we can approximate the expected runtime.
| Model Family | 0-shot (full) | few-shot (full) | 0-shot (LoRA) | few-shot (LoRA) | Approx. Total (hours) |
|---|---|---|---|---|---|
| Chinese-Llama3 (8B) | 5407 | 36845 | 6648 | 38159 | 1,741,180 (484h) |
| Chinese-Mistral (7B) | 10950 | 45861 | 7037 | 37715 | 2,031,260 (564h) |
| Gemma2 (9B) | 4171 | 16287 | 9790 | 63519 | 1,650,032 (458h) |
| Llama3 (8B) | 4890 | 26568 | 6473 | 37426 | 1,507,140 (419h) |
| LLMjp-3 (7B) | 2505 | 15366 | 7453 | 43886 | 4,761,819 (1323h) |
| Mistral (7B) | 5937 | 40976 | 174795 | OOM | 4,434,160 (1232h) |
| OLMo (7B) | 2776 | 13565 | 34200* | 120600* | 6,674,679 (1854h) |
| Qwen2.5 (7B) | 1809 | 8032 | 6106 | 33204 | 2,842,969 (790h) |
| Sarashina2 (7B) | 2757 | 12163 | 7728 | 43305 | 1,319,060 (366h) |
| Swallow-Mistral (7B) | 8451 | 26000 | 6669 | 36878 | 1,559,960 (433h) |
| Yi1.5 (9B) | 2888 | 11119 | 8693 | 53763 | 1,529,260 (425h) |
* For OLMo (7B), LoRA values were estimated from logs since weight merging was not supported (evaluated without VLLM).
* Each column value represents the time measured from the start to the completion of the evaluation process for a single model.
* Since the number of trained models varies across model families, the total time is not directly comparable under the same conditions.
* OOM = Out of Memory.
- All times above are measured with 1 GPU only.
In practice, evaluation can be accelerated by#nodes × #GPUs. - These results are rough estimates and are extremely difficult to reproduce in subsequent experiments.
We hope this information will be useful as a reference.