-
Notifications
You must be signed in to change notification settings - Fork 114
Add llm-benchmarks proposal #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
8905586
add llm-benchmarks proposal
IcyFeather233 7754088
update llm benchmark proposal
IcyFeather233 f5d74a1
update llm benchmark proposal
IcyFeather233 6862f88
update llm benchmark proposal
IcyFeather233 4b6afa1
translate llm-benchmark proposal
IcyFeather233 8115c14
update proposal, add opencompass tutorial
IcyFeather233 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+53.6 KB
docs/proposals/scenarios/llm-benchmarks/images/data_process_change.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
475 changes: 475 additions & 0 deletions
475
docs/proposals/scenarios/llm-benchmarks/llm-benchmarks-zh.md
Large diffs are not rendered by default.
Oops, something went wrong.
510 changes: 510 additions & 0 deletions
510
docs/proposals/scenarios/llm-benchmarks/llm-benchmarks.md
Large diffs are not rendered by default.
Oops, something went wrong.
75 changes: 75 additions & 0 deletions
75
docs/proposals/scenarios/llm-benchmarks/opencompass-tutorial.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| # OpenCompass Tutorial | ||
|
|
||
| Github Repo: | ||
|
|
||
| https://github.com/open-compass/opencompass/ | ||
|
|
||
| ## Introduction | ||
|
|
||
|  | ||
|
|
||
| OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features include: | ||
|
|
||
| - Comprehensive support for models and datasets: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 70+ datasets with about 400,000 questions, comprehensively evaluating the capabilities of the models in five dimensions. | ||
|
|
||
| - Efficient distributed evaluation: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours. | ||
|
|
||
| - Diversified evaluation paradigms: Support for zero-shot, few-shot, and chain-of-thought evaluations, combined with standard or dialogue-type prompt templates, to easily stimulate the maximum performance of various models. | ||
|
|
||
| - Modular design with high extensibility: Want to add new models or datasets, customize an advanced task division strategy, or even support a new cluster management system? Everything about OpenCompass can be easily expanded! | ||
|
|
||
| - Experiment management and reporting mechanism: Use config files to fully record each experiment, and support real-time reporting of results. | ||
|
|
||
| In a nutshell, OpenCompass supports the evaluation of most mainstream large models on mainstream benchmarks, and it is very convenient to configure and run evaluations on multiple datasets across multiple models with just one click. | ||
|
|
||
| ## QuickStart | ||
|
|
||
| The evaluation of OpenCompass depends on the configuration file, which includes the model section and the dataset (i.e., benchmark) section. Below, I will explain with an example. | ||
|
|
||
| In [`configs/eval_chat_demo.py`](https://github.com/open-compass/opencompass/blob/main/configs/eval_chat_demo.py), it shows: | ||
|
|
||
| ```python | ||
| from mmengine.config import read_base | ||
|
|
||
| with read_base(): | ||
| from .datasets.demo.demo_gsm8k_chat_gen import gsm8k_datasets | ||
| from .datasets.demo.demo_math_chat_gen import math_datasets | ||
| from .models.qwen.hf_qwen2_1_5b_instruct import models as hf_qwen2_1_5b_instruct_models | ||
| from .models.hf_internlm.hf_internlm2_chat_1_8b import models as hf_internlm2_chat_1_8b_models | ||
|
|
||
| datasets = gsm8k_datasets + math_datasets | ||
| models = hf_qwen2_1_5b_instruct_models + hf_internlm2_chat_1_8b_models | ||
| ``` | ||
|
|
||
| This means the BenchMarks are gsm8k_datasets and math_datasets, and the models are hf_qwen2_1_5b_instruct_models and hf_internlm2_chat_1_8b_models. | ||
|
|
||
| For the detailed configurations, you can look up in `configs/datasets` and `configs/models` for full information about this. | ||
|
|
||
| For example, in `configs/models/qwen/hf_qwen2_1_5b_instruct.py`: | ||
|
|
||
| ```python | ||
| from opencompass.models import HuggingFacewithChatTemplate | ||
|
|
||
| models = [ | ||
| dict( | ||
| type=HuggingFacewithChatTemplate, | ||
| abbr='qwen2-1.5b-instruct-hf', | ||
| path='Qwen/Qwen2-1.5B-Instruct', | ||
| max_out_len=1024, | ||
| batch_size=8, | ||
| run_cfg=dict(num_gpus=1), | ||
| ) | ||
| ] | ||
| ``` | ||
|
|
||
| It shows the model name, model path, max_out_len, inference batch size and gpu_nums. | ||
|
|
||
| You can modify the config as you want. | ||
|
|
||
| And you can run OpenCompass with only one command: `python run.py configs/eval_demo.py -w outputs/demo --debug` | ||
|
|
||
| This will run the `configs/eval_demo.py` config file, and the outputs will be put in `outputs/demo` | ||
|
|
||
| You can change the config to change the BenchMarks and the models. It is very simple to use. | ||
|
|
||
| For more detailed document, you can click [official doc](https://opencompass.readthedocs.io/). |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal is related to #95.
Great to see a comprehensive proposal. This one is close to the final version. As discussed in the routine meeting, there might be some advices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I add my changes to ianvs core, including 2 flowcharts showing the architecture. And a more complete benchmark format is updated in the doc.