Add GSM8K evaluation script and AWQ+FP8 results#2330
Add GSM8K evaluation script and AWQ+FP8 results#2330rtj1 wants to merge 12 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @rtj1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new evaluation script and comprehensive results for AWQ+FP8 quantization on the GSM8K benchmark, specifically targeting the Qwen2.5-0.5B-Instruct model. It demonstrates the superior performance of the FP8_DYNAMIC scheme over FP8_BLOCK in terms of accuracy preservation, while maintaining efficient model compression. The changes aim to provide clear, reproducible benchmarks for quantized models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable evaluation script for GSM8K and provides detailed results for AWQ+FP8 quantization in RESULTS.md. The documentation is clear and the script is a useful addition for benchmarking. I have a couple of suggestions to enhance the robustness of the reproduction steps and the new evaluation script, mainly by making a shell command more specific and adding input validation to the Python script for better error handling.
examples/awq/RESULTS.md
Outdated
| Used the existing example scripts from the repo: | ||
| ```bash | ||
| cd examples/awq | ||
| sed -i 's/meta-llama\/Meta-Llama-3-8B-Instruct/Qwen\/Qwen2.5-0.5B-Instruct/g' *.py |
There was a problem hiding this comment.
Using a broad wildcard like *.py with sed -i can be risky, as it might unintentionally modify other Python files in the directory. It would be safer and more robust to use a more specific pattern that targets only the intended example scripts.
| sed -i 's/meta-llama\/Meta-Llama-3-8B-Instruct/Qwen\/Qwen2.5-0.5B-Instruct/g' *.py | |
| sed -i 's/meta-llama\/Meta-Llama-3-8B-Instruct/Qwen\/Qwen2.5-0.5B-Instruct/g' fp8_*_llama_example.py |
Closes vllm-project#2305 This PR adds: - gsm8k_eval.py: Evaluation script for running GSM8K benchmarks on quantized models - RESULTS.md: Quantization and evaluation results for Qwen2.5-0.5B-Instruct with FP8_DYNAMIC and FP8_BLOCK schemes Key findings: - FP8_DYNAMIC achieves 22.67% strict match vs 17.97% for FP8_BLOCK on GSM8K - Both schemes achieve ~1.2x compression (1.1GB -> 0.92GB) - Quantized models uploaded to HuggingFace Hub for reproducibility Evaluated on Google Colab L4 GPU (22.5GB) using the existing example scripts. Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>
Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jagarlamudi <76727507+rtj1@users.noreply.github.com> Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>
3ab5622 to
14bc4cc
Compare
|
So we're looking to evaluate the actual models the examples are generating. I'll run evals using your PR and we can go from there |
|
The quality checks have failed. Please run |
- Run make style to format code with ruff - Run make quality to ensure all checks pass - Address mergify bot feedback on quality checks Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>
dabab9a to
151a5b9
Compare
|
Thanks for taking a look, @HDCharles! The quantized models are uploaded to HuggingFace:
I've also fixed the quality checks - all tests passing now. Let me know if you see any issues with the models or need different configs for the evaluation. |
|
The quality checks have failed. Please run |
|
looks like you edited a ton of files. I would undo the last commit and reinstall llm-compressor to get the correct version of ruff |
|
ok here is my evaluation, see https://github.com/vllm-project/llm-compressor/pull/2347/changes, can you finalize this PR using those numbers checkpoints https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-awq-asym-fp8-block/tree/main/ |
- Update RESULTS.md with HDCharles's Llama-3-8B-Instruct evaluation results - FP8_DYNAMIC: 76.42% strict match vs FP8_BLOCK: 75.21% - Run make style with proper dev dependencies (pip install -e.[dev]) - Fix code formatting per maintainer feedback Results from: vllm-project#2347 Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>
3302cd9 to
9d425dd
Compare
|
Updated with Llama-3-8B evaluation results from PR #2347. Reinstalled llm-compressor with dev dependencies ( The PR now includes your official Llama-3-8B-Instruct evaluation results showing FP8_DYNAMIC achieves 76.42% vs FP8_BLOCK's 75.21% on GSM8K strict matching. |
examples/awq/RESULTS.md
Outdated
| @@ -0,0 +1,67 @@ | |||
| # AWQ + FP8 Quantization Results | |||
|
|
|||
| Closes #2305 | |||
There was a problem hiding this comment.
no need to call this out in the changes, just in the PR summary
examples/awq/gsm8k_eval.py
Outdated
There was a problem hiding this comment.
I think we should remove this file in favor of directly calling lm_eval as you show in the markdown file.
- Remove 'Closes vllm-project#2305' from RESULTS.md - Remove gsm8k_eval.py file (use lm_eval directly as documented) - Update RESULTS.md to reference only lm_eval command Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>
|
Thanks for the feedback @brian-dellabetta! I've addressed both points:
Latest commit: 002b77a |
examples/awq/RESULTS.md
Outdated
| | **FP8_DYNAMIC** | **76.42%** | **76.19%** | | ||
| | **FP8_BLOCK** | 75.21% | 74.98% | | ||
|
|
||
| FP8_DYNAMIC wins by ~1.2% on strict matching. Both achieve similar performance on flexible extraction. |
There was a problem hiding this comment.
| FP8_DYNAMIC wins by ~1.2% on strict matching. Both achieve similar performance on flexible extraction. |
this seems outdated?
|
update the PR description for the targeted model |
Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com>
This PR adds GSM8K evaluation results for AWQ+FP8 quantization as requested in #2305.
What's included
RESULTS.md - Evaluation results comparing FP8_DYNAMIC vs FP8_BLOCK quantization schemes on Meta-Llama-3-8B-Instruct
Results
Evaluation command
Note:
batch_size=16is important — the defaultautopicks 1, significantly increasing evaluation time.Model Checkpoints (from @HDCharles)